MonoClaw

Using Voice Mode

Voice mode lets you talk to Mona naturally. This guide covers setup, usage, and optimization.

Installation

cd ~/.monoclaw/monoclaw-runtime
uv pip install -e ".[voice]"

This installs:

  • faster-whisper — Local speech-to-text
  • sounddevice — Audio capture
  • numpy — Audio processing

CLI voice mode

Enable

/voice on

Record and send

  1. Press Ctrl+B to start recording
  2. Speak your message
  3. Press Ctrl+B again to stop
  4. Mona transcribes and responds

Switch TTS voice

/voice voice zh-HK-HiuMaanNeural

Telegram voice messages

No setup needed — just send a voice message to Mona. She will:

  1. Download the voice note
  2. Transcribe it locally
  3. Reply (with text or voice if TTS is on)

Enable TTS replies

monoclaw config set telegram.tts.enabled true

Discord voice channels

Join a voice channel

/voice join

Speak naturally

  • Join the same voice channel as Mona
  • Speak normally — she listens and responds
  • She leaves when you use /voice leave

Push-to-talk alternative

If the channel is noisy, use text chat alongside voice:

@Mona [voice only] summarize what I just said

Choosing a TTS provider

Edge TTS (free, default)

Good for everyday use. Configure:

monoclaw config set tts.provider edge-tts
monoclaw config set tts.edge-tts.voice "en-US-AriaNeural"

ElevenLabs (premium)

Best quality, supports voice cloning:

uv pip install -e ".[tts-premium]"
monoclaw config set tts.provider elevenlabs
monoclaw config set ELEVENLABS_API_KEY "your-key"

Optimizing transcription accuracy

Choose the right whisper model

ModelAccuracySpeedRAM
tinyBasicFastest~1GB
baseGoodFast~1GB
smallVery goodMedium~2GB
mediumExcellentSlow~5GB
largeBestSlowest~10GB
monoclaw config set voice.stt.model "small"

Set language

Auto-detect works well but specifying improves accuracy:

monoclaw config set voice.stt.language "zh"    # Mandarin
monoclaw config set voice.stt.language "yue"   # Cantonese
monoclaw config set voice.stt.language "en"    # English

Best practices

  • Speak clearly — Whisper handles accents but clarity helps
  • Keep messages short — Under 30 seconds for best results
  • Use in quiet environments — Background noise reduces accuracy
  • Test different models — Balance accuracy vs speed for your hardware
  • Use headphones in Discord — Prevents echo and feedback

Troubleshooting

ProblemFix
"Voice mode not available"Install [voice] extra
"No audio input"Check microphone permissions in OS settings
Poor transcriptionTry a larger whisper model
High latencyUse base or small model; reduce TTS quality
Discord voice choppyCheck network connection; lower bitrate