Using Voice Mode
Voice mode lets you talk to Mona naturally. This guide covers setup, usage, and optimization.
Installation
cd ~/.monoclaw/monoclaw-runtime
uv pip install -e ".[voice]"
This installs:
faster-whisper— Local speech-to-textsounddevice— Audio capturenumpy— Audio processing
CLI voice mode
Enable
/voice on
Record and send
- Press
Ctrl+Bto start recording - Speak your message
- Press
Ctrl+Bagain to stop - Mona transcribes and responds
Switch TTS voice
/voice voice zh-HK-HiuMaanNeural
Telegram voice messages
No setup needed — just send a voice message to Mona. She will:
- Download the voice note
- Transcribe it locally
- Reply (with text or voice if TTS is on)
Enable TTS replies
monoclaw config set telegram.tts.enabled true
Discord voice channels
Join a voice channel
/voice join
Speak naturally
- Join the same voice channel as Mona
- Speak normally — she listens and responds
- She leaves when you use
/voice leave
Push-to-talk alternative
If the channel is noisy, use text chat alongside voice:
@Mona [voice only] summarize what I just said
Choosing a TTS provider
Edge TTS (free, default)
Good for everyday use. Configure:
monoclaw config set tts.provider edge-tts
monoclaw config set tts.edge-tts.voice "en-US-AriaNeural"
ElevenLabs (premium)
Best quality, supports voice cloning:
uv pip install -e ".[tts-premium]"
monoclaw config set tts.provider elevenlabs
monoclaw config set ELEVENLABS_API_KEY "your-key"
Optimizing transcription accuracy
Choose the right whisper model
| Model | Accuracy | Speed | RAM |
|---|---|---|---|
| tiny | Basic | Fastest | ~1GB |
| base | Good | Fast | ~1GB |
| small | Very good | Medium | ~2GB |
| medium | Excellent | Slow | ~5GB |
| large | Best | Slowest | ~10GB |
monoclaw config set voice.stt.model "small"
Set language
Auto-detect works well but specifying improves accuracy:
monoclaw config set voice.stt.language "zh" # Mandarin
monoclaw config set voice.stt.language "yue" # Cantonese
monoclaw config set voice.stt.language "en" # English
Best practices
- Speak clearly — Whisper handles accents but clarity helps
- Keep messages short — Under 30 seconds for best results
- Use in quiet environments — Background noise reduces accuracy
- Test different models — Balance accuracy vs speed for your hardware
- Use headphones in Discord — Prevents echo and feedback
Troubleshooting
| Problem | Fix |
|---|---|
| "Voice mode not available" | Install [voice] extra |
| "No audio input" | Check microphone permissions in OS settings |
| Poor transcription | Try a larger whisper model |
| High latency | Use base or small model; reduce TTS quality |
| Discord voice choppy | Check network connection; lower bitrate |