Using Voice Mode

Voice mode lets you talk to Mona naturally. This guide covers setup, usage, and optimization.

Installation

cd ~/.monoclaw/monoclaw-runtime
uv pip install -e ".[voice]"

This installs:

faster-whisper — Local speech-to-text
sounddevice — Audio capture
numpy — Audio processing

CLI voice mode

Enable

/voice on

Record and send

Press Ctrl+B to start recording
Speak your message
Press Ctrl+B again to stop
Mona transcribes and responds

Switch TTS voice

/voice voice zh-HK-HiuMaanNeural

Telegram voice messages

No setup needed — just send a voice message to Mona. She will:

Download the voice note
Transcribe it locally
Reply (with text or voice if TTS is on)

Enable TTS replies

monoclaw config set telegram.tts.enabled true

Discord voice channels

Join a voice channel

/voice join

Speak naturally

Join the same voice channel as Mona
Speak normally — she listens and responds
She leaves when you use /voice leave

Push-to-talk alternative

If the channel is noisy, use text chat alongside voice:

@Mona [voice only] summarize what I just said

Choosing a TTS provider

Edge TTS (free, default)

Good for everyday use. Configure:

monoclaw config set tts.provider edge-tts
monoclaw config set tts.edge-tts.voice "en-US-AriaNeural"

ElevenLabs (premium)

Best quality, supports voice cloning:

uv pip install -e ".[tts-premium]"
monoclaw config set tts.provider elevenlabs
monoclaw config set ELEVENLABS_API_KEY "your-key"

Optimizing transcription accuracy

Choose the right whisper model

Model	Accuracy	Speed	RAM
tiny	Basic	Fastest	~1GB
base	Good	Fast	~1GB
small	Very good	Medium	~2GB
medium	Excellent	Slow	~5GB
large	Best	Slowest	~10GB

monoclaw config set voice.stt.model "small"

Set language

Auto-detect works well but specifying improves accuracy:

monoclaw config set voice.stt.language "zh"    # Mandarin
monoclaw config set voice.stt.language "yue"   # Cantonese
monoclaw config set voice.stt.language "en"    # English

Best practices

Speak clearly — Whisper handles accents but clarity helps
Keep messages short — Under 30 seconds for best results
Use in quiet environments — Background noise reduces accuracy
Test different models — Balance accuracy vs speed for your hardware
Use headphones in Discord — Prevents echo and feedback

Troubleshooting

Problem	Fix
"Voice mode not available"	Install `[voice]` extra
"No audio input"	Check microphone permissions in OS settings
Poor transcription	Try a larger whisper model
High latency	Use `base` or `small` model; reduce TTS quality
Discord voice choppy	Check network connection; lower bitrate