Voice Mode
Voice mode lets you talk to Mona instead of typing. It works in the CLI, Telegram, and Discord.
Prerequisites
Install the voice extra:
cd ~/.monoclaw/monoclaw-runtime
uv pip install -e ".[voice]"
This installs:
faster-whisper— Local speech-to-textsounddevice— Audio recordingnumpy— Audio processing
CLI voice mode
Enable
/voice on
Record
Press Ctrl+B to start recording. Press again to stop.
Mona will:
- Transcribe your audio (locally, no cloud)
- Process your request
- Speak the response (using your configured TTS provider)
Disable
/voice off
Telegram voice messages
Send a voice message to Mona on Telegram. She will:
- Download the voice note
- Transcribe it locally
- Reply with text (or voice, if TTS is enabled)
No special setup needed — voice messages work out of the box.
Discord voice channels
Mona can join Discord voice channels:
- Invite Mona to a voice channel
- Speak naturally — she listens and responds
- Use
/voice leaveto disconnect her
Voice channel commands
| Command | Action |
|---|---|
/voice join | Join the voice channel |
/voice leave | Leave the voice channel |
/voice mute | Stop listening |
/voice unmute | Resume listening |
TTS configuration
Voice mode uses your configured TTS provider for responses:
# ~/.monoclaw/config.yaml
tts:
provider: edge-tts # edge-tts | elevenlabs
edge-tts:
voice: "zh-HK-HiuMaanNeural"
For higher quality, use ElevenLabs:
uv pip install -e ".[tts-premium]"
monoclaw config set tts.provider elevenlabs
STT configuration
Speech-to-text uses faster-whisper locally:
voice:
stt:
model: "base" # tiny | base | small | medium | large
language: "auto" # auto-detect or specify (en, zh, etc.)
Larger models are more accurate but slower and use more memory.
Best practices
- Use a good microphone — Reduces transcription errors
- Speak clearly — Faster-whisper handles accents well but clarity helps
- Keep messages short — 30 seconds or less for best results
- Use in quiet environments — Background noise reduces accuracy
Limitations
- Local only — STT runs on your machine; no cloud processing
- English/Chinese best — Other languages may have lower accuracy
- Resource usage — Medium/large models use 2–8GB RAM
- No real-time streaming — Records full message before processing
Troubleshooting
| Problem | Fix |
|---|---|
| "Voice mode not available" | Install the voice extra |
| "No audio input detected" | Check microphone permissions |
| Transcription is poor | Try a larger whisper model |
| High latency | Use base model instead of large |
| Discord voice not working | Ensure Mona has voice channel permissions |