MonoClaw

Voice Mode

Voice mode lets you talk to Mona instead of typing. It works in the CLI, Telegram, and Discord.

Prerequisites

Install the voice extra:

cd ~/.monoclaw/monoclaw-runtime
uv pip install -e ".[voice]"

This installs:

  • faster-whisper — Local speech-to-text
  • sounddevice — Audio recording
  • numpy — Audio processing

CLI voice mode

Enable

/voice on

Record

Press Ctrl+B to start recording. Press again to stop.

Mona will:

  1. Transcribe your audio (locally, no cloud)
  2. Process your request
  3. Speak the response (using your configured TTS provider)

Disable

/voice off

Telegram voice messages

Send a voice message to Mona on Telegram. She will:

  1. Download the voice note
  2. Transcribe it locally
  3. Reply with text (or voice, if TTS is enabled)

No special setup needed — voice messages work out of the box.

Discord voice channels

Mona can join Discord voice channels:

  1. Invite Mona to a voice channel
  2. Speak naturally — she listens and responds
  3. Use /voice leave to disconnect her

Voice channel commands

CommandAction
/voice joinJoin the voice channel
/voice leaveLeave the voice channel
/voice muteStop listening
/voice unmuteResume listening

TTS configuration

Voice mode uses your configured TTS provider for responses:

# ~/.monoclaw/config.yaml
tts:
  provider: edge-tts    # edge-tts | elevenlabs
  edge-tts:
    voice: "zh-HK-HiuMaanNeural"

For higher quality, use ElevenLabs:

uv pip install -e ".[tts-premium]"
monoclaw config set tts.provider elevenlabs

STT configuration

Speech-to-text uses faster-whisper locally:

voice:
  stt:
    model: "base"        # tiny | base | small | medium | large
    language: "auto"     # auto-detect or specify (en, zh, etc.)

Larger models are more accurate but slower and use more memory.

Best practices

  • Use a good microphone — Reduces transcription errors
  • Speak clearly — Faster-whisper handles accents well but clarity helps
  • Keep messages short — 30 seconds or less for best results
  • Use in quiet environments — Background noise reduces accuracy

Limitations

  • Local only — STT runs on your machine; no cloud processing
  • English/Chinese best — Other languages may have lower accuracy
  • Resource usage — Medium/large models use 2–8GB RAM
  • No real-time streaming — Records full message before processing

Troubleshooting

ProblemFix
"Voice mode not available"Install the voice extra
"No audio input detected"Check microphone permissions
Transcription is poorTry a larger whisper model
High latencyUse base model instead of large
Discord voice not workingEnsure Mona has voice channel permissions