MonoClaw

Local LLMs on Mac

Running models locally on your Mac keeps data private and eliminates API costs. Apple Silicon (M1/M2/M3/M4) is particularly well-suited for this.

Option 1: Ollama (easiest)

Installation

brew install ollama

Download a model

ollama pull gemma4:e4b
ollama pull qwen3.5:8b
ollama pull gemma4:26b-moe

Start the server

ollama serve

Configure MonoClaw

monoclaw config set model.custom.endpoint "http://localhost:11434/v1"
monoclaw config set model.custom.api_key "ollama"
monoclaw config set model.custom.name "gemma4:e4b"
monoclaw config set model.custom.context_length 65536

Option 2: llama.cpp (most control)

Installation

brew install llama.cpp

Download a GGUF model

Download from Hugging Face:

huggingface-cli download bartowski/Gemma-4-E4B-Instruct-GGUF --include "*Q4_K_M.gguf"

Start the server

llama-server \
  -m Gemma-4-E4B-Instruct-Q4_K_M.gguf \
  -c 65536 \
  --host 0.0.0.0 \
  --port 8080

Configure MonoClaw

monoclaw config set model.custom.endpoint "http://localhost:8080/v1"
monoclaw config set model.custom.api_key "none"

Option 3: MLX (Apple Silicon optimized)

Installation

pip install mlx-lm

Start the server

python -m mlx_lm.server --model mlx-community/Gemma-4-E4B-Instruct-4bit

Configure MonoClaw

monoclaw config set model.custom.endpoint "http://localhost:8080/v1"

Model recommendations for Mac

ModelSizeRAM neededSpeed (tok/s)Best for
Gemma 4 E4B2.8GB3.5GB60Default choice, lightning speed, low RAM impact
Qwen 3.5 8B4.7GB6.0GB55Excellent general coding and multi-step tasks
Gemma 4 26B MoE14.0GB15.5GB45Deep reasoning, runs fast (3.8B active params) but tight on RAM
Qwen 3.5 14B8.2GB9.5GB30High logic complexity, requires closing background apps

Benchmarks on M4 Mac (16GB unified RAM). Speed varies by quantization (recommending Q4_K_M).

Memory optimization

Quantization levels

LevelSize ratioQualityUse case
Q4_K_M~60%GoodMost use cases
Q5_K_M~70%Very goodCritical reasoning
Q6_K~80%ExcellentNear-original quality
Q8_0~90%BestMaximum quality

Running multiple models

Macs with unified memory can run multiple models simultaneously:

# Terminal 1
ollama serve
# Terminal 2
ollama pull gemma4:e4b
ollama pull qwen3.5:8b

Switch between them in MonoClaw:

/model local-gemma
/model local-qwen

Performance tips

  • Use Metal GPU — Ollama and MLX use Apple Silicon GPU automatically
  • Close other apps — Free up unified memory for the model
  • Lower context — Reduce -c if you don't need 64K context
  • Batch requests — Send multiple prompts in one request

Troubleshooting

ProblemFix
"Out of memory"Use a smaller model or Q4 quantization
Slow generationClose other apps; use MLX for best Apple Silicon performance
"Model not found"Verify the model name with ollama list
Garbled outputIncrease context size or use a larger quantization