Local LLMs on Mac
Running models locally on your Mac keeps data private and eliminates API costs. Apple Silicon (M1/M2/M3/M4) is particularly well-suited for this.
Option 1: Ollama (easiest)
Installation
brew install ollama
Download a model
ollama pull gemma4:e4b
ollama pull qwen3.5:8b
ollama pull gemma4:26b-moe
Start the server
ollama serve
Configure MonoClaw
monoclaw config set model.custom.endpoint "http://localhost:11434/v1"
monoclaw config set model.custom.api_key "ollama"
monoclaw config set model.custom.name "gemma4:e4b"
monoclaw config set model.custom.context_length 65536
Option 2: llama.cpp (most control)
Installation
brew install llama.cpp
Download a GGUF model
Download from Hugging Face:
huggingface-cli download bartowski/Gemma-4-E4B-Instruct-GGUF --include "*Q4_K_M.gguf"
Start the server
llama-server \
-m Gemma-4-E4B-Instruct-Q4_K_M.gguf \
-c 65536 \
--host 0.0.0.0 \
--port 8080
Configure MonoClaw
monoclaw config set model.custom.endpoint "http://localhost:8080/v1"
monoclaw config set model.custom.api_key "none"
Option 3: MLX (Apple Silicon optimized)
Installation
pip install mlx-lm
Start the server
python -m mlx_lm.server --model mlx-community/Gemma-4-E4B-Instruct-4bit
Configure MonoClaw
monoclaw config set model.custom.endpoint "http://localhost:8080/v1"
Model recommendations for Mac
| Model | Size | RAM needed | Speed (tok/s) | Best for |
|---|---|---|---|---|
| Gemma 4 E4B | 2.8GB | 3.5GB | 60 | Default choice, lightning speed, low RAM impact |
| Qwen 3.5 8B | 4.7GB | 6.0GB | 55 | Excellent general coding and multi-step tasks |
| Gemma 4 26B MoE | 14.0GB | 15.5GB | 45 | Deep reasoning, runs fast (3.8B active params) but tight on RAM |
| Qwen 3.5 14B | 8.2GB | 9.5GB | 30 | High logic complexity, requires closing background apps |
Benchmarks on M4 Mac (16GB unified RAM). Speed varies by quantization (recommending Q4_K_M).
Memory optimization
Quantization levels
| Level | Size ratio | Quality | Use case |
|---|---|---|---|
| Q4_K_M | ~60% | Good | Most use cases |
| Q5_K_M | ~70% | Very good | Critical reasoning |
| Q6_K | ~80% | Excellent | Near-original quality |
| Q8_0 | ~90% | Best | Maximum quality |
Running multiple models
Macs with unified memory can run multiple models simultaneously:
# Terminal 1
ollama serve
# Terminal 2
ollama pull gemma4:e4b
ollama pull qwen3.5:8b
Switch between them in MonoClaw:
/model local-gemma
/model local-qwen
Performance tips
- Use Metal GPU — Ollama and MLX use Apple Silicon GPU automatically
- Close other apps — Free up unified memory for the model
- Lower context — Reduce
-cif you don't need 64K context - Batch requests — Send multiple prompts in one request
Troubleshooting
| Problem | Fix |
|---|---|
| "Out of memory" | Use a smaller model or Q4 quantization |
| Slow generation | Close other apps; use MLX for best Apple Silicon performance |
| "Model not found" | Verify the model name with ollama list |
| Garbled output | Increase context size or use a larger quantization |