Local LLMs on Mac

Running models locally on your Mac keeps data private and eliminates API costs. Apple Silicon (M1/M2/M3/M4) is particularly well-suited for this.

Option 1: Ollama (easiest)

Installation

brew install ollama

Download a model

ollama pull gemma4:e4b
ollama pull qwen3.5:8b
ollama pull gemma4:26b-moe

Start the server

ollama serve

Configure MonoClaw

monoclaw config set model.custom.endpoint "http://localhost:11434/v1"
monoclaw config set model.custom.api_key "ollama"
monoclaw config set model.custom.name "gemma4:e4b"
monoclaw config set model.custom.context_length 65536

Option 2: llama.cpp (most control)

Installation

brew install llama.cpp

Download a GGUF model

Download from Hugging Face:

huggingface-cli download bartowski/Gemma-4-E4B-Instruct-GGUF --include "*Q4_K_M.gguf"

Start the server

llama-server \
  -m Gemma-4-E4B-Instruct-Q4_K_M.gguf \
  -c 65536 \
  --host 0.0.0.0 \
  --port 8080

Configure MonoClaw

monoclaw config set model.custom.endpoint "http://localhost:8080/v1"
monoclaw config set model.custom.api_key "none"

Option 3: MLX (Apple Silicon optimized)

Installation

pip install mlx-lm

Start the server

python -m mlx_lm.server --model mlx-community/Gemma-4-E4B-Instruct-4bit

Configure MonoClaw

monoclaw config set model.custom.endpoint "http://localhost:8080/v1"

Model recommendations for Mac

Model	Size	RAM needed	Speed (tok/s)	Best for
Gemma 4 E4B	2.8GB	3.5GB	60	Default choice, lightning speed, low RAM impact
Qwen 3.5 8B	4.7GB	6.0GB	55	Excellent general coding and multi-step tasks
Gemma 4 26B MoE	14.0GB	15.5GB	45	Deep reasoning, runs fast (3.8B active params) but tight on RAM
Qwen 3.5 14B	8.2GB	9.5GB	30	High logic complexity, requires closing background apps

Benchmarks on M4 Mac (16GB unified RAM). Speed varies by quantization (recommending Q4_K_M).

Memory optimization

Quantization levels

Level	Size ratio	Quality	Use case
Q4_K_M	~60%	Good	Most use cases
Q5_K_M	~70%	Very good	Critical reasoning
Q6_K	~80%	Excellent	Near-original quality
Q8_0	~90%	Best	Maximum quality

Running multiple models

Macs with unified memory can run multiple models simultaneously:

# Terminal 1
ollama serve
# Terminal 2
ollama pull gemma4:e4b
ollama pull qwen3.5:8b

Switch between them in MonoClaw:

/model local-gemma
/model local-qwen

Performance tips

Use Metal GPU — Ollama and MLX use Apple Silicon GPU automatically
Close other apps — Free up unified memory for the model
Lower context — Reduce -c if you don't need 64K context
Batch requests — Send multiple prompts in one request

Troubleshooting

Problem	Fix
"Out of memory"	Use a smaller model or Q4 quantization
Slow generation	Close other apps; use MLX for best Apple Silicon performance
"Model not found"	Verify the model name with `ollama list`
Garbled output	Increase context size or use a larger quantization