LM Studio Guide

LM Studio is an offline-first LLM runner designed to run Llama, DeepSeek, Qwen, Gemma, and other open-source models completely locally on your computer. Since Apple Silicon (such as the M4 chip) features high-bandwidth unified memory, it is highly optimized for local model execution.

This guide covers how to set up LM Studio, configure its local server to connect with MonoClaw, select the best models for an M4 Mac with 16GB unified RAM, and understand the critical parameters when researching local LLMs.

1. Installation and Setup

LM Studio can be run either as a desktop application (GUI) or as a headless CLI service via llmster.

Option A: Desktop Application (GUI)

Download the macOS installer from the LM Studio website.
Drag and drop the application into your Applications folder.
Open LM Studio.

Option B: Headless Service (CLI)

For headless setups, remote servers, or command-line enthusiasts, LM Studio provides a core background daemon called llmster.

Install the command-line tool lms:

curl -fsSL https://lmstudio.ai/install.sh | bash

Once installed, restart your terminal or reload your path, then control the daemon:

lms daemon up          # Starts the LM Studio background daemon
lms server start       # Starts the local server on the default port (1234)

2. Local Server Configuration

To act as a model provider for MonoClaw, LM Studio runs a local server that exposes an OpenAI-compatible API.

Starting the Server in the GUI

Open LM Studio.
Click on the Developer / Local Server tab (represented by a double arrow or terminal-like icon <-> on the left sidebar).
Select a model from the top dropdown to load it into memory.
Set the port to 1234 (the LM Studio default).
Click Start Server.

Connecting MonoClaw

MonoClaw treats LM Studio as a custom endpoint. Configure MonoClaw to route its queries locally by running the following commands in your terminal:

monoclaw config set model.custom.endpoint "http://localhost:1234/v1"
monoclaw config set model.custom.api_key "lm-studio"
monoclaw config set model.custom.name "gemma-4-e4b"
monoclaw config set model.custom.context_length 65536

Once configured, tell MonoClaw to use the custom local endpoint:

monoclaw config set model gemma-4-e4b

3. Optimizing for M4 Macs (16GB Unified RAM)

Apple Silicon Macs use Unified Memory Architecture (UMA), meaning the CPU and GPU share the same pool of RAM. Out of 16GB of unified RAM:

macOS and system services consume roughly 3GB to 4GB.
This leaves 10GB to 12GB of RAM comfortably free for loading local LLM weights and maintaining the context memory (KV cache).

Recommended Models for M4 (16GB RAM)

Model Name	Param Count	File Size	Memory Footprint (Weights)	Context Potential	Performance on M4	Best For
Gemma 4 E4B Instruct	4.5 Billion	~2.8 GB	~3.3 GB	256K tokens	Blistering (60+ tok/s)	Default choice (Pre-provisioned): Coding, agent workflows, highly optimized context headroom
Qwen 3.5 8B Instruct	8.0 Billion	~4.7 GB	~5.5 GB	128K tokens	Blistering (55+ tok/s)	All-round task execution, strong coding
Gemma 4 26B MoE	25.2B (3.8B Active)	~14.0 GB	~15.5 GB (weights + KV cache)	256K tokens	Responsive (45 tok/s)	MoE power choice: high intelligence, tight RAM but runs very fast due to active routing
Qwen 3.5 14B Instruct	14.0 Billion	~8.2 GB	~9.5 GB	128K tokens	Moderate (30 tok/s)	Power choice: heavy logic, requires closing all other heavy apps

The Pre-Provisioned Choice: Gemma 4 E4B

We pre-provision Gemma 4 E4B at Q4_K_M for our users on their M4 Macs. It delivers an outstanding balance because:

Its smaller memory footprint (~3.3GB) leaves plenty of RAM (over 7GB) for your browser, system, and other active processes.
It executes at over 60 tokens per second on the M4 GPU.
It fully supports MonoClaw's 64K context window without risk of system swap or out-of-memory crashes.

Meta Llama 4 Scale Constraints

Meta's 2025 Llama 4 models — including Llama 4 Scout (109B) and Llama 4 Maverick (400B MoE) — are designed for large-scale enterprise infrastructure. Even with aggressive quantization, they are far too massive to run on a 16GB Mac. Attempting to load them will result in failure or extreme slowdowns as the system relies on swap memory. Keep your local choices restricted to Gemma 4 and Qwen 3.5.

Gemma 4 Thinking Mode

Gemma 4 introduces a highly capable native Thinking Mode. When engaged in complex reasoning or code planning, it outputs step-by-step cognitive reasoning inside <thought> tags before generating the final answer. The MonoClaw runtime is fully compatible with this behavior: it automatically parses, renders, and non-destructively strips these <thought> sequences in your chat window so you get clean, elegant, and highly structured final outputs.

Memory Headroom Warning (for 14B+ and MoE Models)

Running models larger than 9B parameters (such as Qwen 3.5 14B or the Gemma 4 26B MoE) on a 16GB Mac is possible, but it pushes the hardware near its limits.

You must close memory-heavy desktop applications (such as Google Chrome, Slack, Docker Desktop, or Xcode) before loading the model in LM Studio.
If the system runs out of physical RAM, macOS will begin paging memory to your SSD (swap), which will drastically slow model inference speeds to less than 2-3 tokens per second.

4. Key LLM Parameters and Terms to Know

When searching Hugging Face or downloading models in LM Studio, pay close attention to the following parameters to ensure the model will work properly with your Mac and MonoClaw:

Parameter Count (e.g., 7B, 8B, 14B, 70B)

What it is: The number of weights or connections in the neural network (in billions).
Why it matters: Higher parameter counts generally yield higher reasoning capabilities, better logic, and stronger coding performance. However, they also require proportional increases in RAM and compute power.
M4 16GB Rule: Keep models in the 4B to 8B range for optimal daily use with ample system headroom (such as Gemma 4 E4B or Qwen 3.5 8B).

Quantization (e.g., FP16, Q8_0, Q4_K_M)

What it is: A compression process that lowers the numeric precision of the model's weights (e.g., from 16-bit floating point down to 4-bit or 5-bit integers).
Why it matters: This reduces the file size and RAM requirement by up to 70% with negligible loss in reasoning ability.
Common Levels:
- FP16 / BF16 (Uncompressed): Extremely heavy. An 8B model requires 16GB of RAM just for weights. Do not run unquantized models on a 16GB Mac.
- Q8_0 (8-bit): Very high fidelity, but heavy. An 8B model requires ~8.5GB of RAM.
- Q4_K_M (4-bit Medium): The recommended sweet spot. It offers a near-perfect balance, retaining almost all of the base model's intelligence while keeping the memory footprint under 5GB for 8B models.
- Q3_K_M (3-bit): Compresses models further to fit larger parameter sizes, but you will notice a degradation in formatting, logic, and instruction following.

Instruct vs. Base Models

Instruct (or Chat) Models: Fine-tuned on instruction-following datasets. They understand prompts, answer questions, and converse like an assistant (e.g., Gemma-4-E4B-Instruct).
Base (or Foundation) Models: Trained only to predict the next word in a sentence (e.g., Gemma-4-E4B). They do not understand chat instructions and will repeatedly echo your prompt or output incoherent text.
Mona Requirement: Always use "Instruct" or "Chat" models. MonoClaw will not work with Base models.

Context Window (n_ctx)

What it is: The maximum number of tokens (words and code blocks) the model can read and remember in a single session.
MonoClaw Hard Invariant: MonoClaw requires a minimum context window of 64,000 (64K) tokens.
Setting Context in LM Studio:
1. Go to the local server settings on the right panel in LM Studio.
2. Under Hardware Settings / Context Window, set the value of n_ctx to 65536.
3. Ensure your downloaded model natively supports a high context window (e.g., Gemma 4 supports up to 256K context and Qwen 3.5 supports up to 128K context, making them ideal).
Memory Note: A larger context window reserves more memory for active session history (known as the KV Cache). Setting n_ctx to 64K will consume an additional 1.5GB to 2GB of RAM during active conversations, which is why choosing a Q4_K_M quantized Gemma 4 E4B or Qwen 3.5 8B model is critical to leaving enough memory headroom.

5. Troubleshooting Local Connections

Error: "Out of Memory" or Slow Generation

Symptom: Model takes minutes to reply, or LM Studio crashes when loading.
Fix: You have exceeded your Mac's physical RAM. Close other desktop applications, unload the model, and download a smaller variant (e.g., transition from Q8_0 to Q4_K_M, or choose a 4.5B model like Gemma 4 E4B instead of a 14B model).

Error: "Minimum context of 64K tokens required"

Symptom: MonoClaw rejects the connection during startup.
Fix: LM Studio's default server context is often set to 2048 or 4096. In LM Studio's right-hand configuration panel, search for Context Window or n_ctx and change it to 65536. Restart the server.

Error: "Connection Refused"

Symptom: MonoClaw cannot connect to the endpoint.
Fix: Ensure the local server is actively running in LM Studio. Double-check that the port in LM Studio matches the port in the MonoClaw configuration command (usually 1234 for LM Studio, whereas ollama defaults to 11434 and llama.cpp to 8080). You can test the server in your terminal by running:
```
curl http://localhost:1234/v1/models
```
If it returns a JSON list of loaded models, the server is running correctly.