LM Studio Guide
LM Studio is an offline-first LLM runner designed to run Llama, DeepSeek, Qwen, Gemma, and other open-source models completely locally on your computer. Since Apple Silicon (such as the M4 chip) features high-bandwidth unified memory, it is highly optimized for local model execution.
This guide covers how to set up LM Studio, configure its local server to connect with MonoClaw, select the best models for an M4 Mac with 16GB unified RAM, and understand the critical parameters when researching local LLMs.
1. Installation and Setup
LM Studio can be run either as a desktop application (GUI) or as a headless CLI service via llmster.
Option A: Desktop Application (GUI)
- Download the macOS installer from the LM Studio website.
- Drag and drop the application into your
Applicationsfolder. - Open LM Studio.
Option B: Headless Service (CLI)
For headless setups, remote servers, or command-line enthusiasts, LM Studio provides a core background daemon called llmster.
Install the command-line tool lms:
curl -fsSL https://lmstudio.ai/install.sh | bash
Once installed, restart your terminal or reload your path, then control the daemon:
lms daemon up # Starts the LM Studio background daemon
lms server start # Starts the local server on the default port (1234)
2. Local Server Configuration
To act as a model provider for MonoClaw, LM Studio runs a local server that exposes an OpenAI-compatible API.
Starting the Server in the GUI
- Open LM Studio.
- Click on the Developer / Local Server tab (represented by a double arrow or terminal-like icon
<->on the left sidebar). - Select a model from the top dropdown to load it into memory.
- Set the port to
1234(the LM Studio default). - Click Start Server.
Connecting MonoClaw
MonoClaw treats LM Studio as a custom endpoint. Configure MonoClaw to route its queries locally by running the following commands in your terminal:
monoclaw config set model.custom.endpoint "http://localhost:1234/v1"
monoclaw config set model.custom.api_key "lm-studio"
monoclaw config set model.custom.name "gemma-4-e4b"
monoclaw config set model.custom.context_length 65536
Once configured, tell MonoClaw to use the custom local endpoint:
monoclaw config set model gemma-4-e4b
3. Optimizing for M4 Macs (16GB Unified RAM)
Apple Silicon Macs use Unified Memory Architecture (UMA), meaning the CPU and GPU share the same pool of RAM. Out of 16GB of unified RAM:
- macOS and system services consume roughly 3GB to 4GB.
- This leaves 10GB to 12GB of RAM comfortably free for loading local LLM weights and maintaining the context memory (KV cache).
Recommended Models for M4 (16GB RAM)
| Model Name | Param Count | File Size | Memory Footprint (Weights) | Context Potential | Performance on M4 | Best For |
|---|---|---|---|---|---|---|
| Gemma 4 E4B Instruct | 4.5 Billion | ~2.8 GB | ~3.3 GB | 256K tokens | Blistering (60+ tok/s) | Default choice (Pre-provisioned): Coding, agent workflows, highly optimized context headroom |
| Qwen 3.5 8B Instruct | 8.0 Billion | ~4.7 GB | ~5.5 GB | 128K tokens | Blistering (55+ tok/s) | All-round task execution, strong coding |
| Gemma 4 26B MoE | 25.2B (3.8B Active) | ~14.0 GB | ~15.5 GB (weights + KV cache) | 256K tokens | Responsive (45 tok/s) | MoE power choice: high intelligence, tight RAM but runs very fast due to active routing |
| Qwen 3.5 14B Instruct | 14.0 Billion | ~8.2 GB | ~9.5 GB | 128K tokens | Moderate (30 tok/s) | Power choice: heavy logic, requires closing all other heavy apps |
The Pre-Provisioned Choice: Gemma 4 E4B
We pre-provision Gemma 4 E4B at Q4_K_M for our users on their M4 Macs. It delivers an outstanding balance because:
- Its smaller memory footprint (~3.3GB) leaves plenty of RAM (over 7GB) for your browser, system, and other active processes.
- It executes at over 60 tokens per second on the M4 GPU.
- It fully supports MonoClaw's 64K context window without risk of system swap or out-of-memory crashes.
Meta Llama 4 Scale Constraints
Meta's 2025 Llama 4 models — including Llama 4 Scout (109B) and Llama 4 Maverick (400B MoE) — are designed for large-scale enterprise infrastructure. Even with aggressive quantization, they are far too massive to run on a 16GB Mac. Attempting to load them will result in failure or extreme slowdowns as the system relies on swap memory. Keep your local choices restricted to Gemma 4 and Qwen 3.5.
Gemma 4 Thinking Mode
Gemma 4 introduces a highly capable native Thinking Mode. When engaged in complex reasoning or code planning, it outputs step-by-step cognitive reasoning inside <thought> tags before generating the final answer. The MonoClaw runtime is fully compatible with this behavior: it automatically parses, renders, and non-destructively strips these <thought> sequences in your chat window so you get clean, elegant, and highly structured final outputs.
Memory Headroom Warning (for 14B+ and MoE Models)
Running models larger than 9B parameters (such as Qwen 3.5 14B or the Gemma 4 26B MoE) on a 16GB Mac is possible, but it pushes the hardware near its limits.
- You must close memory-heavy desktop applications (such as Google Chrome, Slack, Docker Desktop, or Xcode) before loading the model in LM Studio.
- If the system runs out of physical RAM, macOS will begin paging memory to your SSD (swap), which will drastically slow model inference speeds to less than 2-3 tokens per second.
4. Key LLM Parameters and Terms to Know
When searching Hugging Face or downloading models in LM Studio, pay close attention to the following parameters to ensure the model will work properly with your Mac and MonoClaw:
Parameter Count (e.g., 7B, 8B, 14B, 70B)
- What it is: The number of weights or connections in the neural network (in billions).
- Why it matters: Higher parameter counts generally yield higher reasoning capabilities, better logic, and stronger coding performance. However, they also require proportional increases in RAM and compute power.
- M4 16GB Rule: Keep models in the 4B to 8B range for optimal daily use with ample system headroom (such as Gemma 4 E4B or Qwen 3.5 8B).
Quantization (e.g., FP16, Q8_0, Q4_K_M)
- What it is: A compression process that lowers the numeric precision of the model's weights (e.g., from 16-bit floating point down to 4-bit or 5-bit integers).
- Why it matters: This reduces the file size and RAM requirement by up to 70% with negligible loss in reasoning ability.
- Common Levels:
- FP16 / BF16 (Uncompressed): Extremely heavy. An 8B model requires 16GB of RAM just for weights. Do not run unquantized models on a 16GB Mac.
- Q8_0 (8-bit): Very high fidelity, but heavy. An 8B model requires ~8.5GB of RAM.
- Q4_K_M (4-bit Medium): The recommended sweet spot. It offers a near-perfect balance, retaining almost all of the base model's intelligence while keeping the memory footprint under 5GB for 8B models.
- Q3_K_M (3-bit): Compresses models further to fit larger parameter sizes, but you will notice a degradation in formatting, logic, and instruction following.
Instruct vs. Base Models
- Instruct (or Chat) Models: Fine-tuned on instruction-following datasets. They understand prompts, answer questions, and converse like an assistant (e.g.,
Gemma-4-E4B-Instruct). - Base (or Foundation) Models: Trained only to predict the next word in a sentence (e.g.,
Gemma-4-E4B). They do not understand chat instructions and will repeatedly echo your prompt or output incoherent text. - Mona Requirement: Always use "Instruct" or "Chat" models. MonoClaw will not work with Base models.
Context Window (n_ctx)
- What it is: The maximum number of tokens (words and code blocks) the model can read and remember in a single session.
- MonoClaw Hard Invariant: MonoClaw requires a minimum context window of 64,000 (64K) tokens.
- Setting Context in LM Studio:
- Go to the local server settings on the right panel in LM Studio.
- Under Hardware Settings / Context Window, set the value of
n_ctxto65536. - Ensure your downloaded model natively supports a high context window (e.g., Gemma 4 supports up to 256K context and Qwen 3.5 supports up to 128K context, making them ideal).
- Memory Note: A larger context window reserves more memory for active session history (known as the KV Cache). Setting
n_ctxto 64K will consume an additional 1.5GB to 2GB of RAM during active conversations, which is why choosing a Q4_K_M quantized Gemma 4 E4B or Qwen 3.5 8B model is critical to leaving enough memory headroom.
5. Troubleshooting Local Connections
Error: "Out of Memory" or Slow Generation
- Symptom: Model takes minutes to reply, or LM Studio crashes when loading.
- Fix: You have exceeded your Mac's physical RAM. Close other desktop applications, unload the model, and download a smaller variant (e.g., transition from Q8_0 to Q4_K_M, or choose a 4.5B model like Gemma 4 E4B instead of a 14B model).
Error: "Minimum context of 64K tokens required"
- Symptom: MonoClaw rejects the connection during startup.
- Fix: LM Studio's default server context is often set to
2048or4096. In LM Studio's right-hand configuration panel, search forContext Windoworn_ctxand change it to65536. Restart the server.
Error: "Connection Refused"
- Symptom: MonoClaw cannot connect to the endpoint.
- Fix: Ensure the local server is actively running in LM Studio. Double-check that the port in LM Studio matches the port in the MonoClaw configuration command (usually
1234for LM Studio, whereasollamadefaults to11434andllama.cppto8080). You can test the server in your terminal by running:If it returns a JSON list of loaded models, the server is running correctly.curl http://localhost:1234/v1/models