RL Training Stack
MonoClaw includes a reinforcement learning (RL) training pipeline for fine-tuning models based on agent trajectories. This is powered by Atropos, Tinker, and Weights & Biases.
When to use RL training
- You want Mona to learn from her own successful trajectories
- You need a model specialized for your specific workflows
- You want to improve tool-calling accuracy
- You're building a custom agent for a domain
Installation
The RL extra is not included in [all] by default. Install it explicitly:
cd ~/.monoclaw/monoclaw-runtime
uv pip install -e ".[rl]"
This installs:
atroposlib— RL training librarytinker— Trajectory processingwandb— Experiment trackingfastapi+uvicorn— Training server
Components
Atropos
Atropos is the RL training engine. It implements:
- PPO (Proximal Policy Optimization)
- GRPO (Group Relative Policy Optimization)
- DPO (Direct Preference Optimization)
Tinker
Tinker processes agent trajectories into training data:
- Extracts tool calls and outcomes
- Labels successful vs failed trajectories
- Generates preference pairs for DPO
Weights & Biases
W&B tracks experiments:
- Loss curves
- Reward metrics
- Model checkpoints
- Hyperparameter sweeps
Quick start
1. Generate training data
Run batch jobs to generate trajectories:
monoclaw batch run \
--input prompts/training-set.jsonl \
--prompt "Solve the problem step by step" \
--output data/trajectories.jsonl \
--save-tool-calls true
2. Process trajectories
python -m tinker.process \
--input data/trajectories.jsonl \
--output data/training-data.jsonl \
--filter-success-only
3. Start training
python -m atropos.train \
--model unsloth/llama-3-8b \
--data data/training-data.jsonl \
--method grpo \
--output models/monoclaw-rl-v1
4. Track with W&B
export WANDB_API_KEY="your-key"
python -m atropos.train --wandb-project monoclaw-rl
Configuration
# ~/.monoclaw/config.yaml
rl:
training:
batch_size: 4
learning_rate: 5e-5
max_length: 8192
save_steps: 100
wandb:
project: "monoclaw-rl"
entity: "your-team"
Using the trained model
After training, use your model in MonoClaw:
monoclaw config set model.custom.endpoint "http://localhost:8000/v1"
monoclaw config set model.custom.name "monoclaw-rl-v1"
Hardware requirements
| Model size | VRAM needed | Hardware |
|---|---|---|
| 7B | 16GB | RTX 4090 / A100 40GB |
| 13B | 32GB | A100 40GB / 2x RTX 4090 |
| 70B | 80GB | A100 80GB / 2x A100 40GB |
For smaller GPUs, use quantization (Q4, Q8) or LoRA.
Best practices
- Start with quality data — Filter for successful trajectories only
- Use GRPO for tool-calling — It works well for structured outputs
- Track everything — Use W&B to compare experiments
- Validate frequently — Test the model on held-out prompts
- Iterate gradually — Small models fine-tune faster and cheaper
Troubleshooting
| Problem | Fix |
|---|---|
| "Out of memory" | Reduce batch size or use quantization |
| "NaN loss" | Lower learning rate, check data quality |
| "W&B not logging" | Verify WANDB_API_KEY is set |
| Slow training | Use flash attention, gradient checkpointing |