RL Training Stack

MonoClaw includes a reinforcement learning (RL) training pipeline for fine-tuning models based on agent trajectories. This is powered by Atropos, Tinker, and Weights & Biases.

When to use RL training

You want Mona to learn from her own successful trajectories
You need a model specialized for your specific workflows
You want to improve tool-calling accuracy
You're building a custom agent for a domain

Installation

The RL extra is not included in [all] by default. Install it explicitly:

cd ~/.monoclaw/monoclaw-runtime
uv pip install -e ".[rl]"

This installs:

atroposlib — RL training library
tinker — Trajectory processing
wandb — Experiment tracking
fastapi + uvicorn — Training server

Components

Atropos

Atropos is the RL training engine. It implements:

PPO (Proximal Policy Optimization)
GRPO (Group Relative Policy Optimization)
DPO (Direct Preference Optimization)

Tinker

Tinker processes agent trajectories into training data:

Extracts tool calls and outcomes
Labels successful vs failed trajectories
Generates preference pairs for DPO

Weights & Biases

W&B tracks experiments:

Loss curves
Reward metrics
Model checkpoints
Hyperparameter sweeps

Quick start

1. Generate training data

Run batch jobs to generate trajectories:

monoclaw batch run \
  --input prompts/training-set.jsonl \
  --prompt "Solve the problem step by step" \
  --output data/trajectories.jsonl \
  --save-tool-calls true

2. Process trajectories

python -m tinker.process \
  --input data/trajectories.jsonl \
  --output data/training-data.jsonl \
  --filter-success-only

3. Start training

python -m atropos.train \
  --model unsloth/llama-3-8b \
  --data data/training-data.jsonl \
  --method grpo \
  --output models/monoclaw-rl-v1

4. Track with W&B

export WANDB_API_KEY="your-key"
python -m atropos.train --wandb-project monoclaw-rl

Configuration

# ~/.monoclaw/config.yaml
rl:
  training:
    batch_size: 4
    learning_rate: 5e-5
    max_length: 8192
    save_steps: 100
  wandb:
    project: "monoclaw-rl"
    entity: "your-team"

Using the trained model

After training, use your model in MonoClaw:

monoclaw config set model.custom.endpoint "http://localhost:8000/v1"
monoclaw config set model.custom.name "monoclaw-rl-v1"

Hardware requirements

Model size	VRAM needed	Hardware
7B	16GB	RTX 4090 / A100 40GB
13B	32GB	A100 40GB / 2x RTX 4090
70B	80GB	A100 80GB / 2x A100 40GB

For smaller GPUs, use quantization (Q4, Q8) or LoRA.

Best practices

Start with quality data — Filter for successful trajectories only
Use GRPO for tool-calling — It works well for structured outputs
Track everything — Use W&B to compare experiments
Validate frequently — Test the model on held-out prompts
Iterate gradually — Small models fine-tune faster and cheaper

Troubleshooting

Problem	Fix
"Out of memory"	Reduce batch size or use quantization
"NaN loss"	Lower learning rate, check data quality
"W&B not logging"	Verify `WANDB_API_KEY` is set
Slow training	Use flash attention, gradient checkpointing