MonoClaw

RL Training Stack

MonoClaw includes a reinforcement learning (RL) training pipeline for fine-tuning models based on agent trajectories. This is powered by Atropos, Tinker, and Weights & Biases.

When to use RL training

  • You want Mona to learn from her own successful trajectories
  • You need a model specialized for your specific workflows
  • You want to improve tool-calling accuracy
  • You're building a custom agent for a domain

Installation

The RL extra is not included in [all] by default. Install it explicitly:

cd ~/.monoclaw/monoclaw-runtime
uv pip install -e ".[rl]"

This installs:

  • atroposlib — RL training library
  • tinker — Trajectory processing
  • wandb — Experiment tracking
  • fastapi + uvicorn — Training server

Components

Atropos

Atropos is the RL training engine. It implements:

  • PPO (Proximal Policy Optimization)
  • GRPO (Group Relative Policy Optimization)
  • DPO (Direct Preference Optimization)

Tinker

Tinker processes agent trajectories into training data:

  • Extracts tool calls and outcomes
  • Labels successful vs failed trajectories
  • Generates preference pairs for DPO

Weights & Biases

W&B tracks experiments:

  • Loss curves
  • Reward metrics
  • Model checkpoints
  • Hyperparameter sweeps

Quick start

1. Generate training data

Run batch jobs to generate trajectories:

monoclaw batch run \
  --input prompts/training-set.jsonl \
  --prompt "Solve the problem step by step" \
  --output data/trajectories.jsonl \
  --save-tool-calls true

2. Process trajectories

python -m tinker.process \
  --input data/trajectories.jsonl \
  --output data/training-data.jsonl \
  --filter-success-only

3. Start training

python -m atropos.train \
  --model unsloth/llama-3-8b \
  --data data/training-data.jsonl \
  --method grpo \
  --output models/monoclaw-rl-v1

4. Track with W&B

export WANDB_API_KEY="your-key"
python -m atropos.train --wandb-project monoclaw-rl

Configuration

# ~/.monoclaw/config.yaml
rl:
  training:
    batch_size: 4
    learning_rate: 5e-5
    max_length: 8192
    save_steps: 100
  wandb:
    project: "monoclaw-rl"
    entity: "your-team"

Using the trained model

After training, use your model in MonoClaw:

monoclaw config set model.custom.endpoint "http://localhost:8000/v1"
monoclaw config set model.custom.name "monoclaw-rl-v1"

Hardware requirements

Model sizeVRAM neededHardware
7B16GBRTX 4090 / A100 40GB
13B32GBA100 40GB / 2x RTX 4090
70B80GBA100 80GB / 2x A100 40GB

For smaller GPUs, use quantization (Q4, Q8) or LoRA.

Best practices

  • Start with quality data — Filter for successful trajectories only
  • Use GRPO for tool-calling — It works well for structured outputs
  • Track everything — Use W&B to compare experiments
  • Validate frequently — Test the model on held-out prompts
  • Iterate gradually — Small models fine-tune faster and cheaper

Troubleshooting

ProblemFix
"Out of memory"Reduce batch size or use quantization
"NaN loss"Lower learning rate, check data quality
"W&B not logging"Verify WANDB_API_KEY is set
Slow trainingUse flash attention, gradient checkpointing