Modal

Modal is a serverless compute platform that lets you run Mona on GPUs and CPUs in the cloud, scaling from zero to thousands of workers.

When to use Modal

You need GPU acceleration for local model inference
You want ephemeral, isolated execution environments
You want to scale concurrent agent sessions horizontally
You don't want to manage servers

Installation

The Modal extra is included in the [all] install. If you installed the minimal bundle:

cd ~/.monoclaw/monoclaw-runtime
uv pip install -e ".[modal]"

Authentication

Sign up at modal.com
Install the Modal CLI: pip install modal
Run modal token new to authenticate

Configure MonoClaw

Set Modal as your terminal backend:

monoclaw config set terminal.backend modal

Or configure in config.yaml:

terminal:
  backend: modal
  modal:
    app_name: "monoclaw-agent"
    gpu: "a10g"        # or "t4", "a100", "h100"
    cpu: 4
    memory: 16384      # MB
    timeout: 3600      # seconds

How it works

When Mona needs to run a command:

MonoClaw spins up a Modal sandbox
The command executes inside the sandbox
Output streams back to Mona in real time
The sandbox shuts down when idle (or stays warm if configured)

Cost optimization

Modal charges only for compute time. Tips to minimize costs:

Use keep_warm: 1 to keep one sandbox warm for fast responses
Use cheaper GPUs (t4) for light workloads
Set timeout low to prevent runaway processes
Use spot instances for non-critical tasks

Example: GPU-accelerated local model

Run a local model inside Modal:

model:
  default: "custom"
  custom:
    endpoint: "https://your-modal-app.modal.run/v1"
    api_key: "${MODAL_API_KEY}"

Deploy an OpenAI-compatible endpoint on Modal using vLLM or TGI, then point Mona at it.

Limitations

Cold start latency: 5–30 seconds for the first command if no sandbox is warm
Network: Sandboxes have internet access but no persistent local storage
Secrets: Pass secrets via environment variables, not command arguments