Run GPU workloads on Modal — training, fine-tuning, inference, batch processing. Zero-config serverless: no SSH, no Docker, auto scale-to-zero. Use when user says "modal run", "modal training", "modal inference", "deploy to modal", "need a GPU", "run on modal", "serverless GPU", or needs remote GPU compute.
Task: $ARGUMENTS
Modal is a serverless GPU cloud. Key advantages over SSH-based platforms (vast.ai, remote servers):
modal run → done.modal run from your laptop. Code, data, and results stay local; only the GPU function runs remotely.modal.Image, not system-level packages.Best for: Users without a local GPU who need to debug CUDA code, run small-scale tests, or iterate quickly on experiments. The $5 free tier (no card) is enough for code debugging; $30 (with card) covers most small-scale experiment runs.
Trade-off: Modal costs more per GPU-hour than vast.ai or Lightning for some GPU tiers, but eliminates setup time and idle billing, often making it cheaper for short/medium workloads. For long training runs (>4 hours), consider vast.ai for lower $/hr.
pip install modal
modal setup # Opens browser login, writes token to ~/.modal.toml
# Verify:
modal run -q 'print("ok")'
modal secret create huggingface-secret HF_TOKEN=hf_xxxxxRecommended setup: Bind a card to unlock $30/month, then immediately set a spending limit (e.g., $30) so you never exceed the free tier. Modal will pause your workloads when the limit is hit.
SECURITY WARNING: Always bind your card and set spending limits directly on https://modal.com/settings in your browser. NEVER enter payment information, card numbers, or billing details through Claude Code or any CLI tool. Only the official Modal website is safe for payment operations.
| GPU | $/sec | ≈$/hr | VRAM | Bandwidth GB/s | Free budget → hours |
|---|---|---|---|---|---|
| T4 | $0.000164 | $0.59 | 16GB | 300 | ~8.5 hr ($5) / 50.8 hr ($30) |
| L4 | $0.000222 | $0.80 | 24GB | 300 | ~6.3 hr / 37.5 hr |
| A10 | $0.000306 | $1.10 | 24GB | 600 | ~4.5 hr / 27.3 hr |
| L40S | $0.000542 | $1.95 | 48GB | 864 | ~2.6 hr / 15.4 hr |
| A100-40GB | $0.000583 | $2.10 | 40GB | 1555 | ~2.4 hr / 14.3 hr |
| A100-80GB | $0.000694 | $2.50 | 80GB | 2039 | ~2.0 hr / 12.0 hr |
| H100 | $0.001097 | $3.95 | 80GB | 3352 | ~1.3 hr / 7.6 hr |
| H200 | $0.001261 | $4.54 | 141GB | 4800 | ~1.1 hr / 6.6 hr |
| B200 | $0.001736 | $6.25 | 192GB | 8000 | ~0.8 hr / 4.8 hr |
CPU: $0.047/core/hr | RAM: $0.008/GiB/hr (GPU typically 90%+ of total cost)
Before EVERY run, estimate cost and show to user for confirmation.
Key insights:
Cost estimate (Modal):
Model: [name] ([params], [precision])
VRAM: ~[X]GB (weights + KV cache + overhead)
GPU: [type] ([VRAM]GB, $[X]/sec = $[X]/hr, bandwidth [X] GB/s)
Estimate: ~[N] min, ~$[X]
| GPU | Speed tok/s | $/hr | 1000 samples x 200tok cost | Duration |
|---|---|---|---|---|
| H100 | 224 | $3.95 | $0.98 | 15 min |
| A100-40GB | 104 | $2.10 | $1.12 | 32 min |
| L4 | 20 | $0.80 | $2.22 | 167 min |
Same analysis as any GPU skill — determine VRAM needs from model size, pick GPU, estimate hours, calculate cost. See pricing table above.
VRAM Rules of Thumb:
| Model Size | FP16 VRAM | Recommended GPU |
|---|---|---|
| ≤3B | ~8GB | T4, L4 |
| 7-8B | ~22GB | L4, A10, A100-40GB |
| 13B | ~30GB | L40S, A100-40GB |
| 30B | ~65GB | A100-80GB, H100 |
| 70B | ~140GB | H100:2, H200 |
Based on the task type, generate the appropriate launcher script.
The most common pattern for run-experiment integration. Wraps an existing training script:
import modal
app = modal.App("experiment-name")
image = modal.Image.debian_slim(python_version="3.11").pip_install(
"torch", "transformers", "accelerate", "datasets", "wandb"
)
# Mount local project code into the container
local_code = modal.Mount.from_local_dir(".", remote_path="/workspace")
# Persistent volume for checkpoints and results
volume = modal.Volume.from_name("experiment-results", create_if_missing=True)
@app.function(
image=image,
gpu="A100-80GB", # Chosen based on Step 1 analysis
mounts=[local_code],
volumes={"/results": volume},
timeout=3600 * 6, # 6 hours max
secrets=[modal.Secret.from_name("wandb-secret")], # Optional
)
def train():
import subprocess
subprocess.run(
["python", "train.py", "--output_dir", "/results/run_001"],
cwd="/workspace",
check=True,
)
volume.commit() # Persist results to volume
@app.local_entrypoint()
def main():
train.remote()
print("Training complete. Results saved to Modal volume 'experiment-results'.")
Run: modal run launcher.py
import modal
app = modal.App("inference-api")
image = modal.Image.debian_slim(python_version="3.11").pip_install(
"torch", "transformers", "accelerate"
)
@app.cls(image=image, gpu="L40S")
@modal.concurrent(max_inputs=10)
class InferenceAPI:
@modal.enter()
def load_model(self):
from transformers import AutoModelForCausalLM, AutoTokenizer
self.tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B")
self.model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.2-1B", device_map="auto"
)
@modal.fastapi_endpoint(method="POST")
def generate(self, request: dict):
inputs = self.tokenizer(request.get("prompt", ""), return_tensors="pt").to("cuda")
outputs = self.model.generate(**inputs, max_new_tokens=256)
return {"text": self.tokenizer.decode(outputs[0], skip_special_tokens=True)}
Deploy: modal deploy app.py
import modal, subprocess
app = modal.App("vllm-server")
image = modal.Image.debian_slim(python_version="3.11").pip_install("vllm")
VOLUME = modal.Volume.from_name("model-cache", create_if_missing=True)
MODEL = "Qwen/Qwen3-4B"
@app.function(image=image, gpu="H100", volumes={"/models": VOLUME}, timeout=3600)
@modal.concurrent(max_inputs=100)
@modal.web_server(port=8000)
def serve():
subprocess.Popen(["python", "-m", "vllm.entrypoints.openai.api_server",
"--model", MODEL, "--download-dir", "/models", "--port", "8000"])
@app.function(image=image, gpu="T4", timeout=600)
def process_item(item: dict) -> dict:
# ... process one item ...
return {"result": "processed"}
@app.local_entrypoint()
def main():
results = list(process_item.map([{"id": i} for i in range(1000)]))
@app.function(
image=image, gpu="A100-80GB", volumes={"/output": volume},
timeout=3600 * 6, secrets=[modal.Secret.from_name("huggingface-secret")],
)
def train():
# ... transformers + peft + trl training code ...
trainer.save_model("/output/final")
volume.commit()
@app.function(image=image, gpu="H100:4", volumes={"/output": volume}, timeout=3600 * 12)
def train_distributed():
import subprocess
subprocess.run(["accelerate", "launch", "--num_processes", "4",
"--mixed_precision", "bf16", "train.py"], check=True)
modal run launcher.py # One-shot execution (most common for experiments)
modal deploy app.py # Persistent service deployment
modal app list # List running apps
modal app logs <app-name> # Stream logs
Results collection depends on the pattern used:
Volume-based (recommended for training):
# Download results from volume after run completes
# Option A: In the launcher script, copy results to local mount before exit
# Option B: Use modal volume commands
modal volume ls experiment-results
modal volume get experiment-results /run_001/results.json ./results/
Stdout/return-based (for evaluation/benchmarks): Results are printed to terminal or returned from the function — already local.
Modal auto-scales to zero — no manual instance destruction needed. But clean up unused resources:
modal app stop <app-name> # Stop a deployed service
modal volume rm <volume-name> # Delete a volume when done
modal run app.py # Run once
modal deploy app.py # Deploy persistent service
modal app logs <app> # View logs
modal app list # List apps
modal app stop <app> # Stop
modal volume ls # List volumes
modal volume get <vol> <remote> <local> # Download from volume
modal secret create NAME KEY=VALUE # Create secret
gpu=["H100", "A100-80GB", "L40S"] — Modal tries each in ordergpu="H100:4" (up to 8 GPUs, cost scales linearly)modal.Volume.from_name("x", create_if_missing=True) for persistent storage@modal.enter() loads model once per container | @modal.concurrent() for concurrent requeststimeout=3600 * N (default is 5 min)modal.Mount.from_local_dir(".", remote_path="/workspace")secrets=[modal.Secret.from_name("wandb-secret")] + wandb.init() in your script/run-experiment "train model" <- detects gpu: modal, calls /serverless-modal
-> /serverless-modal <- analyzes task, generates launcher, runs
-> Results returned locally or to Modal Volume
-> No destroy step needed (auto scale-to-zero)
/serverless-modal <- standalone: any Modal GPU workload
/serverless-modal "deploy vLLM" <- inference service deployment
## Modal
- gpu: modal # tells run-experiment to use Modal serverless
- modal_gpu: A100-80GB # optional: override GPU selection (default: auto-select)
- modal_timeout: 21600 # optional: max seconds (default: 6 hours)
- modal_volume: my-results # optional: named volume for results persistence
No SSH keys, no Docker images, no instance management needed. Just pip install modal && modal setup.
Cost protection: After
modal setup, go to https://modal.com/settings in your browser (NEVER through CLI) → bind a payment method to unlock $30/month free tier (without card: only $5/month). Then set a workspace spending limit equal to your free tier amount — Modal will auto-pause workloads when the limit is reached, preventing any surprise charges.