Task: $ARGUMENTS

Overview

Modal is a serverless GPU cloud. Key advantages over SSH-based platforms (vast.ai, remote servers):

Zero config: no SSH, no Docker, no port forwarding. Write Python → modal run → done.
Auto scale-to-zero: billing stops the instant your code finishes. No idle instances.
Local-first: run modal run from your laptop. Code, data, and results stay local; only the GPU function runs remotely.
Reproducible environments: dependencies declared in code via modal.Image, not system-level packages.

Best for: Users without a local GPU who need to debug CUDA code, run small-scale tests, or iterate quickly on experiments. The $5 free tier (no card) is enough for code debugging; $30 (with card) covers most small-scale experiment runs.

Trade-off: Modal costs more per GPU-hour than vast.ai or Lightning for some GPU tiers, but eliminates setup time and idle billing, often making it cheaper for short/medium workloads. For long training runs (>4 hours), consider vast.ai for lower $/hr.

Task: $ARGUMENTS

Overview

Modal is a serverless GPU cloud. Key advantages over SSH-based platforms (vast.ai, remote servers):

Zero config: no SSH, no Docker, no port forwarding. Write Python → modal run → done.
Auto scale-to-zero: billing stops the instant your code finishes. No idle instances.
Local-first: run modal run from your laptop. Code, data, and results stay local; only the GPU function runs remotely.
Reproducible environments: dependencies declared in code via modal.Image, not system-level packages.

GPU	$/sec	≈$/hr	VRAM	Bandwidth GB/s	Free budget → hours
T4	$0.000164	$0.59	16GB	300	~8.5 hr ($5) / 50.8 hr ($30)
L4	$0.000222	$0.80	24GB	300	~6.3 hr / 37.5 hr
A10	$0.000306	$1.10	24GB	600	~4.5 hr / 27.3 hr
L40S	$0.000542	$1.95	48GB	864	~2.6 hr / 15.4 hr
A100-40GB	$0.000583	$2.10	40GB	1555	~2.4 hr / 14.3 hr
A100-80GB	$0.000694	$2.50	80GB	2039	~2.0 hr / 12.0 hr
H100	$0.001097	$3.95	80GB	3352	~1.3 hr / 7.6 hr
H200	$0.001261	$4.54	141GB	4800	~1.1 hr / 6.6 hr
B200	$0.001736	$6.25	192GB	8000	~0.8 hr / 4.8 hr

GPU	Speed tok/s	$/hr	1000 samples x 200tok cost	Duration
H100	224	$3.95	$0.98	15 min
A100-40GB	104	$2.10	$1.12	32 min
L4	20	$0.80	$2.22	167 min

Model Size	FP16 VRAM	Recommended GPU
≤3B	~8GB	T4, L4
7-8B	~22GB	L4, A10, A100-40GB
13B	~30GB	L40S, A100-40GB
30B	~65GB	A100-80GB, H100
70B	~140GB	H100:2, H200

Serverless Modal

Overview

Serverless Modal

Overview

Authentication

Pricing (source: modal.com/pricing, per-second billing)

!! Cost Estimation Required !!

Cost Estimation Template (required before every run)

7-8B BF16 Benchmark Cost Comparison

Workflow

Step 1: Analyze Task → Estimate Cost → Choose GPU

Pattern A: One-Shot GPU Function (training, evaluation, benchmark)

Pattern B: Web API (persistent inference service)

Pattern C: vLLM High-Performance Inference

Pattern D: Batch Parallel (map over dataset)

Pattern E: LoRA Fine-Tuning

Pattern F: Multi-GPU Distributed Training

Step 3: Run

Step 4: Verify & Monitor

Step 5: Collect Results

Step 6: Cleanup

CLI Reference

Key Tips

Composing with Other Skills

CLAUDE.md Example

Documentation

Mcporter

Sonoscli

Openhue

Healthcheck

Things Mac

Eightctl

Serverless Modal

Modal Cloud GPU — Training & Inference

Overview

Serverless Modal

Modal Cloud GPU — Training & Inference

Overview

Authentication

Pricing (source: modal.com/pricing, per-second billing)

!! Cost Estimation Required !!

Cost Estimation Template (required before every run)

7-8B BF16 Benchmark Cost Comparison

Workflow

Step 1: Analyze Task → Estimate Cost → Choose GPU

Step 2: Generate Modal Launcher

Pattern A: One-Shot GPU Function (training, evaluation, benchmark)

Pattern B: Web API (persistent inference service)

Pattern C: vLLM High-Performance Inference

Pattern D: Batch Parallel (map over dataset)

Pattern E: LoRA Fine-Tuning

Pattern F: Multi-GPU Distributed Training

Step 3: Run

Step 4: Verify & Monitor

Step 5: Collect Results

Step 6: Cleanup

CLI Reference

Key Tips

Composing with Other Skills

CLAUDE.md Example

Documentation

Mcporter

Sonoscli

Openhue

Healthcheck

Things Mac

Eightctl