Name: CUDA Graphs for PyTorch
Author: NVIDIA

CUDA Graphs capture a sequence of GPU operations once and replay them with minimal CPU overhead. This skill guides applying CUDA Graphs to PyTorch training and inference workloads using native PyTorch APIs, Transformer Engine, and Megatron-LM.

When to Use

Reach for this skill when you encounter:

Triggers: User wants to optimize with CUDA Graphs, reduce kernel launch overhead, or speed up training/inference loops
Symptoms: Low GPU utilization (<80%), many small kernel launches (<50 us each), CPU-bound training, high kernel launch latency visible in Nsight Systems profiles
Keywords: "CUDA graph", "torch.cuda.graph", "make_graphed_callables", "reduce-overhead", "graph capture", "graph replay", "kernel launch overhead", "CudaGraphManager", "FullCudaGraphWrapper", "full-iteration graph", "stream capture"

Do NOT use this skill for:

General PyTorch performance tuning unrelated to kernel launch overhead
CUDA kernel development or custom CUDA C++ code
Host-device sync elimination only (use perf-torch-sync-free skill instead)
Nsight Systems profiling (use perf-nsight-systems skill)

Dependency	Version	Notes
PyTorch	>= 1.10	`torch.cuda.graph()` available
CUDA	>= 11.0	Graph update APIs
GPU	NVIDIA (any)	Required for CUDA
Nsight Systems	any	Optional, for profiling
APEX	any	Optional, for capturable optimizers
Transformer Engine	>= 2.2	Optional, for FP8-aware graphing
Megatron-LM	core >= 0.14.0	Optional, for CudaGraphManager / FullCudaGraphWrapper

Situation	API	Workflow
Quick experiment, unknown graph boundaries	`torch.compile(mode="reduce-overhead")`	Workflow 2
Training, need autograd, no FP8/PP	`torch.cuda.make_graphed_callables()`	Workflow 3
Any PyTorch model, FP8 or PP support	TE `make_graphed_callables`	Workflow 4
Megatron-LM, per-layer, automatic	MCore `CudaGraphManager`	Workflow 5
Maximum perf, full-iteration capture	MCore `FullCudaGraphWrapper`	Workflow 6
Full manual control, custom pipelines	`torch.cuda.graph()`	Workflow 7

Dynamic aspect	Fix
`if loss > threshold:`	`torch.where(condition, a, b)`
`input = new_tensor` (address changes)	Pre-allocate + `.copy_()`
Python scalars (lr, temperature)	GPU tensor + `.fill_()`
Variable batch size / sequence length	Padding or bucketing
MoE / dynamic routing	Partial graphing

Metric	How to Check
Correctness	`torch.allclose(eager, graphed, rtol=1e-5)`
Speedup	Wall-clock time comparison
GPU utilization	`nvidia-smi` or Nsight Systems timeline
Memory overhead	`torch.cuda.memory_summary()`

Error	Cause	Fix
`StreamCaptureUnsupported` (900)	Sync op during capture (`.item()`, `.cpu()`)	Move sync outside graph
`StreamCaptureInvalidated` (901)	Background thread (e.g., pin_memory)	`capture_error_mode="thread_local"`
`StreamCaptureUnjoined` (904)	Side stream didn't rejoin capture stream	`capture_stream.wait_stream(side_stream)`
`StreamCaptureImplicit` (906)	AccumulateGrad on default stream	Warmup on side stream before capture
Illegal memory access	Input tensor freed/reassigned	Keep persistent ref, use `.copy_()`
Wrong numerical results	Dynamic behavior frozen at capture	See `references/patterns-compatibility.md`
OOM with multiple graphs	Pools can't share memory	`pool=g1.pool()` for sequential graphs
No speedup	Already GPU-bound or wrong capture scope	Profile with nsys first (Workflow 1)
FP8 scaling corruption	TE without `fp8_autocast` during replay	Wrap with `fp8_autocast(enabled=True)`
PP replay order mismatch	Wrong execution order during replay	Match `_order` / capture sequence exactly
FullCudaGraphWrapper capture fail	NaN check or sync enabled	`--no-check-for-nan-in-loss-and-grad`
RNG failure with FullCudaGraphWrapper	Standard RNG not capturable	`--te-rng-tracker`
DDP capture failure	Async error handling watchdog	`TORCH_NCCL_ASYNC_ERROR_HANDLING=0`
DDP AccumulateGrad on default stream	DDP constructed on default stream	Construct DDP in side stream context
Autocast cache invalidation	Cached cast tensors freed on exit	`cache_enabled=False`

CUDA Graphs for PyTorch

CUDA Graphs for PyTorch

When to Use

Requirements

API Selection Guide

Workflows

Workflow 1: Profile and Decide Whether Graphs Help

Workflow 2: torch.compile(mode="reduce-overhead")

Workflow 3: torch.cuda.make_graphed_callables()

Workflow 4: TE make_graphed_callables

Workflow 5: MCore CudaGraphManager (Per-Layer)

Workflow 6: MCore FullCudaGraphWrapper (Full-Iteration)

Workflow 7: torch.cuda.graph() (Manual)

Navigating Between Workflows

Making Code Graph-Compatible

Principle 1: GPU-Only

Principle 2: Sync-Free

Principle 3: Static

Compatibility Checklist

Output Formats

Error Handling

Finding More Information

Tier 1: This File (SKILL.md)

Tier 2: references/ Directory

Tier 3: Original Documentation

Pytorch Patterns

Regex Vs Llm Structured Text

Effect

Flags

WPF to WinUI 3 Migration Skill

At Dispatch V2