Debug Distributed Training

Debugging guide for distributed training issues in AReaL (FSDP2, TP, CP, EP).

When to Use

This skill is triggered when:

Training hangs or deadlocks
Results differ across ranks or are numerically wrong
OOM errors in distributed settings
NCCL/communication errors or device mesh issues

Debugging Principles

Minimal Reproduction

Always follow the minimal demo principle: Reproduce with the least amount of code to narrow down the issue faster.

# Bad: Debug in full training loop
# Good: Create minimal script
import torch
import torch.distributed as dist

dist.init_process_group("nccl")
rank = dist.get_rank()

# Reproduce the exact operation that fails
tensor = torch.ones(10).cuda()
dist.all_reduce(tensor)  # <-- Isolate the failing op
print(f"Rank {rank}: {tensor}")

Debugging Principles

Minimal Reproduction

Always follow the minimal demo principle: Reproduce with the least amount of code to narrow down the issue faster.

# Bad: Debug in full training loop # Good: Create minimal script import torch import torch.distributed as dist dist.init_process_group("nccl") rank = dist.get_rank() # Reproduce the exact operation that fails tensor = torch.ones(10).cuda() dist.all_reduce(tensor) # <-- Isolate the failing op print(f"Rank {rank}: {tensor}")

Error	Cause	Solution
`NCCL WARN Cuda failure`	GPU communication	Check NCCL version, GPU topology
`RuntimeError: Timed out`	Rank synchronization	Increase timeout, check code paths
`Invalid device mesh`	Mesh configuration	Verify world_size = dp * tp * cp

Variable	Purpose
`TORCH_DISTRIBUTED_DEBUG=DETAIL`	Detailed distributed logging
`NCCL_DEBUG=INFO`	NCCL communication logging
`NCCL_DEBUG_SUBSYS=ALL`	All NCCL subsystems
`TORCH_LOGS="+dynamo,recompiles"`	torch.compile logging
`TORCHDYNAMO_VERBOSE=1`	Dynamo verbose output
`CUDA_LAUNCH_BLOCKING=1`	Synchronous CUDA (slow, for debugging)

Component	File
Parallel Dims	`areal/experimental/models/archon/parallel_dims.py`
Expert Parallel	`areal/experimental/models/archon/expert_parallel.py`
Ulysses (CP)	`areal/experimental/models/archon/ulysses.py`
FSDP/TP Apply	`areal/experimental/models/archon/qwen2/infra/parallelize.py`

Debug Distributed