Use when setting up training infrastructure — data loaders, distributed training, mixed precision, checkpointing, or training loops
DDP+AMP from day 1, always checkpoint, FSDP only when DDP OOMs.
Core principle: Start distributed and stay distributed. Single-GPU training is not a stepping stone — it's a trap.
Violating the letter of the rules is violating the spirit of the rules.
NO MULTI-NODE TRAINING WITHOUT SINGLE-NODE DDP VERIFICATION FIRST
Skipping single-node DDP? You'll debug distributed bugs across machines instead of locally. Prove correctness on one node first. Always.
Always:
Build the DataLoader correctly from the start.
dataloader = DataLoader(
dataset,
batch_size=cfg.batch_size,
num_workers=4,
pin_memory=True,
persistent_workers=True,
shuffle=True,
)
Verify before moving on:
Don't guess. Look at your data.
Launch with torchrun on a single node with 2 GPUs:
torchrun --nproc_per_node=2 train.py
Verify: overfit-one-batch checkpoint round-trip passes (save, reload, loss matches).
Save and reload state. Verify loss continuity after reload.
Rules:
trainer = L.Trainer(
accelerator="gpu",
devices=2,
strategy="ddp",
precision="bf16-mixed",
max_steps=cfg.max_steps,
accumulate_grad_batches=cfg.grad_accum,
gradient_clip_val=1.0,
callbacks=[
ModelCheckpoint(every_n_train_steps=500),
LearningRateMonitor(),
],
log_every_n_steps=10,
)
This is your starting point. Deviate only with justification.
DDP bugs are silent. Check all four every time:
| Item | Why |
|---|---|
| Logging only on rank 0 | All-rank logging floods stdout, corrupts log files, hides real errors |
| Checkpointing only on rank 0 | Multiple ranks writing the same checkpoint causes corruption or race conditions |
| Effective batch size = per_gpu_batch * num_gpus * grad_accum | Wrong batch size silently changes learning dynamics. Always compute and log this. |
| Random seeds: same for model init, different for data shuffle | Same model init ensures identical starting weights. Different data shuffle ensures each GPU sees different data. |
DDP OOMs? → FSDP.
DDP doesn't OOM? → Stay on DDP.
That's it. No premature optimization. FSDP adds complexity — sharding strategies, communication overhead, debugging difficulty. Only pay that cost when DDP provably cannot fit your model.
| Excuse | Reality |
|---|---|
| "Let me verify single-GPU first" | DDP on 2 GPUs is your verification. Overfit-one-batch proves it. |
| "Mixed precision might cause instability" | bf16 is stable on modern hardware. Add it now, debug later only if NaN. |
| "I'll checkpoint later" | Training is expensive. Any lost work is unacceptable. Checkpoint always. |
| "FSDP because our model is big" | Start DDP. Only FSDP when DDP provably OOMs. |
| "I need to tune data loading first" | 4 workers + pin_memory + persistent_workers. Profile only if GPU is starved. |
Any of these means: stop, go back, fix the pipeline.
Before marking training infrastructure complete:
Can't check all boxes? Your pipeline isn't ready.