技能档案

Ml Training Pipeline

Name: Ml Training Pipeline
Author: rishikanthc

Use when setting up training infrastructure — data loaders, distributed training, mixed precision, checkpointing, or training loops

rishikanthc0 星标2026年2月23日

职业
分类: 机器学习

技能内容

Overview

DDP+AMP from day 1, always checkpoint, FSDP only when DDP OOMs.

Core principle: Start distributed and stay distributed. Single-GPU training is not a stepping stone — it's a trap.

Violating the letter of the rules is violating the spirit of the rules.

The Iron Law

NO MULTI-NODE TRAINING WITHOUT SINGLE-NODE DDP VERIFICATION FIRST

Skipping single-node DDP? You'll debug distributed bugs across machines instead of locally. Prove correctness on one node first. Always.

When to Use

Always:

Setting up any training infrastructure
Data loaders, distributed training, mixed precision
Checkpointing or training loops
Scaling from prototype to production training

Pipeline Construction Order

相关技能

Ml Training Pipeline | Skills Pool

dataloader = DataLoader(
    dataset,
    batch_size=cfg.batch_size,
    num_workers=4,
    pin_memory=True,
    persistent_workers=True,
    shuffle=True,
)

torchrun --nproc_per_node=2 train.py

trainer = L.Trainer(
    accelerator="gpu",
    devices=2,
    strategy="ddp",
    precision="bf16-mixed",
    max_steps=cfg.max_steps,
    accumulate_grad_batches=cfg.grad_accum,
    gradient_clip_val=1.0,
    callbacks=[
        ModelCheckpoint(every_n_train_steps=500),
        LearningRateMonitor(),
    ],
    log_every_n_steps=10,
)

Item	Why
Logging only on rank 0	All-rank logging floods stdout, corrupts log files, hides real errors
Checkpointing only on rank 0	Multiple ranks writing the same checkpoint causes corruption or race conditions
Effective batch size = per_gpu_batch * num_gpus * grad_accum	Wrong batch size silently changes learning dynamics. Always compute and log this.
Random seeds: same for model init, different for data shuffle	Same model init ensures identical starting weights. Different data shuffle ensures each GPU sees different data.

DDP OOMs? → FSDP.
DDP doesn't OOM? → Stay on DDP.

Excuse	Reality
"Let me verify single-GPU first"	DDP on 2 GPUs is your verification. Overfit-one-batch proves it.
"Mixed precision might cause instability"	bf16 is stable on modern hardware. Add it now, debug later only if NaN.
"I'll checkpoint later"	Training is expensive. Any lost work is unacceptable. Checkpoint always.
"FSDP because our model is big"	Start DDP. Only FSDP when DDP provably OOMs.
"I need to tune data loading first"	4 workers + pin_memory + persistent_workers. Profile only if GPU is starved.

Ml Training Pipeline

Overview

The Iron Law

When to Use

Pipeline Construction Order

Ml Training Pipeline

Overview

The Iron Law

When to Use

Pipeline Construction Order

Step 1: Data Pipeline

Step 2: DDP + Mixed Precision from Day 1

Step 3: Checkpointing — Always

Step 4: Scale Out (Only When Needed)

Default Lightning Setup

DDP Silent-Failure Checklist

FSDP Decision

Common Rationalizations

Red Flags — STOP and Fix

Verification Checklist

Continuous Learning V2

Continuous Learning V2

Continuous Learning V2

Continuous Learning

Continuous Learning

Pytorch Patterns