Configure distributed LLM training infrastructure—DDP, FSDP, DeepSpeed ZeRO, multi-node orchestration, checkpointing, fault tolerance, and mixed precision. Use when setting up torchrun/accelerate/deepspeed jobs, writing SLURM scripts, tuning NCCL, or debugging GPU memory and communication bottlenecks.
Set up, configure, and debug distributed training infrastructure for large language models—covering parallelism strategies (DDP, FSDP, DeepSpeed ZeRO), multi-node orchestration, checkpointing, fault tolerance, mixed precision, and GPU profiling.
torchrun, accelerate launch, or deepspeed launcherShardingStrategy.FULL_SHARD), or DeepSpeed ZeRO stages 1/2/3safetensors format)torch.autocast or DeepSpeed configtorch.profiler or NVIDIA Nsight Systemsmodel-architectureserving-architecturepretraining-pipelinetorchrun --nproc_per_node=8). For models requiring sharding: use FSDP or DeepSpeed ZeRO.
FullyShardedDataParallel(model, sharding_strategy=ShardingStrategy.FULL_SHARD, mixed_precision=MixedPrecision(param_dtype=torch.bfloat16)){"zero_optimization": {"stage": 2, "offload_optimizer": {"device": "cpu"}, "allgather_bucket_size": 5e8, "reduce_bucket_size": 5e8}}torchrun --nproc_per_node=8 --nnodes=2 --node_rank=$RANK --master_addr=$MASTER --master_port=29500 train.py. Set NCCL_IB_DISABLE=0 for InfiniBand clusters; set NCCL_SOCKET_IFNAME=eth0 for TCP.torch.cuda.amp.GradScaler(init_scale=2**16) with torch.autocast("cuda", dtype=torch.float16). In DeepSpeed config: {"bf16": {"enabled": true}}.torch.distributed.checkpoint.save for sharded models or safetensors.torch.save_model() for single-file. Enable async checkpointing to overlap save I/O with forward pass. Keep last K checkpoints; delete older ones.torchrun elastic launch (--max_restarts=3). Implement heartbeat monitoring between nodes. Log to wandb with WANDB_RESUME=allow so interrupted runs resume automatically.nvidia-smi dmon -s u -d 5 during training. Target >80% GPU compute utilization; if lower, profile for communication bottlenecks. Watch for memory fragmentation via torch.cuda.memory_stats()["allocated_bytes.all.peak"].torch.profiler.profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA], schedule=torch.profiler.schedule(wait=1, warmup=1, active=3)). Export to Chrome trace or TensorBoard. Identify whether bottleneck is compute-bound (increase batch size) or communication-bound (overlap allreduce, use gradient compression).#SBATCH --gpus-per-node=8, #SBATCH --ntasks-per-node=1, srun torchrun .... Set --time conservatively with checkpoint-resume for long jobs.SHARD_GRAD_OP.FULL_SHARD.accelerate for single-codebase portability across DDP/FSDP/DeepSpeed.Parallelism config — strategy chosen, DeepSpeed JSON or FSDP wrapper code, with justificationLaunch command — exact torchrun / accelerate launch / deepspeed command with all flagsCheckpoint plan — format (sharded vs safetensors), frequency, retention policy, resume procedureResource budget — GPU count, memory per GPU, estimated time, SLURM resource requestmodel-architecture — parallelism strategy depends on model size and layer structurepretraining-pipeline — training recipe runs on top of the infrastructure configured hereserving-architecture — checkpoint format affects serving load pathmodel-merging — merging sharded checkpoints requires compatible save formatNCCL_TIMEOUT (default 1800s), and check for straggler GPUs with nvidia-smi.activation_checkpointing or reduce micro-batch size before switching to CPU offload.GradScaler init_scale; check for overflow in gradient norms via torch.nn.utils.clip_grad_norm_.safetensors format, or write to local NVMe then async-copy to shared storage.