Name: Training Infrastructure
Author: merceralex397-collab

SkillsPool

Buscar habilidades.../

Training Infrastructure | Skills Pool

Select parallelism strategy. For models that fit in one GPU with gradients: use DDP (torchrun --nproc_per_node=8). For models requiring sharding: use FSDP or DeepSpeed ZeRO.
- FSDP: FullyShardedDataParallel(model, sharding_strategy=ShardingStrategy.FULL_SHARD, mixed_precision=MixedPrecision(param_dtype=torch.bfloat16))
- DeepSpeed ZeRO-2: {"zero_optimization": {"stage": 2, "offload_optimizer": {"device": "cpu"}, "allgather_bucket_size": 5e8, "reduce_bucket_size": 5e8}}
- For 70B+ models, combine tensor parallelism (Megatron-style column/row splits) with ZeRO-3 or FSDP for the remaining dimensions.
Configure multi-node launch. Use NCCL backend with torchrun --nproc_per_node=8 --nnodes=2 --node_rank=$RANK --master_addr=$MASTER --master_port=29500 train.py. Set NCCL_IB_DISABLE=0 for InfiniBand clusters; set NCCL_SOCKET_IFNAME=eth0 for TCP.
Set up mixed precision. Prefer BF16 on Ampere+ GPUs (no loss scaling needed). For FP16, enable dynamic loss scaling: torch.cuda.amp.GradScaler(init_scale=2**16) with torch.autocast("cuda", dtype=torch.float16). In DeepSpeed config: {"bf16": {"enabled": true}}.
Configure checkpointing. Save every N steps using torch.distributed.checkpoint.save for sharded models or safetensors.torch.save_model() for single-file. Enable async checkpointing to overlap save I/O with forward pass. Keep last K checkpoints; delete older ones.
Enable fault tolerance. Use torchrun elastic launch (--max_restarts=3). Implement heartbeat monitoring between nodes. Log to wandb with WANDB_RESUME=allow so interrupted runs resume automatically.
Monitor GPU utilization. Run nvidia-smi dmon -s u -d 5 during training. Target >80% GPU compute utilization; if lower, profile for communication bottlenecks. Watch for memory fragmentation via torch.cuda.memory_stats()["allocated_bytes.all.peak"].
Profile bottlenecks. Use torch.profiler.profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA], schedule=torch.profiler.schedule(wait=1, warmup=1, active=3)). Export to Chrome trace or TensorBoard. Identify whether bottleneck is compute-bound (increase batch size) or communication-bound (overlap allreduce, use gradient compression).
Write SLURM job script. Include #SBATCH --gpus-per-node=8, #SBATCH --ntasks-per-node=1, srun torchrun .... Set --time conservatively with checkpoint-resume for long jobs.

Training Infrastructure

Purpose

When to use this skill

Do not use this skill when

Training Infrastructure

Purpose

When to use this skill

Do not use this skill when

Operating procedure

Decision rules

Output requirements

References

Failure handling

Pytorch Patterns

Regex Vs Llm Structured Text

Effect

Flags

WPF to WinUI 3 Migration Skill

At Dispatch V2

Training Infrastructure

Purpose

When to use this skill

Do not use this skill when

Training Infrastructure

Purpose

When to use this skill

Do not use this skill when

Operating procedure

Decision rules

Output requirements

References

Related skills

Failure handling

Pytorch Patterns

Regex Vs Llm Structured Text

Effect

Flags

WPF to WinUI 3 Migration Skill

At Dispatch V2