フレームワーク内部構造
Training Infrastructure
Configure distributed LLM training infrastructure—DDP, FSDP, DeepSpeed ZeRO, multi-node orchestration, checkpointing, fault tolerance, and mixed precision. Use when setting up torchrun/accelerate/deepspeed jobs, writing SLURM scripts, tuning NCCL, or debugging GPU memory and communication bottlenecks.