**WORKFLOW SKILL** — Training & fine-tuning: PyTorch training loops, distributed training (FSDP/DDP/DeepSpeed), MoE training (DeepSpeed-MoE, expert parallelism), LoRA/QLoRA, mixed precision, hyperparameter tuning, curriculum learning, and open-weight baseline selection. USE FOR: writing training code, configuring distributed training, fine-tuning pre-trained models, debugging training issues, selecting current open-weight baselines, and training sparse MoE models with expert parallelism. USE WHEN: training a model from scratch, fine-tuning, debugging loss curves, scaling to multiple GPUs, choosing a family-specific base model, or configuring MoE expert routing and load balance.
Train and fine-tune ML models with PyTorch, from single-GPU to multi-node distributed.
| Concept | Description |
|---|---|
| Training Loop | Forward → loss → backward → optimizer step → log |
| Mixed Precision | bf16/fp16 forward pass, fp32 master weights |
| Gradient Accumulation | Simulate larger batch size across steps |
| Gradient Checkpointing | Recompute activations to save memory |
| Learning Rate Schedule | Warmup → cosine/linear decay |
| LoRA | Low-rank adapters, train only adapter weights |
| FSDP | Shard parameters, gradients, optimizer across GPUs |
../_shared/references/llm-landscape.md when the model choice is still openChoose strategy based on model size:
| Strategy | Model Size | GPU Count | Memory Savings |
|---|---|---|---|
| DDP | Fits on 1 GPU | 2-8 | None (replication) |
| FSDP | Too large for 1 GPU | 2-128+ | Shard params + grads + optimizer |
| DeepSpeed ZeRO-2 | Large models | 2-64 | Shard grads + optimizer |
| DeepSpeed ZeRO-3 | Very large models | 8-256+ | Full sharding (like FSDP) |
| DeepSpeed-MoE | MoE models | 8-256+ | Expert parallelism (ep_size), capacity_factor, load balance loss. See skills/cutting-edge-architectures/references/moe-sparse-routing.md |
| LoRA | Any pretrained model | 1+ | Train only adapters (~1% params) |
LoRA Configuration (reference: Kyutai moshi-finetune):