Name: Model Training
Author: K4M1coder

Train and fine-tune ML models with PyTorch, from single-GPU to multi-node distributed.

When to Use

Optimizer: AdamW (default), Lion, Adafactor for memory efficiency
Learning Rate: 1e-4 to 5e-5 for fine-tuning, 1e-3 to 3e-4 for from-scratch
Batch Size: Largest that fits in memory × gradient accumulation steps
Warmup: 5-10% of total steps
Weight Decay: 0.01-0.1 (exclude biases and LayerNorm)
Gradient Clipping: max_norm=1.0

Core Concepts

Concept	Description
Training Loop	Forward → loss → backward → optimizer step → log
Mixed Precision	bf16/fp16 forward pass, fp32 master weights
Gradient Accumulation	Simulate larger batch size across steps
Gradient Checkpointing	Recompute activations to save memory

Train and fine-tune ML models with PyTorch, from single-GPU to multi-node distributed.

Concept	Description
Training Loop	Forward → loss → backward → optimizer step → log
Mixed Precision	bf16/fp16 forward pass, fp32 master weights
Gradient Accumulation	Simulate larger batch size across steps
Gradient Checkpointing	Recompute activations to save memory

Strategy	Model Size	GPU Count	Memory Savings
DDP	Fits on 1 GPU	2-8	None (replication)
FSDP	Too large for 1 GPU	2-128+	Shard params + grads + optimizer
DeepSpeed ZeRO-2	Large models	2-64	Shard grads + optimizer
DeepSpeed ZeRO-3	Very large models	8-256+	Full sharding (like FSDP)
DeepSpeed-MoE	MoE models	8-256+	Expert parallelism (ep_size), capacity_factor, load balance loss. See `skills/cutting-edge-architectures/references/moe-sparse-routing.md`
LoRA	Any pretrained model	1+	Train only adapters (~1% params)