Build a complete PyTorch training loop with best practices. Use when the user says "training loop", "train a model", "PyTorch training", "write the training code", "fit a model", or "set up training".
Build a production-quality training loop from scratch with proper structure.
# Subclass torch.utils.data.Dataset
# DataLoader with:
# num_workers = os.cpu_count()
# pin_memory = True (CUDA only)
# collate_fn for variable-length inputs
# drop_last = True for training (stable batch norm)
Use a proper train/val split. Never evaluate on training data.
model.to(device) after constructiontorch.compile(model) for PyTorch 2.0+ (stable models, repeated forward passes)sum(p.numel() for p in model.parameters() if p.requires_grad)weight_decay only on non-bias, non-LayerNorm params.GradScaler if using mixed precision.model.train()
for batch in train_loader:
optimizer.zero_grad(set_to_none=True) # faster than zero_grad()
with torch.autocast(device_type='cuda', dtype=torch.float16): # mixed precision
output = model(batch)
loss = criterion(output, targets)
scaler.scale(loss).backward()
scaler.unscale_(optimizer)
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
scaler.step(optimizer)
scaler.update()
scheduler.step() # if per-step scheduler
model.eval()
with torch.no_grad():
# Accumulate metrics as scalars, NOT tensors
total_loss = 0.0 # float, not tensor
for batch in val_loader:
...
total_loss += loss.item() # .item() OK here (outside training)
Save model state, optimizer state, scheduler state, epoch, and best metric. Use atomic writes (save to temp, then rename). Keep last N checkpoints.
Choose one: TensorBoard (SummaryWriter), W&B (wandb.log), or structured JSON. Log loss, learning rate, gradient norms, and validation metrics per epoch.
.item() inside the training loop (forces GPU sync)float accumulatorsmodel.train() / model.eval() switchingoptimizer.zero_grad() without set_to_none=True (wastes memory)