Name: ML Training Monitor: Diagnostics and Debugging
Author: henrycashe26

ML Training Monitor: Diagnostics and Debugging

Diagnose, monitor, and debug ML model training runs. Use this skill when the user wants to: understand why a model isn't learning, diagnose training instability (NaN loss, spikes, divergence), interpret loss curves and training metrics, decide whether to adjust hyperparameters mid-run, figure out if a run is worth continuing or should be killed, debug quantization-aware training issues, or understand gradient behavior. Trigger when the user mentions: loss curves, training logs, gradient norms, learning rate schedules, NaN loss, training divergence, "model isn't learning", "loss is stuck", "should I kill this run", validation loss, overfitting/underfitting, warmup, cooldown, weight decay tuning, or any question about how training is going. Also use when someone pastes training logs and wants interpretation.

henrycashe260 星标2026年4月13日

分类: 机器学习

You are helping someone understand and fix their training runs. Training ML models is mostly watching numbers and making judgment calls about whether those numbers look right. This skill is about developing that judgment.

Reading Loss Curves

The loss curve is the single most informative signal during training. Here's what different shapes tell you.

Healthy Training

Loss
 |
 |\.
 |  \.
 |    \..
 |       '...
 |           '''''....___
 |_________________________ Steps

Smooth, monotonically decreasing, with diminishing returns. Validation loss tracks training loss with a small gap. This is what you want.

Common Pathologies

Flat start then sudden drop (S-curve): Normal for some architectures (ternary models, very deep networks). The model is learning internal representations before they start helping. Don't kill the run during the flat part -- wait at least 2x longer than the flat period before deciding it's not learning.

Occasional small spikes (2-3x the running average) are usually fine, especially early in training. The model recovers. Worry when:

ML Training Monitor: Diagnostics and Debugging

henrycashe260 星标2026年4月13日

分类: 机器学习

Reading Loss Curves

The loss curve is the single most informative signal during training. Here's what different shapes tell you.

Healthy Training

Loss | |\. | \. | \.. | '... | '''''....___ |_________________________ Steps

Smooth, monotonically decreasing, with diminishing returns. Validation loss tracks training loss with a small gap. This is what you want.

Common Pathologies

Occasional small spikes (2-3x the running average) are usually fine, especially early in training. The model recovers. Worry when:

Symptom	Likely Cause	Fix
Loss NaN	LR too high, numerical instability	Lower LR 10x, add grad clip, check for log(0)
Loss flat from start	Model too small, LR too low, data bug	Check data, increase LR, verify forward pass
Loss spikes regularly	LR too high, bad batches	Lower LR, check data quality
Val loss diverges	Overfitting	More regularization, less model capacity
Training very slow	Data loading bottleneck, no compile	Profile, add workers, use torch.compile
OOM at step N	Memory leak, activation caching	Check for detached tensors, use gradient checkpointing
Gradients all zero	Dead model, detached computation	Check requires_grad, verify backward pass
Loss decreases then plateaus early	LR schedule wrong, model capacity hit	Check schedule, try larger model
Quantized model much worse	QAT not working, precision too low	Start QAT earlier, use group quantization, check scaling

ML Training Monitor: Diagnostics and Debugging

Reading Loss Curves

Healthy Training

Common Pathologies

ML Training Monitor: Diagnostics and Debugging

Reading Loss Curves

Healthy Training

Common Pathologies

Key Metrics to Watch

During Training (log every N steps)

Periodically (every K steps)

Quantization-Aware Training Specifics

When to Adjust Hyperparameters

Learning Rate

Batch Size

Weight Decay

Gradient Clipping

When to Kill a Run

Interpreting Training Logs

Quick Reference: Training Troubleshooting

Example Diagnosis

Continuous Learning V2

Continuous Learning V2

Continuous Learning V2

Continuous Learning

Continuous Learning

Pytorch Patterns