机器学习
ML Training Monitor: Diagnostics and Debugging
Diagnose, monitor, and debug ML model training runs. Use this skill when the user wants to: understand why a model isn't learning, diagnose training instability (NaN loss, spikes, divergence), interpret loss curves and training metrics, decide whether to adjust hyperparameters mid-run, figure out if a run is worth continuing or should be killed, debug quantization-aware training issues, or understand gradient behavior. Trigger when the user mentions: loss curves, training logs, gradient norms, learning rate schedules, NaN loss, training divergence, "model isn't learning", "loss is stuck", "should I kill this run", validation loss, overfitting/underfitting, warmup, cooldown, weight decay tuning, or any question about how training is going. Also use when someone pastes training logs and wants interpretation.