Periodically check WandB metrics during training to catch problems early (NaN, loss divergence, idle GPUs). Avoids wasting GPU hours on broken runs. Use when training is running and you want automated health checks.
Periodically read WandB metrics during training to catch problems early. Do not wait until training finishes to discover it was a waste of GPU time.
entity/project/run_id.gpt-5.4 - Used via a secondary Codex agent for ambiguous cases only.import wandb
api = wandb.Api()
run = api.run("<entity>/<project>/<run_id>")
history = run.history()
If WandB is unreachable (API error, network issue), fall back to reading the log file directly via SSH:
ssh server "tail -100 /path/to/training.log"
Check these signals:
| Signal | Judgment | Action |
|---|---|---|
| NaN/Inf in loss | Clearly bad | Stop training, investigate |
| Loss diverging (increasing for >N steps) | Clearly bad | Stop training, investigate |
| Eval metrics significantly worse than baseline | Clearly bad | Stop training, investigate |
| Loss decreasing, metrics improving | Clearly fine | Continue, increase check interval |
| Loss flat but not diverging | Unsure | -> Step 3 (secondary review) |
| Metrics noisy, can't tell trend | Unsure | -> Step 3 (secondary review) |
| Slightly worse than baseline but still early | Unsure | -> Step 3 (secondary review) |
Only escalate when the signal is ambiguous. For clearly good or clearly bad signals, act directly.
spawn_agent:
model: REVIEWER_MODEL
reasoning_effort: high
message: |
TRAINING HEALTH CHECK - need your judgment on ambiguous metrics.
Run: <entity>/<project>/<run_id>
Current epoch/step: X / Y total
Training loss (last 10 checkpoints): [values]
Eval metrics (last 3 evals): [values]
Baseline reference: [numbers from paper/reproduction]
What I'm unsure about: [specific concern]
Please respond with exactly one of:
- STOP: clearly problematic, should kill training
- CONTINUE: looks fine, check again next interval
- WAIT: not enough data to judge, check again sooner
If delegation is unavailable, make a local judgment using the same rubric and mark the decision [pending external review]. In ambiguous cases with no hard failure, prefer WAIT over STOP.
| Decision | Action |
|---|---|
| Stop | Kill the training session. Save the WandB run URL, key metrics, and reason for stopping. Log to project notes for debugging. |
| Continue | Do nothing. Re-run at the next interval (increase interval if consistently healthy). |
| Wait | Do nothing but keep the current short interval (do not increase). |
training-check and watchdog-style monitoring operate at different levels:
| Layer | Tool | What it checks | Frequency |
|---|---|---|---|
| Process health | watchdog | Session alive? GPU active? | Every 60s (continuous) |
| Training quality | training-check | Loss trend? Metrics improving? | Every 10-60 min (periodic) |
Use both together:
training-check catches subtle quality issues (loss plateau, metric degradation)After training is confirmed stable:
Create a recurring job (cron, task scheduler, tmux loop, etc.)
that runs `/training-check <entity>/<project>/<run_id>` every 10 minutes.
As the check interval increases, update the old recurring job to match the new interval.