Name: Aris Training Check
Author: OpenLAIR

Training Check

Periodically read WandB metrics during training to catch problems early. Do not wait until training finishes to discover it was a waste of GPU time.

Context: $ARGUMENTS

WANDB_ENTITY and WANDB_PROJECT: read from CLAUDE.md or passed as argument (format: entity/project/run_id)
CHECK_INTERVAL: starts at 10 minutes, then gradually increases if consistently healthy: 10 min → 20 min → 30 min → 60 min (cap)
REVIEWER_MODEL = gpt-5.4 — used via Codex MCP for ambiguous cases only

After training is confirmed running (session alive, loss decreasing for first few steps)
Set up via CronCreate to fire periodically during training
This skill checks training QUALITY, not process HEALTH. Process health (session alive, GPU utilization) is watchdog.py's job.

Periodically read WandB metrics during training to catch problems early. Do not wait until training finishes to discover it was a waste of GPU time.

WANDB_ENTITY and WANDB_PROJECT: read from CLAUDE.md or passed as argument (format: entity/project/run_id)
CHECK_INTERVAL: starts at 10 minutes, then gradually increases if consistently healthy: 10 min → 20 min → 30 min → 60 min (cap)
REVIEWER_MODEL = gpt-5.4 — used via Codex MCP for ambiguous cases only

After training is confirmed running (session alive, loss decreasing for first few steps)
Set up via CronCreate to fire periodically during training
This skill checks training QUALITY, not process HEALTH. Process health (session alive, GPU utilization) is watchdog.py's job.

Signal	Judgment	Action
NaN/Inf in loss	Clearly bad	Stop training, investigate
Loss diverging (increasing for >N steps)	Clearly bad	Stop training, investigate
Eval metrics significantly worse than baseline	Clearly bad	Stop training, investigate
Loss decreasing, metrics improving	Clearly fine	Continue, increase check interval
Loss flat but not diverging	Unsure	→ Step 3 (Codex judgment)
Metrics noisy, can't tell trend	Unsure	→ Step 3 (Codex judgment)
Slightly worse than baseline but still early	Unsure	→ Step 3 (Codex judgment)

Decision	Action
Stop	Kill the training session. Save the WandB run URL, key metrics, and reason for stopping. Log to project notes for debugging.
Continue	Do nothing. Will be invoked again at next interval (increase interval if consistently healthy).
Wait	Do nothing but keep the current short interval (don't increase).

Layer	Tool	What it checks	Frequency
Process health	watchdog.py	Session alive? GPU active?	Every 60s (continuous)
Training quality	training-check	Loss trend? Metrics improving?	Every 10-60 min (periodic)