Name: Training Check
Author: wanshuiyin

Buscar habilidades.../

import wandb
api = wandb.Api()
run = api.run("<entity>/<project>/<run_id>")
history = run.history()

ssh server "tail -100 /path/to/training.log"

Signal	Judgment	Action
NaN/Inf in loss	Clearly bad	Stop training, investigate
Loss diverging (increasing for >N steps)	Clearly bad	Stop training, investigate
Eval metrics significantly worse than baseline	Clearly bad	Stop training, investigate
Loss decreasing, metrics improving	Clearly fine	Continue, increase check interval
Loss flat but not diverging	Unsure	-> Step 3 (secondary review)
Metrics noisy, can't tell trend	Unsure	-> Step 3 (secondary review)
Slightly worse than baseline but still early	Unsure	-> Step 3 (secondary review)

spawn_agent:
  model: REVIEWER_MODEL
  reasoning_effort: high
  message: |
    TRAINING HEALTH CHECK - need your judgment on ambiguous metrics.

    Run: <entity>/<project>/<run_id>
    Current epoch/step: X / Y total
    Training loss (last 10 checkpoints): [values]
    Eval metrics (last 3 evals): [values]
    Baseline reference: [numbers from paper/reproduction]

    What I'm unsure about: [specific concern]

    Please respond with exactly one of:
    - STOP: clearly problematic, should kill training
    - CONTINUE: looks fine, check again next interval
    - WAIT: not enough data to judge, check again sooner

Decision	Action
Stop	Kill the training session. Save the WandB run URL, key metrics, and reason for stopping. Log to project notes for debugging.
Continue	Do nothing. Re-run at the next interval (increase interval if consistently healthy).
Wait	Do nothing but keep the current short interval (do not increase).

After training is confirmed stable:
  Create a recurring job (cron, task scheduler, tmux loop, etc.)
  that runs `/training-check <entity>/<project>/<run_id>` every 10 minutes.

Training Check | Skills Pool

Layer	Tool	What it checks	Frequency
Process health	watchdog	Session alive? GPU active?	Every 60s (continuous)
Training quality	training-check	Loss trend? Metrics improving?	Every 10-60 min (periodic)

Training Check

Training Check

Context: $ARGUMENTS

Constants

When to Use

Workflow

Step 1: Read WandB Metrics

Step 2: Judgment

Step 3: Secondary Codex Judgment (only when unsure)

Step 4: Act

Integration with Watchdog

Rules

Recurring Setup Example

Weather

Data Analyst

Project Planner

Fitness Nutrition

Neuroskill Bci

File Organizer