Monitor and babysit running ML training jobs for the NL-to-SQL T5 project. Perform comprehensive health checks: process alive, loss/metric trending, GPU utilization, disk space, checkpoint integrity, W&B connectivity, and hung process detection. Auto-restart from last checkpoint on crash. Document any issues found in issues/ with proper frontmatter. Use when: (1) User asks to babysit, monitor, or watch training, (2) User says "check on training", "is training healthy", "monitor the run", (3) Used with /loop for periodic automated monitoring, (4) User asks to check if training crashed or needs restart. Typical invocation: `/loop 10m /babysit-training`
Autonomous training monitor for the NL-to-SQL T5 project. Run through every check below in order, report a status line, and take action on any issues found.
Execute these checks in order. Stop and act on the first critical issue before continuing.
ps aux | grep -E "python.*(train|sweep|prompting)" | grep -v grep
If no process found:
output/ for the most recent log file: ls -lt output/*.txt | head -5If process found but no log output for >10 minutes:
tail -20 <log_file> and look for "wandb: Synced" or "Run history"kill <PID> and report. See references/known-issues.md.Read the latest training log:
# Find active log
ls -lt output/*.txt | head -3
# Read last 50 lines
tail -50 <latest_log>
Extract and report:
Red flags (act immediately):
NaN or inf in loss: training diverged. Kill and restart with lower LR.CUDA out of memory / OutOfMemoryError: OOM. Kill, reduce batch size or disable auto_batch_size, restart.RuntimeError tracebacks: read full traceback, diagnose, fix code, restart.Compare current metrics against earlier epochs in the same log:
grep -E "F1 =|train loss =" <log_file> | tail -20
Flag if:
patience_epochs eval cycles (early stopping should handle this, but verify)nvidia-smi --query-gpu=utilization.gpu,memory.used,memory.total,temperature.gpu --format=csv,noheader,nounits
df -h / | tail -1
Flag if:
# Find the active run directory
ls -lt output/ | head -5
# Check checkpoints exist and are recent
ls -la output/<run_dir>/checkpoints/
# Check W&B is syncing (not offline/crashed)
ls -lt output/<run_dir>/wandb/latest-run/
Flag if:
latest-run/ directory missing or no recent sync fileswandb-offline-run-* directories exist (W&B fell back to offline mode)ls -la STOP 2>/dev/null
If STOP file exists, verify training acknowledged it (log should show "Stop file detected"). If training is still running without acknowledgment after 2+ epochs, the stop check may be broken.
When a crash is detected and restart is needed:
ls -lt output/ | grep -E "t5_ft|t5_scr" | head -5
ls output/<run_dir>/checkpoints/training_state.pt
PYTHONUNBUFFERED=1 nohup python3 train_t5.py --resume output/<run_dir> > output/<descriptive_log>.txt 2>&1 &
sleep 30 && tail -20 output/<descriptive_log>.txt
For sweep restarts, the sweep agent handles trial-level recovery automatically. Only restart the sweep process itself if it crashed:
PYTHONUNBUFFERED=1 nohup python3 part1/sweep.py --budget 1.5 --max-hours 12 > output/sweep_restart.txt 2>&1 &
When an issue is found, create issues/<descriptive-name>.md:
---