건강 및 피트니스
Babysit Training
Monitor and babysit running ML training jobs for the NL-to-SQL T5 project.
Perform comprehensive health checks: process alive, loss/metric trending,
GPU utilization, disk space, checkpoint integrity, W&B connectivity, and
hung process detection. Auto-restart from last checkpoint on crash.
Document any issues found in issues/ with proper frontmatter.
Use when: (1) User asks to babysit, monitor, or watch training,
(2) User says "check on training", "is training healthy", "monitor the run",
(3) Used with /loop for periodic automated monitoring,
(4) User asks to check if training crashed or needs restart.
Typical invocation: `/loop 10m /babysit-training`