Name: Babysit Training
Author: Albatross679

Babysit Training

Monitor and babysit running ML training jobs for the NL-to-SQL T5 project. Perform comprehensive health checks: process alive, loss/metric trending, GPU utilization, disk space, checkpoint integrity, W&B connectivity, and hung process detection. Auto-restart from last checkpoint on crash. Document any issues found in issues/ with proper frontmatter. Use when: (1) User asks to babysit, monitor, or watch training, (2) User says "check on training", "is training healthy", "monitor the run", (3) Used with /loop for periodic automated monitoring, (4) User asks to check if training crashed or needs restart. Typical invocation: `/loop 10m /babysit-training`

Albatross6790 스타2026. 3. 14.

직업
카테고리: 건강 및 피트니스

Autonomous training monitor for the NL-to-SQL T5 project. Run through every check below in order, report a status line, and take action on any issues found.

Check Sequence

Execute these checks in order. Stop and act on the first critical issue before continuing.

1. Process Health

ps aux | grep -E "python.*(train|sweep|prompting)" | grep -v grep

If no process found:

Check output/ for the most recent log file: ls -lt output/*.txt | head -5
Read the tail of the log to determine if it exited cleanly or crashed
If crashed: auto-restart (see Restart Procedure below)
If completed normally: report completion and skip remaining checks

If process found but no log output for >10 minutes:

Check if W&B shows FINISHED: tail -20 <log_file> and look for "wandb: Synced" or "Run history"

Autonomous training monitor for the NL-to-SQL T5 project. Run through every check below in order, report a status line, and take action on any issues found.

Check Sequence

Execute these checks in order. Stop and act on the first critical issue before continuing.

1. Process Health

ps aux | grep -E "python.*(train|sweep|prompting)" | grep -v grep

If no process found:

Check output/ for the most recent log file: ls -lt output/*.txt | head -5
Read the tail of the log to determine if it exited cleanly or crashed
If crashed: auto-restart (see Restart Procedure below)
If completed normally: report completion and skip remaining checks

If process found but no log output for >10 minutes:

Check if W&B shows FINISHED: tail -20 <log_file> and look for "wandb: Synced" or "Run history"

Babysit Training

Check Sequence

1. Process Health

Babysit Training

Check Sequence

1. Process Health

2. Log Output Analysis

4. GPU & System Health

5. Checkpoint & W&B Integrity

6. STOP File Check

Restart Procedure

Issue Documentation

Weather

Data Analyst

Project Planner

Fitness Nutrition

Neuroskill Bci

File Organizer

Babysit Training

Check Sequence

1. Process Health

Babysit Training

Check Sequence

1. Process Health

2. Log Output Analysis

3. Metric Trending

4. GPU & System Health

5. Checkpoint & W&B Integrity

6. STOP File Check

Restart Procedure

Issue Documentation

Weather

Data Analyst

Project Planner

Fitness Nutrition

Neuroskill Bci

File Organizer