Skill: Babysit Job

Monitor a job continuously and recover on failure. For Zephyr pipelines, delegate to babysit-zephyr instead. Otherwise, follow this skill — Iris is the execution backend.

Required Info

job_id — Iris job ID in canonical format /<user>/<job> (e.g., /dlwh/iris-run-train_tiny_model_tpu-20260302-185630)
config — Iris config path (e.g., lib/iris/examples/marin.yaml). When the user refers to a cluster by shorthand name (e.g., "marin_dev", "marin-dev", "marin", "coreweave"), resolve it to the matching config file under lib/iris/examples/. Common mappings:
- marin / marin_prod -> lib/iris/examples/marin.yaml
- marin_dev / marin-dev -> lib/iris/examples/marin-dev.yaml
- coreweave -> lib/iris/examples/coreweave.yaml
— exact Iris submit command for resubmission; must include

Skill: Babysit Job

Monitor a job continuously and recover on failure. For Zephyr pipelines, delegate to babysit-zephyr instead. Otherwise, follow this skill — Iris is the execution backend.

Required Info

job_id — Iris job ID in canonical format /<user>/<job> (e.g., /dlwh/iris-run-train_tiny_model_tpu-20260302-185630)
config — Iris config path (e.g., lib/iris/examples/marin.yaml). When the user refers to a cluster by shorthand name (e.g., "marin_dev", "marin-dev", "marin", "coreweave"), resolve it to the matching config file under lib/iris/examples/. Common mappings:
- marin / marin_prod -> lib/iris/examples/marin.yaml
- marin_dev / marin-dev -> lib/iris/examples/marin-dev.yaml
- coreweave -> lib/iris/examples/coreweave.yaml
— exact Iris submit command for resubmission; must include

1. SLEEP - if just submitted/restarted: sleep 120 once - otherwise: sleep 570 2. CHECK LOGS uv run iris --config <CONFIG> job logs --since-seconds 900 --include-children <JOB_ID> | rg -i -e "loss|error|traceback|exception|resource_exhausted|oom|compiler_base\.cc:2587|program hbm requirement|largest program allocations|ownerdiederror|dead node|node death|autoscaler unsatisfied resources|no accelerator found|failed_precondition|device or resource busy" 3. CHECK STATUS uv run iris --config <CONFIG> job list --json --prefix <JOB_ID> Terminal success: JOB_STATE_SUCCEEDED Terminal non-success: JOB_STATE_FAILED, JOB_STATE_KILLED, JOB_STATE_WORKER_FAILED, JOB_STATE_UNSCHEDULABLE Non-terminal: JOB_STATE_PENDING, JOB_STATE_BUILDING, JOB_STATE_RUNNING If `pending_reason` indicates worker scale-up/capacity wait, treat as scheduler capacity wait — do not run cluster update/recreate/restart actions. Continue waiting on cadence, or stop+resubmit only if user explicitly asks. Treat RUNNING as controller-level signal only; confirm allocation via expected W&B run when possible. 3a. ON TERMINAL STATE / OOM-LIKE SIGNAL — get a structured per-task summary (final state, exit, duration, peak memory) instead of grepping logs: uv run iris --config <CONFIG> job summary --json <JOB_ID> Fast postmortem: e.g. "13/14 shards peaked near the container memory limit and failed with exit 137" → cgroup OOM, raise `--memory` on resubmit. 4. PRINT W&B RUN IDS/LINKS (once per training run) 5. REPORT PROGRESS (format: ~<current>/<exact_max>) 6. EVALUATE (terminal? error? stalled? -> recover or continue) 7. RECOVER (STOP -> RESUBMIT) - If current job is still non-terminal, stop it first: uv run iris --config <CONFIG> job stop <JOB_ID> - Then resubmit: <RESUBMIT_COMMAND> - Capture `job_id` from output (line like `Job submitted: /<user>/<job>`). - Iris nuance: - if `resubmit_command` omits `--job-name`, Iris auto-generates a fresh id each resubmission. - if `resubmit_command` uses a fixed `--job-name`, Iris may reuse the same id after terminal completion by replacing the finished job. - Update state file: `job_id=<NEW_JOB_ID>`, `restart_count += 1`. - Go to step 1.

Babysit Job

Skill: Babysit Job

Required Info

Babysit Job

Skill: Babysit Job

Required Info

Scope

Monitoring Ownership and Duration

Cadence and Tooling Notes

State File

Loop

Fixing Small Bugs

Error Patterns

When to Escalate

Notes

Automation Audit Ops

Github Qa Labels

Jupyter Notebook

Tidb Integrationtest Recorder

Quality Nonconformance

Hugging Face Trackio