Multi-job factorial experiment monitoring, aggregated diagnosis, and selective re-launch for SkyPilot jobs. Use when running factorial experiments (model x loss x calibration x fold) via SkyPilot and you need to track all jobs, aggregate failures by root cause, and re-launch only failed conditions. Do NOT use for: single-job monitoring (use ralph-loop), local test failures (use self-learning-iterative-coder), or sequential plan execution (use overnight-runner).
Multi-job factorial experiment monitoring with aggregated diagnosis, batch error resolution, and selective re-launch. The outer loop that orchestrates ralph-loop (per-job diagnosis) and self-learning-iterative-coder (batch code fixes).
Five rules prevent ALL known anti-patterns. Read instructions/rules.md for full details with rationale. Summary:
| Rule | Name | Prevents |
|---|---|---|
| F1 | WAIT-FOR-TERMINAL | Panic fixing, premature whac-a-mole |
| F2 | AGGREGATE-BEFORE-FIX | Silent dismissal, serial fixing |
| F3 | REBUILD-BEFORE-RELAUNCH | Docker image staleness |
| F4 | MAX-TWO-CYCLES | Infinite fix-relaunch loops, cost overrun |
| F5 | FACTORIAL-MANIFEST | Partial factorial amnesia |
Kill-switch exception (Rule F1): If 3+ jobs fail with IDENTICAL error within 5 min AND remaining running jobs haven't passed the failure point → cancel same-config jobs, begin batch diagnosis. Different-config jobs continue.
Phase 1: LAUNCH → protocols/launch.md
Phase 2: MONITOR → protocols/monitor.md (polling loop, READ-ONLY)
Phase 3: DIAGNOSE → protocols/diagnose.md (batch aggregation)
Phase 4: FIX → protocols/fix.md (reviewer-backed batch fix)
Phase 5: RELAUNCH → protocols/relaunch.md (selective, max 2 cycles)
Phase 6: REPORT → protocols/report.md (final summary + issues)
run_factorial.sh <config.yaml> or confirm already launchedfactorial_manifest.json mapping job_id → conditionimage_id: docker:... (Rule #17 — no bare VM)sky jobs queue every 60s| condition | job_id | status | duration |ralph_monitor.analyze_logs()sky exec while jobs run{root_cause → [job_ids], fix_strategy, affected_files, confidence}make test-staging → Docker rebuild → push → verify digestrelaunch_batch numberoutputs/factorial_run_<experiment_id>.jsonl| Skill | How It Integrates |
|---|---|
ralph-loop | Per-job diagnosis via analyze_logs() — reuses failure pattern library |
self-learning-iterative-coder | TDD loop when fixes require code changes |
issue-creator | Unrecoverable failures after 2 cycles become GitHub issues |
overnight-runner | Factorial runs are a type of batch execution |
See eval/checklist.md for 5 binary pass/fail criteria.
See templates/factorial-manifest.json for the experiment state tracking schema.