Meta-skill that standardizes experiment run XML creation, enforces execution protocol on top of /factorial-monitor, and gates on live report writing. The "reproducible harness" — ensures every experiment pass compounds learning. ACTIVATE when: creating a new factorial experiment pass (debug or production), resuming a multi-pass experiment series, or preparing a production launch. DO NOT ACTIVATE when: monitoring a single job (use ralph-loop), running local tests (use self-learning-iterative-coder), continuing an existing pass mid-flight (use factorial-monitor directly), or doing non-experiment development work.
Meta-skill for reproducible experiment execution with compound learning. Creates XML plans, enforces execution protocol, and ensures every pass leaves the codebase measurably better than before.
"Things should compound." — Guiding principle
Across 5 debug factorial passes, Claude Code repeatedly:
/factorial-monitor but bypassed the XML protocolSee: references/failure-history.md
Need to run experiments on cloud GPU?
├── New pass (first time or new fixes applied)?
│ → /experiment-harness (THIS SKILL)
├── Continuing an existing pass mid-flight?
│ → /factorial-monitor (monitoring only)
├── Single job diagnosis?
│ → /ralph-loop
└── Local code changes needed?
→ /self-learning-iterative-coder
Layer 1 (Deterministic): scripts/run_factorial.sh
- Pure sky jobs launch calls in a loop
- ANY researcher can run this without Claude Code
- NO LLM involvement — reproducible, auditable
Layer 2 (Monitoring Harness): /experiment-harness → /factorial-monitor
- Creates experiment XML (standardized template)
- VALIDATES pre-launch gates (9 checks, 0 skips)
- Creates report file BEFORE launching
- Drives /factorial-monitor for monitoring + diagnosis
- Updates report after EVERY state change
- Captures compound learning (new tests, issues, observations)
CRITICAL SEPARATION: The .sh script is the PRODUCT. The harness is the DEVELOPER TOOL. Production runs use ONLY the .sh script.
Phase 1: GENERATE → Create experiment XML from template [protocols/generate.md]
Phase 2: VALIDATE → Pre-launch gates (all must pass) [protocols/validate.md]
Phase 3: EXECUTE → Launch .sh + drive /factorial-monitor [protocols/execute.md]
Phase 4: COMPOUND → Tests, issues, observations, watchlist [protocols/compound.md]
Phase 5: REFLECT → Self-assess harness effectiveness [protocols/reflect.md]
Create the experiment XML from templates/experiment-run.xml.
/plan-context-load → navigator → domains → metalearning → registryyaml.safe_load() (NEVER regex) to verify structureCognitive engagement checkpoint (Shen et al. 2026): Before proceeding, articulate in the report: "WHY do I expect these specific watchlist items? WHAT might go wrong that prior passes haven't revealed?"
Output: docs/planning/v0-2_archive/original_docs/run-debug-factorial-experiment-{N}th-pass.xml
Pre-launch gates that MUST ALL pass:
| Gate | Command | Criteria |
|---|---|---|
| Staging tests | make test-staging | 0 skipped, 0 failed |
| Prod tests | make test-prod | 0 skipped, 0 failed |
| Preflight GCP | python scripts/preflight_gcp.py | 11/11 passed (incl. YAML contract + cost) |
| YAML contract | python scripts/validate_yaml_contract.py | 0 violations |
| Docker image fresh | GAR timestamp > latest code commit | Image newer |
| Docker image verified | docker run ... python3 -c "import ..." | All imports OK |
| Factorial dry-run | run_factorial.sh --dry-run | Correct conditions |
| Report file created | <report-output> path from XML | File exists with headers |
| Prior pass referenced | <session-context> non-empty | Has root causes |
| Security scan | No credentials in XML, paths valid | Clean |
If ANY gate fails → DO NOT LAUNCH. Fix first.
The YAML contract gate (configs/cloud/yaml_contract.yaml) enforces:
allowed_accelerators
per cloud provider. A100 was removed — 5.5x cost, never authorized.train_factorial.yaml must have string-format
accelerators (not dict/priority list), GCP only, spot only, Docker only.configs/cloud/*.yaml.The contract is the SINGLE SOURCE OF TRUTH. Claude Code must NEVER update it without explicit user instruction. See CLAUDE.md Rule 31.
<report-output> with headers + empty tables (H1)bash scripts/run_factorial.sh <config.yaml>/factorial-monitor Phase 2 (MONITOR)After all jobs are terminal, produce ALL of these (H3):
outputs/harness-state.jsonlCompounding gate: Session is NOT complete until all 6 artifacts exist.
Cognitive engagement checkpoint: Before writing tests, articulate: "WHAT did I observe that was NOT predictable from prior passes? WHY does this observation matter for production reliability?"
Self-assess after every pass:
harness-reflection-{N}.md with suggested rule changesThis phase prevents the harness from calcifying around rules that no longer apply. Skills are living artifacts that self-improve from deployment (Zhou et al. 2026).
See templates/experiment-run.xml — 8 required sections. Template is loaded on-demand (L3 progressive disclosure per Anthropic guide).
Inherits F1-F5 from /factorial-monitor, plus:
| Rule | Name | Prevents | Detection |
|---|---|---|---|
| H7 | YAML-IS-THE-CONTRACT | Unauthorized GPU types, cloud cost explosion | validate_yaml_contract.py + check_yaml_contract() in preflight |
| H8 | ZERO-YAML-IMPROVISATION | AI adding "helpful" fallbacks to configs | Pre-commit hook + test suite + this rule text |
| H9 | CONTRACT-BEFORE-LAUNCH | Launching with unauthorized resources | Preflight gate check_yaml_contract() MUST pass |
H7-H9 details:
configs/cloud/yaml_contract.yaml) defines EXACTLY what is allowed{L4: 1, A100: 1}) are BANNED in factorial YAML — they
create non-deterministic GPU selection. Use string format (L4:1) for determinism..claude/metalearning/2026-03-24-unauthorized-a100-in-skypilot-yaml.mdAdditionally inherits:
| Rule | Name | Prevents | Detection |
|---|---|---|---|
| H1 | REPORT-BEFORE-LAUNCH | Lost observations | Assert file exists before sky jobs launch |
| H2 | UPDATE-EVERY-POLL | Stale report | Timestamp check on report file after each poll |
| H3 | COMPOUND-OR-FAIL | Empty passes | Count artifacts ≥ 6 at session end |
| H4 | REFERENCE-PRIOR-PASS | Context amnesia | Assert <session-context> non-empty in XML |
| H5 | NO-AD-HOC-POLLING | Protocol bypass | All sky jobs queue goes through /factorial-monitor |
| H6 | REASON-BEFORE-TEMPLATE | Template-filling without thinking | Cognitive checkpoints in Phase 1 + 4 |
| Anti-Pattern | How It Manifests | Detection |
|---|---|---|
| Template zombie | Fill all 8 XML sections with boilerplate, no genuine reasoning | Watchlist items identical to prior pass, no new hypotheses |
| Launch-then-think | Skip VALIDATE, launch immediately, fix on cloud | Gate failures discovered after job submission |
| Report-at-end | Write report only after all jobs terminal | Report file empty during execution |
| Observation amnesia | Cloud observations noted in chat, never written to report | git diff report.md shows no updates between polls |
| Cost blindness | Launching without checking prior pass cost or setting budget | No cost table in report during execution |
After execution, these must all be answerable YES from artifacts:
sky jobs launch?Append to outputs/harness-state.jsonl after each pass:
{
"pass": 6,
"date": "2026-03-24",
"branch": "test/run-debug-gcp-6th-pass",
"jobs_total": 34,
"jobs_succeeded": 34,
"jobs_failed": 0,
"cost_usd": 8.50,
"observations": 8,
"new_tests_written": 5,
"issues_filed": 2,
"watchlist_carried": 3,
"watchlist_new": 2,
"harness_version": "0.2.0",
"normalized_gain": 0.15
}
This enables: "Is the harness getting better over time?" via duckdb-skills:query.
Track before/after per pass:
g = (pass_success_rate - prior_pass_success_rate) / (1 - prior_pass_success_rate)
A positive g means the harness + fixes improved outcomes. A negative g means the harness caused harm (bloated context, wrong rules, etc.) — trigger Phase 5 REFLECT with extra scrutiny.
| Skill | Role | When |
|---|---|---|
/plan-context-load | Load context before XML generation | Phase 1 |
/factorial-monitor | Monitoring + diagnosis | Phase 3 |
/ralph-loop | Per-job log analysis | Phase 3 (via factorial-monitor) |
/self-learning-iterative-coder | Writing new tests | Phase 4 |
/issue-creator | Filing issues | Phase 4 |
/search-metalearning | Check prior failure patterns | Phase 1 |