Name: Experiment Harness
Author: petteriTeikari

Search skills.../

Experiment Harness | Skills Pool

Need to run experiments on cloud GPU?
  ├── New pass (first time or new fixes applied)?
  │     → /experiment-harness (THIS SKILL)
  ├── Continuing an existing pass mid-flight?
  │     → /factorial-monitor (monitoring only)
  ├── Single job diagnosis?
  │     → /ralph-loop
  └── Local code changes needed?
        → /self-learning-iterative-coder

Layer 1 (Deterministic): scripts/run_factorial.sh
  - Pure sky jobs launch calls in a loop
  - ANY researcher can run this without Claude Code
  - NO LLM involvement — reproducible, auditable

Layer 2 (Monitoring Harness): /experiment-harness → /factorial-monitor
  - Creates experiment XML (standardized template)
  - VALIDATES pre-launch gates (9 checks, 0 skips)
  - Creates report file BEFORE launching
  - Drives /factorial-monitor for monitoring + diagnosis
  - Updates report after EVERY state change
  - Captures compound learning (new tests, issues, observations)

Phase 1: GENERATE   → Create experiment XML from template     [protocols/generate.md]
Phase 2: VALIDATE   → Pre-launch gates (all must pass)        [protocols/validate.md]
Phase 3: EXECUTE    → Launch .sh + drive /factorial-monitor    [protocols/execute.md]
Phase 4: COMPOUND   → Tests, issues, observations, watchlist  [protocols/compound.md]
Phase 5: REFLECT    → Self-assess harness effectiveness       [protocols/reflect.md]

Gate	Command	Criteria
Staging tests	`make test-staging`	0 skipped, 0 failed
Prod tests	`make test-prod`	0 skipped, 0 failed
Preflight GCP	`python scripts/preflight_gcp.py`	11/11 passed (incl. YAML contract + cost)
YAML contract	`python scripts/validate_yaml_contract.py`	0 violations
Docker image fresh	GAR timestamp > latest code commit	Image newer
Docker image verified	`docker run ... python3 -c "import ..."`	All imports OK
Factorial dry-run	`run_factorial.sh --dry-run`	Correct conditions
Report file created	`<report-output>` path from XML	File exists with headers
Prior pass referenced	`<session-context>` non-empty	Has root causes
Security scan	No credentials in XML, paths valid	Clean

Rule	Name	Prevents	Detection
H7	YAML-IS-THE-CONTRACT	Unauthorized GPU types, cloud cost explosion	`validate_yaml_contract.py` + `check_yaml_contract()` in preflight
H8	ZERO-YAML-IMPROVISATION	AI adding "helpful" fallbacks to configs	Pre-commit hook + test suite + this rule text
H9	CONTRACT-BEFORE-LAUNCH	Launching with unauthorized resources	Preflight gate `check_yaml_contract()` MUST pass

Rule	Name	Prevents	Detection
H1	REPORT-BEFORE-LAUNCH	Lost observations	Assert file exists before sky jobs launch
H2	UPDATE-EVERY-POLL	Stale report	Timestamp check on report file after each poll
H3	COMPOUND-OR-FAIL	Empty passes	Count artifacts ≥ 6 at session end
H4	REFERENCE-PRIOR-PASS	Context amnesia	Assert `<session-context>` non-empty in XML
H5	NO-AD-HOC-POLLING	Protocol bypass	All sky jobs queue goes through /factorial-monitor
H6	REASON-BEFORE-TEMPLATE	Template-filling without thinking	Cognitive checkpoints in Phase 1 + 4

Anti-Pattern	How It Manifests	Detection
Template zombie	Fill all 8 XML sections with boilerplate, no genuine reasoning	Watchlist items identical to prior pass, no new hypotheses
Launch-then-think	Skip VALIDATE, launch immediately, fix on cloud	Gate failures discovered after job submission
Report-at-end	Write report only after all jobs terminal	Report file empty during execution
Observation amnesia	Cloud observations noted in chat, never written to report	`git diff report.md` shows no updates between polls
Cost blindness	Launching without checking prior pass cost or setting budget	No cost table in report during execution

{
  "pass": 6,
  "date": "2026-03-24",
  "branch": "test/run-debug-gcp-6th-pass",
  "jobs_total": 34,
  "jobs_succeeded": 34,
  "jobs_failed": 0,
  "cost_usd": 8.50,
  "observations": 8,
  "new_tests_written": 5,
  "issues_filed": 2,
  "watchlist_carried": 3,
  "watchlist_new": 2,
  "harness_version": "0.2.0",
  "normalized_gain": 0.15
}

g = (pass_success_rate - prior_pass_success_rate) / (1 - prior_pass_success_rate)

Skill	Role	When
`/plan-context-load`	Load context before XML generation	Phase 1
`/factorial-monitor`	Monitoring + diagnosis	Phase 3
`/ralph-loop`	Per-job log analysis	Phase 3 (via factorial-monitor)
`/self-learning-iterative-coder`	Writing new tests	Phase 4
`/issue-creator`	Filing issues	Phase 4
`/search-metalearning`	Check prior failure patterns	Phase 1

Experiment Harness

Why This Exists

Decision Tree: Which Skill?

Experiment Harness

Why This Exists

Decision Tree: Which Skill?

Architecture: Two-Layer Execution Model

Workflow (5 Phases)

Phase 1: GENERATE

Phase 2: VALIDATE

YAML Contract Validation (Gate Detail)

Phase 3: EXECUTE

Phase 4: COMPOUND

Phase 5: REFLECT (Memento-Skills Read-Execute-Reflect-Write)

XML Template

Non-Negotiable Rules

Anti-Patterns

Competency Questions (BDI validation)

Cumulative State (cross-pass trend tracking)

Measurement Protocol (SkillsBench normalized gain)

Integration Points

References

Automation Audit Ops

Github Qa Labels

Jupyter Notebook

Tidb Integrationtest Recorder

Quality Nonconformance

Hugging Face Trackio