Orchestrator experiment preset — coordinating scientific experiment workflows across specialist agents. Manages structured hypothesis testing: OBSERVE → HYPOTHESIZE → PREDICT → DESIGN → INSTRUMENT → EXECUTE → ANALYZE → RECORD. Guides causal isolation reasoning, predictions-before-execution, and four post-experiment verdicts (CONFIRMED/REJECTED/PARTIALLY CONFIRMED/INCONCLUSIVE). Use this skill when orchestrating experiments, debugging via scientific method, or running structured hypothesis-driven investigations.
This skill guides scientific reasoning — it is not a compliance checklist. Every rule here exists for a specific reason. When reviewing or executing experiments:
An agent that mechanically applies rules without reasoning is no better than a linter. An agent that reasons about experimental design and catches real methodological flaws is invaluable.
Trigger patterns: "experiment", "hypothesis", "debug scientifically", "root cause", "investigate", "why does", "measure", "bisect", "A/B test", "reproduce"
OBSERVE → HYPOTHESIZE → PREDICT → DESIGN → INSTRUMENT → EXECUTE → ANALYZE → RECORD
↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓
Symptom Falsifiable Metric Classify Add Probes Run It Verdict Update
Data Claims Targets Obs/Int & Logging & Learn Registry
DESIGN-REVIEW (HyperSan, once) → EXECUTE (HyperArch, autonomous) → ANALYZE (HyperSan, post-hoc)
Experiment authoring (Steps 1–5) happens BEFORE orchestration begins — the experiment entry must exist with Steps 1–5 filled before HyperOrch starts.
Describe exact symptom with measurable data. No interpretation, no guessing.
List ALL plausible explanations as falsifiable claims.
[LIKELY], [POSSIBLE], [UNLIKELY] — prior belief, never modified retroactively.State specific predictions BEFORE execution. Use the 5-column prediction table:
| # | If hypothesis is true... | Metric | Expected Value | Tolerance |
|---|
Why this is absolute: Post-hoc rationalization is cognitively invisible — humans and agents naturally construct explanations that fit observed data. Without pre-registered predictions, you cannot distinguish genuine understanding from pattern-matching on noise. This gate has no reasoning exceptions because the failure mode is undetectable from inside.
Classify experiment type and create minimal design: → See experiment_types.md for classification rules
Set minimum_confirmation_runs:
| Category | Runs | Examples |
|---|---|---|
| Stochastic | 2+ | Perf benchmarks, load tests, flaky repros |
| Deterministic | 1 | Bug reproduction, logic errors |
| Unreproducible | 0 (observational only) | Production incidents, one-time failures |
Add measurement or intervention code, tagged for clean removal.
# PROBE tagged instrumentation
→ See probing_guideline.md for principles, tag convention, and cleanup# INTERVENTION tagged code changes or use branch/config strategy
→ See intervention_procedure.md for tiered strategy and cleanup gatesRun the experiment. Collect data. Log everything.
Compare predictions to measurements. Force a verdict into one of four categories:
CONFIRMED | REJECTED | PARTIALLY CONFIRMED | INCONCLUSIVE
Why only four: Soft verdicts like "LIKELY CONFIRMED" let agents avoid committing to a conclusion. If the evidence clearly supports the hypothesis, say CONFIRMED. If not, say what it actually is. Suggestive-but-not-conclusive evidence is INCONCLUSIVE with a recommendation narrative — that's honest, and it drives the next experiment.
PARTIALLY CONFIRMED: Some predictions match, others don't, but the evidence itself is clear (not noisy — the predictions were partially right).
Update all records and clean up experiment artifacts:
# PROBE and # INTERVENTION changesThese are the principles that make experiments trustworthy. Each carries its rationale so you can reason about intent.
Predictions MUST be written before Step 6.
Why absolute: Post-hoc rationalization is cognitively invisible. This is the one rule where "the purpose is obviously served" reasoning fails, because you cannot detect the bias from inside. Always enforced.
Purpose: When an experiment changes something and observes an effect, you need to be able to attribute the effect to a specific cause.
The test: "If a metric moves, can I determine which change caused it?"
Typical application: Each interventional arm changes one variable from baseline — this is the simplest way to ensure attribution.
Reasoning beyond the simple case: The principle is about attributability, not about counting variables. Designs that test multiple variables CAN satisfy causal isolation if the design allows decomposition:
When reviewing: Ask "can the experimenter determine what caused each observed effect?" — not "did each arm change exactly one variable?"
Purpose: A single stochastic run can be misleading due to random variation.
The test: "Is the observed result distinguishable from noise?"
Deterministic experiments (same seed, same hardware = same result) need one run. Stochastic experiments need enough runs to establish a pattern. The experimenter should justify their run count relative to expected variance.
Purpose: Avoid re-testing hypotheses that prior experiments already resolved.
The test: "Has this hypothesis (or a substantially similar one) already been tested?"
Read the experiment registry before designing new experiments. Previously REJECTED hypotheses need new evidence or a meaningfully different angle to justify re-testing.
If a completed experiment entry is provided (Steps 1-5 already filled):
Invoke HyperSan to validate experiment design. HyperSan should reason about:
If substantial issues: Return to experiment author with specific reasoning. Max 2 revision cycles. If sound: Proceed to Phase 2 autonomously.
HyperOrch MUST invoke runSubagent(HyperArch) for this phase. If runSubagent is unavailable or fails, report the delegation failure to the user — do not improvise alternative execution.
Invoke HyperArch to run the experiment: