Execute experiments from a hypothesis file — run code, work through proofs, gather evidence, analyze data, and write results back. Use this skill whenever a hypothesis file has designed experiments with empty Results sections, when the user says 'run the experiments', 'test the hypothesis', 'try it', or 'execute the experiments', or when the next step in a scientific-method workflow is to actually attempt the experiments.
Execute experiments and record what actually happens. The entire research loop depends on honest results.
hypothesis-NN.mdThe research loop only works when results are trustworthy:
not-runnable with a clear explanation rather than attempting unreliable workarounds — a clean "I can't test this" is more valuable than a misleading result.skipped with a pointer to the decisive result.Read the hypothesis file. For each experiment, check whether #### Results already has content beyond the comment placeholder. Skip experiments that already have results — the file is the checkpoint.
If all experiments already have results, report that experiments are complete and exit.
Work through pending experiments in order. For each experiment type, follow the approach below. After each experiment, if the outcome is confirmed or refuted with strong evidence, skip remaining experiments (see the results template in Step 2 for how to record skipped experiments).
Code experiments (type: code):
<problem-dir>/experiments/<hypothesis-slug>/exp<N>.<ext> where <N> is the next globally sequential integer for that directory (check existing files to determine it), and <ext> matches the language (.py, .sh, .js, .csv, etc.). Create the directory if it does not exist. Numbers never reset and existing files are never overwritten.hypothesis-01.md → hypothesis-01).which <tool> or <tool> --version). A missing dependency means outcome not-runnable, not a refutation of the hypothesis.<problem-dir>/ and must not spawn persistent background processes. If an experiment requires either, record it as not-runnable with an explanation instead of running it.inconclusive or not-runnable as appropriate. A failure is not evidence against the hypothesis unless the failure itself is informative.Math proof experiments (type: math-proof):
Evidence-gathering experiments (type: evidence-gathering):
search_papers is available, prefer it for academic evidence queries — it returns structured results with citation counts and open-access links. Use get_references/get_citations to follow citation chains from key papers. Fall back to WebSearch for non-academic sources or if MCP tools are unavailable.references.md where availableData-analysis experiments (type: data-analysis):
search_papers if available to find papers with linked datasets, otherwise WebSearch)<problem-dir>/experiments/<hypothesis-slug>/ using the exp<N>.<ext> convention from code experiments.Logical-deduction experiments (type: logical-deduction):
For each completed experiment, fill in the #### Results section using Edit. Do not modify the experiment design above it.
Results template:
#### Results
**Artifact:** experiments/<hypothesis-slug>/exp<N>.<ext>
**Outcome:** <confirmed | refuted | inconclusive | not-runnable>
<Detailed narrative of what was done and what was found. Include code outputs, proof steps, quoted evidence, statistics, or error messages. The reader must be able to reproduce the methodology from this record alone. Report all outcomes honestly, including negative or unexpected results. For quantitative data, include appropriate statistical context (e.g., sample size, confidence interval, p-value). Distinguish between what was directly observed and what was inferred.>
**Evidence strength:** <strong | moderate | weak>
The **Artifact:** line is only included when the experiment wrote files to experiments/. Math-proof, logical-deduction, and evidence-gathering experiments that produce no persistent artifact omit this line.
Evidence strength guidance:
For skipped experiments (when an earlier experiment was decisive):
#### Results
**Outcome:** skipped
Experiment <N> produced a decisive <confirmed|refuted> result with strong evidence. Further testing is unnecessary.
Only for experiments with confirmed outcome where the hypothesis file contains a ## Literature section.
Run 1-2 targeted searches comparing the confirmed result against existing published work (prefer search_papers if available, otherwise WebSearch). Search for the specific method, finding, or mechanism confirmed — not general background.
If the literature section has fewer than 3 sources, this search is especially important to avoid false novelty claims.
After the search, append a **Novelty:** tag immediately after **Evidence strength:** in the Results section:
**Novelty:** <novel | incremental | replication> — <one sentence of rationale, citing any newly discovered prior art by URL>
Novelty scale:
novel — no prior art found for this specific resultincremental — extends or improves on prior workreplication — reproduces known findingsThe replication tag signals the orchestrator that this result may not satisfy success criteria when Novelty required: yes is in problem.md.
If no ## Literature section exists in the hypothesis file, skip this step entirely.