Use when designing a new experiment or research protocol and need to ensure methodological rigor
You are designing a rigorous empirical study. Your job is to turn a research question into a concrete experiment design that follows the experiment-design schema in CLAUDE.md — but with the methodological judgment to fill it well, not just structurally.
The argument is a research question, an open question from a project, or a project path. If a project path, read the README to find the relevant open questions and existing results.
Before any design work:
decisions/ for prior methodological choices that constrain the design.Write a falsifiable hypothesis. A good hypothesis:
If the question is exploratory (no clear prediction possible), say so explicitly and frame it as a measurement study with defined quantities of interest rather than forcing a hypothesis.
Independent variables: What are you varying? Justify why these variables and not others. Name alternatives you considered and why you rejected them.
Dependent variables / metrics: What are you measuring? For each metric:
If multiple metrics are plausible, recommend a primary metric and state why, then list secondary metrics.
Controlled variables: What must be held fixed for the comparison to be valid? Be specific — "same dataset" is insufficient; specify which subset, what preprocessing, what exclusion criteria.
Write a step-by-step procedure. For each step:
Production code path verification (MANDATORY before anchoring on any production path): Before referencing any production code path in the design:
production-code.md file. If so, read it first.modules/example-service/src/config.py:ServiceConfig.get_default_config())ls -la <path> or equivalentrg "<path_or_module>" projects/<project>/batch_eval/), deprecated files (*_deprecated.py), or test utilitiesinfra/.envWhy this matters: Agents (especially smaller models like Fast Model) may anchor on files that "look authoritative" without verifying actual usage. The verification step prevents config mismatches that would invalidate experiments. See ADR 0039 for incident details.
Upstream limitations review (MANDATORY when consuming prior experiment outputs): Before designing an experiment that uses outputs from prior experiments:
If no upstream experiments are consumed, state: "Upstream limitations reviewed: none (no prior experiment outputs consumed)".
Why this matters: EXPERIMENT.md "Limitations" sections document known issues that may invalidate downstream results. Without explicit review, agents assume upstream outputs are valid, leading to cascading failures. See projects/sample-project/diagnosis/diagnosis-input-validation-gap-2026-02-26.md for incident details.
Model selection verification (MANDATORY when the experiment calls an external model): If the experiment design involves calling any external model (LLM or VLM):
Why this matters: Model naming is confusing (e.g., gemini-2.0-flash vs gemini-3-flash-preview are different models with different capabilities). akari has accumulated empirical data on model performance that must be leveraged. See projects/sample-project/postmortem/postmortem-vlm-model-selection-no-capability-lookup-2026-02-26.md.
Address these validity threats explicitly:
production-code.md if it exists (see ADR 0039). Include: (1) exact file path and function/class of the production config, (2) parameter-by-parameter comparison confirming the experiment config matches, (3) pipeline architecture match (single-stage vs. multi-stage, model checkpoints). Ambiguous references like "from config.py" are insufficient when multiple files share the same name. Verify paths against existing working project scripts, not just production module code. See projects/sample-project/postmortem/postmortem-eval-config-mismatch-production-2026-02-25.md.Deployment gates (preconditions — must be true to ship, not research):
Evaluation metrics (research — measured over N sessions):
What result would confirm the hypothesis? What would refute it? What would be ambiguous? Be specific about thresholds or effect sizes.
Produce the experiment using the schema from CLAUDE.md:
## Experiment: <title>
Hypothesis: <falsifiable statement — or "Measurement study" with defined quantities>
CI layers: <which layers are involved and how>
Variables:
- Independent: <what we vary, with justification>
- Dependent: <what we measure, with metric properties and alternatives considered>
- Controlled: <what we hold fixed, with specifics>
Method:
1. <step with tool/command references>
2. ...
Validity threats:
- <threat>: <how addressed>
Cost estimate:
- API calls: <N calls × $X = $total>
- Compute: <estimate>
- Human time: <estimate>
- Sessions: <single or multi-session>
Success criteria:
- Confirmed if: <specific threshold or pattern>
- Refuted if: <specific threshold or pattern>
- Ambiguous if: <what would leave the question open>
After the schema output, add a brief Design rationale section explaining the key judgment calls you made and what alternatives you rejected.
When the experiment involves a long-running process (>5 minutes), the analysis task must be split per decisions/0023-incremental-analysis-throttling.md:
This prevents monolithic analysis tasks from being perpetually re-selected by /orient with diminishing returns. A single "Analyze results" task with a Done-when achievable only at completion will loop indefinitely.
When an experiment requires visual validation (rendering GLB/FBX models, generating images from 3D scenes), an agent can use:
xvfb-run -a python3 -c "
import trimesh
scene = trimesh.load('model.glb')
result = scene.save_image(resolution=[1280, 960])
with open('output.png', 'wb') as f:
f.write(result)
"
Do not assume visual rendering is blocked in headless environments. The xvfb + trimesh combination enables headless visual validation.
Follow docs/sops/commit-workflow.md. Commit message: design: <experiment title> — status: planned