Guides structured 4-stage experiment execution with attempt budgets and gate conditions: Stage 1 initial implementation (reproduce baseline), Stage 2 hyperparameter tuning, Stage 3 proposed method validation, Stage 4 ablation study. Integrates with evo-memory (load prior strategies, trigger IVE/ESE) and experiment-craft (5-step diagnostic on failure). Use when: user has a planned experiment, needs to reproduce baselines, organize experiment workflow, or systematically validate a method. Do NOT use for debugging a specific experiment failure (use experiment-craft) or designing which experiments to run (use paper-planning).
A structured 4-stage framework for executing research experiments from initial implementation through ablation study, with attempt budgets and gate conditions that prevent wasted effort. This follows the Experiment Tree Search design from the EvoScientist paper, where the engineer agent iteratively generates executable code, runs experiments, and records structured execution results at each stage.
Experiments fail for two reasons: wrong order and no stopping criteria. Most researchers jump straight to testing their novel method without verifying their baseline setup, then wonder why results don't make sense. Others spend weeks tuning hyperparameters without a budget, hoping the next run will work.
The 4-stage pipeline solves both problems. It enforces a strict order (each stage validates assumptions the next stage depends on) and assigns attempt budgets (forcing systematic thinking over brute-force iteration).
If coming from idea-tournament, your research proposal (Phase 4) provides the experiment plan — datasets, baselines, metrics, and ablation design — that maps directly to Stages 1-4 below.
Before entering the pipeline, load Experimentation Memory (M_E) from prior cycles:
/memory/experiment-memory.mdEach stage follows a generate → execute → record → diagnose → revise loop:
| Stage | Goal | Budget (N_E^s) | Gate Condition |
|---|---|---|---|
| 1. Initial Implementation | Get baseline code running and reproduce known results | ≤20 attempts | Metrics within 2% of reported values (or within reported variance) |
| 2. Hyperparameter Tuning | Optimize config for your setup | ≤12 attempts | Stable config, variance < 5% across 3 runs |
| 3. Proposed Method | Implement & validate novel method | ≤12 attempts | Outperforms tuned baseline on primary metric, consistent across 3 runs |
| 4. Ablation Study | Prove each component's contribution | ≤18 attempts | All claims evidenced with controlled experiments |
Each stage saves artifacts to /experiments/stageN_name/.
Within every stage, repeat this cycle for each attempt:
experiment-craft for the 5-step diagnostic flow.Goal: Find or generate executable baseline code and verify it reproduces published results. This stage corresponds to the paper's "initial implementation" — the engineer agent searches for working code, runs it, and records structured execution results.
Why this matters: If you can't get the baseline running and reproducing known results, every subsequent comparison is meaningless. Initial implementation validates your data pipeline, evaluation code, training infrastructure, and understanding of prior work.
Budget: ≤20 attempts (N_E^1=20). Baselines can be tricky — missing details in papers, version mismatches, unreported preprocessing steps. 20 attempts gives enough room to debug without allowing infinite tinkering.
Gate: Primary metrics within 2% of reported values (or within the reported variance if provided).
Process:
When to load experiment-craft: If attempts 1-5 all fail significantly (>10% gap), switch to the 5-step diagnostic flow to isolate the cause before burning more attempts.
Output: /experiments/stage1_baseline/ containing results, config, and verified baseline code.
See references/stage-protocols.md for detailed initial implementation checklists.
Goal: Find the optimal hyperparameter configuration for YOUR specific setup.
Why this matters: Published hyperparameters are tuned for the authors' setup. Your hardware, data version, framework version, or subtle implementation differences mean their config may not be optimal for you. Tuning now prevents confounding your novel method's results with suboptimal baselines.
Budget: ≤12 attempts. Hyperparameter tuning has diminishing returns. If 12 structured attempts don't find a stable config, the problem is likely deeper than hyperparameters.
Gate: Stable configuration found — variance < 5% across 3 independent runs with different random seeds.
Process:
Priority order for tuning: Learning rate → batch size → loss weights → regularization → architecture-specific params. This order reflects typical sensitivity.
When to load experiment-craft: If results are highly unstable (variance > 20%) across runs, there's likely a training instability issue. Use diagnostic flow.
Output: /experiments/stage2_tuning/ containing tuning logs, final config, and stability verification.
See references/attempt-budget-guide.md for budget rationale and adjustment rules.
Goal: Implement and validate your novel method, demonstrating improvement over the tuned baseline.
Why this matters: This is the core contribution. But because you've verified the baseline (Stage 1) and optimized the config (Stage 2), any improvement you see is genuinely attributable to your method — not to a better-tuned setup or a broken baseline.
Budget: ≤12 attempts. Your method should work within a reasonable number of iterations if the underlying idea is sound. Excessive attempts suggest a fundamental problem, not a tuning issue.
Gate: Outperforms the tuned baseline on the primary metric. The improvement should be consistent across at least 3 runs.
Process:
Integration strategy: Add your method's components one at a time to the working baseline. Each added component should stay within 20% of the baseline's performance — if a single component causes a >20% regression, isolate and debug it before proceeding. Never integrate the full method in one shot.
When to load experiment-craft: When your method underperforms the baseline despite correct implementation. The 5-step diagnostic flow will help distinguish between implementation bugs and fundamental issues.
Critical decision — failure classification: If the method underperforms the baseline after exhausting the attempt budget, hand off to evo-memory for IVE (Idea Validation Evolution) — this is evo-memory's job, not this skill's. IVE triggers under two conditions:
The evo-memory skill will classify the failure as:
Output: /experiments/stage3_method/ containing method code, results, comparison with baseline.
Goal: Prove that each component of your method contributes meaningfully to the final result.
Why this matters: Reviewers will ask "is component X really necessary?" for every part of your method. Without ablation, you can't answer. More importantly, ablation helps YOU understand why your method works — sometimes components you thought were important aren't, and vice versa.
Budget: ≤18 attempts. Ablation requires multiple controlled experiments — one per component being ablated, plus interaction effects. 18 attempts covers a method with 4-5 components.
Gate: Every claimed contribution is supported by a controlled experiment showing its effect.
Process:
Three ablation designs:
When to load experiment-craft: If ablation results contradict your hypothesis (removing a component improves results), use diagnostic flow to understand why.
Output: /experiments/stage4_ablation/ containing ablation results table, per-component analysis.
See references/stage-protocols.md for detailed ablation design patterns.
When a stage attempt fails, refer to the experiment-craft skill for structured diagnosis:
Trigger points: After any failed attempt in any stage. Especially important:
Every attempt across all stages should be logged in a structured format that captures not just WHAT you did but WHY and WHAT YOU LEARNED. These logs feed into evo-memory's Experiment Strategy Evolution (ESE) mechanism.
For each attempt, record:
See references/code-trajectory-logging.md for the full logging format and how logs feed into evo-memory.
Prioritize these rules during experiment execution:
Initial implementation is not wasted time: It validates your entire infrastructure — data pipeline, evaluation code, training setup. Skipping it means every subsequent result is built on unverified ground. Most "method doesn't work" bugs are actually baseline setup bugs.
Budget limits prevent rabbit holes: Fixed attempt budgets force you to think systematically. When you know you have 12 attempts, you design each one to maximize information. Without limits, attempt #47 is rarely more informative than attempt #12 — it's just more desperate.
Stage order is non-negotiable: Each stage validates assumptions the next depends on. Skipping Stage 1 means Stage 3 results could be wrong due to a broken baseline. Skipping Stage 2 means Stage 3 improvements might just be better hyperparameters, not a better method. There are no shortcuts.
Ablation is not optional cleanup: It's the primary evidence that your method works for the right reasons. A method that outperforms the baseline but has no ablation is a method you don't understand. Reviewers know this.
Failed attempts are data, not waste: Each failed attempt narrows the search space and reveals something about the problem. Log failures carefully — they feed into evo-memory and prevent future researchers from repeating the same mistakes.
Early termination is a feature: Stopping before budget exhaustion is smart, not lazy. If the gate is clearly unachievable after systematic attempts, escalate to evo-memory IVE rather than burning remaining budget on increasingly random variations.
When all four stages are complete, pass these artifacts to paper-writing:
| Artifact | Source Stage | Used By |
|---|---|---|
| Initial implementation results | Stage 1 | Comparison tables, setup verification |
| Optimal hyperparameter config | Stage 2 | Reproducibility section |
| Method vs baseline comparison | Stage 3 | Main results table |
| Ablation study results | Stage 4 | Ablation table, contribution claims |
| Code trajectory logs (all stages) | All stages | Method section details, supplementary |
| Implementation details and tricks | Stages 1-3 | Method section, reproducibility (captured in trajectory log Analysis fields and [Reusable] tags) |
Also pass results to evo-memory for evolution updates:
Refer to the evo-memory skill to read Experimentation Memory:
→ Read M_E at /memory/experiment-memory.md
Refer to the experiment-craft skill for 5-step diagnostic: → Run diagnosis → Return to pipeline
Refer to the evo-memory skill for failure classification: → Run IVE protocol
Refer to the evo-memory skill for strategy extraction: → Run ESE protocol with trajectory logs
Refer to the paper-writing skill: → Pass all stage artifacts
| Topic | Reference File | When to Use |
|---|---|---|
| Per-stage checklists and patterns | stage-protocols.md | Detailed guidance for each stage |
| Budget rationale and adjustment | attempt-budget-guide.md | When budgets feel too tight or too loose |
| Code trajectory logging format | code-trajectory-logging.md | Recording attempts for evo-memory |
| Stage log template | stage-log-template.md | Logging a single stage's progress |
| Pipeline tracker template | pipeline-tracker-template.md | Tracking the full 4-stage pipeline |