Use when main results pass result-to-claim (`claim_supported = yes` or `partial`) and ablation studies are needed for paper submission. A secondary Codex agent designs ablations from a reviewer's perspective; the local executor reviews feasibility and implements.
Systematically design ablation studies that answer the questions reviewers will ask. The reviewer agent leads the design; the local executor reviews feasibility and implements.
gpt-5.4 - Used via a secondary Codex agent for reviewer-style ablation design./result-to-claim with claim_supported = yes or partial/auto-review-loop identifies missing ablationsRead available project files to build the full picture:
docs/research_contract.mdEXPERIMENT_LOG.md, EXPERIMENT_TRACKER.md, or W&B)/result-to-claim output or project notes)spawn_agent:
model: REVIEWER_MODEL
reasoning_effort: xhigh
message: |
You are a rigorous ML reviewer planning ablation studies.
Given this method and results, design ablations that:
1. Isolate the contribution of each novel component
2. Answer questions reviewers will definitely ask
3. Test sensitivity to key hyperparameters
4. Compare against natural alternative design choices
Method: [description from project files]
Components: [list of removable or replaceable components]
Current results: [key metrics from experiments]
Claims: [what we claim and current evidence]
For each ablation, specify:
- name: what to change (for example, "remove module X", "replace Y with Z")
- what_it_tests: the specific question this answers
- expected_if_component_matters: what we predict if the component is important
- priority: 1 (must-run) to 5 (nice-to-have)
Also provide:
- coverage_assessment: what reviewer questions these ablations answer
- unnecessary_ablations: experiments that seem useful but will not add insight
- suggested_order: run order optimized for maximum early information
- estimated_compute: total GPU-hours estimate
If delegation is unavailable, generate the same plan locally and mark it [pending external review].
Normalize the response into a structured format:
## Ablation Plan
### Component Ablations (highest priority)
| # | Name | What It Tests | Expected If Matters | Priority |
|---|------|---------------|---------------------|----------|
| 1 | remove module X | contribution of X | performance drops on metric Y | 1 |
| 2 | replace X with simpler Z | value of learned vs fixed | drops, especially on dataset A | 2 |
### Hyperparameter Sensitivity
| # | Parameter | Values to Test | What It Tests | Priority |
|---|-----------|----------------|---------------|----------|
| 3 | lambda | [0.01, 0.1, 1.0] | sensitivity to regularization | 3 |
### Design Choice Comparisons
| # | Name | What It Tests | Priority |
|---|------|---------------|----------|
| 4 | joint vs separate matching | whether joint adds value | 4 |
### Coverage Assessment
[What reviewer questions these ablations answer]
### Unnecessary Ablations
[Experiments that seem useful but will not add insight - skip these]
### Run Order
[Optimized for maximum early information]
### Estimated Compute
[Total GPU-hours]
Before running anything, check:
ablation-no-module-X)EXPERIMENT_LOG.mdfindings.md with insightswhat_it_tests and expected_if_component_matters. No "just try it" experiments.EXPERIMENT_LOG.md, including negative results (for example, component removal had no effect).