Use when experiments complete to judge what claims the results support, what they do not, and what evidence is still missing. A secondary Codex agent evaluates results against intended claims and routes to the next action (pivot, supplement, or confirm). Use after experiments finish - before writing the paper or running ablations.
Experiments produce numbers; this gate decides what those numbers mean. Collect results from available sources, get an objective judgment, then route based on the verdict.
gpt-5.4 - Used via a secondary Codex agent for objective claim assessment.Gather experiment data from whatever sources are available in the project:
wandb.Api().run("<entity>/<project>/<run_id>").history() - metrics, training curves, comparisonsEXPERIMENT_LOG.md - full results table with baselines and verdictsEXPERIMENT_TRACKER.md - check which experiments are done vs still runningssh server "tail -100 /path/to/training.log" if no other sourcedocs/research_contract.md or project notes - intended claims and experiment designAssemble the key information:
Send the collected results to a secondary Codex agent for objective evaluation:
spawn_agent:
model: REVIEWER_MODEL
reasoning_effort: xhigh
message: |
RESULT-TO-CLAIM EVALUATION
I need you to judge whether experimental results support the intended claim.
Intended claim: [the claim these experiments test]
Experiments run:
[list experiments with method, dataset, metrics]
Results:
[paste key numbers, comparison deltas, significance]
Baselines:
[baseline numbers and sources - reproduced or from paper]
Known caveats:
[any confounding factors, limited datasets, missing comparisons]
Please evaluate:
1. claim_supported: yes | partial | no
2. what_results_support: what the data actually shows
3. what_results_dont_support: where the data falls short of the claim
4. missing_evidence: specific evidence gaps
5. suggested_claim_revision: if the claim should be strengthened, weakened, or reframed
6. next_experiments_needed: specific experiments to fill gaps (if any)
7. confidence: high | medium | low
Be honest. Do not inflate claims beyond what the data supports.
A single positive result on one dataset does not support a general claim.
If delegation is unavailable, run the same evaluation locally and mark the verdict [pending external review] instead of blocking the pipeline.
Extract structured fields from the response:
- claim_supported: yes | partial | no
- what_results_support: "..."
- what_results_dont_support: "..."
- missing_evidence: "..."
- suggested_claim_revision: "..."
- next_experiments_needed: "..."
- confidence: high | medium | low
no - Claim not supportedfindings.md:
IDEA_CANDIDATES.md or try an alternative approachpartial - Claim partially supportedfindings.md/result-to-claim after supplementary experiments completepartial verdicts, record the analysis in findings.md and consider narrowing the claim scope or switching ideasyes - Claim supported/ablation-plannerpartial, do not round up to yes.confidence is low, treat the judgment as inconclusive and add experiments rather than committing to a claim.[pending external review].findings.md, regardless of outcome.