Use when experiments complete to judge what claims the results support, what they do not support, and what evidence is still missing. A secondary Codex reviewer evaluates the results and routes the next action.
Experiments produce numbers; this gate decides what those numbers mean.
partial.Resolve automation defaults in this precedence order:
PROJECT_AUTOMATION.md in the project rootCLAUDE.md in the project rootBefore invoking the secondary reviewer, read ../shared-references/agent-role-charter.md and apply the Claim Judge role.
Gather experiment data from whatever sources are available:
Assemble:
spawn_agent:
model: gpt-5.4
reasoning_effort: xhigh
message: |
RESULT-TO-CLAIM EVALUATION
I need you to judge whether experimental results support the intended claim.
Intended claim: [the claim these experiments test]
Experiments run:
[list experiments with method, dataset, metrics]
Results:
[paste key numbers, comparison deltas, significance]
Baselines:
[baseline numbers and sources]
Known caveats:
[confounds, limited datasets, missing comparisons]
Please act as a skeptical senior experimental scientist and journal reviewer.
Evaluate only what the current evidence supports. Prefer a narrow correct claim to
a broad weak claim.
Please evaluate:
1. claim_supported: yes | partial | no
2. what_results_support
3. what_results_dont_support
4. missing_evidence
5. suggested_claim_revision
6. next_experiments_needed
7. confidence: high | medium | low
Be honest. Do not inflate claims beyond what the data supports, and do not
generalize beyond the tested scope.
Extract structured fields:
- claim_supported: yes | partial | no
- what_results_support: "..."
- what_results_dont_support: "..."
- missing_evidence: "..."
- suggested_claim_revision: "..."
- next_experiments_needed: "..."
- confidence: high | medium | low
nofindings.mdpartialfindings.md/result-to-claim after the supplementary experiments completeyes/ablation-plannerfindings.md, regardless of outcome.