Cross-model verification of experiment results. Uses multiple LLMs to independently verify claims, check code correctness, and validate metrics. Use when user says 'verify results', 'cross check', '交叉验证', or wants independent validation of experimental findings.
Use multiple independent LLMs to verify experimental results, catch calculation errors, and validate claims before submission.
VERIFICATION_REPORT.md in project rootgpt-5.4 — Via Codex MCPSingle-model workflows have a blind spot: if Claude makes an error in evaluation code or metric computation, Claude reviewing its own code is unlikely to catch it. Cross-model verification sends the same evidence to an independent model for parallel analysis.
This is particularly important for:
Gather all materials needed for verification:
Send evaluation code to external LLM for independent review:
mcp__codex__codex:
config: {"model_reasoning_effort": "high"}
prompt: |
Review this evaluation code for correctness. Specifically check:
1. Is the metric computation correct? (F1, BLEU, accuracy — check edge cases)
2. Is the model being evaluated on the correct data split?
3. Are there any data leakage risks? (features from test set leaking into training)
4. Are random seeds properly controlled?
5. Is the comparison with baselines fair? (same data, same preprocessing)
Evaluation code:
```python
[paste evaluation script]
```
Test data sample (first 5 entries):
[paste sample]
Report any issues as: [VERIFIED], [SUSPICIOUS], or [ERROR] for each check.
Ask Claude to independently recompute key metrics from raw data:
For each claim in the paper/report:
Send claims to external LLM for independent assessment:
mcp__codex__codex-reply:
threadId: [from Step 2]
config: {"model_reasoning_effort": "high"}
prompt: |
Now verify these specific claims against the evidence:
Claims:
1. [claim 1] — supported by [Table X / Figure Y]
2. [claim 2] — supported by [experiment Z]
For each claim, assess:
- Is the evidence sufficient?
- Could there be an alternative explanation?
- What additional evidence would strengthen/weaken this claim?
Rate each: [SUPPORTED], [WEAKLY SUPPORTED], [UNSUPPORTED], [CONTRADICTED]
Write VERIFICATION_REPORT.md:
# Verification Report
## Summary
- Total claims verified: N
- Supported: X | Weakly supported: Y | Flagged: Z
## Code Review
| Check | Status | Notes |
|-------|--------|-------|
| Metric computation | [VERIFIED/SUSPICIOUS/ERROR] | ... |
| Data split integrity | ... | ... |
| Baseline fairness | ... | ... |
## Metric Recomputation
| Metric | Reported | Recomputed | Match |
|--------|----------|------------|-------|
| F1 | 0.723 | 0.721 | ✅ (delta < 0.001 rounding) |
## Claim Verification
| Claim | Evidence | External Assessment | Status |
|-------|----------|-------------------|--------|
| ... | ... | ... | [SUPPORTED] |
## Recommendations
- [Any issues that need to be addressed before submission]