Name: Review Reproduction
Author: LukasWallrich

Review Reproduction | Skills Pool

.md

keep-md: true

Verdict consistency: The verdict in the Abstract must match the verdict in the Conclusion. Both must be one of the five canonical categories:
- COMPUTATIONALLY REPRODUCED
- QUALITATIVELY REPRODUCED
- QUALITATIVELY REPRODUCED (substantial numeric deviations)
- PARTIALLY REPRODUCED
- NOT REPRODUCED
Check that the chosen verdict is justified by the deviation log. For example:
- "COMPUTATIONALLY REPRODUCED" requires ALL statistics to match within rounding
- "QUALITATIVELY REPRODUCED" requires ALL claims confirmed in direction and significance
- "PARTIALLY REPRODUCED" requires at least one claim NOT confirmed
Claims consistency: The numbered claims in "Paper Overview" must match those in claim_result_mapping.md and must all appear in the claim-by-claim conclusion assessment. Check that no claim is dropped or added between sections.
Numbers in narrative vs. tables: Where the report narrative mentions specific statistics (coefficients, p-values, sample sizes), verify they match the computed tables in the rendered HTML. Flag any discrepancy with the exact values found.
Deviation categories vs. thresholds: Verify that the deviation log categories are correctly assigned per the threshold rules:
- Exact: reproduced value rounds to the same reported value at the precision the paper reports. For example, if the paper reports −0.0005 (4 decimal places), the reproduced value must round to −0.0005 at 4 decimal places. The near-zero absolute threshold (0.005) does NOT override this rounding check — it is meant for values like 0.01 vs 0.02, not for high-precision values that happen to be small.
- Minor deviation: ≤5% relative (or ≤0.01 absolute when |paper value| < 0.2 and the parameter is not statistically significant)
- Substantive deviation: >5% relative (and >0.01 absolute)
- Conclusion change: significance or direction differs
- Near-zero rule with significance override: The near-zero absolute threshold (≤0.01 when |paper value| < 0.2) is designed for noisy parameters near zero, not for parameters that are small due to their units but statistically significant. When a near-zero parameter IS significant in the paper, the 5% relative threshold should apply instead — a 50% decline in a significant coefficient should be flagged as substantive regardless of absolute magnitude. Check that the report passes significant = TRUE to classify_deviation() for such parameters.
- Pay special attention to parameters reported at ≥3 decimal places — these are most prone to misclassification by the near-zero rule.
Deviation log completeness: Check that the deviation log covers all claim-relevant parameters mentioned in the claim mapping — i.e., parameters that directly bear on the abstract claims (main predictors, interactions, moderators, sample sizes, fit statistics). Control variable coefficients do not need deviation log entries unless a claim specifically depends on them. Specifically:
- For each table/model reproduced in the Reproduction Results section, verify its claim-relevant parameters appear in the deviation log. If a table has multiple models (e.g., Table 2 Model 1 and Model 2), ALL models must be in the log.
- Check for claim-relevant parameters that appear in comparison tables in the Reproduction Results but are absent from the deviation log (a common omission).
Claim coverage: For each numbered claim in Paper Overview, count the deviation log entries with that claim number. Flag any claim with zero entries. If a claim was not numerically tested, it must either (a) have a deviation log entry with category "Not tested", or (b) be explicitly noted as untested in the claim-by-claim conclusion. A claim with zero deviation log entries that is described as "Reproduced" in the conclusion is a Critical issue.
Open Materials note: If the Open Materials table says replication materials exist, verify the report documents whether/when they were consulted.
Replication code consultation for substantive deviations: If the deviation log contains any "Substantive deviation" or "Conclusion change" entries AND the Open Materials table indicates that replication code is available, verify that:
- The Discrepancy Investigation section documents that the replication code was actually read and compared against the reproduction code
- Specific analytical differences are identified (e.g., variable construction, sample restrictions, model specification, software)
- If the replication code was NOT consulted despite being available, flag this as a Critical issue: "Replication code is available but was not consulted to investigate substantive deviations. Stage 6 of the reproduction workflow requires consulting authors' replication materials when deviations cannot be explained through alternative specifications alone."

Unhedged causal claims about discrepancies: Explanations for deviations must use hedged language ("one possible explanation is...", "this may be due to...") unless there is direct evidence. Flag any statement that presents a hypothesis as fact. Examples of overclaiming:
- BAD: "The deviation is caused by dataset version differences"
- GOOD: "One possible explanation is dataset version differences"
- EXCEPTION: Direct evidence makes a confident statement acceptable (e.g., "We verified in the dataset changelog that variable X was recoded in version 2.1")
Statistical errors in prose: Watch for common mistakes:
- Confusing significance with effect size
- Claiming reference category changes affect other predictors' coefficients
- Saying "results are significant" without specifying the threshold or test
- Misinterpreting interaction terms
Missing substantive direction of deviations: When deviations are described, the report should state whether they strengthen or weaken the paper's claims. Flag any deviation discussion that omits this.
Overclaiming reproduction success: If there are substantive deviations or conclusion changes, the report should not describe the reproduction as fully successful without qualification.
Underclaiming / excessive hedging: If all values match exactly and all claims are confirmed, the report should not add unnecessary caveats. The review should note if the report is being overly cautious.
Phantom references: Search the report text for references to figures or tables (e.g., "Figure 13", "Table 3") that appear in narrative or conclusion text. Verify that each referenced figure/table actually exists in the Reproduction Results section. Flag any reference to an analysis that was not performed — this is a Critical issue because it presents unverified assertions as confirmed findings.
Conclusion overclaiming check: If the conclusion says "all claims confirmed" or equivalent, verify this is actually true: every claim in the Paper Overview must have deviation log entries, and none can be marked "Not tested" or have zero entries. Flag any mismatch as Critical.

# Review: [Report Title]

**Report**: [path to .qmd file]
**Reviewed**: [date]

## Summary

[1-3 sentence overall assessment: Is the report ready for human review, or does it need revisions first?]

## Critical Issues

[Issues that MUST be fixed before the report can be considered reliable. These are errors that could mislead a reader about whether the reproduction succeeded.]

### [Issue title]
- **Location**: [section/line reference]
- **Problem**: [specific description with quotes from the report]
- **Suggestion**: [how to fix it]

## Structural Issues

[Missing sections, wrong section names, template deviations]

### [Issue title]
- **Location**: [section/line reference]
- **Problem**: [specific description]
- **Suggestion**: [how to fix it]

## Consistency Issues

[Contradictions between sections, numbers that don't match]

### [Issue title]
- **Location**: [section/line reference]
- **Problem**: [specific description with quotes]
- **Suggestion**: [how to fix it]

## Language and Hedging Issues

[Overclaiming, unhedged hypotheses, missing substantive direction]

### [Issue title]
- **Location**: [section/line reference]
- **Problem**: [specific description with quotes]
- **Suggestion**: [how to fix it]

## Minor Issues

[Style, formatting, non-critical improvements]

- [item]
- [item]

## What's Done Well

[Briefly note 2-3 things the report does well — this helps calibrate the review and shows it's not just looking for problems]

Review Reproduction

Skill: Review Reproduction Report

Overview

Inputs

Review Reproduction

Skill: Review Reproduction Report

Overview

Inputs

Workflow

Step 1: Read All Inputs

Step 2: Structural Completeness Check

Step 3: Internal Consistency Check

Step 4: Overclaiming and Language Check

Step 5: Code Quality Check (Light)

Step 6: Produce Review Output

Severity Classification

Important Notes

Goplaces

Research Ops

Editor

Fact Checker

Deep Research

Academic Researcher