Enforce the replication-protocol.md rule by cross-checking numeric claims in a manuscript against the actual R / Stata / Python outputs. Report PASS/FAIL per claim against tolerance thresholds. Use before submission and before releasing a replication package.
Compare numeric claims in a manuscript (point estimates, standard errors, p-values, counts) against the actual outputs produced by the analysis pipeline. Report PASS / FAIL per claim against the tolerance thresholds defined in .claude/rules/replication-protocol.md.
Core principle: If the paper says ATT = -1.632 (0.584) and the code produces -1.628 (0.591), we verify — numerically — that the difference is within the documented tolerance. No more "looks close enough" eyeballing.
/commit. Pair with a pre-commit invocation on manuscript + analysis changes.$0 — path to the manuscript (.tex, .qmd, .md, .pdf). Required.$1 — path to the outputs directory. Defaults to scripts/R/_outputs/. Can be _targets/objects/, a Stata .do-file log directory, etc.replication-protocol.md for the tolerance thresholds currently in effect.Rscript scripts/R/00_run_all.R) before auditing.sessionInfo.txt or equivalent environment capture exists in the outputs dir.Parse the manuscript for numeric claims. Patterns to match:
ATT = -1.632 (0.584), $\beta = 0.342$ (0.091), hat{\tau} = 1.28** with starred significance& -1.632$^{***}$ & 0.584 & in LaTeX table environmentsour sample of 2,847 firms, $N = 2{,}847$mean = 0.423, SD = 0.087p < 0.01, $p = 0.003$Record each claim as a tuple:
{
claim_id: "Table2_col3_ATT",
location: "Table 2, Column 3, row 'Treatment'",
kind: "point_estimate" | "standard_error" | "p_value" | "count" | "percentage",
reported_value: -1.632,
uncertainty: 0.584, # only for point estimates
significance_stars: 3, # 0-3 or None
raw_context: "the ATT estimate of -1.632 (0.584) indicates..."
}
Write the extracted claims to .claude/session-reports/reproducibility_claims_[manuscript-name].json so the user can review the extraction before audit.
Scan $1 for corresponding values. Priority order:
.rds files — readRDS(path)$coef[["treatment"]] style lookups. Can use Rscript -e "saveRDS(summary(readRDS(...)), '/tmp/audit.rds')" to extract..tex tables — parse LaTeX table cells directly; match on column headers + row labels..csv summary files — pandas/readr parse, key-value lookup..out / .log files (Stata, regress output) — regex extraction..json — direct key lookup.Record each extracted result:
{
source: "scripts/R/_outputs/results.rds",
lookup_key: "fit_main$coefficients['treated']",
value: -1.628,
uncertainty: 0.591,
p_value: 0.005
}
Use fuzzy heuristics when exact labels don't match:
"treatment effect" ~ "ATT" ~ "treated")raw_context field (table number, row label, description)For every claim, produce a match candidate with a confidence score. Claims below 0.7 confidence get flagged as "UNMATCHED — manual review needed" rather than silently passing.
For each matched claim, apply the thresholds from replication-protocol.md:
| Kind | Tolerance | Example |
|---|---|---|
| Integers (N, counts) | Exact | 2,847 must equal 2,847 |
| Point estimates | abs(reported - computed) < 0.01 | -1.632 vs -1.628 → diff = 0.004 → PASS |
| Standard errors | abs(reported - computed) < 0.05 | 0.584 vs 0.591 → diff = 0.007 → PASS |
| P-values | Same significance level | p<0.01 and p<0.01 → PASS; p<0.01 and p=0.03 → FAIL |
| Percentages | ±0.1pp | 42.3% vs 42.35% → PASS |
Respect any tolerance overrides the user has written into their replication-protocol.md fork (they may loosen for MC noise or tighten for administrative data).
Write .claude/session-reports/reproducibility_audit_[manuscript-name].md:
# Reproducibility Audit: [Manuscript Title]
**Date:** [YYYY-MM-DD]
**Manuscript:** [path]
**Outputs directory:** [path]
**Tolerance source:** .claude/rules/replication-protocol.md
## Summary
| Status | Count |
|---|---|
| PASS | N |
| FAIL (diff > tolerance) | M |
| UNMATCHED (manual review) | K |
| **Overall verdict** | **PASS / FAIL** |
## PASS (all within tolerance)
| Claim | Reported | Computed | Diff | Tolerance |
|---|---|---|---|---|
| Table2_col3_ATT | -1.632 (0.584) | -1.628 (0.591) | 0.004 / 0.007 | 0.01 / 0.05 |
## FAIL (outside tolerance — BLOCKER)
| Claim | Reported | Computed | Diff | Tolerance | Location in paper |
|---|---|---|---|---|---|
## UNMATCHED (manual review)
| Claim | Raw context | Candidate sources |
|---|---|---|
## Environment
[sessionInfo excerpt]
## Next steps
1. Fix any FAIL rows — either update the manuscript or rerun analysis.
2. Review UNMATCHED rows — add explicit lookup keys or widen the search scope.
3. After zero FAILs, the paper is replication-ready.
/commit pre-commit gate — see replication-protocol.md for the enforcement pattern..claude/rules/replication-protocol.md — the tolerance contract..claude/skills/review-r/SKILL.md — catches code-style issues; this skill catches NUMERICAL reproducibility..claude/skills/review-paper/SKILL.md — content review; pair with this skill for a full pre-submission audit.-1.632 is reproducible. Whether -1.632 is the RIGHT estimand is a review-paper / domain-reviewer question.sessionInfo.txt capture lets a reviewer see the env; pinning versions is on the user (via renv.lock or a DESCRIPTION file).