Review code and data supplements of scientific papers for computational reproducibility. Use when asked to: review a code supplement, check if a paper's code is reproducible, audit a simulation study, evaluate a scientific paper's data and code, or assess computational reproducibility. Actively executes code (fixing minor issues), runs reduced simulations, and compares outputs against reported results. Outputs a structured markdown review document.
Review code and data supplements for computational reproducibility. Actively execute code, fix minor issues, run reduced simulations, and verify results match the paper.
Start by reading the paper PDF (and supplement PDF if provided):
Assess the code supplement:
- List all files and folder structure
- Identify README and documentation
- Locate main/master scripts vs. helper functions
- Identify data files, code files, output files
- Note programming languages used
If the paper or supplement references external code repositories (GitHub, GitLab, Bitbucket, Zenodo, etc.), attempt to clone or download them before proceeding:
git clone the repository into a subdirectory of the working folder (e.g., external/repo-name/)remotes::install_github() or pak::pak()uv pip install git+https://...Rationale: Many papers split code across the supplement and external repositories. The supplement is the primary unit of review, but easily accessible external code (public repos, archived releases) should be fetched and included in the review scope. Only flag "missing code" for components that are truly inaccessible.
Document: which external dependencies were fetched, from where, and which version/commit.
Install dependencies flexibly:
library()/require() callsrequirements.txt or import statements. Always use uv.Run all code, fixing minor issues as needed:
Apply these fixes:
Document every fix — these become review items.
For simulation studies: Invoke the setup-benchmark skill (via the Skill tool) before running reduced simulations. This gives you domain knowledge to evaluate whether the simulation design is sound (well-specified and misspecified DGPs, factorial design, coverage diagnostics, Monte Carlo SEs, practical significance thresholds) and to make informed decisions about what to reduce without destroying the study's ability to support its conclusions.
For computationally intensive code, create reduced versions targeting < 1 hour runtime:
Reduction strategies:
Verify qualitative consistency:
For unavoidably long computations:
Compare generated outputs to paper:
Output markdown using template in assets/review-template.md.
Severity levels:
Fix before complaining — If fixable in 30 seconds, fix it and note as minor issue.
Verify, don't trust — Run the code. Check outputs. Compare to paper.
Be constructive — Goal is helping authors improve their supplement.
Document thoroughly — Another reviewer should understand exactly what you did.
Qualitative over exact — For reduced runs, patterns and rankings matter more than exact numbers.
references/checklist.md — Complete checklist for documentation, completeness, organization, quality, reproducibilityassets/review-template.md — Output template for the review document