Statistical review protocol checking for p-hacking, multiple comparison correction, and effect size reporting
Protocol for evaluating the statistical quality of a manuscript. Covers test selection, assumption verification, result reporting, common pitfalls, and p-hacking detection.
Review each statistical test used against the following criteria:
| Question | What to Check |
|---|---|
| Does the test match the outcome variable type? | Continuous outcome with t-test/ANOVA, categorical with chi-square/Fisher's, time-to-event with log-rank/Cox |
| Does the test match the number of groups? | Two groups: t-test/Mann-Whitney. Three or more: ANOVA/Kruskal-Wallis |
| Is the test appropriate for the data structure? | Paired data requires paired tests. Clustered data requires mixed models or GEE. Repeated measures requires RM-ANOVA or mixed models |
| Is a parametric test justified? | Check if normality was assessed and reported |
| Is the test two-sided? | One-sided tests require strong a priori justification |
Common errors:
For each statistical test, verify that assumptions are checked:
| Scenario | Expected Action |
|---|---|
| Multiple primary outcomes | Alpha adjustment (Bonferroni, Holm) or clearly designated single primary outcome |
| Multiple pairwise comparisons after ANOVA | Post-hoc test (Tukey, Dunnett, Games-Howell) |
| Subgroup analyses | Interaction test before subgroup comparisons; label as exploratory |
| Multiple secondary outcomes | FDR correction (Benjamini-Hochberg) or state as exploratory |
| Correlation matrices | FDR correction for number of comparisons |
Red flags:
Every comparison or association should report an effect size alongside the p-value.
| Analysis Type | Expected Effect Size | Interpretation Aid |
|---|---|---|
| Two-group comparison (continuous) | Cohen's d or mean difference with CI | Small: 0.2, Medium: 0.5, Large: 0.8 |
| ANOVA | Eta-squared or partial eta-squared | Small: 0.01, Medium: 0.06, Large: 0.14 |
| Correlation | r or rho (already an effect size) | Small: 0.1, Medium: 0.3, Large: 0.5 |
| Chi-square | Cramer's V or phi | Depends on df |
| Logistic regression | Odds ratio with CI | Meaningful thresholds context-dependent |
| Cox regression | Hazard ratio with CI | Meaningful thresholds context-dependent |
| Linear regression | R-squared, standardized beta | Report adjusted R-squared for model |
Red flags:
Red flags:
Red flags:
P-hacking is the practice of manipulating data analysis to produce statistically significant results. Watch for:
## Statistical Review Report
### Overall Statistical Quality: [Adequate / Needs Improvement / Serious Concerns]
### Test Appropriateness
| Analysis | Test Used | Appropriate? | Issue (if any) |
|----------|----------|-------------|----------------|
| [description] | [test] | Yes/No | [issue] |
### Assumption Verification
- [Finding 1]
- [Finding 2]
### Effect Size and CI Reporting
- [Finding]
### Multiple Testing
- [Finding]
### Sample Size Adequacy
- [Finding]
### Missing Data
- [Finding]
### P-Hacking Indicators
- [None detected / Concerns identified: ...]
### Recommendations
1. **Critical:** [...]
2. **Major:** [...]
3. **Minor:** [...]