Domain-specific statistical modeling guidance for cognitive science and neuroscience, encoding when and how to apply mixed models, correction methods, Bayesian approaches, and effect size reporting
This skill encodes domain-specific statistical knowledge for cognitive science and neuroscience research. It addresses the modeling decisions, correction strategies, and reporting conventions that a general-purpose statistician or programmer would get wrong without training in the field. For concrete analysis recipes with code, see references/common-analyses.md.
Before executing the domain-specific steps below, you MUST:
For detailed methodology guidance, see the research-literacy skill.
This skill was generated by AI from academic literature. All parameters, thresholds, and citations require independent verification before use in research. If you find errors, please open an issue.
Critical domain knowledge: Clark (1973) demonstrated that failing to treat items as random effects inflates Type I error. This remains one of the most common statistical errors in cognitive science. If your stimuli are sampled from a larger population (e.g., words, faces, scenes), you must account for item variability.
Are your stimuli sampled from a larger population?
|
+-- YES --> Mixed-effects model with crossed random effects
| (subjects and items)
|
+-- NO (e.g., fixed set of 4 task conditions) -->
|
+-- Any missing data, unbalanced cells, or continuous predictors?
| |
| +-- YES --> Mixed-effects model (subjects as random effect)
| |
| +-- NO --> Repeated-measures ANOVA is acceptable
|
+-- Need trial-level analysis (e.g., RT distributions)?
|
+-- YES --> Mixed-effects model (operates on individual trials)
+-- NO --> Repeated-measures ANOVA on condition means
Barr et al. (2013) recommend fitting the maximal random effects structure justified by the design to minimize Type I error. This means including random intercepts and slopes for all within-unit factors.
For a typical 2x2 design with factors A (within-subjects, within-items) and B (within-subjects, between-items):
# Maximal structure (Barr et al., 2013)
lmer(RT ~ A * B + (1 + A * B | Subject) + (1 + A | Item), data = d)
Convergence failures are common with complex random effects. Use this hierarchy (Barr et al., 2013; Matuschek et al., 2017):
|| in lme4)Do NOT simply drop all random slopes to achieve convergence. This inflates Type I error and undermines the purpose of mixed-effects modeling (Barr et al., 2013).
| Design | Random Effects | Rationale |
|---|---|---|
| Lexical decision (words as items) | `(1 + condition | subj) + (1 + condition |
| Stroop task (fixed conditions) | `(1 + congruency | subj)` |
| Picture naming (pictures as items) | `(1 + SOA | subj) + (1 |
| Multi-site study | `(1 + condition | subj) + (1 |
RT data in cognitive experiments are positively skewed, bounded below by physiological limits, and often contaminated by outliers. The approach matters.
Apply these criteria before modeling (Ratcliff, 1993; Luce, 1986):
| Criterion | Threshold | Source |
|---|---|---|
| Fast outliers (anticipatory) | < 200 ms | Whelan, 2008; Ratcliff, 1993 |
| Slow absolute cutoff | > 2000-3000 ms (task-dependent) | Ratcliff, 1993 |
| Within-subject SD trimming | > 3 SD from participant's condition mean | Van Selst & Jolicoeur, 1994 |
| Within-subject MAD trimming | > 3 MAD from participant's condition median | Leys et al., 2013 (more robust to skew) |
Task-specific note: For simple RT tasks (e.g., detection), use 100 ms as the fast cutoff (Whelan, 2008). For choice RT tasks (e.g., lexical decision), use 200 ms (Ratcliff, 1993). Always report exclusion rates.
Is your primary interest in RT distributions (not just means)?
|
+-- YES --> Drift Diffusion Model or ex-Gaussian fitting
|
+-- NO --> Choose a modeling approach:
|
+-- Option 1: Log-transform RT, then fit LMM (Gaussian)
| - Pro: Simple, widely understood
| - Con: Back-transformation of means is biased;
| changes the hypothesis being tested
| (Lo & Andrews, 2015)
|
+-- Option 2: Inverse-transform RT (1/RT = speed), then LMM
| - Pro: Often achieves better normality than log
| - Con: Same back-transformation issues as log
| (Ratcliff, 1993)
|
+-- Option 3 (Recommended): Generalized LMM with
Gamma family + identity link
- Pro: Models RT in original units; handles skew
directly; avoids transformation issues
(Lo & Andrews, 2015)
- Con: Computationally slower; may have convergence
issues with complex random effects
Recommended default: Gamma GLMM with identity link (Lo & Andrews, 2015). Report results on the original millisecond scale.
# Recommended RT model (Lo & Andrews, 2015)
glmer(RT ~ condition * group + (1 + condition | subj) + (1 | item),
family = Gamma(link = "identity"), data = d)
| Scenario | Method | Rationale | Source |
|---|---|---|---|
| Small number of planned contrasts (< 5) | No correction or Holm | Planned contrasts based on a priori hypotheses do not require correction if specified before data collection | Rubin, 2021 |
| All pairwise comparisons after ANOVA | Tukey HSD | Controls family-wise error for all pairwise comparisons; assumes equal variance | Tukey, 1953 |
| Many tests, correlated (e.g., EEG channels) | Cluster-based permutation | Respects spatial/temporal correlation structure | Maris & Oostenveld, 2007 |
| Many tests, independent | Bonferroni-Holm | More powerful than Bonferroni; step-down procedure | Holm, 1979 |
| Large-scale testing (fMRI voxels, genomics) | FDR (Benjamini-Hochberg) | Controls false discovery rate rather than family-wise error; appropriate when some false positives are tolerable | Benjamini & Hochberg, 1995 |
| Exploratory whole-brain fMRI | Cluster-level FWE (with cluster-forming threshold p < 0.001) | Eklund et al. (2016) showed that p < 0.01 cluster-forming threshold inflates false positive rates to ~70% | Eklund et al., 2016 |
| Confirmatory ROI analysis in fMRI | Small volume correction (SVC) with FWE | Restricts search space to a priori ROI | Worsley et al., 1996 |
| BF10 Range | Evidence Category | Source |
|---|---|---|
| < 1/10 | Strong evidence for H0 | Jeffreys, 1961; Lee & Wagenmakers, 2013 |
| 1/10 to 1/3 | Moderate evidence for H0 | Lee & Wagenmakers, 2013 |
| 1/3 to 3 | Anecdotal / inconclusive | Lee & Wagenmakers, 2013 |
| 3 to 10 | Moderate evidence for H1 | Lee & Wagenmakers, 2013 |
| > 10 | Strong evidence for H1 | Lee & Wagenmakers, 2013 |
| Tool | Use Case | Language |
|---|---|---|
| BayesFactor | Standard designs (t-test, ANOVA, correlation, regression) | R |
| brms | Complex models (multilevel, non-Gaussian, multivariate) | R (Stan backend) |
| JASP | GUI-based Bayesian analysis for standard tests | Standalone |
| PyMC | Custom Bayesian models | Python |
Report the exact BF, not just the category (Wagenmakers et al., 2018):
"A Bayesian paired-samples t-test indicated moderate evidence for a difference between conditions, BF10 = 5.3 (default Cauchy prior, r = 0.707)."
Always specify:
APA 7th edition (2020, Section 6.6) requires reporting effect sizes for all primary analyses. The specific measure depends on the test:
| Test | Effect Size | Interpretation Benchmarks | Source |
|---|---|---|---|
| t-test (between groups) | Cohen's d | 0.2 small, 0.5 medium, 0.8 large | Cohen, 1988 |
| t-test (within subjects) | Cohen's d_z or d_av | d_z uses SD of difference scores | Lakens, 2013 |
| One-way ANOVA | eta-squared or omega-squared | 0.01 small, 0.06 medium, 0.14 large | Cohen, 1988 |
| Factorial ANOVA | partial eta-squared | 0.01 small, 0.06 medium, 0.14 large | Cohen, 1988; Richardson, 2011 |
| Mixed-effects model | semi-partial R-squared | No universal benchmarks; report CI | Rights & Sterba, 2019 |
| Correlation | r | 0.1 small, 0.3 medium, 0.5 large | Cohen, 1988 |
| Chi-square | Cramer's V or phi | Depends on df | Cohen, 1988 |
Domain note: Always report confidence intervals around effect sizes (APA 7th, 2020). Use
effectsize(R) orstatsmodels(Python) for computation. The benchmarks above are Cohen's generic guidelines; paradigm-specific benchmarks are more informative (see../cogsci-power-analysis/references/effect-sizes.md).
Traditional effect sizes are not straightforward for mixed models. Options:
r2glmm or effectsize package (Rights & Sterba, 2019)MuMIn::r.squaredGLMM() (Nakagawa & Schielzeth, 2013)Problem: Analyzing condition means averaged over items, ignoring item variability, fails to generalize beyond the specific stimuli used (Clark, 1973).
Fix: Use mixed-effects models with crossed random effects for subjects and items.
Problem: Selecting voxels/channels/time-windows based on the effect of interest, then testing that same effect (Kriegeskorte et al., 2009). Inflates effect sizes by 2x or more (Vul et al., 2009).
Fix: Use independent localizer, leave-one-out cross-validation, or whole-brain corrected analysis.
Problem: ANOVA on proportion correct violates normality and homogeneity assumptions, especially at ceiling (> 90%) or floor (< 10%) (Jaeger, 2008; Dixon, 2008).
Fix: Use logistic mixed-effects model on binary (correct/incorrect) trial-level data.
Problem: Removing "outlier" participants based on the dependent variable (e.g., excluding subjects whose effects go in the wrong direction) without a priori criteria.
Fix: Define exclusion criteria before data collection. Base exclusions on performance metrics (accuracy below chance, excessive RTs), not on the effect of interest.
Problem: ANOVA on raw RT means violates normality. Condition means conceal distributional differences (Ratcliff, 1993).
Fix: Use Gamma GLMM (Lo & Andrews, 2015) or transform RTs, and supplement with distributional analysis if warranted.
Problem: Cluster-based inference with cluster-forming thresholds more lenient than p < 0.001 (uncorrected) produces unacceptable false positive rates up to 70% (Eklund et al., 2016).
Fix: Use voxel-level threshold of p < 0.001 (uncorrected) as minimum cluster-forming threshold, or use voxel-level FWE/FDR correction.
Problem: A "significant" correlation of r = 0.30 with N = 50 has a 95% CI of [0.02, 0.53] -- the true effect could be near zero (Cumming, 2014).
Fix: Always report bootstrap 95% CI for correlations. Use 10000 bootstrap samples (Efron & Tibshirani, 1993).
Based on APA 7th edition (2020) and Appelbaum et al. (2018):
See references/common-analyses.md for concrete analysis recipes with code patterns.