Expert guidance for designing self-paced reading experiments: region segmentation, timing parameters, comprehension probes, and spillover analysis
This skill encodes expert knowledge for designing self-paced reading (SPR) experiments in psycholinguistics. SPR is the most widely used behavioral method for studying real-time sentence comprehension during reading (Jegerski, 2014). A competent programmer without psycholinguistics training will reliably make errors in region segmentation, spillover design, and comprehension question construction -- all of which invalidate the resulting data.
For detailed region segmentation strategies, see references/region-segmentation.md.
For statistical analysis guidance, see references/analysis-guide.md.
Self-paced reading appears deceptively simple: participants press a button to reveal successive words. But the scientific value of an SPR experiment depends entirely on decisions that require psycholinguistic training:
Before executing the domain-specific steps below, you MUST:
For detailed methodology guidance, see the research-literacy skill.
This skill was generated by AI from academic literature. All parameters, thresholds, and citations require independent verification before use in research. If you find errors, please open an issue.
Choose based on your research question, population, and resources:
references/region-segmentation.md); reduces temporal resolution; risks confounding region length with reading time| Parameter | Recommended Value | Rationale |
|---|---|---|
| Response timeout | None (self-paced) or 3000-5000 ms per region | No timeout is standard for in-lab SPR; timeout prevents excessively slow responses in web-based studies (Boyce et al., 2020) |
| Inter-stimulus interval (ISI) | 0 ms for non-cumulative moving window | Standard practice; the next word appears immediately when the previous is masked (Just et al., 1982) |
| ISI for phrase-by-phrase | 0 ms (typical) | Any nonzero ISI introduces a blank that disrupts reading and may introduce strategic pausing |
| Pre-sentence fixation | + or * for 500-1000 ms | Orients attention to display location; standard in SPR (Jegerski, 2014) |
| Post-sentence delay | 0-500 ms before comprehension question | Brief delay prevents motor interference between last word button-press and question response |
| Practice trials | 6-10 items minimum | Familiarizes participants with button-press rhythm and comprehension questions; use different sentences than experimental items (Jegerski, 2014; Keating & Jegerski, 2015) |
This is the most consequential design decision in an SPR experiment. See references/region-segmentation.md for full guidelines.
Match critical regions across conditions for word length (in characters) and lexical frequency. If your manipulation requires different words, match them on length (+/- 1 character) and log frequency (use SUBTLEX-US; Brysbaert & New, 2009). Unmatched items introduce confounds that mimic or mask experimental effects.
Include at least 2-3 spillover words after the critical region. Processing difficulty at the critical region reliably spills over to subsequent words in SPR (Just et al., 1982; Mitchell, 2004; Rayner, 1998). Without spillover regions, you will miss your effect. These spillover words must be identical across conditions.
Avoid placing critical regions at clause or sentence boundaries. Reading times at clause-final and sentence-final positions are inflated by wrap-up processes -- integration of clause-level meaning, discourse updating, and possibly implicit prosodic boundary effects (Just & Carpenter, 1980; Warren, White, & Reichle, 2009). This inflation is independent of your manipulation and adds noise.
Keep critical regions short (ideally a single word). Multi-word critical regions reduce temporal resolution and introduce length confounds. If you must use a multi-word region, it must have the same number of words and matched total character length across conditions.
Ensure the pre-critical region is identical across conditions. Any difference before the critical word can create baseline differences in reading time that propagate into the critical region via spillover.
Comprehension questions serve two purposes: ensuring participants read for meaning, and providing an exclusion criterion for inattentive participants.
| Parameter | Recommendation | Rationale |
|---|---|---|
| Proportion of trials with questions | 1/3 to 1/2 of all trials (experimental + filler) | Fewer than 1/3: participants may stop reading carefully; more than 1/2: task becomes tedious, and participants may shift to a question-anticipation strategy (Just et al., 1982; Jegerski, 2014) |
| Answer balance | 50% yes / 50% no for yes/no questions | Prevents response bias toward one answer |
| Question content | Target semantic content of the sentence, NOT the critical manipulation | Questions about the manipulation teach participants what you are studying, inducing strategic reading (Jegerski, 2014) |
| Accuracy exclusion threshold | >80% correct to retain participant | Standard criterion; lower accuracy suggests the participant was not reading for comprehension (Jegerski, 2014; common practice across SPR studies) |
| Question timing | Immediately after the sentence (or after the final button press) | Delayed questions test memory, not comprehension |
Suppose the experimental sentence manipulates relative clause attachment:
The maid of the actress who was on the balcony shouted to the crowd.
For within-subjects manipulations, use a Latin square design so that each participant sees each item in exactly one condition, and each condition is seen equally often across participants (Keating & Jegerski, 2015).
| Population | Minimum Items per Condition | Rationale |
|---|---|---|
| L1 speakers, robust effect (e.g., garden-path) | 24 items per condition | Sufficient for medium-to-large effects in mixed models (Keating & Jegerski, 2015) |
| L1 speakers, subtle effect (e.g., pragmatic inference) | 32-40 items per condition | Smaller effects require more items for adequate power (Keating & Jegerski, 2015; Brysbaert & Stevens, 2018) |
| L2 speakers | 32-40 items per condition | Higher variability in L2 populations requires more observations (Marsden, Thompson, & Plonsky, 2018) |
| Parameter | Recommendation | Rationale |
|---|---|---|
| Filler-to-experimental ratio | 2:1 or 3:1 (fillers : experimental items) | Prevents participants from identifying the experimental pattern; higher ratios reduce strategic processing (Keating & Jegerski, 2015) |
| Filler variety | Include multiple sentence types, lengths, and structures | Monotonous fillers fail to mask the experimental manipulation |
| Filler complexity | Include some fillers of similar complexity to experimental items | If only experimental items are complex, participants learn to attend differently to them |
| Comprehension questions on fillers | Yes -- at least the same rate as on experimental items | If questions only follow experimental items, participants learn that complex sentences predict questions |
This is a design-level decision that should be made before programming the experiment.
| Criterion | SPR | Eye-Tracking |
|---|---|---|
| Equipment cost | Low (any computer) | High (dedicated eye-tracker, ~$20,000-$50,000) |
| Online data collection | Yes (web-based SPR and Maze work well) | No (requires in-lab calibration) |
| Temporal resolution | Word-by-word, with substantial spillover | Multiple fixation measures (first fixation, gaze duration, go-past, total time, regressions) |
| Regressions | Not measurable (non-cumulative display prevents rereading) | Yes -- regressions are a primary measure of reanalysis |
| Ecological validity | Moderate (button-press is unnatural, but spatial layout preserved) | Higher (closer to natural reading) |
| Sensitivity to early/late processing stages | Low (only a single RT per region, which blends all processing stages) | High (first-pass vs. second-pass measures separate early from late processing; Rayner, 1998) |
| Best for | Robust syntactic/semantic effects, web-based or underfunded studies, L2 populations without lab access | Nuanced temporal dynamics, distinguishing processing stages, studying regressions, garden-path recovery |
Rule of thumb: If you only need to know whether a manipulation affects reading time, SPR is sufficient. If you need to know when during processing the effect occurs (early lexical access vs. late reanalysis), use eye-tracking.
These are errors that non-specialists routinely make:
No spillover region. The most common fatal flaw. If the sentence ends at or immediately after the critical word, spillover effects have nowhere to appear, and the effect is lost. Always include 2-3 words of identical post-critical material across conditions.
Critical region at a clause boundary. Wrap-up effects at clause-final positions (Just & Carpenter, 1980) inflate reading times by 50-100+ ms regardless of condition, swamping the experimental effect or producing spurious interactions.
Length/frequency mismatch. Longer words take approximately 30-40 ms per additional character in SPR (Ferreira & Clifton, 1986). A 2-character difference between conditions creates a ~60-80 ms confound, which can easily exceed the size of most psycholinguistic effects.
Comprehension questions targeting the manipulation. This transforms the experiment from measuring natural reading into measuring strategic disambiguation. Participants adapt within 10-15 trials (Jegerski, 2014).
Too few items per condition. With fewer than 24 items per condition, even large effects (d = 0.8) may not reach significance in mixed-effects models, particularly with by-item random slopes (Brysbaert & Stevens, 2018).
No fillers or insufficient fillers. Without a 2:1 filler-to-item ratio, participants identify the experimental manipulation and shift to strategic reading (Keating & Jegerski, 2015).
Analyzing only the critical region. Even when an effect appears on the critical word, it typically continues into the spillover region. Analyzing only one region provides an incomplete picture and may miss effects that appear exclusively in spillover.
Using raw reading times without controlling for word length. Raw RTs conflate lexical processing speed with the experimental manipulation. Either match word length precisely or use residual RTs / include word length as a covariate in the statistical model (Ferreira & Clifton, 1986).
Ignoring trial position effects. Reading speed increases across the experiment as participants become practiced. Include trial order as a covariate or present items in a randomized order (Jegerski, 2014).
Not checking comprehension accuracy before analyzing RTs. Participants with low accuracy (<80%) may not be reading for comprehension. Their RT data are uninterpretable and should be excluded (Jegerski, 2014).
Before running your experiment, verify:
references/analysis-guide.md)