Purpose

This skill encodes expert methodological knowledge for norming linguistic stimuli before running psycholinguistic experiments. A competent programmer without linguistics training would likely construct stimuli based on intuition, failing to control for critical lexical variables (word frequency, length, neighborhood density), skipping cloze norming, using inappropriate rating scales, or under-powering the norming study. Poor stimulus norming is the single most common methodological weakness in psycholinguistic research, because confounds in the materials propagate to every analysis.

When to Use

Use this skill when:

Creating sentence stimuli for reading experiments (self-paced reading, eye-tracking, ERP)
Norming the predictability (cloze probability) of critical words in sentence contexts
Collecting plausibility, naturalness, or acceptability ratings for sentence materials
Controlling lexical properties of critical words across experimental conditions
Designing Latin square counterbalancing for within-item designs
Planning filler items and practice trials

Do not use this skill when:

Parameter	Recommended Value	Citation / Rationale
N per item	Minimum 30 raters	Taylor, 1953; Bloom & Fischler, 1980; standard minimum for stable estimates
Preferred N	40-50 raters	More stable estimates, especially for medium-cloze items
Items per participant	50-100 fragments per norming session	Avoid fatigue; pilot to calibrate
Time limit	~10-15 seconds per item or untimed	Untimed is standard; brief limit prevents overthinking
Population	Same as experimental population (e.g., native English speakers, same age range)	Ensures cloze values generalize

Cloze Range	Label	Use Case
> 0.80	High cloze / highly predictable	N400 amplitude studies; predictability effects (Kutas & Hillyard, 1984)
0.30 - 0.70	Medium cloze	Moderate predictability manipulations
< 0.10	Low cloze / unpredictable	Baseline; unexpected completions
0.00	Zero cloze	Anomalous or implausible continuations

Aspect	Lab	Online (e.g., Prolific, MTurk)
Quality control	Direct observation	Must include catch trials and attention checks
Sample size	Limited by lab capacity	Easy to reach N = 40-50 per item
Population	Typically university students	More diverse; specify inclusion criteria
Validity	Gold standard	Comparable for cloze (Schütze & Sprouse, 2014)
Cost	Lab time	Participant payment (~$10-15/hour; Prolific standards)

Parameter	Recommended	Citation / Rationale
Scale type	Likert scale	Standard for sentence ratings (Schütze & Sprouse, 2014)
Number of points	7-point scale	Balances sensitivity and reliability; standard in psycholinguistics (Schütze & Sprouse, 2014)
Anchors	1 = "very unnatural/implausible" to 7 = "very natural/plausible"	Labeled endpoints with unlabeled intermediate points
N per item	Minimum 20 raters; preferred 30+	Sufficient for stable means per item (Sprouse & Almeida, 2012)
Items per rater	40-80 items per session	Avoid fatigue effects
Practice items	3-5 items spanning the full range before data collection	Calibrate scale use

Method	Description	Pros	Cons	Citation
Likert scale (7-point)	Rate acceptability 1-7	Simple; familiar; sufficient for most purposes	Ceiling/floor possible; ordinal data	Schütze & Sprouse, 2014
Magnitude estimation (ME)	Assign a number proportional to perceived acceptability relative to a reference sentence	Unbounded scale; ratio-level data (in theory)	More complex; participants need training; debated whether it outperforms Likert	Bard et al., 1996; Sprouse, 2011
Forced choice	Choose the more acceptable of two sentences	Binary; easy; avoids scale-use differences	Low sensitivity; many trials needed	Sprouse & Almeida, 2012
Yes/No judgment	"Is this sentence acceptable?"	Simple; binary	Very low sensitivity; cannot distinguish degrees of unacceptability	--

Design	Minimum N	Rationale	Citation
Simple grammatical/ungrammatical	20 participants	Large effect sizes (d > 1.0 typical)	Sprouse & Almeida, 2012
Factorial (2x2) with interaction	30-40 participants	Interaction effects are smaller	Sprouse et al., 2012
Subtle contrasts	50+ participants	Small effect sizes require more power	Power analysis recommended

Variable	Database / Source	Why It Matters	Citation
Word frequency	SUBTLEX-US (log10 word frequency per million)	Most powerful predictor of reading time; ~30-60 ms effect for high vs. low (Brysbaert & New, 2009)	Brysbaert & New, 2009
Word length	Character count	Longer words = longer reading times; ~20-30 ms per character (Rayner, 2009)	Rayner, 1998
Orthographic neighborhood density (N)	N-Watch; CLEARPOND	Number of words differing by one letter; affects lexical access (Coltheart et al., 1977)	Andrews, 1997
Concreteness	Brysbaert et al. (2014) ratings	Concrete words processed faster than abstract words	Brysbaert et al., 2014
Age of acquisition (AoA)	Kuperman et al. (2012) ratings	Earlier-acquired words processed faster	Kuperman et al., 2012
Number of syllables	Any pronunciation dictionary	Affects phonological processing time	Rayner, 1998
Morphological complexity	Manual coding	Derived words (e.g., un-happi-ness) processed differently than monomorphemic words	Taft, 2004

Database	Language	Measure	Recommended?	Citation
SUBTLEX-US	English (US)	Subtitle-based frequency per million	Yes -- best predictor of processing times	Brysbaert & New, 2009
SUBTLEX-UK	English (UK)	Subtitle-based frequency	Yes, for British English materials	van Heuven et al., 2014
HAL	English	Usenet corpus frequency	Outdated; SUBTLEX preferred	Lund & Burgess, 1996
CELEX	English, Dutch, German	Mixed corpus frequency	Acceptable but less predictive than SUBTLEX	Baayen et al., 1995

List	Items 1-20	Items 21-40
List 1	Condition A	Condition B
List 2	Condition B	Condition A

Parameter	Value	Rationale
Minimum items per condition per list	16-24	Standard for psycholinguistic experiments; fewer items = lower power (Brysbaert & Stevens, 2018)
Recommended items	24-40 per condition	More stable estimates, especially for eye-tracking
Participants per list	Equal across lists; minimum 4-6 per list	Ensures balanced representation
Total participants	Divisible by number of lists	Critical for balanced design

Parameter	Recommended Value	Rationale
Filler-to-target ratio	2:1 or 3:1 (fillers:targets)	Standard in psycholinguistics; prevents pattern detection (Schütze & Sprouse, 2014)
Filler diversity	Fillers should span the full range of sentence types, lengths, and structures	Prevents target sentences from standing out
Filler acceptability range	Include some clearly good and some mildly awkward fillers	Prevents raters from using only part of the scale
Filler length	Match the average length of target sentences	Controls for sentence length expectations

Parameter	Recommended Value	Rationale
Number of practice items	4-6 items (minimum 3)	Familiarize participants with the task and interface
Practice item composition	Span the range of difficulty/acceptability	Calibrate participant expectations
Practice data	Always exclude from analysis	Practice responses are contaminated by learning effects
Warm-up items at start of main experiment	2-3 additional filler items	Allow settling into the task; exclude from analysis

Platform	Pros	Cons	Typical Pay Rate
Prolific	Diverse participants; pre-screening; good data quality	Smaller pool than MTurk	~$10-15/hour (Prolific minimum: $8/hour)
Amazon MTurk	Large pool; fast recruitment	Lower data quality; less diverse; requires careful screening	~$10-15/hour recommended
PCIbex / Ibex Farm	Free hosting; designed for linguistics	Requires programming; no built-in recruitment	(hosting only)
Gorilla	GUI-based; good for complex designs	Subscription cost	(hosting only)

Measure	Implementation	Threshold
Catch trials	Include 10-15% filler items with obvious answers	Exclude participants failing > 20%
Completion time	Record total time	Exclude participants completing in < 50% of median time
Straight-lining	Check for same response on all items	Exclude participants with zero variance in ratings
Bot detection	Include reCAPTCHA or similar	Exclude flagged responses
Native speaker check	Self-report + brief language background questionnaire	Exclude non-native speakers (unless studying L2)

Sentence Stimulus Norming

Sentence Stimulus Norming

Purpose

When to Use

Research Planning Protocol

⚠️ Verification Notice

Cloze Probability Norming

What Is Cloze Probability?

Procedure

Design Parameters

Scoring Conventions

Cloze Probability Benchmarks

Online vs. Lab Norming

Plausibility and Naturalness Ratings

When to Collect

Rating Scale Design

Instructions Template

Critical Design Considerations

Acceptability Judgments

When to Collect

Rating Methods

Sample Size for Acceptability

Lexical Controls

Variables That Must Be Controlled Across Conditions

Frequency Database Selection

How to Match Across Conditions

Latin Square Counterbalancing

Purpose

Construction

Example: 2-Condition Design

Requirements

Filler Items

Purpose

Design Parameters

Filler Construction Tips

Practice and Warm-Up Items

Online Norming Considerations

Platform Recommendations

Quality Control for Online Studies

Common Pitfalls

Minimum Reporting Checklist

References

Goplaces

Research Ops

Editor

Fact Checker

Deep Research

Academic Researcher