Specifies norming procedures for linguistic stimuli including cloze probability, plausibility ratings, acceptability judgments, and lexical controls
This skill encodes expert methodological knowledge for norming linguistic stimuli before running psycholinguistic experiments. A competent programmer without linguistics training would likely construct stimuli based on intuition, failing to control for critical lexical variables (word frequency, length, neighborhood density), skipping cloze norming, using inappropriate rating scales, or under-powering the norming study. Poor stimulus norming is the single most common methodological weakness in psycholinguistic research, because confounds in the materials propagate to every analysis.
Use this skill when:
Do not use this skill when:
Before executing the domain-specific steps below, you MUST:
For detailed methodology guidance, see the research-literacy skill.
This skill was generated by AI from academic literature. All parameters, thresholds, and citations require independent verification before use in research. If you find errors, please open an issue.
Cloze probability is the proportion of people who complete a sentence fragment with a particular word (Taylor, 1953). It is the standard measure of a word's predictability in context and is a critical control variable in nearly all sentence processing research.
| Parameter | Recommended Value | Citation / Rationale |
|---|---|---|
| N per item | Minimum 30 raters | Taylor, 1953; Bloom & Fischler, 1980; standard minimum for stable estimates |
| Preferred N | 40-50 raters | More stable estimates, especially for medium-cloze items |
| Items per participant | 50-100 fragments per norming session | Avoid fatigue; pilot to calibrate |
| Time limit | ~10-15 seconds per item or untimed | Untimed is standard; brief limit prevents overthinking |
| Population | Same as experimental population (e.g., native English speakers, same age range) | Ensures cloze values generalize |
| Cloze Range | Label | Use Case |
|---|---|---|
| > 0.80 | High cloze / highly predictable | N400 amplitude studies; predictability effects (Kutas & Hillyard, 1984) |
| 0.30 - 0.70 | Medium cloze | Moderate predictability manipulations |
| < 0.10 | Low cloze / unpredictable | Baseline; unexpected completions |
| 0.00 | Zero cloze | Anomalous or implausible continuations |
| Aspect | Lab | Online (e.g., Prolific, MTurk) |
|---|---|---|
| Quality control | Direct observation | Must include catch trials and attention checks |
| Sample size | Limited by lab capacity | Easy to reach N = 40-50 per item |
| Population | Typically university students | More diverse; specify inclusion criteria |
| Validity | Gold standard | Comparable for cloze (Schütze & Sprouse, 2014) |
| Cost | Lab time | Participant payment (~$10-15/hour; Prolific standards) |
Recommendation for online norming: Include 10-15% catch trials (sentences with obvious completions, e.g., "The dog chased the ___") and exclude participants who fail > 20% of catch trials.
| Parameter | Recommended | Citation / Rationale |
|---|---|---|
| Scale type | Likert scale | Standard for sentence ratings (Schütze & Sprouse, 2014) |
| Number of points | 7-point scale | Balances sensitivity and reliability; standard in psycholinguistics (Schütze & Sprouse, 2014) |
| Anchors | 1 = "very unnatural/implausible" to 7 = "very natural/plausible" | Labeled endpoints with unlabeled intermediate points |
| N per item | Minimum 20 raters; preferred 30+ | Sufficient for stable means per item (Sprouse & Almeida, 2012) |
| Items per rater | 40-80 items per session | Avoid fatigue effects |
| Practice items | 3-5 items spanning the full range before data collection | Calibrate scale use |
"You will read a series of sentences. For each sentence, please rate how natural or plausible it sounds on a scale from 1 to 7, where 1 means 'very unnatural / makes no sense' and 7 means 'perfectly natural / makes complete sense.' There are no right or wrong answers; we are interested in your intuition."
| Method | Description | Pros | Cons | Citation |
|---|---|---|---|---|
| Likert scale (7-point) | Rate acceptability 1-7 | Simple; familiar; sufficient for most purposes | Ceiling/floor possible; ordinal data | Schütze & Sprouse, 2014 |
| Magnitude estimation (ME) | Assign a number proportional to perceived acceptability relative to a reference sentence | Unbounded scale; ratio-level data (in theory) | More complex; participants need training; debated whether it outperforms Likert | Bard et al., 1996; Sprouse, 2011 |
| Forced choice | Choose the more acceptable of two sentences | Binary; easy; avoids scale-use differences | Low sensitivity; many trials needed | Sprouse & Almeida, 2012 |
| Yes/No judgment | "Is this sentence acceptable?" | Simple; binary | Very low sensitivity; cannot distinguish degrees of unacceptability | -- |
Recommendation: Use 7-point Likert as the default. It provides sufficient sensitivity for most research questions and has been shown to replicate formal linguistic judgments as reliably as magnitude estimation (Sprouse & Almeida, 2012; Sprouse, 2011).
| Design | Minimum N | Rationale | Citation |
|---|---|---|---|
| Simple grammatical/ungrammatical | 20 participants | Large effect sizes (d > 1.0 typical) | Sprouse & Almeida, 2012 |
| Factorial (2x2) with interaction | 30-40 participants | Interaction effects are smaller | Sprouse et al., 2012 |
| Subtle contrasts | 50+ participants | Small effect sizes require more power | Power analysis recommended |
Every critical word manipulation must control for confounding lexical variables. The target word and its condition-matched alternatives should be equated on the following:
| Variable | Database / Source | Why It Matters | Citation |
|---|---|---|---|
| Word frequency | SUBTLEX-US (log10 word frequency per million) | Most powerful predictor of reading time; ~30-60 ms effect for high vs. low (Brysbaert & New, 2009) | Brysbaert & New, 2009 |
| Word length | Character count | Longer words = longer reading times; ~20-30 ms per character (Rayner, 2009) | Rayner, 1998 |
| Orthographic neighborhood density (N) | N-Watch; CLEARPOND | Number of words differing by one letter; affects lexical access (Coltheart et al., 1977) | Andrews, 1997 |
| Concreteness | Brysbaert et al. (2014) ratings | Concrete words processed faster than abstract words | Brysbaert et al., 2014 |
| Age of acquisition (AoA) | Kuperman et al. (2012) ratings | Earlier-acquired words processed faster | Kuperman et al., 2012 |
| Number of syllables | Any pronunciation dictionary | Affects phonological processing time | Rayner, 1998 |
| Morphological complexity | Manual coding | Derived words (e.g., un-happi-ness) processed differently than monomorphemic words | Taft, 2004 |
| Database | Language | Measure | Recommended? | Citation |
|---|---|---|---|---|
| SUBTLEX-US | English (US) | Subtitle-based frequency per million | Yes -- best predictor of processing times | Brysbaert & New, 2009 |
| SUBTLEX-UK | English (UK) | Subtitle-based frequency | Yes, for British English materials | van Heuven et al., 2014 |
| HAL | English | Usenet corpus frequency | Outdated; SUBTLEX preferred | Lund & Burgess, 1996 |
| CELEX | English, Dutch, German | Mixed corpus frequency | Acceptable but less predictive than SUBTLEX | Baayen et al., 1995 |
Key recommendation: Use SUBTLEX log frequency values. They explain more variance in lexical decision and naming times than older norms (Brysbaert & New, 2009).
In a within-item design, each item appears in all conditions, but each participant sees each item in only one condition. A Latin square assigns items to conditions across participant lists.
For a design with k conditions and n items (where n is divisible by k):
With 40 items and 2 conditions (A, B):
| List | Items 1-20 | Items 21-40 |
|---|---|---|
| List 1 | Condition A | Condition B |
| List 2 | Condition B | Condition A |
| Parameter | Value | Rationale |
|---|---|---|
| Minimum items per condition per list | 16-24 | Standard for psycholinguistic experiments; fewer items = lower power (Brysbaert & Stevens, 2018) |
| Recommended items | 24-40 per condition | More stable estimates, especially for eye-tracking |
| Participants per list | Equal across lists; minimum 4-6 per list | Ensures balanced representation |
| Total participants | Divisible by number of lists | Critical for balanced design |
Fillers prevent participants from noticing the experimental manipulation and adopting strategies.
| Parameter | Recommended Value | Rationale |
|---|---|---|
| Filler-to-target ratio | 2:1 or 3:1 (fillers:targets) | Standard in psycholinguistics; prevents pattern detection (Schütze & Sprouse, 2014) |
| Filler diversity | Fillers should span the full range of sentence types, lengths, and structures | Prevents target sentences from standing out |
| Filler acceptability range | Include some clearly good and some mildly awkward fillers | Prevents raters from using only part of the scale |
| Filler length | Match the average length of target sentences | Controls for sentence length expectations |
| Parameter | Recommended Value | Rationale |
|---|---|---|
| Number of practice items | 4-6 items (minimum 3) | Familiarize participants with the task and interface |
| Practice item composition | Span the range of difficulty/acceptability | Calibrate participant expectations |
| Practice data | Always exclude from analysis | Practice responses are contaminated by learning effects |
| Warm-up items at start of main experiment | 2-3 additional filler items | Allow settling into the task; exclude from analysis |
| Platform | Pros | Cons | Typical Pay Rate |
|---|---|---|---|
| Prolific | Diverse participants; pre-screening; good data quality | Smaller pool than MTurk | ~$10-15/hour (Prolific minimum: $8/hour) |
| Amazon MTurk | Large pool; fast recruitment | Lower data quality; less diverse; requires careful screening | ~$10-15/hour recommended |
| PCIbex / Ibex Farm | Free hosting; designed for linguistics | Requires programming; no built-in recruitment | (hosting only) |
| Gorilla | GUI-based; good for complex designs | Subscription cost | (hosting only) |
| Measure | Implementation | Threshold |
|---|---|---|
| Catch trials | Include 10-15% filler items with obvious answers | Exclude participants failing > 20% |
| Completion time | Record total time | Exclude participants completing in < 50% of median time |
| Straight-lining | Check for same response on all items | Exclude participants with zero variance in ratings |
| Bot detection | Include reCAPTCHA or similar | Exclude flagged responses |
| Native speaker check | Self-report + brief language background questionnaire | Exclude non-native speakers (unless studying L2) |
Not norming cloze probability: Claiming words are "predictable" or "unpredictable" based on experimenter intuition rather than empirical cloze norms. Always collect cloze data (Taylor, 1953).
Too few raters per item: With N < 20 raters for cloze, individual item estimates are unstable. A word with true cloze of 0.50 could yield observed cloze of 0.20-0.80 with only 10 raters. Use minimum 30 raters (Bloom & Fischler, 1980).
Not controlling word frequency: Frequency is the strongest single predictor of reading time. A 1 log-unit difference in SUBTLEX frequency corresponds to ~30-40 ms in gaze duration (Brysbaert & New, 2009; Rayner, 1998). Always match or control.
Using the wrong frequency database: HAL and Kucera-Francis norms are outdated. SUBTLEX-US explains significantly more variance in behavioral data (Brysbaert & New, 2009).
Showing raters multiple conditions of the same item: This introduces contrastive evaluation. Raters must see each item in only one condition (Latin square for norming too).
Insufficient filler items: A 1:1 target-to-filler ratio makes the manipulation transparent. Use at least 2:1 fillers to targets (Schütze & Sprouse, 2014).
Not piloting the norming study: Always pilot with 5-10 participants to catch unclear instructions, ambiguous items, and timing issues before running the full norming sample.
Ignoring age of acquisition: AoA effects are independent of frequency (Kuperman et al., 2012). Failing to control AoA can introduce confounds, especially for studies comparing concrete vs. abstract words.
Based on Schütze & Sprouse (2014) and current psycholinguistic standards:
See references/lexical-databases-guide.md for detailed instructions on accessing and querying lexical control databases.