Domain-validated multi-dimensional scoring system for divergent thinking tasks, including fluency, flexibility, originality, and automated semantic distance methods
This skill encodes expert methodological knowledge for scoring responses from divergent thinking tasks (Alternative Uses Task, Unusual Uses Task, instances tasks, etc.). It covers the four standard scoring dimensions — fluency, flexibility, originality, and elaboration — plus modern automated scoring using semantic distance. A general-purpose programmer would typically count responses (fluency) but would not know the domain-specific decisions around flexibility category systems, originality thresholds, inter-rater reliability requirements, or how to compute semantic distance as a creativity metric.
Before executing the domain-specific steps below, you MUST:
For detailed methodology guidance, see the research-literacy skill.
This skill was generated by AI from academic literature. All parameters, thresholds, and citations require independent verification before use in research. If you find errors, please open an issue.
| Dimension | What It Measures | Scoring Method | Automation | Source |
|---|---|---|---|---|
| Fluency | Quantity of responses | Count valid responses | Fully automated | Guilford, 1967 |
| Flexibility | Variety of conceptual categories | Count distinct categories | Semi-automated (COWA) | Reiter-Palmon et al., 2019 |
| Originality | Statistical rarity or novelty | Frequency <5% threshold or subjective rating | Semi-automated | Silvia et al., 2008 |
| Elaboration | Detail and development of ideas | Count additional details per response | Manual only | Guilford, 1967 |
| Semantic distance | Conceptual remoteness from prompt | GloVe/word2vec cosine distance | Fully automated | Beaty & Johnson, 2021 |
Fluency = the total number of valid, non-redundant responses a participant generates.
Domain insight: Fluency is the most reliable but least interesting creativity measure. It correlates with personality traits (openness) and general cognitive ability, but does not distinguish truly creative responses from merely numerous ones (Silvia et al., 2008).
Flexibility = the number of distinct conceptual categories across a participant's responses.
COWA (Category of Words from AUT) system (Reiter-Palmon et al., 2019):
Provides a standardized taxonomy of response categories for common AUT objects. Example categories for "brick":
| Category | Example Responses |
|---|---|
| Construction/Building | "build a wall," "build a house" |
| Weapon/Violence | "throw at someone," "use as a weapon" |
| Weight/Anchor | "paperweight," "doorstop," "anchor" |
| Art/Decoration | "sculpt into art," "garden decoration" |
| Sport/Exercise | "use as a dumbbell," "exercise weight" |
| Tool | "hammer," "grinding surface" |
A response is "original" if it is given by fewer than 5% of the sample (Wallach & Kogan, 1965; Lee & Chung, 2024).
Procedure:
Alternative thresholds: Some studies use <1% (very strict) or <10% (lenient). The 5% threshold is most common (Reiter-Palmon et al., 2019).
Human raters judge each response for creativity on a Likert scale (Silvia et al., 2008).
Procedure:
Is your sample size large (N > 100)?
|
+-- YES --> Do you need fine-grained creativity distinctions?
| |
| +-- YES --> Use subjective rating (richer information)
| |
| +-- NO --> Use statistical rarity (objective, faster)
|
+-- NO --> Statistical rarity is unreliable with small N
(rare responses may be rare by chance)
--> Use subjective rating
Semantic distance measures how conceptually far a response is from the prompt word in a vector space model. More distant = more creative (Beaty & Johnson, 2021).
| Semantic Distance | Interpretation |
|---|---|
| Low (~0.3-0.5) | Response is semantically close to the object (e.g., "brick" → "build a wall") |
| Medium (~0.5-0.7) | Moderately creative (e.g., "brick" → "use as a paperweight") |
| High (~0.7-1.0) | Highly creative / remote association (e.g., "brick" → "use as a canvas for art") |
Validation: Semantic distance correlates with subjective originality ratings at r ≈ 0.40-0.60 and predicts real-world creative achievement (Beaty & Johnson, 2021; Organisciak et al., 2023).
| Advantage | Limitation |
|---|---|
| Fully automated, no rater training | Misses context — "use as food" for a brick is unusual but gets a moderate distance score |
| Objective and reproducible | Depends on the embedding model's training corpus |
| Scales to large datasets | Multi-word responses require averaging, which loses phrase-level meaning |
| No inter-rater reliability concerns | Not validated for all object types or languages |
Participants who generate more ideas (high fluency) have a higher probability of producing at least one statistically rare idea, inflating their originality scores (Silvia et al., 2008).
Recommendation: Use top-2 scoring (average the 2 most creative responses) when the primary interest is creative quality rather than quantity. This method has the best psychometric properties (Silvia et al., 2008).
Scoring originality without normalizing text: "doorstop," "door stop," and "use as a door stop" are the same response. Normalize spelling, capitalization, and phrasing before computing frequency (Reiter-Palmon et al., 2019).
Using statistical rarity with small samples: With N < 50, many responses appear "unique" simply because the sample is small. Use subjective ratings instead, or pool responses with published norms (Reiter-Palmon et al., 2019).
Ignoring inter-rater reliability: Reporting subjective creativity scores without ICC suggests the scores may reflect individual rater bias, not genuine creativity differences. Always report ICC with the model type specified (Lee & Chung, 2024).
Treating semantic distance as a complete creativity measure: Semantic distance captures novelty but not usefulness/appropriateness — the other key dimension of creativity (Runco & Jaeger, 2012). Combine with subjective ratings for a comprehensive assessment.
Averaging semantic distance across all responses including poor ones: Low-quality responses (gibberish, conventional uses) can dilute or inflate average distance. Clean data before computing semantic distance.
Not reporting which scoring method was used: Different methods yield different results. Always specify whether originality is statistical rarity, subjective rating, or semantic distance, and which threshold or scale was used.
Based on Reiter-Palmon et al. (2019) and Silvia et al. (2008):
See references/scoring-rubric.md for detailed scoring examples and training materials.