Domain-validated multi-dimensional scoring system for divergent thinking tasks, including fluency, flexibility, originality, and automated semantic distance methods
This skill encodes expert methodological knowledge for scoring responses from divergent thinking tasks (Alternative Uses Task, Unusual Uses Task, instances tasks, etc.). It covers the four standard scoring dimensions — fluency, flexibility, originality, and elaboration — plus modern automated scoring using semantic distance. A general-purpose programmer would typically count responses (fluency) but would not know the domain-specific decisions around flexibility category systems, originality thresholds, inter-rater reliability requirements, or how to compute semantic distance as a creativity metric.
When to Use This Skill
Scoring responses from an AUT, Unusual Uses Task, or similar divergent thinking task
Choosing between subjective (human-rated) and objective (automated) scoring approaches
Computing semantic distance as an automated creativity metric
Establishing inter-rater reliability for creativity coding
Deciding how to handle the fluency-originality confound
Research Planning Protocol
Skills relacionados
Before executing the domain-specific steps below, you MUST:
State the research question — What specific aspect of divergent thinking is being measured?
Justify the method choice — Why these scoring dimensions? What alternatives were considered?
Declare expected outcomes — Which dimensions are expected to show effects?
Note assumptions and limitations — What does each scoring method assume? Where could it mislead?
Present the plan to the user and WAIT for confirmation before proceeding.
For detailed methodology guidance, see the research-literacy skill.
⚠️ Verification Notice
This skill was generated by AI from academic literature. All parameters, thresholds, and citations require independent verification before use in research. If you find errors, please open an issue.
Scoring Dimensions Overview
Dimension
What It Measures
Scoring Method
Automation
Source
Fluency
Quantity of responses
Count valid responses
Fully automated
Guilford, 1967
Flexibility
Variety of conceptual categories
Count distinct categories
Semi-automated (COWA)
Reiter-Palmon et al., 2019
Originality
Statistical rarity or novelty
Frequency <5% threshold or subjective rating
Semi-automated
Silvia et al., 2008
Elaboration
Detail and development of ideas
Count additional details per response
Manual only
Guilford, 1967
Semantic distance
Conceptual remoteness from prompt
GloVe/word2vec cosine distance
Fully automated
Beaty & Johnson, 2021
Fluency Scoring
Definition
Fluency = the total number of valid, non-redundant responses a participant generates.
Scoring Rules
Count each distinct response as one unit
Exclude:
Exact duplicates
Conventional/typical uses of the object (debated — some protocols include them; Reiter-Palmon et al., 2019)
Gibberish or clearly irrelevant responses
Responses that are minor variations of each other (e.g., "use as a hammer" and "use to pound nails" = 1 response)
When in doubt, count as separate responses and let originality scoring handle quality
Domain insight: Fluency is the most reliable but least interesting creativity measure. It correlates with personality traits (openness) and general cognitive ability, but does not distinguish truly creative responses from merely numerous ones (Silvia et al., 2008).
Flexibility Scoring
Definition
Flexibility = the number of distinct conceptual categories across a participant's responses.
Category Systems
COWA (Category of Words from AUT) system (Reiter-Palmon et al., 2019):
Provides a standardized taxonomy of response categories for common AUT objects. Example categories for "brick":
Category
Example Responses
Construction/Building
"build a wall," "build a house"
Weapon/Violence
"throw at someone," "use as a weapon"
Weight/Anchor
"paperweight," "doorstop," "anchor"
Art/Decoration
"sculpt into art," "garden decoration"
Sport/Exercise
"use as a dumbbell," "exercise weight"
Tool
"hammer," "grinding surface"
Scoring Procedure
Train coders on the category system (minimum 2 coders)
Assign each response to its most appropriate category
Count unique categories per participant = flexibility score
Scoring: Average across raters for each response; then average or sum per participant
Method Selection Decision Logic
Is your sample size large (N > 100)?
|
+-- YES --> Do you need fine-grained creativity distinctions?
| |
| +-- YES --> Use subjective rating (richer information)
| |
| +-- NO --> Use statistical rarity (objective, faster)
|
+-- NO --> Statistical rarity is unreliable with small N
(rare responses may be rare by chance)
--> Use subjective rating
Semantic Distance (Automated Scoring)
Overview
Semantic distance measures how conceptually far a response is from the prompt word in a vector space model. More distant = more creative (Beaty & Johnson, 2021).
Method (Beaty & Johnson, 2021; Organisciak et al., 2023)
Embedding model: GloVe (Global Vectors for Word Representation; Pennington et al., 2014) trained on Common Crawl — 300-dimensional vectors
Computation:
Represent the prompt word (e.g., "brick") as a vector
Represent each response as a vector (average word vectors for multi-word responses)
Response is semantically close to the object (e.g., "brick" → "build a wall")
Medium (~0.5-0.7)
Moderately creative (e.g., "brick" → "use as a paperweight")
High (~0.7-1.0)
Highly creative / remote association (e.g., "brick" → "use as a canvas for art")
Validation: Semantic distance correlates with subjective originality ratings at r ≈ 0.40-0.60 and predicts real-world creative achievement (Beaty & Johnson, 2021; Organisciak et al., 2023).
Advantages and Limitations
Advantage
Limitation
Fully automated, no rater training
Misses context — "use as food" for a brick is unusual but gets a moderate distance score
Objective and reproducible
Depends on the embedding model's training corpus
Scales to large datasets
Multi-word responses require averaging, which loses phrase-level meaning
No inter-rater reliability concerns
Not validated for all object types or languages
Handling the Fluency-Originality Confound
The Problem
Participants who generate more ideas (high fluency) have a higher probability of producing at least one statistically rare idea, inflating their originality scores (Silvia et al., 2008).
Solutions
Ratio-based originality: Divide originality sum by fluency → proportion of original responses (Reiter-Palmon et al., 2019)
Top-N scoring: Score only the top 2-3 most creative responses per participant, equalizing opportunity across fluency levels (Silvia et al., 2008)
Statistical control: Include fluency as a covariate in analyses of originality (Lee & Chung, 2024)
Multilevel modeling: Nest responses within participants, accounting for varying response counts
Recommendation: Use top-2 scoring (average the 2 most creative responses) when the primary interest is creative quality rather than quantity. This method has the best psychometric properties (Silvia et al., 2008).
Common Pitfalls
Scoring originality without normalizing text: "doorstop," "door stop," and "use as a door stop" are the same response. Normalize spelling, capitalization, and phrasing before computing frequency (Reiter-Palmon et al., 2019).
Using statistical rarity with small samples: With N < 50, many responses appear "unique" simply because the sample is small. Use subjective ratings instead, or pool responses with published norms (Reiter-Palmon et al., 2019).
Ignoring inter-rater reliability: Reporting subjective creativity scores without ICC suggests the scores may reflect individual rater bias, not genuine creativity differences. Always report ICC with the model type specified (Lee & Chung, 2024).
Treating semantic distance as a complete creativity measure: Semantic distance captures novelty but not usefulness/appropriateness — the other key dimension of creativity (Runco & Jaeger, 2012). Combine with subjective ratings for a comprehensive assessment.
Averaging semantic distance across all responses including poor ones: Low-quality responses (gibberish, conventional uses) can dilute or inflate average distance. Clean data before computing semantic distance.
Not reporting which scoring method was used: Different methods yield different results. Always specify whether originality is statistical rarity, subjective rating, or semantic distance, and which threshold or scale was used.
Minimum Reporting Checklist
Based on Reiter-Palmon et al. (2019) and Silvia et al. (2008):
Scoring dimensions used (fluency, flexibility, originality, elaboration, semantic distance)
For fluency: definition of "valid response" and exclusion rules
For flexibility: category system used (COWA or custom) and category list
For originality: method (statistical rarity threshold or subjective rating scale)
For subjective scoring: number of raters, training procedure, ICC values (model type specified)
For semantic distance: embedding model, dimensionality, platform/implementation
How fluency-originality confound was handled (ratio, top-N, covariate, or acknowledged)
Data cleaning steps (text normalization, duplicate removal)
Whether scoring was blind to condition
References
Beaty, R. E., & Johnson, D. R. (2021). Automating creativity assessment with SemDis: An open platform for computing semantic distance. Behavior Research Methods, 53(2), 757-780.
Guilford, J. P. (1967). The nature of human intelligence. McGraw-Hill.
Organisciak, P., Acar, S., Dumas, D., & Berthiaume, K. (2023). Beyond semantic distance: Automated scoring of divergent thinking greatly improves with large language models. Thinking Skills and Creativity, 49, 101356.
Pennington, J., Socher, R., & Manning, C. D. (2014). GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1532-1543.
Reiter-Palmon, R., Forthmann, B., & Barbot, B. (2019). Scoring divergent thinking tests: A review and systematic framework. Psychology of Aesthetics, Creativity, and the Arts, 13(2), 144-152.
Runco, M. A., & Jaeger, G. J. (2012). The standard definition of creativity. Creativity Research Journal, 24(1), 92-96.
Shrout, P. E., & Fleiss, J. L. (1979). Intraclass correlations: Uses in assessing rater reliability. Psychological Bulletin, 86(2), 420-428.
Silvia, P. J., Winterstein, B. P., Willse, J. T., et al. (2008). Assessing creativity with divergent thinking tasks: Exploring the reliability and validity of new subjective scoring methods. Psychology of Aesthetics, Creativity, and the Arts, 2(2), 68-85.
Wallach, M. A., & Kogan, N. (1965). Modes of thinking in young children. Holt, Rinehart and Winston.
See references/scoring-rubric.md for detailed scoring examples and training materials.