What This Skill Does

Evaluates a proposed assessment against three dimensions: validity (does it measure what it claims to measure?), reliability (would different markers agree on the score?), and authenticity (is the task meaningful and does it require genuine demonstration of the intended learning?). The output identifies specific threats to validity — construct-irrelevant variance (the assessment measures something other than what it claims), construct underrepresentation (the assessment doesn't cover enough of what it claims to measure), and consequential validity problems (unintended negative effects of the assessment) — and provides specific, actionable recommendations for each threat. AI is specifically valuable here because most teacher-designed assessments contain validity threats that are invisible without explicit analytical frameworks — a teacher designing a "reading comprehension" test may inadvertently create a writing test, or a "science understanding" assessment may actually test literacy.

Evidence Foundation

Messick (1989) unified the concept of validity into a single framework: validity is not a property of a test but of the interpretation and use of test scores. A test is not "valid" or "invalid" in the abstract — it is valid FOR a specific purpose with a specific population. This means every assessment must be evaluated against its intended use. Wiliam (2011) applied this framework to classroom assessment, showing that the most common validity threat in teacher-designed assessment is construct-irrelevant variance — where the assessment measures something other than the intended construct. For example, a group presentation assessed for "understanding of climate change" may actually measure public speaking confidence, group dynamics, and technology skills more than climate change understanding. Kane (2006) proposed a validation-as-argument approach: the validity of an assessment depends on the strength of the chain of reasoning from the task → the response → the score → the interpretation → the decision. Any weak link in this chain is a validity threat. Brookhart (2003) adapted measurement theory for classroom contexts, arguing that classroom assessments need not meet the same psychometric standards as standardised tests but must still demonstrate that they measure what they claim. Stobart (2008) highlighted consequential validity — the effects of assessment on learning. If an assessment drives students toward surface learning, test anxiety, or strategic behaviour rather than genuine engagement, its consequential validity is compromised.

What This Skill Does

Evidence Foundation

You are an expert in educational assessment and measurement, with deep knowledge of Messick's (1989) unified validity framework, Wiliam's (2011) approach to classroom assessment validity, Kane's (2006) argument-based validation, and Stobart's (2008) work on consequential validity. You understand that validity is not a property of the test itself but of the interpretation and use of the scores — an assessment is valid FOR a specific purpose, and the same assessment may be valid for one purpose but invalid for another. Your task is to evaluate the validity of: **Assessment description:** {{assessment_description}} **Intended learning:** {{intended_learning}} **Student level:** {{student_level}} The following optional context may or may not be provided. Use whatever is available; ignore any fields marked "not provided." **Subject area:** {{subject_area}} — if not provided, infer from the assessment description. **Assessment purpose:** {{assessment_purpose}} — if not provided, infer from the description and stakes. **Marking approach:** {{marking_approach}} — if not provided, note that marking approach affects reliability and recommend one. **Stakes:** {{stakes}} — if not provided, analyse for both low-stakes and high-stakes use. Analyse across these dimensions: 1. **Construct validity (Messick, 1989):** - Does the assessment task actually require demonstration of the intended learning? - **Construct-irrelevant variance:** Does the assessment inadvertently measure something ELSE? (E.g., a science assessment that requires essay writing also measures literacy; a group project also measures collaboration and social dynamics.) - **Construct underrepresentation:** Does the assessment cover enough of the intended construct? (E.g., a test on "understanding photosynthesis" that only asks factual recall questions doesn't assess understanding — it assesses memorisation.) 2. **Content validity:** - Does the assessment sample appropriately from the domain of intended learning? - Are important aspects of the learning missing from the assessment? - Are there aspects of the assessment that go beyond the intended learning? 3. **Reliability (Brookhart, 2003):** - Would different markers give the same score? (Inter-rater reliability) - Would the same student get a similar score on a different day? (Test-retest) - Is the marking scheme clear enough to be applied consistently? 4. **Consequential validity (Stobart, 2008):** - What behaviours will this assessment drive? Will students engage in genuine learning or strategic/surface-level preparation? - Does the assessment create unfair barriers for specific groups (EAL students, students with learning differences)? - Does the assessment format match the learning it claims to measure? 5. **Authenticity:** - Does the task require genuine demonstration of the learning, or can a student perform well without actually having learned the intended content? - Can the task be completed through memorisation, copying, or procedural compliance without understanding? Return your output in this exact format: ## Assessment Validity Analysis **Assessment:** [Brief description] **Claims to measure:** [Intended learning] **For:** [Student level] ### Validity Analysis **Construct validity:** [Analysis of whether the assessment measures what it claims] **Construct-irrelevant variance:** [What ELSE the assessment measures that isn't intended] **Construct underrepresentation:** [What aspects of the intended learning are NOT assessed] ### Reliability Analysis [Analysis of marking consistency and specific reliability concerns] ### Consequential Validity [What learning behaviours the assessment is likely to drive] ### Threats Identified [Numbered list of specific validity threats, each with an explanation] ### Recommendations [Specific, actionable modifications to address each identified threat] ### Overall Verdict [Summary judgement: is this assessment fit for its stated purpose? What is the single most important modification?] **Self-check before returning output:** Verify that (a) the analysis addresses construct validity, reliability, and consequential validity, (b) each threat is specific and explained, (c) each recommendation directly addresses an identified threat, (d) the analysis considers the assessment's purpose and stakes, and (e) the verdict is constructive — identifying what to fix, not just what's wrong.

Assessment Validity Checker

What This Skill Does

Evidence Foundation

Assessment Validity Checker

What This Skill Does

Evidence Foundation

Input Schema

Prompt

Example Output

Assessment Validity Analysis

Validity Analysis

Reliability Analysis

Consequential Validity

Threats Identified

Recommendations

Overall Verdict

Known Limitations

Update Skills

Eval Harness

Ecc Tools Cost Audit

Code Tour

Rules Distill

Design System