Name: Assessment Design Guide
Author: wentorai

搵技能.../

技能內容

A skill for designing, validating, and analyzing educational assessments using modern psychometric methods. Covers classical test theory, item response theory, test construction, validity evidence, and computerized adaptive testing.

Classical Test Theory

Reliability Analysis

Classical test theory (CTT) models observed scores as the sum of a true score and error:

X = T + E

Key reliability coefficients:

Coefficient	Method	Interpretation
Cronbach's alpha	Internal consistency	Homogeneity of items
Test-retest	Stability over time	Temporal consistency
Parallel forms	Equivalent test versions	Form equivalence
Split-half (Spearman-Brown)	Odd-even item split	Internal consistency

import numpy as np
import pandas as pd

def item_analysis(responses: pd.DataFrame, total_scores: pd.Series) -> pd.DataFrame:
    """
    Classical item analysis: difficulty, discrimination, point-biserial.
    responses: binary DataFrame (1=correct, 0=incorrect), items as columns.
    total_scores: total test score for each examinee.
    """
    results = []
    for item in responses.columns:
        scores = responses[item]
        difficulty = scores.mean()  # p-value (proportion correct)

        # Point-biserial correlation
        corr = scores.corr(total_scores)

        # Upper-lower discrimination (top/bottom 27%)
        n = len(total_scores)
        cutoff_high = total_scores.quantile(0.73)
        cutoff_low = total_scores.quantile(0.27)
        upper = scores[total_scores >= cutoff_high].mean()
        lower = scores[total_scores <= cutoff_low].mean()
        discrimination = upper - lower

        results.append({
            "item": item,
            "difficulty": round(difficulty, 3),
            "discrimination": round(discrimination, 3),
            "point_biserial": round(corr, 3),
            "flag": "review" if difficulty < 0.2 or difficulty > 0.9
                    or discrimination < 0.2 else "ok"
        })
    return pd.DataFrame(results)

import numpy as np

def irt_3pl(theta: float, a: float, b: float, c: float) -> float:
    """
    Three-parameter logistic IRT model.
    theta: examinee ability (typically -3 to +3)
    a: discrimination parameter (slope, typically 0.5 to 2.5)
    b: difficulty parameter (location, same scale as theta)
    c: guessing parameter (lower asymptote, typically 0.0 to 0.35)
    Returns: probability of correct response
    """
    exponent = -a * (theta - b)
    return c + (1 - c) / (1 + np.exp(exponent))

# Item characteristic curves for three items
thetas = np.linspace(-3, 3, 100)
item_easy = [irt_3pl(t, a=1.0, b=-1.0, c=0.2) for t in thetas]
item_medium = [irt_3pl(t, a=1.5, b=0.0, c=0.2) for t in thetas]
item_hard = [irt_3pl(t, a=1.2, b=1.5, c=0.2) for t in thetas]

# Using the 'mirt' package in R (called via rpy2 or standalone)
# R code for fitting a 2PL model:
r_code = """
library(mirt)

# responses: binary matrix (examinees x items)
mod <- mirt(responses, model = 1, itemtype = "2PL")

# Item parameters
coef(mod, simplify = TRUE)

# Ability estimates (Expected A Posteriori)
theta_hat <- fscores(mod, method = "EAP")

# Model fit
M2(mod)  # limited-information fit statistic
itemfit(mod, fit_stats = "S_X2")
"""

Model	Parameters	Use Case
Rasch (1PL)	b only	Equal discrimination assumed; measurement-focused
2PL	a, b	Different discrimination; general purpose
3PL	a, b, c	Multiple choice with guessing
Graded Response	a, b_k	Likert-scale or partial credit items
Nominal Response	a_k, c_k	Multiple choice with informative distractors

from factor_analyzer import FactorAnalyzer

# Confirmatory approach: check dimensionality
fa = FactorAnalyzer(n_factors=3, rotation="promax")
fa.fit(item_responses)

# Eigenvalues for scree plot
eigenvalues, _ = fa.get_eigenvalues()
print("Eigenvalues:", eigenvalues[:10])

# Factor loadings
loadings = pd.DataFrame(
    fa.loadings_,
    columns=["Factor1", "Factor2", "Factor3"],
    index=item_names
)
print(loadings.round(3))

Initialize: theta_0 = 0 (prior mean)
For each item i = 1, 2, ..., until stopping rule met:
    1. Select item with maximum Fisher information at current theta
    2. Administer item, observe response
    3. Update theta estimate using maximum likelihood or Bayesian EAP
    4. Check stopping rule:
       - Fixed length (e.g., 30 items)
       - SE(theta) < threshold (e.g., 0.30)
       - Maximum time reached
Return: final theta estimate and standard error

Assessment Design Guide | Skills Pool

Assessment Design Guide

Assessment Design Guide

Classical Test Theory

Reliability Analysis

Item Selection Guidelines

Item Response Theory

The Three-Parameter Logistic Model

IRT Model Estimation

Model Comparison

Validity Evidence

The Unified Validity Framework

Computerized Adaptive Testing

CAT Algorithm

Item Exposure Control

Tools and Software

Key References

Update Skills

Eval Harness

Ecc Tools Cost Audit

Code Tour

Rules Distill

Design System