Psychometrics and educational assessment design for researchers
A skill for designing, validating, and analyzing educational assessments using modern psychometric methods. Covers classical test theory, item response theory, test construction, validity evidence, and computerized adaptive testing.
Classical test theory (CTT) models observed scores as the sum of a true score and error:
X = T + E
Key reliability coefficients:
| Coefficient | Method | Interpretation |
|---|---|---|
| Cronbach's alpha | Internal consistency | Homogeneity of items |
| Test-retest | Stability over time | Temporal consistency |
| Parallel forms | Equivalent test versions | Form equivalence |
| Split-half (Spearman-Brown) | Odd-even item split | Internal consistency |
| Inter-rater (Cohen's kappa) | Multiple raters | Scoring agreement |
import numpy as np
import pandas as pd
def item_analysis(responses: pd.DataFrame, total_scores: pd.Series) -> pd.DataFrame:
"""
Classical item analysis: difficulty, discrimination, point-biserial.
responses: binary DataFrame (1=correct, 0=incorrect), items as columns.
total_scores: total test score for each examinee.
"""
results = []
for item in responses.columns:
scores = responses[item]
difficulty = scores.mean() # p-value (proportion correct)
# Point-biserial correlation
corr = scores.corr(total_scores)
# Upper-lower discrimination (top/bottom 27%)
n = len(total_scores)
cutoff_high = total_scores.quantile(0.73)
cutoff_low = total_scores.quantile(0.27)
upper = scores[total_scores >= cutoff_high].mean()
lower = scores[total_scores <= cutoff_low].mean()
discrimination = upper - lower
results.append({
"item": item,
"difficulty": round(difficulty, 3),
"discrimination": round(discrimination, 3),
"point_biserial": round(corr, 3),
"flag": "review" if difficulty < 0.2 or difficulty > 0.9
or discrimination < 0.2 else "ok"
})
return pd.DataFrame(results)
IRT provides a more rigorous framework than CTT by modeling the probability of a correct response as a function of ability and item parameters:
import numpy as np
def irt_3pl(theta: float, a: float, b: float, c: float) -> float:
"""
Three-parameter logistic IRT model.
theta: examinee ability (typically -3 to +3)
a: discrimination parameter (slope, typically 0.5 to 2.5)
b: difficulty parameter (location, same scale as theta)
c: guessing parameter (lower asymptote, typically 0.0 to 0.35)
Returns: probability of correct response
"""
exponent = -a * (theta - b)
return c + (1 - c) / (1 + np.exp(exponent))
# Item characteristic curves for three items
thetas = np.linspace(-3, 3, 100)
item_easy = [irt_3pl(t, a=1.0, b=-1.0, c=0.2) for t in thetas]
item_medium = [irt_3pl(t, a=1.5, b=0.0, c=0.2) for t in thetas]
item_hard = [irt_3pl(t, a=1.2, b=1.5, c=0.2) for t in thetas]
# Using the 'mirt' package in R (called via rpy2 or standalone)
# R code for fitting a 2PL model:
r_code = """
library(mirt)
# responses: binary matrix (examinees x items)
mod <- mirt(responses, model = 1, itemtype = "2PL")
# Item parameters
coef(mod, simplify = TRUE)
# Ability estimates (Expected A Posteriori)
theta_hat <- fscores(mod, method = "EAP")
# Model fit
M2(mod) # limited-information fit statistic
itemfit(mod, fit_stats = "S_X2")
"""
| Model | Parameters | Use Case |
|---|---|---|
| Rasch (1PL) | b only | Equal discrimination assumed; measurement-focused |
| 2PL | a, b | Different discrimination; general purpose |
| 3PL | a, b, c | Multiple choice with guessing |
| Graded Response | a, b_k | Likert-scale or partial credit items |
| Nominal Response | a_k, c_k | Multiple choice with informative distractors |
Following the Standards for Educational and Psychological Testing (AERA/APA/NCME, 2014), validity is a unitary concept supported by five types of evidence:
from factor_analyzer import FactorAnalyzer
# Confirmatory approach: check dimensionality
fa = FactorAnalyzer(n_factors=3, rotation="promax")
fa.fit(item_responses)
# Eigenvalues for scree plot
eigenvalues, _ = fa.get_eigenvalues()
print("Eigenvalues:", eigenvalues[:10])
# Factor loadings
loadings = pd.DataFrame(
fa.loadings_,
columns=["Factor1", "Factor2", "Factor3"],
index=item_names
)
print(loadings.round(3))
Computerized adaptive testing selects items in real time to match examinee ability:
Initialize: theta_0 = 0 (prior mean)
For each item i = 1, 2, ..., until stopping rule met:
1. Select item with maximum Fisher information at current theta
2. Administer item, observe response
3. Update theta estimate using maximum likelihood or Bayesian EAP
4. Check stopping rule:
- Fixed length (e.g., 30 items)
- SE(theta) < threshold (e.g., 0.30)
- Maximum time reached
Return: final theta estimate and standard error
To prevent overuse of high-quality items and maintain test security: