Design and implement adaptive testing systems using Item Response Theory (IRT). Use when working with computerized adaptive tests (CAT), psychometric assessment, ability estimation, question calibration, test design, or IRT models (1PL/2PL/3PL). Covers test algorithms, stopping rules, item selection strategies, and practical implementation patterns for K-12, certification, placement, and diagnostic assessments.
Design computerized adaptive tests that measure ability efficiently and accurately using Item Response Theory.
Adaptive tests adjust difficulty in real-time based on student responses. A correct answer → harder question. Incorrect → easier question. The result: accurate ability estimates in ~50% fewer questions than fixed-length tests.
Key advantage: Traditional tests waste time on too-easy or too-hard questions. Adaptive tests spend time where measurement matters most — near the student's ability level.
| You need to... | See |
|---|---|
| Understand IRT models and parameters | IRT Fundamentals |
| Design a new adaptive test |
| Test Design Workflow |
| Choose item selection algorithm | Item Selection |
| Decide when to stop the test | Stopping Rules |
| Calibrate new questions | references/calibration.md |
| Implement CAT algorithm | references/implementation.md |
Most adaptive tests use the 3PL model. Each question has three parameters:
Probability of correct response:
P(correct | ability, a, b, c) = c + (1 - c) / (1 + e^(-a(ability - b)))
Simpler models:
Use 3PL for high-stakes tests. Use 2PL/1PL when sample size is small (<500 responses per item).
Information measures how precisely an item estimates ability at a given level. Peak information occurs when ability ≈ difficulty (b parameter).
Standard Error (SE) is the inverse of information:
SE = 1 / sqrt(Information)
Goal of CAT: Maximize information (minimize SE) at the student's true ability level.
Minimum bank size: 10× the average test length. For a 20-item CAT, you need ≥200 calibrated items.
Distribution targets:
Content balancing: If testing math, ensure geometry/algebra/etc. are proportionally represented.
Pick one from each category:
Item selection: (see below)
Ability estimation:
Stopping rule: (see below)
Before going live, simulate 1000+ test sessions with known abilities. Check:
Adjust if needed.
Rule: Select the item with highest information at current ability estimate.
Pros: Optimal precision, shortest tests Cons: Overuses "best" items, poor security
Use when: Pilot testing, low-stakes practice
Rule: Select from top N items by information (e.g., top 5), choose randomly from that set.
Pros: Balances precision and security Cons: Slightly longer tests than pure MFI
Use when: Operational tests, default choice
Rule: Start with high-discrimination items (high a), use mid-discrimination later.
Pros: Fast initial ability estimate Cons: Complex to implement
Use when: Very large item banks, research settings
Rule: Track content area usage, prioritize underrepresented areas when selecting next item.
Implementation: Weight information by content constraint satisfaction.
Use when: Blueprint requirements, multidimensional tests
Stop after N items (e.g., 20 questions).
Pros: Predictable time, simple Cons: May over/under-test some students
Use when: Time limits matter, simple implementation needed
Stop when SE < target (e.g., SE < 0.3).
Pros: Consistent precision across ability levels Cons: Variable test length (harder to schedule)
Typical targets:
Use when: Precision matters more than time
Stop when (SE < target) OR (length ≥ max) OR (length ≥ min AND ability estimate stable).
Use when: Production systems (safest approach)
Options:
Never start at extremes (-3 or +3).
All correct or all incorrect: MLE fails. Use EAP or Bayesian prior to regularize.
Rapid changes: If ability estimate jumps >1.0, consider response anomaly (cheating, guessing).
Track how often each item is used. Flag items used >20% of the time. Consider:
If testing multiple skills (e.g., algebra + geometry), use separate ability estimates per dimension. Select items to balance information across dimensions.
Warning: MIRT requires larger item banks and more complex calibration.
❌ Too few items in bank → High exposure, security risk ✅ Aim for 10× average test length
❌ Poorly distributed difficulties → Accurate only in narrow ability range
✅ Spread items across -2 to +2 difficulty
❌ Ignoring content balance → May skip important topics
✅ Build content constraints into item selection
❌ Using MLE for all incorrect → Returns -∞
✅ Use EAP or cap estimates at -3/+3
❌ No exposure control → Same items every test
✅ Use randomesque or Sympson-Hetter
| Need | File |
|---|---|
| Calibrate new items (collect data, estimate parameters) | references/calibration.md |
| Implement CAT algorithm (code patterns, libraries) | references/implementation.md |
Setup:
Flow:
Result: Average 18 questions, 95% of students placed within ±0.5 grade levels of true ability.
IRT packages:
mirt, girth, catsimmirt, TAM, catR