Map a health AI tool through the 5-phase evidence chain from lab to deployment
Getting an AI model from "works in the lab" to "safe for patients" requires a chain of evidence across 5 distinct phases. Most health AI tools stall at Phase 1 (technical accuracy) and never complete Phases 3-5 (clinical validation, deployment, monitoring). This skill teaches you to map any health AI tool's evidence maturity and identify what's missing.
When a vendor says "our AI is 95% accurate," you need to ask: accurate in whose hands, on what population, compared to what, and does accuracy translate to better patient outcomes? The evidence chain framework gives you the vocabulary and structure to ask the right questions.
| Phase | Name | Key Question | Gold Standard |
|---|---|---|---|
| 1 | Technical Validation | Does the model perform well on held-out data? | AUROC, AUPRC on external test set |
| 2 | Clinical Validation | Does performance hold in clinical conditions? | Prospective study in clinical setting |
| 3 | Clinical Utility | Does using the model improve clinical decisions? | Randomized controlled trial or DCA |
| 4 | Implementation | Can the model be deployed safely in real workflows? | Deployment study with workflow integration |
| 5 | Monitoring | Does performance hold over time in production? | Continuous monitoring with drift detection |
Choose a health AI tool from:
awesome-health-ai-evaluation model registryFor each phase, search for and document the evidence:
Phase 1: Technical Validation
├── Published papers: [list with DOIs]
├── Test set description: [internal? external? multi-site?]
├── Key metrics: [AUROC, sensitivity, specificity, calibration]
├── Comparison to baseline: [what was the comparator?]
└── Assessment: [Strong / Adequate / Weak / Missing]
Phase 2: Clinical Validation
├── Prospective studies: [list with DOIs]
├── Population: [who was studied? representative?]
├── Setting: [academic? community? LMIC?]
├── Sample size: [adequate for the prevalence?]
└── Assessment: [Strong / Adequate / Weak / Missing]
Phase 3: Clinical Utility
├── RCTs or comparative studies: [list with DOIs]
├── Outcome measures: [patient outcomes? process measures?]
├── Decision Curve Analysis: [done? results?]
├── Override/adoption rates: [do clinicians actually use it?]
└── Assessment: [Strong / Adequate / Weak / Missing]
Phase 4: Implementation
├── Deployment studies: [list with DOIs]
├── Workflow integration: [how was it embedded?]
├── User experience: [clinician feedback?]
├── Failure modes: [what went wrong?]
└── Assessment: [Strong / Adequate / Weak / Missing]
Phase 5: Monitoring
├── Post-deployment monitoring: [is it being tracked?]
├── Performance drift: [has performance changed over time?]
├── Demographic fairness: [different performance by subgroup?]
├── Feedback loop: [are corrections fed back to model?]
└── Assessment: [Strong / Adequate / Weak / Missing]
| Score | Meaning |
|---|---|
| Phase 1 only | Lab-ready. Not clinically validated. |
| Phase 1-2 | Clinically validated. Not proven useful. |
| Phase 1-3 | Clinical utility demonstrated. Ready for deployment planning. |
| Phase 1-4 | Deployed. Needs monitoring plan. |
| Phase 1-5 | Full evidence chain. Gold standard. |
Most health AI tools score Phase 1 only. The IDx-DR (diabetic retinopathy screening) is one of the few Phase 1-5 examples.
For the most critical missing phase, design the study that would fill the gap:
| Criterion | Meets Standard | Below Standard |
|---|---|---|
| All 5 phases assessed | Every phase documented with evidence or "Missing" | Phases skipped |
| Evidence accurately represented | Cited papers match claims | Misrepresentation of study findings |
| Gap identification | Critical gap identified with clinical reasoning | Superficial gap analysis |
| Study proposal | Feasible study design addressing the right gap | Unrealistic or misdirected proposal |
run-tripod-ai-checklist — Phase 1-2 reporting quality assessmentdecision-curve-analysis — Phase 3 clinical utility quantificationmodel-card-generator — Document the evidence chain in a model cardbridge-tbi-protocol — Example of Phase 1-3 evidence chain in TBI