Assess whether a health AI model's predicted probabilities match observed outcomes
A model that says "80% chance of disease" should be correct about 80% of the time. That's calibration. Most health AI papers report discrimination (AUROC) but ignore calibration — yet calibration determines whether clinicians can trust the model's probability estimates for decision-making. This skill teaches you to assess and visualize calibration, a critical gap in health AI evaluation.
Imagine an ER triage AI that predicts "90% probability of severe TBI." If the model is poorly calibrated, that 90% might really mean 50%. The clinician who trusts that 90% may order unnecessary interventions — or worse, a model that says "10%" when the true risk is 40% may lead to missed diagnoses. Calibration is the bridge between model output and clinical trust.
| Concept | Question It Answers |
|---|---|
| Discrimination (AUROC) | Can the model rank patients by risk? (higher risk patients get higher scores) |
| Calibration | Are the predicted probabilities accurate? (80% prediction = 80% observed rate) |
| Clinical Utility (DCA) | Does using the model lead to better decisions than alternatives? |
A model can have perfect discrimination (AUROC=1.0) but terrible calibration (all predictions are 0.99 or 0.01). And vice versa.
Using provided predictions and outcomes (or your own dataset):
import numpy as np
from sklearn.calibration import calibration_curve
import matplotlib.pyplot as plt
# Sample: model predictions and true outcomes
y_true = [...] # 0 or 1
y_pred = [...] # predicted probabilities (0.0 to 1.0)
# Create calibration curve
prob_true, prob_pred = calibration_curve(y_true, y_pred, n_bins=10)
# Plot
plt.plot(prob_pred, prob_true, marker='o', label='Model')
plt.plot([0, 1], [0, 1], linestyle='--', label='Perfect calibration')
plt.xlabel('Mean predicted probability')
plt.ylabel('Observed frequency')
plt.title('Reliability Diagram')
plt.legend()
plt.savefig('calibration_plot.png')
from sklearn.metrics import brier_score_loss
# Brier Score (lower is better, 0 is perfect)
brier = brier_score_loss(y_true, y_pred)
# Expected Calibration Error (lower is better)
def expected_calibration_error(y_true, y_pred, n_bins=10):
bins = np.linspace(0, 1, n_bins + 1)
ece = 0
for i in range(n_bins):
mask = (y_pred >= bins[i]) & (y_pred < bins[i+1])
if mask.sum() > 0:
bin_acc = y_true[mask].mean()
bin_conf = y_pred[mask].mean()
ece += mask.sum() / len(y_true) * abs(bin_acc - bin_conf)
return ece
ece = expected_calibration_error(np.array(y_true), np.array(y_pred))
For your model, answer:
If calibration is poor, common fixes:
Recalibrate on a held-out calibration set, never on the test set.
| Criterion | Meets Standard | Below Standard |
|---|---|---|
| Reliability diagram | Correctly plotted with reference line | Missing reference line or incorrect binning |
| ECE calculation | Correct implementation, reasonable bin count | Wrong formula or interpretation |
| Clinical interpretation | Links calibration to specific clinical decisions | Generic interpretation without clinical context |
| Recalibration | Appropriate method recommended with justification | No recalibration plan for poorly calibrated model |
run-tripod-ai-checklist — Item 13a specifically addresses calibration reportingdecision-curve-analysis — Clinical utility assessment that accounts for calibrationfairness-audit — Calibration may differ across demographic subgroupsmodel-card-generator — Include calibration metrics in model documentation