Oxford CEBM levels of evidence, GRADE certainty ratings, mapping between systems, traditional evidence hierarchies, and their limitations
Evidence classification systems provide a common language for describing the strength and quality of clinical evidence. This skill covers the major systems in use, how they relate to each other, and their important limitations.
The classic hierarchy ranks study designs by their inherent susceptibility to bias:
Each level has progressively less control over confounding and bias. Randomization eliminates selection bias and balances unmeasured confounders. Systematic reviews aggregate evidence and increase precision.
The Oxford CEBM system (2011 revision) is question-type specific, recognizing that the optimal study design depends on the clinical question.
| Level | Study Type |
|---|---|
| 1a | SR of RCTs (with homogeneity) |
| 1b | Individual RCT (with narrow CI) |
| 1c | All-or-none case series |
| 2a | SR of cohort studies (with homogeneity) |
| 2b | Individual cohort study (or low-quality RCT) |
| 2c | Outcomes research; ecological studies |
| 3a | SR of case-control studies |
| 3b | Individual case-control study |
| 4 | Case series, poor-quality cohort/case-control |
| 5 | Expert opinion without explicit critical appraisal |
| Level | Study Type |
|---|---|
| 1a | SR of Level 1 diagnostic studies (with homogeneity) |
| 1b | Validating cohort with good reference standard |
| 1c | Absolute SpPins and SnNouts |
| 2a | SR of Level 2+ diagnostic studies |
| 2b | Exploratory cohort with good reference standard |
| 3b | Non-consecutive study or without consistently applied reference standard |
| 4 | Case-control study or poor/non-independent reference standard |
| 5 | Expert opinion |
| Level | Study Type |
|---|---|
| 1a | SR of inception cohort studies |
| 1b | Individual inception cohort with >80% follow-up |
| 1c | All-or-none case series |
| 2a | SR of retrospective cohort studies or untreated control groups in RCTs |
| 2b | Retrospective cohort study or follow-up of untreated control group in RCT |
| 2c | Outcomes research |
| 4 | Case series or poor-quality prognostic cohort |
| 5 | Expert opinion |
A special category for dramatic effects observed without controlled trials:
These observations can provide Level 1c evidence despite being observational.
GRADE has become the dominant system for guideline development worldwide (WHO, Cochrane, NICE, AHA, ESC, and 100+ organizations).
GRADE uses four levels of certainty:
| Certainty | Definition |
|---|---|
| High | Very confident the true effect lies close to the estimate |
| Moderate | Moderately confident; the true effect is likely close but may be substantially different |
| Low | Limited confidence; the true effect may be substantially different from the estimate |
| Very Low | Very little confidence; the true effect is likely substantially different |
| Strength | Meaning |
|---|---|
| Strong | Benefits clearly outweigh risks (or vice versa). Most patients should receive the intervention. |
| Conditional (Weak) | Benefits and risks are closely balanced, evidence uncertain, or values-dependent. Decision should involve SDM. |
A strong recommendation can be based on low-certainty evidence (e.g., seatbelts — no RCTs, but strong recommendation based on mechanism and observational evidence).
| Oxford CEBM | GRADE Certainty | Interpretation |
|---|---|---|
| Level 1 (consistent) | High | Strong basis for recommendation |
| Level 1 (with limitations) | Moderate | Probably reliable |
| Level 2-3 (consistent) | Moderate-Low | Effect estimate uncertain |
| Level 4-5 | Low-Very Low | Very uncertain |
This mapping is approximate. GRADE evaluates bodies of evidence holistically, while CEBM classifies individual studies.
Not an evidence level system but a per-study quality assessment (covered in Bias Assessment skill).
When discussing evidence: