Entity schema, BIO-2, controlled vocabularies (UMLS/RxNorm/SNOMED/ICD-10/LOINC), PHI handling rules.
| Label | Definition | Normalizes to |
|---|---|---|
DIAGNOSIS | Disease, disorder, condition. | SNOMED CT + ICD-10-CM |
MEDICATION | Drug (generic or brand). | RxNorm RxCUI |
DOSAGE | Amount + unit (e.g. "500 mg"). | Attribute of MEDICATION |
FREQUENCY | Schedule ("bid", "q8h"). | Attribute of MEDICATION |
ROUTE | Administration route ("PO", "IV"). | Attribute of MEDICATION |
SYMPTOM | Patient-reported finding. | SNOMED CT (Clinical Finding) |
LAB | Lab test name. | LOINC |
PROCEDURE | Surgical/diagnostic procedure. | SNOMED CT (Procedure) + CPT |
ANATOMY | Body site modifier. | UMLS CUI (Anatomy semantic type) |
B-<LABEL> starts an entity, I-<LABEL> continues it, O is outside.I-X must follow either B-X or I-X. Never follow O or a different entity type. Post-process repairs invalid sequences before returning to caller.labels.yaml.class ExtractedEntity(BaseModel):
text: str # surface form from source
label: EntityLabel # one of taxonomy above
char_span: tuple[int, int] # [start, end) in the DE-IDENTIFIED text
token_span: tuple[int, int] # for debugging / re-attention
confidence: float # model softmax max, or aggregate
assertion: Assertion # present | negated | hypothetical | historical | family
attributes: dict[str, str] # e.g. {"dosage": "500 mg", "frequency": "bid"}
normalization: Normalization | None # CUI + vocab codes, or None if unmapped
Every entity MUST have an assertion. "No chest pain" must not be extracted as a positive SYMPTOM. Default to present only when no modifier matches; log unmapped modifiers for review.
class Normalization(BaseModel):
cui: str | None # UMLS CUI, e.g. "C0011849"
preferred_term: str # canonical name
codes: dict[VocabSource, str] # {"SNOMED": "73211009", "ICD10": "E11.9", "RXNORM": ...}
method: Literal["exact", "semantic", "llm_arbiter"]
score: float # retrieval score, 0–1
deidentified_text, content_hash (SHA-256 of original for dedup), extracted entities.structlog processors redact any text / note / content fields in dev; in prod they're dropped.ml/data_card.md.labels.yaml).