Generate synthetic FHIR test data from PheKB phenotype definitions using Synthea. Creates custom Synthea modules, runs data generation, and loads results to the FHIR server.
/synthea <command> [options]
| Command | Description |
|---|---|
create-module <phenotype> | Create Synthea module from phenotype data |
generate <phenotype> | Run Synthea to generate FHIR data |
load <phenotype> | Load generated data to FHIR server |
full <phenotype> | Create module + generate + load (full pipeline) |
status | Show status of all phenotypes |
list | List available phenotypes with modules |
batch <phenotypes...> | Process multiple phenotypes |
Creates a Synthea GMF (Generic Module Framework) module from phenotype data.
Inputs:
data/phekb-raw/<phenotype>/document_analysis.json - Extracted codes and criteriadata/phekb-raw/<phenotype>/description.txt - Phenotype descriptiontest-cases/phekb/phekb-<phenotype>.json - Test case with required codesOutputs:
synthea/modules/custom/phekb_<phenotype>.json - Positive case modulesynthea/modules/custom/phekb_<phenotype>_control.json - Control moduleRuns Synthea with the custom module to generate synthetic patients.
Options:
--patients N - Number of positive cases (default: 20)--controls N - Number of control cases (default: 20)--seed N - Random seed for reproducibilityOutputs:
synthea/output/<phenotype>/positive/fhir/*.json - Positive patient bundlessynthea/output/<phenotype>/control/fhir/*.json - Control patient bundlesLoads generated FHIR bundles to the FHIR server.
Prerequisites:
http://localhost:8080/fhir or Azure FHIR at http://localhost:9080)synthea/output/<phenotype>/Important: Infrastructure bundles (hospitals, practitioners) must load BEFORE patient bundles. The CLI fhir-eval load synthea command handles this automatically. If loading manually, load hospitalInformation*.json and practitionerInformation*.json files first.
Runs the complete pipeline: create-module → generate → load
Shows which phenotypes have:
When this skill is invoked, follow these instructions based on the command:
create-module <phenotype>:Read the phenotype data:
Read: data/phekb-raw/<phenotype>/document_analysis.json
Read: test-cases/phekb/phekb-<phenotype>.json (if exists)
Extract key information:
Verify and enrich codes using the UMLS MCP server (CRITICAL):
Do NOT trust codes from document_analysis.json blindly — they may be incomplete, outdated, or wrong. Use the UMLS MCP tools to get authoritative codes:
a. For each diagnosis concept, search UMLS and get codes across systems:
search_umls(query="<condition name>", search_type="exact")
get_concept(cui="<CUI>")
crosswalk_codes(source="SNOMEDCT_US", code="<SNOMED>", target_source="ICD10CM")
crosswalk_codes(source="SNOMEDCT_US", code="<SNOMED>", target_source="ICD9CM")
b. For each lab concept, find the correct LOINC code:
search_umls(query="<lab name>", search_type="words")
Look for results with semantic type "Laboratory Procedure" or "Clinical Attribute".
c. For each medication, find the RxNorm code:
search_umls(query="<medication name>", search_type="exact")
crosswalk_codes(source="RXNORM", code="<CODE>", target_source="SNOMEDCT_US")
d. Cross-check codes from the phenotype data against UMLS:
get_source_concept(source="ICD10CM", id="<CODE>")
Verify the display name matches the intended concept. Discard obsolete codes.
e. If crosswalk returns empty (common for SNOMED→ICD-10), search UMLS for the term with search_type="words" and look for CUIs with atoms from the target system.
See /umls skill for full details on UMLS MCP usage patterns and gotchas.
Generate the Synthea module following GMF format:
synthea/modules/custom/phekb_type_2_diabetes.json as a templateGenerate the control module:
Write the modules:
Write: synthea/modules/custom/phekb_<phenotype>.json
Write: synthea/modules/custom/phekb_<phenotype>_control.json
Validate JSON syntax by reading back and parsing
generate <phenotype>:Check prerequisites:
C:\repos\synthea (or SYNTHEA_HOME env var)synthea/modules/custom/phekb_<phenotype>.jsonIf Synthea not found, provide install instructions:
git clone https://github.com/synthetichealth/synthea.git C:/repos/synthea
IMPORTANT: Environment-specific issues to handle:
Claude Code runs in a bash environment (git bash), NOT cmd.exe. This causes three issues:
Issue 1: .bat files don't run natively in bash.
run_synthea.bat directly — it won't find gradlew.bat../gradlew (the Unix wrapper) directly from the Synthea directory.Issue 2: JAVA_HOME / Java not on PATH.
export JAVA_HOME="/c/Program Files/Eclipse Adoptium/jdk-17.0.18.8-hotspot"
export PATH="$JAVA_HOME/bin:$PATH"
/c/Program Files/Eclipse Adoptium/, /c/Program Files/Java/, etc.Issue 3: Backslash paths break Gradle -Params.
\ as escape characters../gradlew.C:/repos/llm-fhir-query-eval/synthea/modules/custom (NOT C:\repos\...)Run the Python helper script (recommended):
python synthea/generate_test_data.py --phenotype <phenotype> --patients 20 --controls 20
The script auto-detects whether it's running in bash or cmd.exe and adjusts:
./gradlew run -Params="[...]" directly with forward-slash pathsrun_synthea.bat as beforeOr run Synthea directly via gradlew (if script has issues):
export JAVA_HOME="/c/Program Files/Eclipse Adoptium/jdk-17.0.18.8-hotspot"
export PATH="$JAVA_HOME/bin:$PATH"
cd C:/repos/synthea && ./gradlew run -Params="['-p','20','-m','phekb_<phenotype>','-d','C:/repos/llm-fhir-query-eval/synthea/modules/custom','--exporter.fhir.export','true','--exporter.fhir.use_us_core_ig','true','--exporter.baseDirectory','C:/repos/llm-fhir-query-eval/synthea/output/<phenotype>/positive','-s','42']"
Then repeat with phekb_<phenotype>_control module and control output subdirectory.
Report results: Count generated files and summarize
load <phenotype>:Check FHIR server is running:
curl -s http://localhost:8080/fhir/metadata | head -5
Load positive cases:
for f in synthea/output/<phenotype>/positive/fhir/*.json; do
curl -X POST http://localhost:8080/fhir \
-H "Content-Type: application/fhir+json" \
-d @"$f"
done
Load control cases:
for f in synthea/output/<phenotype>/control/fhir/*.json; do
curl -X POST http://localhost:8080/fhir \
-H "Content-Type: application/fhir+json" \
-d @"$f"
done
Verify loaded data by querying the server
Update test case with expected resource IDs (optional)
full <phenotype>:Execute in sequence:
create-module <phenotype>generate <phenotype>load <phenotype>Report overall success/failure.
status:List all phenotypes from data/phekb-raw/*/
For each, check:
document_analysis.json?synthea/modules/custom/?synthea/output/?Display summary table:
Phenotype | Analysis | Module | Data (pos/ctrl) | Loaded
---------------------|----------|--------|-----------------|--------
type-2-diabetes | ✓ | ✓ | 20/20 | ✓
asthma | ✓ | ✗ | - | -
heart-failure | ✓ | ✗ | - | -
list:List phenotypes that have Synthea modules ready:
ls synthea/modules/custom/phekb_*.json | grep -v _control
batch <phenotypes...>:full <phenotype>PheKB phenotype algorithms are multi-path decision trees, NOT simple code lookups. A single phenotype may identify patients through DIFFERENT combinations of clinical data:
When creating Synthea modules for phenotypes with multiple identification paths, generate DISTINCT patient groups:
data/phekb-raw/<phenotype>/) to identify all pathsInitial → Age_Guard → Set_Diabetes_Flag → Wellness_Encounter → Path_Router
├→ Path_With_Diagnosis (70%)
│ ├→ Diagnose_Condition
│ ├→ Record_Labs
│ ├→ Prescribe_Meds
│ └→ End_Encounter
└→ Path_No_Diagnosis (30%)
├→ Record_Abnormal_Labs (NO ConditionOnset!)
├→ Prescribe_Meds
└→ End_Encounter
When generating observation values, use thresholds from the phenotype algorithm document:
Example from T2DM:
| Lab | Case Threshold | Control Exclusion |
|---|---|---|
| HbA1c | >= 6.5% | >= 6.0% |
| Fasting glucose | >= 125 mg/dL | >= 110 mg/dL |
| Random glucose | > 200 mg/dL | > 110 mg/dL |
For each phenotype, plan to generate TWO Synthea module variants:
| Variant | Module Suffix | Condition Codes | Meds | Categories | When to Use |
|---|---|---|---|---|---|
| Generic | phekb_<name>.json | SNOMED only | RxNorm SCD | Base FHIR | Tier 1 eval, basic testing |
| US Core | phekb_<name>_uscore.json | SNOMED + ICD-10-CM | RxNorm SCD | US Core categories | Tier 3 eval, profile-aware testing |
The US Core variant adds:
Condition.category = problem-list-item (US Core requires this)Observation.category = laboratory with proper US Core category codingMedicationRequest.intent = order and .status = active (US Core requires these)Output directories:
synthea/output/<phenotype>/
├── generic/
│ ├── positive/fhir/
│ └── control/fhir/
└── uscore/
├── positive/fhir/
└── control/fhir/
Synthea's FHIR exporter works best with SCD-level (Semantic Clinical Drug) RxNorm codes, not ingredient-level codes. Always:
IMPORTANT: Test case expected queries and Synthea modules need DIFFERENT code levels:
860975 for "metformin 500 MG ER Tablet") because Synthea generates FHIR MedicationRequest resources with specific drug formsThese are hard-won lessons from debugging Synthea module generation:
"wellness": true on Encounter states is REQUIRED. Without it, the module's ConditionOnset/Observation/MedicationOrder states will process but produce ZERO FHIR resources. Synthea only writes resources to output when they occur inside a lifecycle-managed encounter.
ConditionOnset MUST be inside an encounter. Place it AFTER the Encounter state and BEFORE the EncounterEnd state. The old pattern of using target_encounter pointing to a future encounter state does NOT work reliably for custom modules.
Use SetAttribute for disease flags, conditional_transition for branching. Match the pattern from Synthea's built-in metabolic_syndrome_disease.json + metabolic_syndrome_care.json. Disease modules set attributes; care/encounter modules check attributes and create resources.
MedicationOrder reason field must reference an attribute name, not a state name. Use "reason": "t2dm_condition" where t2dm_condition was set via assign_to_attribute on a ConditionOnset. For Path 4 patients (no condition), omit the reason field.
Infrastructure bundles must load first on HAPI FHIR. Synthea generates hospitalInformation*.json and practitionerInformation*.json files. These must be loaded before patient bundles, or HAPI returns 404 errors for Practitioner references.
Patient count vs module filter. The -m flag in Synthea keeps only patients who enter the named module. Combined with -p N, it generates N total patients but only outputs those matching the module. If the module has an Age_Guard, young patients may pass the filter but lack clinical data.
| Server | Healthcheck | Data Persistence | Bundle Load Order | _has Support | Notes |
|---|---|---|---|---|---|
| HAPI FHIR | curl works | Stable in-memory | Infra files first | Yes | Recommended for dev |
| fhir-candle | No curl/wget | Unstable (periodic resets) | Any order | Limited | NOT recommended |
| Azure FHIR | TBD | SQL-backed | Infra files first | Yes | Requires SQL Server |
When creating modules, use this structure:
{
"name": "PheKB <Phenotype Name>",
"remarks": [
"Auto-generated from PheKB phenotype: <phenotype-id>",
"Clinical criteria: ...",
"..."
],
"states": {
"Initial": {
"type": "Initial",
"direct_transition": "Age_Guard"
},
"Age_Guard": {
"type": "Guard",
"allow": { "condition_type": "Age", "operator": ">=", "quantity": 18, "unit": "years" },
"direct_transition": "..."
},
"Condition_Onset": {
"type": "ConditionOnset",
"codes": [
{ "system": "SNOMED-CT", "code": "...", "display": "..." },
{ "system": "ICD-10-CM", "code": "...", "display": "..." }
],
"direct_transition": "..."
},
"Lab_Observation": {
"type": "Observation",
"category": "laboratory",
"codes": [{ "system": "LOINC", "code": "...", "display": "..." }],
"unit": "...",
"range": { "low": ..., "high": ... },
"direct_transition": "..."
},
"Medication_Order": {
"type": "MedicationOrder",
"codes": [{ "system": "RxNorm", "code": "...", "display": "..." }],
"direct_transition": "..."
},
"Terminal": {
"type": "Terminal"
}
},
"gmf_version": 2
}
| System | Synthea Name | FHIR URI |
|---|---|---|
| SNOMED CT | SNOMED-CT | http://snomed.info/sct |
| ICD-10-CM | ICD-10-CM | http://hl7.org/fhir/sid/icd-10-cm |
| ICD-9-CM | ICD-9-CM | http://hl7.org/fhir/sid/icd-9-cm |
| LOINC | LOINC | http://loinc.org |
| RxNorm | RxNorm | http://www.nlm.nih.gov/research/umls/rxnorm |
Note: Algorithm PDFs (e.g., Table 7) may list additional LOINC codes beyond the primary ones. Include all relevant codes in modules:
User: /synthea create-module asthma
Claude: I'll create a Synthea module for the asthma phenotype.
[Reads document_analysis.json and test case]
[Extracts codes: ICD-9 493.x, SNOMED 195967001, LOINC for spirometry, RxNorm for albuterol]
[Creates module with age guard, condition onset, observations, medications]
[Writes phekb_asthma.json and phekb_asthma_control.json]
Created Synthea modules for asthma:
- synthea/modules/custom/phekb_asthma.json (positive cases)
- synthea/modules/custom/phekb_asthma_control.json (controls)
To generate test data: /synthea generate asthma
User: /synthea full heart-failure
Claude: Running full pipeline for heart-failure phenotype...
Step 1/3: Creating module...
[Creates module]
Step 2/3: Generating data...
[Runs Synthea - 20 positive, 20 control patients]
Step 3/3: Loading to FHIR server...
[Loads 40 patient bundles]
Complete! Generated and loaded 40 patients for heart-failure phenotype.
- 20 positive cases (should match phenotype query)
- 20 controls (should NOT match)