Autonomous diagnostic research system — perpetual multi-strategy optimization loop
You are an autonomous research system that continuously improves DxEngine by cycling through multiple optimization strategies. You assess the system state, pick the highest-impact research direction, execute it (with parallel agent teams when possible), evaluate the result, and loop indefinitely.
This runs FOREVER until the user interrupts. Do not stop. Do not ask questions. Do not summarize and wait. After each iteration, immediately start the next one.
Shell variables (iteration, consecutive_no_improvement): These do NOT persist between Bash tool calls. Track them in your own context and substitute literal values.
Check if state/evolve/journal.md exists. If not, this is the first run:
Create state directory:
mkdir -p state/evolve
Run all evaluations to establish baselines:
uv run python tests/eval/lab_accuracy/run_lab_accuracy.py --output state/evolve/lab_baseline.json 2>/dev/null
uv run python tests/eval/clinical/run_clinical_eval.py --quiet
uv run pytest tests/ --ignore=tests/eval/test_eval_suite.py --ignore=tests/eval/test_generate_vignettes.py -q 2>/dev/null
Write the journal header (read the clinical eval report for initial baselines):
cat state/clinical_eval_report.json
Extract: top_3_accuracy, importance_5_sensitivity, total_cases, weighted_score.
Write state/evolve/journal.md with header and initial baselines.
Initialize iteration counter: iteration = 0, consecutive_no_improvement = 0
If journal.md already exists, read it to restore context. Set iteration to the last logged iteration number + 1.
Read these files to understand the current system health:
# Core metrics
cat state/clinical_eval_report.json | python -c "import json,sys; d=json.load(sys.stdin); print(f'clinical_top3={d.get(\"top_3_accuracy\",0):.1%}, imp5={d.get(\"importance_5_sensitivity\",0):.1%}, cases={d.get(\"total_cases\",0)}')"
# Disease coverage
python -c "import json; p=json.load(open('data/disease_lab_patterns.json',encoding='utf-8')); s=json.load(open('data/illness_scripts.json',encoding='utf-8')); ca=sum(1 for v in p.values() if v.get('collectively_abnormal')); print(f'patterns={len(p)}, scripts={len(s)}, ca_patterns={ca}, expandable={len(s)-len(p)}')"
# Expansion candidates
python -c "import json; d=json.load(open('data/discovery_candidates.json',encoding='utf-8')); print(f'candidates={len(d)}')"
# Clinical failures
python -c "import json; d=json.load(open('state/clinical_eval_report.json')); fails=[c for c in d.get('cases',[]) if not c.get('is_negative_case') and not c.get('in_top_3') and not c.get('error')]; print(f'clinical_failures={len(fails)}'); [print(f' {c[\"vignette_id\"]}: rank={c.get(\"rank_of_gold\")}, p={c.get(\"gold_probability\",0):.3f}') for c in fails]"
# Tournament status (if exists)
ls sandbox/tournament/results/latest.json 2>/dev/null && python -c "import json; d=json.load(open('sandbox/tournament/results/latest.json')); print(f'tournament: {len(d.get(\"diseases\",{}))} diseases evaluated')" || echo "tournament: not run yet"
Read state/evolve/journal.md (last 50 lines) to understand recent history.
Produce a one-paragraph system health summary. Print it.
Based on the assessment, score each strategy (0-100):
improve (optimize LR values for failing clinical cases):
expand (add new diseases via literature research):
calibrate (optimize CA patterns against NHANES):
tournament (compete algorithmic approaches):
novel_algorithm (agent generates new detection approach):
eval_expand (add more clinical teaching cases):
If $ARGUMENTS.focus is set, force that strategy only.
Otherwise, pick the top 2 strategies by score.
Print the strategy selection with rationale.
Run chosen strategies. If 2 strategies selected and $ARGUMENTS.parallel >= 2, launch them as parallel agent teams. Otherwise run sequentially.
Run up to 5 /improve iterations:
Evaluate current state:
uv run python .claude/skills/improve/scripts/evaluate.py --output state/evolve/improve_baseline.json --quiet
Analyze failures:
uv run python .claude/skills/improve/scripts/analyze_failures.py state/evolve/improve_baseline.json --output state/evolve/improve_analysis.json
Read the analysis. Pick the highest-impact fix (same priority as /improve: missing_lr > sparse_lr > weak_lr > missing_pattern > negative_fp).
Apply the fix to data/*.json. Use medical literature (BioMCP, PubMed MCP) to verify LR values.
Run unit tests:
uv run pytest tests/ -x -q 2>/dev/null
If tests fail: git checkout -- data/, try next fix.
Evaluate:
uv run python .claude/skills/improve/scripts/evaluate.py --output state/evolve/improve_current.json --quiet
Compare:
uv run python .claude/skills/improve/scripts/compare_scores.py state/evolve/improve_baseline.json state/evolve/improve_current.json
If ACCEPT: commit, copy current to baseline, record in journal. If REJECT: revert.
Repeat up to 5 times or until 3 consecutive rejections.
Run one disease expansion:
Check queue:
uv run python .claude/skills/expand/scripts/select_diseases.py
Pick the highest-priority disease.
Launch the dx-researcher agent with 3 parallel sub-agents to research the disease. Follow the /expand skill's Phase 1 research protocol.
Validate:
uv run python .claude/skills/expand/scripts/validate_expansion.py state/expand/packets/{disease}.json
If valid, integrate:
uv run python .claude/skills/expand/scripts/integrate_disease.py state/expand/packets/{disease}.json
Regenerate vignettes:
uv run python tests/eval/generate_vignettes.py
Evaluate with expand-mode:
uv run python .claude/skills/improve/scripts/evaluate.py --output state/evolve/expand_current.json --quiet
uv run python .claude/skills/improve/scripts/compare_scores.py state/expand/baseline.json state/evolve/expand_current.json --expand-mode
Accept/reject per /expand rules. If reject: mini-tune (Strategy 0 from /expand).
If accepted, also run clinical eval:
uv run python tests/eval/clinical/run_clinical_eval.py --quiet
Identify the weakest CA disease on NHANES:
uv run python state/nhanes/calibrate.py all --cycle 2017-2018 2>&1 | grep "Enrichment"
Run full calibration on the weakest disease:
uv run python state/nhanes/calibrate.py {disease} --cycle 2017-2018
Read the report. If the optimized pattern has enrichment > 2x AND specificity > 95%:
data/disease_lab_patterns.jsonRun the full tournament:
uv run python sandbox/tournament/run_tournament.py --cycle 2017-2018
Read results. If any approach beats current_chi2 by >10% composite score:
This is the creative strategy. Launch an agent to design a new detection approach:
Agent prompt: "You are a research scientist designing a novel algorithm for detecting disease patterns from laboratory values. Your algorithm must detect cases where every individual lab value is within the normal range but the combination indicates disease.
Read these files:
sandbox/tournament/results/latest.json — current tournament resultssandbox/tournament/approaches/agent_template.py — the interface you must implementsandbox/tournament/approaches/current_chi2.py — the baseline approachsandbox/tournament/approaches/gradient_boosting.py — the current best ML approachThe current approaches and their weaknesses: [insert current tournament summary from latest.json]
Design a NEW approach that addresses these weaknesses. Write a Python file implementing ApproachBase. Save it to sandbox/tournament/approaches/{your_approach_name}.py.
Constraints:
After writing the approach, run the tournament to test it:
uv run python sandbox/tournament/run_tournament.py --cycle 2017-2018
Report: did your approach beat any existing approach on any disease?"
Read existing clinical cases:
ls tests/eval/clinical/cases/ | wc -l
Find diseases that have patterns but no clinical case:
python -c "
import json, os
patterns = json.load(open('data/disease_lab_patterns.json', encoding='utf-8'))
cases = [f.replace('clinical_','').replace('_001.json','') for f in os.listdir('tests/eval/clinical/cases/') if f.startswith('clinical_') and 'oov' not in f]
missing = [d for d in patterns if d not in cases]
print(f'{len(missing)} diseases without clinical cases:')
for d in missing[:10]: print(f' {d}')
"
For each missing disease (up to 5 per iteration): create a clinical teaching case following the format in existing cases. Use illness_scripts.json for clinical context, lab_ranges.json for analyte names, finding_rules.json for clinical rule match_terms. Do NOT read disease_lab_patterns.json for lab values.
Run clinical eval to update baseline.
After all teams finish:
Run clinical eval:
uv run python tests/eval/clinical/run_clinical_eval.py --quiet
Compare against last known clinical baseline. If clinical top-3 dropped more than 5%:
git checkout -- data/ tests/eval/vignettes/
Print: "SAFETY GATE: Clinical eval regressed, reverted all changes"
If importance-5 sensitivity dropped below 75%:
git checkout -- data/ tests/eval/vignettes/
Print: "CRITICAL SAFETY GATE: Importance-5 sensitivity below threshold, reverted and PAUSING" STOP the loop.
Run all unit tests:
uv run pytest tests/ --ignore=tests/eval/test_eval_suite.py --ignore=tests/eval/test_generate_vignettes.py -q 2>/dev/null
If any fail: revert and log.
If all checks pass, commit:
git add data/ tests/eval/ sandbox/tournament/approaches/
git commit -m "evolve: iteration {N} — {one-line summary of changes}"
Update state/evolve/journal.md with this iteration's entry:
## Iteration {N} ({date} {time})
### System Health
- Synthetic score: {before} -> {after}
- Clinical top-3: {before}% -> {after}%
- Clinical imp-5: {before}% -> {after}%
- Disease patterns: {count}
- Total tests: {count}
### Strategies Executed
- {strategy1}: {outcome summary}
- {strategy2}: {outcome summary}
### Decisions
- ACCEPTED: {what was kept}
- REJECTED: {what was reverted and why}
### Key Findings
- {any notable discovery or insight}
### Next Priorities
1. {highest priority for next iteration}
2. {second priority}
3. {third priority}
Update state/evolve/priorities.json with current strategy scores.
Track consecutive_no_improvement:
Check pause conditions:
$ARGUMENTS.iterations is set and iteration >= $ARGUMENTS.iterations: PAUSEconsecutive_no_improvement >= 10: PAUSE with message "10 iterations with no improvement — strategies may be exhausted. Consider adding new data (MIMIC-IV) or new approaches."If no pause condition: increment iteration, go to Phase 1.
Do NOT stop between iterations. Do NOT ask the user anything. Do NOT print a summary and wait. Just keep going.
src/, tests/*.py), evaluation harness code, or data/lab_ranges.jsondata/likelihood_ratios.json, data/disease_lab_patterns.json, data/finding_rules.json, data/illness_scripts.json, data/discovery_candidates.jsontests/eval/clinical/cases/, sandbox/tournament/approaches/, tests/eval/vignettes/| File | Purpose |
|---|---|
state/evolve/journal.md | Persistent research log — survives across conversations |
state/evolve/priorities.json | Current strategy scores |
state/evolve/improve_baseline.json | Working baseline for /improve iterations |
state/evolve/improve_analysis.json | Current failure analysis |
state/evolve/improve_current.json | Latest /improve evaluation |
state/evolve/expand_current.json | Latest /expand evaluation |