스킬 검색.../

Evolve | Skills Pool

Run all evaluations to establish baselines:

uv run python tests/eval/lab_accuracy/run_lab_accuracy.py --output state/evolve/lab_baseline.json 2>/dev/null
uv run python tests/eval/clinical/run_clinical_eval.py --quiet
uv run pytest tests/ --ignore=tests/eval/test_eval_suite.py --ignore=tests/eval/test_generate_vignettes.py -q 2>/dev/null

# Core metrics
cat state/clinical_eval_report.json | python -c "import json,sys; d=json.load(sys.stdin); print(f'clinical_top3={d.get(\"top_3_accuracy\",0):.1%}, imp5={d.get(\"importance_5_sensitivity\",0):.1%}, cases={d.get(\"total_cases\",0)}')"

# Disease coverage
python -c "import json; p=json.load(open('data/disease_lab_patterns.json',encoding='utf-8')); s=json.load(open('data/illness_scripts.json',encoding='utf-8')); ca=sum(1 for v in p.values() if v.get('collectively_abnormal')); print(f'patterns={len(p)}, scripts={len(s)}, ca_patterns={ca}, expandable={len(s)-len(p)}')"

# Expansion candidates
python -c "import json; d=json.load(open('data/discovery_candidates.json',encoding='utf-8')); print(f'candidates={len(d)}')"

# Clinical failures
python -c "import json; d=json.load(open('state/clinical_eval_report.json')); fails=[c for c in d.get('cases',[]) if not c.get('is_negative_case') and not c.get('in_top_3') and not c.get('error')]; print(f'clinical_failures={len(fails)}'); [print(f'  {c[\"vignette_id\"]}: rank={c.get(\"rank_of_gold\")}, p={c.get(\"gold_probability\",0):.3f}') for c in fails]"

# Tournament status (if exists)
ls sandbox/tournament/results/latest.json 2>/dev/null && python -c "import json; d=json.load(open('sandbox/tournament/results/latest.json')); print(f'tournament: {len(d.get(\"diseases\",{}))} diseases evaluated')" || echo "tournament: not run yet"

Evaluate current state:

uv run python .claude/skills/improve/scripts/evaluate.py --output state/evolve/improve_baseline.json --quiet

Analyze failures:

uv run python .claude/skills/improve/scripts/analyze_failures.py state/evolve/improve_baseline.json --output state/evolve/improve_analysis.json

Read the analysis. Pick the highest-impact fix (same priority as /improve: missing_lr > sparse_lr > weak_lr > missing_pattern > negative_fp).
Apply the fix to data/*.json. Use medical literature (BioMCP, PubMed MCP) to verify LR values.
Run unit tests:
```
uv run pytest tests/ -x -q 2>/dev/null
```
If tests fail: git checkout -- data/, try next fix.

Evaluate:

uv run python .claude/skills/improve/scripts/evaluate.py --output state/evolve/improve_current.json --quiet

Compare:

uv run python .claude/skills/improve/scripts/compare_scores.py state/evolve/improve_baseline.json state/evolve/improve_current.json

If ACCEPT: commit, copy current to baseline, record in journal. If REJECT: revert.
Repeat up to 5 times or until 3 consecutive rejections.

Check queue:

uv run python .claude/skills/expand/scripts/select_diseases.py

Pick the highest-priority disease.
Launch the dx-researcher agent with 3 parallel sub-agents to research the disease. Follow the /expand skill's Phase 1 research protocol.

Validate:

uv run python .claude/skills/expand/scripts/validate_expansion.py state/expand/packets/{disease}.json

If valid, integrate:

uv run python .claude/skills/expand/scripts/integrate_disease.py state/expand/packets/{disease}.json

Regenerate vignettes:

uv run python tests/eval/generate_vignettes.py

Evaluate with expand-mode:

uv run python .claude/skills/improve/scripts/evaluate.py --output state/evolve/expand_current.json --quiet
uv run python .claude/skills/improve/scripts/compare_scores.py state/expand/baseline.json state/evolve/expand_current.json --expand-mode

Accept/reject per /expand rules. If reject: mini-tune (Strategy 0 from /expand).

If accepted, also run clinical eval:

uv run python tests/eval/clinical/run_clinical_eval.py --quiet

Identify the weakest CA disease on NHANES:

uv run python state/nhanes/calibrate.py all --cycle 2017-2018 2>&1 | grep "Enrichment"

Run full calibration on the weakest disease:

uv run python state/nhanes/calibrate.py {disease} --cycle 2017-2018

Read the report. If the optimized pattern has enrichment > 2x AND specificity > 95%:
- Apply the proposed pattern changes to data/disease_lab_patterns.json
- Run eval to check for regressions
- Accept/reject based on regression gates

Run the full tournament:

uv run python sandbox/tournament/run_tournament.py --cycle 2017-2018

Read results. If any approach beats current_chi2 by >10% composite score:
- Log the finding in the journal
- If the winning approach has low overfit gap (<0.10): investigate integrating it

uv run python sandbox/tournament/run_tournament.py --cycle 2017-2018

Read existing clinical cases:
```
ls tests/eval/clinical/cases/ | wc -l
```

Find diseases that have patterns but no clinical case:

python -c "
import json, os
patterns = json.load(open('data/disease_lab_patterns.json', encoding='utf-8'))
cases = [f.replace('clinical_','').replace('_001.json','') for f in os.listdir('tests/eval/clinical/cases/') if f.startswith('clinical_') and 'oov' not in f]
missing = [d for d in patterns if d not in cases]
print(f'{len(missing)} diseases without clinical cases:')
for d in missing[:10]: print(f'  {d}')
"

For each missing disease (up to 5 per iteration): create a clinical teaching case following the format in existing cases. Use illness_scripts.json for clinical context, lab_ranges.json for analyte names, finding_rules.json for clinical rule match_terms. Do NOT read disease_lab_patterns.json for lab values.
Run clinical eval to update baseline.

Run clinical eval:

uv run python tests/eval/clinical/run_clinical_eval.py --quiet

Compare against last known clinical baseline. If clinical top-3 dropped more than 5%:
```
git checkout -- data/ tests/eval/vignettes/
```
Print: "SAFETY GATE: Clinical eval regressed, reverted all changes"
If importance-5 sensitivity dropped below 75%:
```
git checkout -- data/ tests/eval/vignettes/
```
Print: "CRITICAL SAFETY GATE: Importance-5 sensitivity below threshold, reverted and PAUSING" STOP the loop.

Run all unit tests:

uv run pytest tests/ --ignore=tests/eval/test_eval_suite.py --ignore=tests/eval/test_generate_vignettes.py -q 2>/dev/null

If any fail: revert and log.

If all checks pass, commit:

git add data/ tests/eval/ sandbox/tournament/approaches/
git commit -m "evolve: iteration {N} — {one-line summary of changes}"

## Iteration {N} ({date} {time})

### System Health
- Synthetic score: {before} -> {after}
- Clinical top-3: {before}% -> {after}%
- Clinical imp-5: {before}% -> {after}%
- Disease patterns: {count}
- Total tests: {count}

### Strategies Executed
- {strategy1}: {outcome summary}
- {strategy2}: {outcome summary}

### Decisions
- ACCEPTED: {what was kept}
- REJECTED: {what was reverted and why}

### Key Findings
- {any notable discovery or insight}

### Next Priorities
1. {highest priority for next iteration}
2. {second priority}
3. {third priority}

File	Purpose
`state/evolve/journal.md`	Persistent research log — survives across conversations
`state/evolve/priorities.json`	Current strategy scores
`state/evolve/improve_baseline.json`	Working baseline for /improve iterations
`state/evolve/improve_analysis.json`	Current failure analysis
`state/evolve/improve_current.json`	Latest /improve evaluation
`state/evolve/expand_current.json`	Latest /expand evaluation

Evolve

/evolve — Autonomous Diagnostic Research System

Phase 0: INITIALIZE (first run only)

Evolve

/evolve — Autonomous Diagnostic Research System

Phase 0: INITIALIZE (first run only)

Phase 1: ASSESS

Phase 2: STRATEGIZE

Phase 3: EXECUTE

Strategy: improve

Strategy: expand

Strategy: calibrate

Strategy: tournament

Strategy: novel_algorithm

Strategy: eval_expand

Phase 4: INTEGRATE

Phase 5: REFLECT

Phase 6: LOOP

Safety Rules

State Files

Clickhouse Io

Clickhouse Io

Claude Devfleet

Clickhouse Io

Ai First Engineering

Postgres Patterns