Analyze prompt performance across explorers, architects, and reviewers. Promote winners, generate challengers. Triggered by session-learnings or manually.
Closed-loop optimization for subagent prompts across three agent types:
Each agent type has its own variant pool, event log, and scoring model. The goal: prompts get measurably better over time across the entire workflow.
Invoke: After session-learnings, or manually with /prompt-optimization.
memory/episodic/exploration-events.jsonl for the current session/prompt-optimization to review performance dataRun the tracker to recompute all variant metrics from event history across all agent types:
python3 ~/.claude/scripts/prompt-tracker.py update-metrics
# Full report (all agent types)
python3 ~/.claude/scripts/prompt-tracker.py report
# Or filter to a specific type
python3 ~/.claude/scripts/prompt-tracker.py report explorer
python3 ~/.claude/scripts/prompt-tracker.py report architect
python3 ~/.claude/scripts/prompt-tracker.py report reviewer
Present the report to the user. Key things to highlight:
Explorers: Which variants have highest F1, promotion readiness, most commonly missed files Architects: Which optimization target users prefer (selection rate), convergence speed, resulting code quality Reviewers: Which review style finds more real issues (true positive rate) vs noise (false positives)
Promotions now use statistical confidence intervals instead of raw score gaps. A variant wins only when the evidence is significant — not just when it's ahead on average.
# Run CI-aware promotion check
python3 scripts/stat-eval.py promote <agent_type> <category>
# Also check behavioral consistency and flakiness
python3 scripts/stat-eval.py consistency <agent_type> <category>
python3 scripts/stat-eval.py flakiness <agent_type> <category>
Promotion criteria (any of):
Block promotion if:
stat-eval.py regression)If promoted:
current_best_A / current_best_B in prompt-variants.jsonprompt-library.md with the winning prompt textAggregate metrics tell you which variant lost, not why. Before synthesizing a challenger, do a brief manual review so the challenger is grounded in actual failure modes rather than the winner's surface differences.
# Sample 20 traces (or all, if fewer) from the losing variant's event log
python3 ~/.claude/scripts/prompt-tracker.py sample-traces \
<agent_type> <variant_id> --n 20
For each sampled trace, write ONE sentence about the earliest observable failure (the "open coding" pass — see Hamel Husain's error-analysis workflow for the technique). Stop at the first error per trace; downstream errors are usually consequences, not causes.
Aggregate the notes into error categories (3-7 buckets). Count each bucket. The top 2-3 buckets are the challenger's target — everything else is noise.
When to skip: If the loser has < 10 sessions, skip this step and just use aggregate metrics — there isn't enough signal to warrant manual review.
Why it matters: Aggregate F1 gives a correct answer to the wrong question. A challenger that only copies the winner's surface structure will often regress on cases the winner also handles poorly. Counting beats vibes.
For each losing variant:
Challenger generation prompt:
This explorer prompt scored avg F1={loser_f1} over {sessions} sessions.
The winning prompt scored avg F1={winner_f1}.
LOSING PROMPT:
{loser_prompt}
WINNING PROMPT:
{winner_prompt}
COMMONLY MISSED FILES BY LOSER:
{missed_files_list}
TOP FAILURE BUCKETS FROM TRACE-SAMPLING (Step 3.5):
{bucket_counts} # e.g. "schema-miss: 7, wrong-file-scope: 5, no-test-context: 3"
Rewrite the losing prompt to better discover these file types.
Keep the same exploration scope and thinking budget.
The rewrite should be a single paragraph instruction.
Return ONLY the rewritten prompt text, nothing else.
Output:
| File | Purpose |
|---|---|
memory/procedural/prompt-variants.json | Variant definitions + aggregate metrics (all agent types) |
memory/episodic/exploration-events.jsonl | Explorer outcome data (Phase 2 → Phase 5) |
memory/episodic/architect-events.jsonl | Architect outcome data (Phase 4 → Phase 6) |
memory/episodic/reviewer-events.jsonl | Reviewer outcome data (Phase 6) |
scripts/prompt-tracker.py | CLI for selection, recording, metrics, reporting |
scripts/stat-eval.py | Statistical analysis: CIs, consistency, regression, calibration, flakiness |
precision = files_found_and_used / files_found (less noise)
recall = files_found_and_used / total_files_needed (fewer misses)
f1 = harmonic mean of precision and recall
score = f1 * (1 - retry_rate) (penalize bad exploration)
Good exploration: High precision (found files were useful) AND high recall (didn't miss critical files). The F1 score balances both.
selection = 1.0 if user chose this proposal, 0.0 otherwise
convergence = 1.0 - (refinement_rounds / 3.0) (fewer rounds = better)
quality = 1.0 - critical_penalty - total_penalty (fewer review issues = better)
score = (selection * 0.4) + (quality * 0.35) + (convergence * 0.25)
Good architecture: Users choose it, the plan converges quickly, and Phase 6 review finds few critical issues.
true_positive_rate = issues_fixed / issues_found (found real problems)
signal_to_noise = (found - false_positives) / found (not just noise)
score = true_positive_rate * signal_to_noise
Good review: High signal — issues found are real and worth fixing, not noise that wastes time.
| Condition | Threshold |
|---|---|
| Minimum sessions per variant | 10 |
| F1 gap required for promotion | 0.05 |
| Maximum active variants per role | 2 |
| Challenger generation | Requires user approval |
prompt-tracker.py select explorer <category> <role> before dispatching explorersget_prompt_performance tool (accepts optional agent_type filter)/claude-flow on a real task — it auto-records explorer, architect, and reviewer performance events.~/.claude/memory/prompt-variants.json for variant win rates and F1 scores./prompt-optimization after 10+ sessions per variant to trigger automatic promotion.