Name: Prompt Optimization
Author: sumrae412

Prompt Optimization | Skills Pool

python3 ~/.claude/scripts/prompt-tracker.py update-metrics

# Full report (all agent types)
python3 ~/.claude/scripts/prompt-tracker.py report

# Or filter to a specific type
python3 ~/.claude/scripts/prompt-tracker.py report explorer
python3 ~/.claude/scripts/prompt-tracker.py report architect
python3 ~/.claude/scripts/prompt-tracker.py report reviewer

# Run CI-aware promotion check
python3 scripts/stat-eval.py promote <agent_type> <category>

# Also check behavioral consistency and flakiness
python3 scripts/stat-eval.py consistency <agent_type> <category>
python3 scripts/stat-eval.py flakiness <agent_type> <category>

# Sample 20 traces (or all, if fewer) from the losing variant's event log
python3 ~/.claude/scripts/prompt-tracker.py sample-traces \
    <agent_type> <variant_id> --n 20

This explorer prompt scored avg F1={loser_f1} over {sessions} sessions.
The winning prompt scored avg F1={winner_f1}.

LOSING PROMPT:
{loser_prompt}

WINNING PROMPT:
{winner_prompt}

COMMONLY MISSED FILES BY LOSER:
{missed_files_list}

TOP FAILURE BUCKETS FROM TRACE-SAMPLING (Step 3.5):
{bucket_counts}  # e.g. "schema-miss: 7, wrong-file-scope: 5, no-test-context: 3"

Rewrite the losing prompt to better discover these file types.
Keep the same exploration scope and thinking budget.
The rewrite should be a single paragraph instruction.
Return ONLY the rewritten prompt text, nothing else.

File	Purpose
`memory/procedural/prompt-variants.json`	Variant definitions + aggregate metrics (all agent types)
`memory/episodic/exploration-events.jsonl`	Explorer outcome data (Phase 2 → Phase 5)
`memory/episodic/architect-events.jsonl`	Architect outcome data (Phase 4 → Phase 6)
`memory/episodic/reviewer-events.jsonl`	Reviewer outcome data (Phase 6)
`scripts/prompt-tracker.py`	CLI for selection, recording, metrics, reporting
`scripts/stat-eval.py`	Statistical analysis: CIs, consistency, regression, calibration, flakiness

precision = files_found_and_used / files_found        (less noise)
recall    = files_found_and_used / total_files_needed  (fewer misses)
f1        = harmonic mean of precision and recall
score     = f1 * (1 - retry_rate)                      (penalize bad exploration)

selection   = 1.0 if user chose this proposal, 0.0 otherwise
convergence = 1.0 - (refinement_rounds / 3.0)         (fewer rounds = better)
quality     = 1.0 - critical_penalty - total_penalty   (fewer review issues = better)
score       = (selection * 0.4) + (quality * 0.35) + (convergence * 0.25)

true_positive_rate = issues_fixed / issues_found       (found real problems)
signal_to_noise    = (found - false_positives) / found (not just noise)
score              = true_positive_rate * signal_to_noise

Condition	Threshold
Minimum sessions per variant	10
F1 gap required for promotion	0.05
Maximum active variants per role	2
Challenger generation	Requires user approval

Prompt Optimization

Overview

When This Skill Triggers

The Process

Prompt Optimization

Overview

When This Skill Triggers

The Process

Step 1: Update Metrics

Step 2: Generate Report

Step 3: Check for Promotions (CI-Aware)

Step 3.5: Trace-Sample Error Analysis (Before Drafting Challengers)

Step 4: Generate Challengers (User Approval Required)

Step 5: Summary

Data Files

Score Definitions

Explorers

Architects

Reviewers

Promotion Thresholds

Integration

Next Steps

Taskflow Inbox Triage

Accessibility

Open a Pull Request

Investor Materials

Continuous Agent Loop

Configure Ecc