Evaluate whether a skill routes correctly and produces better output than a baseline — measuring precision, recall, output quality, and win rate against no-skill runs. Use this when validating a new skill before promotion, verifying a refinement worked, or auditing whether an existing skill still adds value. Do not use for building the test harness itself (use skill-testing-harness), benchmarking multiple variants head-to-head (use skill-benchmarking), or debugging a skill that is obviously broken (fix it first with skill-refinement).
Measures whether a skill is working: does it route correctly (trigger when should, not trigger when shouldn't) and does it produce better output than not having the skill? Produces quantitative evidence that a skill adds value.
Use when:
Do NOT use when:
skill-testing-harness)skill-eval-runner)skill-benchmarking)skill-refinement)## Skill Evaluation: [skill-name]
### Routing Accuracy
| Metric | Value | Target | Pass? |
|--------|-------|--------|-------|
| Precision | X% | ≥95% | ✓/✗ |
| Recall | X% | ≥90% | ✓/✗ |
**Issues**: [false negatives/positives]
### Output Quality (N cases)
**Score**: X/N pass (Y%)
### Baseline Comparison
**Win rate**: X/N (Y%)
### Verdict: [Pass | Fail | Needs Work]
**Issues**: [list or "None"]
skill-refinement