Compare skill variants or before-and-after versions using pass rate, token usage, latency, and qualitative win rate. Use this when choosing between multiple skill variants, measuring whether a refinement helped, or justifying skill maintenance investment. Do not use for evaluating a single skill in isolation (use skill-evaluation) or for building test infrastructure (use skill-testing-harness).
Compares skill variants or before/after versions using quantitative metrics: pass rate, token usage, latency, and qualitative win rate. Produces data to decide which variant to keep, whether refinement helped, or if skill is worth maintaining.
Use when:
Do NOT use when:
skill-evaluation for single skill)skill-testing-harness)skill-refinement)## Benchmark: [Skill A] vs [Skill B]
### Summary
| Metric | A | B | Winner |
|--------|---|---|--------|
| Pass Rate | 85% | 92% | B |
| Avg Tokens | 1200 | 980 | B |
| Win Rate | 35% | 65% | B |
### Detailed Results
[By category if applicable]
### Statistical Notes
- Pass rate difference: p=X
- Win rate: significantly > 50%? Yes/No
### Recommendation
**Keep [winner]**, [deprecate/archive] [loser].
Rationale: [why]