Run trigger tests, behavior tests, and baseline comparisons for a skill's eval suite, then produce a structured quality verdict. Use when a skill has been modified and needs regression testing, when CI/pre-release validation requires documented eval results, or when measuring quality before catalog inclusion. Do not use when no evals exist yet (build them first) or for manual evaluation without test files.
Executes a skill's eval suite and produces a structured quality report.
Check for eval files in the skill directory:
evals/triggers.yaml — trigger accuracy testsevals/outputs.yaml — behavior correctness testsevals/baselines.yaml — baseline comparison testsNote which exist and which are missing.
For each case in triggers.yaml:
Record: prompt, expected, actual, Pass/Fail.
For each case in outputs.yaml:
expected_sectionsrequired_patternsforbidden_patternsRecord: test name, checks passed/total, Pass/Fail.
For each case in baselines.yaml:
Calculate:
| Verdict | Criteria |
|---|---|
| Pass | All rates ≥80% AND baseline win |
| Pass with issues | Any rate 60-79% |
| Fail | Any rate <60% OR baseline lose |
## Eval Report: [skill-name]
Date: [YYYY-MM-DD]
### Trigger Tests
| Prompt | Type | Expected | Actual | Result |
|--------|------|----------|--------|--------|
Precision: X% Recall: Y%
### Output Tests
| Test | Checks Passed | Result |
|------|--------------|--------|
Pass rate: X%
### Baseline Comparison
Skill adds value: [Yes/No]
### Verdict: [Pass | Pass with issues | Fail]
Issues: [list or "None"]