Builds automated evaluation suites to measure skill quality over time. Activates when teams need pre-deploy validation, continuous monitoring, or metric-driven assessment of trigger accuracy and output quality.
This skill provides templates and patterns for testing other skills using automated evaluations (evals).
Use this skill when the user:
For skills with predictable output. Create scripts/eval_code_grader.py:
# scripts/eval_code_grader.py
"""
Code-based grader for [skill-name].
Checks deterministic properties of skill output.
Usage: python scripts/eval_code_grader.py <result.json>
Exit 0 = PASS, Exit 1 = FAIL
"""
import json, sys
result = json.load(open(sys.argv[1]))
checks = {
"has_output": "output" in result,
"valid_format": isinstance(result.get("data"), dict),
"no_errors": len(result.get("errors", [])) == 0,
# Add skill-specific checks:
# "required_fields": all(f in result["data"] for f in REQUIRED),
}
passed = all(checks.values())
print(json.dumps({"passed": passed, "checks": checks}, indent=2))
sys.exit(0 if passed else 1)
For skills where output quality is subjective. Create references/eval_rubric.md:
## Eval Rubric for [skill-name]
Rate output on a 1-5 scale for each criterion:
1. **Completeness** — Does the output cover all required aspects?
2. **Accuracy** — Are the facts and recommendations correct?
3. **Specificity** — Are suggestions actionable (not generic)?
4. **Format** — Does the output match the expected structure?
### Scoring
- Score >= 4 on ALL criteria = PASS
- Any score < 3 = FAIL
- Score 3 on any criterion = REVIEW NEEDED
For each skill change:
1. Run code-based graders → pass/fail
2. Run LLM-as-judge with rubric → score
3. Compare against baseline metrics
4. If fail OR regression → block deploy
5. Log results for trend analysis
# Code-based eval
python scripts/eval_code_grader.py test_output.json
# LLM-as-judge (manual)
# Paste rubric from references/eval_rubric.md + skill output
# into Claude/Gemini and ask for scoring
| Metric | Target | How to Measure |
|---|---|---|
| Trigger rate | > 80% | 10 relevant prompts |
| False trigger rate | < 10% | 5 irrelevant prompts |
| Grader pass rate | > 90% | Automated eval runs |
| Token consumption | Stable ±10% | API logs |
For each critical step, define both:
Use this pattern in workflow steps:
### Step X: [Critical Step Name]
Run: [exact command or procedure]
**MANDATORY:** Always run this step.
**DO NOT:** Skip, shortcut, or assume success without evidence.
If this step fails, fix the issue and re-run before continuing.
STOP and re-evaluate if any of these occur:
Prevent infinite eval retry loops:
Avoid: eval fails → tweak output → eval fails → tweak output → ... (without fixing root cause)
scripts/eval_code_grader.py — Deterministic output checkerreferences/eval_rubric.md — LLM-as-judge scoring criteria