Objective eval metrics via code/model/human graders with pass@k/pass^k scoring. USE WHEN eval, evaluate, test agent, benchmark, verify behavior, regression test, capability test, run eval, compare models, compare prompts, create judge, create use case, view results, failure to task, suite manager, transcript capture, trial runner.
Before executing, check for user customizations at:
~/.claude/PAI/USER/SKILLCUSTOMIZATIONS/Evals/
If this directory exists, load and apply any PREFERENCES.md, configurations, or resources found there. These override default behavior. If the directory does not exist, proceed with skill defaults.
You MUST send this notification BEFORE doing anything else when this skill is invoked.
Send voice notification:
curl -s -X POST http://localhost:8888/notify \
-H "Content-Type: application/json" \
-d '{"message": "Running the WORKFLOWNAME workflow in the Evals skill to ACTION"}' \
> /dev/null 2>&1 &
Output text notification:
Running the **WorkflowName** workflow in the **Evals** skill to ACTION...
This is not optional. Execute this curl command immediately upon skill invocation.
Comprehensive agent evaluation system based on Anthropic's "Demystifying Evals for AI Agents" (Jan 2026).
Key differentiator: Evaluates agent workflows (transcripts, tool calls, multi-turn conversations), not just single outputs.
| Type | Strengths | Weaknesses | Use For |
|---|---|---|---|
| Code-based | Fast, cheap, deterministic, reproducible | Brittle, lacks nuance | Tests, state checks, tool verification |
| Model-based | Flexible, captures nuance, scalable | Non-deterministic, expensive | Quality rubrics, assertions, comparisons |
| Human | Gold standard, handles subjectivity | Expensive, slow | Calibration, spot checks, A/B testing |
| Type | Pass Target | Purpose |
|---|---|---|
| Capability | ~70% | Stretch goals, measuring improvement potential |
| Regression | ~99% | Quality gates, detecting backsliding |
| Request Pattern | Route To |
|---|---|
| Run eval, evaluate suite, run tests, benchmark | Workflows/RunEval.md |
| Compare models, model comparison, A/B test models | Workflows/CompareModels.md |
| Compare prompts, prompt comparison, test prompts | Workflows/ComparePrompts.md |
| Create judge, model grader, evaluation judge | Workflows/CreateJudge.md |
| Create use case, new eval, test case, create suite | Workflows/CreateUseCase.md |
| View results, eval results, scores, pass rate | Workflows/ViewResults.md |
| Trigger | Tool |
|---|---|
| Run suite | Tools/AlgorithmBridge.ts |
| Log failure | Tools/FailureToTask.ts log |
| Convert failures | Tools/FailureToTask.ts convert-all |
| Create suite | Tools/SuiteManager.ts create |
| Check saturation | Tools/SuiteManager.ts check-saturation |
# Run an eval suite
bun run ~/.claude/skills/Utilities/Evals/Tools/AlgorithmBridge.ts -s <suite>
# Log a failure for later conversion
bun run ~/.claude/skills/Utilities/Evals/Tools/FailureToTask.ts log "description" -c category -s severity
# Convert failures to test tasks
bun run ~/.claude/skills/Utilities/Evals/Tools/FailureToTask.ts convert-all
# Manage suites
bun run ~/.claude/skills/Utilities/Evals/Tools/SuiteManager.ts create <name> -t capability -d "description"
bun run ~/.claude/skills/Utilities/Evals/Tools/SuiteManager.ts list
bun run ~/.claude/skills/Utilities/Evals/Tools/SuiteManager.ts check-saturation <name>
bun run ~/.claude/skills/Utilities/Evals/Tools/SuiteManager.ts graduate <name>
Evals is a verification method for THE ALGORITHM ISC rows:
# Run eval and update ISC row
bun run ~/.claude/skills/Utilities/Evals/Tools/AlgorithmBridge.ts -s regression-core -r 3 -u
ISC rows can specify eval verification:
| # | What Ideal Looks Like | Verify |
|---|----------------------|--------|
| 1 | Auth bypass fixed | eval:auth-security |
| 2 | Tests all pass | eval:regression |
| Grader | Use Case |
|---|---|
string_match | Exact substring matching |
regex_match | Pattern matching |
binary_tests | Run test files |
static_analysis | Lint, type-check, security scan |
state_check | Verify system state after execution |
tool_calls | Verify specific tools were called |
| Grader | Use Case |
|---|---|
llm_rubric | Score against detailed rubric |
natural_language_assert | Check assertions are true |
pairwise_comparison | Compare to reference with position swap |
Pre-configured grader stacks for common agent types:
| Domain | Primary Graders |
|---|---|
coding | binary_tests + static_analysis + tool_calls + llm_rubric |
conversational | llm_rubric + natural_language_assert + state_check |
research | llm_rubric + natural_language_assert + tool_calls |
computer_use | state_check + tool_calls + llm_rubric |
See Data/DomainPatterns.yaml for full configurations.