This skill provides templates and patterns for testing other skills using automated evaluations (evals).

When to Activate

Use this skill when the user:

Needs to test or evaluate an existing skill
Wants to set up continuous quality monitoring
Asks about skill quality metrics or grading

Two Types of Graders

1. Code-based Grader (deterministic)

For skills with predictable output. Create scripts/eval_code_grader.py:

# scripts/eval_code_grader.py
"""
Code-based grader for [skill-name].
Checks deterministic properties of skill output.

Usage: python scripts/eval_code_grader.py <result.json>
Exit 0 = PASS, Exit 1 = FAIL
"""
import json, sys

result = json.load(open(sys.argv[1]))

checks = {
    "has_output": "output" in result,
    "valid_format": isinstance(result.get("data"), dict),
    "no_errors": len(result.get("errors", [])) == 0,
    # Add skill-specific checks:
    # "required_fields": all(f in result["data"] for f in REQUIRED),
}

passed = all(checks.values())
print(json.dumps({"passed": passed, "checks": checks}, indent=2))
sys.exit(0 if passed else 1)

This skill provides templates and patterns for testing other skills using automated evaluations (evals).

When to Activate

Use this skill when the user:

Needs to test or evaluate an existing skill
Wants to set up continuous quality monitoring
Asks about skill quality metrics or grading

Two Types of Graders

1. Code-based Grader (deterministic)

For skills with predictable output. Create scripts/eval_code_grader.py:

# scripts/eval_code_grader.py
"""
Code-based grader for [skill-name].
Checks deterministic properties of skill output.

Usage: python scripts/eval_code_grader.py <result.json>
Exit 0 = PASS, Exit 1 = FAIL
"""
import json, sys

result = json.load(open(sys.argv[1]))

checks = {
    "has_output": "output" in result,
    "valid_format": isinstance(result.get("data"), dict),
    "no_errors": len(result.get("errors", [])) == 0,
    # Add skill-specific checks:
    # "required_fields": all(f in result["data"] for f in REQUIRED),
}

passed = all(checks.values())
print(json.dumps({"passed": passed, "checks": checks}, indent=2))
sys.exit(0 if passed else 1)

Metric	Target	How to Measure
Trigger rate	> 80%	10 relevant prompts
False trigger rate	< 10%	5 irrelevant prompts
Grader pass rate	> 90%	Automated eval runs
Token consumption	Stable ±10%	API logs

Skill Evals

When to Activate

Two Types of Graders

1. Code-based Grader (deterministic)

Skill Evals

When to Activate

Two Types of Graders

1. Code-based Grader (deterministic)

2. LLM-as-Judge Grader (open-ended)

Eval Workflow

Running Evals

Key Metrics to Track

Anti-rationalization Guardrails

Red Flags

Circuit Breaker (Anti-Loop)

Resources

Automation Audit Ops

Github Qa Labels

Jupyter Notebook

Tidb Integrationtest Recorder

Quality Nonconformance

Hugging Face Trackio