Use when evaluating whether a skill measurably improves outputs by running controlled comparisons — with-skill vs without-skill, old version vs new version, or A/B tests across skill variants — where quality is graded on produced artifacts like code, docs, migrations, or commit messages
Test skills by comparing output quality with and without the skill. Run the same prompt both ways, grade the difference, and iterate until the skill consistently produces better results.
Core principle: If you can't measure the output quality difference, you don't know if the skill helps. Subjective impressions are not evidence. Run structured comparisons, grade with assertions, and benchmark the delta.
Use when testing:
Do NOT use for:
skillproof:skill-testing-discipline)For quick tests (see skillproof:skill-tdd testing tiers), reduce scope:
Quick mode is for cheap validation of a hypothesis, not rigorous comparison. Use the full process below when you need quantitative evidence.
Create 2-3 realistic test prompts — the kind of thing a real user would actually say. Share them with the user for review.
Save test cases to evals/evals.json:
{
"skill_name": "example-skill",
"evals": [
{
"id": 1,
"prompt": "User's task prompt",
"expected_output": "Description of expected result",
"files": []
}
]
}
Don't write assertions yet — just prompts. You'll draft assertions while runs are in progress.
See schemas.md for the full schema.
For each test case, spawn TWO subagents simultaneously — one with the skill, one without. Launch everything at once so it all finishes around the same time.
With-skill run:
Execute this task:
- Skill path: <path-to-skill>
- Task: <eval prompt>
- Input files: <eval files if any>
- Save outputs to: <workspace>/iteration-<N>/eval-<ID>/with_skill/outputs/
Baseline run:
without_skill/outputs/cp -r), point baseline at snapshot. Save to old_skill/outputs/Write an eval_metadata.json for each test case:
{
"eval_id": 0,
"eval_name": "descriptive-name-here",
"prompt": "The user's task prompt",
"assertions": []
}
Give each eval a descriptive name based on what it tests, not just "eval-0".
While runs are in progress, draft quantitative assertions. Good assertions are:
Subjective skills (writing style, design quality) are better evaluated qualitatively. Don't force assertions onto things that need human judgment.
Update eval_metadata.json and evals/evals.json with assertions.
When each subagent task completes, the notification contains total_tokens and duration_ms. Save immediately to timing.json:
{
"total_tokens": 84852,
"duration_ms": 23332,
"total_duration_seconds": 23.3
}
This is the only opportunity to capture this data — it comes through the notification and isn't persisted elsewhere.
Spawn a grader subagent using grader.md:
Save results to grading.json in each run directory. The grading.json expectations array must use fields text, passed, and evidence.
For assertions checkable programmatically, write and run a script — scripts are faster and more reliable than eyeballing.
Aggregate results across all runs:
python -m scripts.aggregate_benchmark <workspace>/iteration-N --skill-name <name>
Produces benchmark.json and benchmark.md with pass_rate, time, and tokens for each configuration, with mean ± stddev and delta.
See schemas.md for the benchmark.json format.
Launch the browser-based reviewer:
nohup python ../writing-skills/eval-viewer/generate_review.py \
<workspace>/iteration-N \
--skill-name "my-skill" \
--benchmark <workspace>/iteration-N/benchmark.json \
> /dev/null 2>&1 &
For iteration 2+, add --previous-workspace <workspace>/iteration-N-1>.
Cowork / headless: Use --static <output_path> for standalone HTML.
The user sees two tabs:
When done, user clicks "Submit All Reviews" → feedback.json.
{
"reviews": [
{"run_id": "eval-0-with_skill", "feedback": "chart missing axis labels", "timestamp": "..."},
{"run_id": "eval-1-with_skill", "feedback": "", "timestamp": "..."}
]
}
Empty feedback = user thought it was fine. Focus improvements on test cases with specific complaints.
The skill will be used many times across many prompts. You and the user iterate on only a few examples because it's fast. But if the skill works only for those examples, it's useless. Rather than fiddly overfitting changes, try different metaphors, recommend different patterns. It's cheap to try.
Remove things that aren't pulling their weight. Read the transcripts — if the skill makes the agent waste time on unproductive steps, remove those parts and see what happens.
Use theory of mind. Today's LLMs are smart. When given good reasoning, they go beyond rote instructions. If you find yourself writing ALWAYS or NEVER in all caps, reframe: explain the reasoning so the model understands why it matters.
Read transcripts from test runs. If all test cases result in the agent writing a similar helper script, that script should be bundled in scripts/. Write it once, save every future invocation from reinventing the wheel.
The output may look fine while the process was wasteful or fragile.
After improving the skill:
iteration-<N+1>/ directory, including baselines--previous-workspace pointing at previous iterationKeep going until:
For rigorous comparison between two skill versions:
comparator.md — give two outputs labeled A/B without revealing which skill produced whichanalyzer.md — analyze WHY the winner wonOptional, requires subagents, most users don't need it. The human review loop is usually sufficient.
skill-name-workspace/
└── iteration-1/
├── eval-descriptive-name/
│ ├── with_skill/
│ │ ├── outputs/
│ │ └── timing.json
│ ├── without_skill/
│ │ ├── outputs/
│ │ └── timing.json
│ ├── eval_metadata.json
│ └── grading.json
├── benchmark.json
├── benchmark.md
└── feedback.json
Claude Code: Full functionality — subagents, browser viewer, all scripts.
Claude.ai: No subagents. Run test cases yourself one at a time. Skip baselines. Present results inline and ask for feedback directly. Skip benchmarking.
Cowork: Subagents work. Use --static for viewer (no browser). Feedback downloads as file.
Run from the skill-testing-output directory within the plugin:
Run scripts from this skill's base directory.
| Script | Purpose | When to use |
|---|---|---|
python -m scripts.init_workspace <name> --evals <evals.json> | Create workspace directory structure | Before running comparisons |
python -m scripts.run_comparison --skill-path <path> --evals <evals.json> --workspace <path> | Run A/B comparisons via claude -p | Executing eval runs |
python -m scripts.run_comparison --baseline-skill <old-path> | Compare two skill versions | Old vs new skill testing |
python -m scripts.diff_outputs <workspace>/iteration-N/eval-name | Compare output files between configs | Reviewing output differences |
For grading and benchmarking, use the shared writing-skills scripts:
# From the writing-skills sibling directory (../writing-skills):
python -m scripts.aggregate_benchmark <workspace>/iteration-N --skill-name <name>
Prerequisites: Python 3.10+, claude CLI. Activate .venv if available.