Use when creating new skills, editing existing skills, or verifying skills work before deployment
Create, test, and refine skills through a 6-phase process. Routes testing strategy by skill type: discipline, workflow, reference, or hybrid.
Core principle: If you didn't watch an agent fail without the skill, you don't know if the skill teaches the right thing.
REQUIRED: Also use skillproof:skill-tdd for the testing methodology (RED-GREEN-REFACTOR cycle).
Understand what the skill should do before writing anything.
Determine:
Classify skill type:
| Type | Examples | Test approach |
|---|---|---|
| Discipline |
| TDD, verification-before-completion, coding standards |
| Pressure scenarios (skillproof:skill-testing-discipline) |
| Technique/Workflow | condition-based-waiting, root-cause-tracing | Output comparison (skillproof:skill-testing-output) |
| Reference | API docs, command references, library guides | Retrieval scenarios (inline) |
| Hybrid | Rules + techniques combined | Both discipline + output testing |
Interview:
Read references/best-practices.md for authoring guidance before drafting.
Document what agents do WITHOUT the skill. Choose the testing tier from skillproof:skill-tdd that matches your situation:
REQUIRED: Follow skillproof:skill-tdd for the TDD methodology and tier selection.
Hypothesized failures don't count as evidence. Spot checks require direct evidence (production reports, user feedback, prior test results). Quick and full tests require running an actual agent and observing what it does.
| Acceptable | Not acceptable |
|---|---|
| Spawned subagent, observed it chose B | "Agents typically choose B in this situation" |
| Agent produced output with 3 specific gaps | "Common gaps include X, Y, Z" |
| Production report: "skill triggers for one-line fixes" | "The skill probably triggers too broadly" |
| Captured verbatim rationalization from run | Listed known rationalizations from memory |
Route by type:
Discipline → Follow skillproof:skill-testing-discipline:
Technique/Workflow → Follow skillproof:skill-testing-output:
evals/evals.json)Reference → Retrieval scenarios (lightweight, inline):
Hybrid → Both discipline testing AND output comparison
Write the minimal skill that addresses the specific baseline failures you documented. Nothing more.
Structure every skill:
---