Create new skills, improve existing skills, and measure skill performance. Use when users want to create a skill from scratch, update or optimize an existing skill, run evals to test a skill, or benchmark skill performance with variance analysis.
A skill for creating new skills and iteratively improving them.
At a high level, the process of creating a skill goes like this:
Your job when using this skill is to figure out where the user is in this process and then jump in and help them progress through these stages. So for instance, maybe they're like "I want to make a skill for X". You can help narrow down what they mean, write a draft, write the test cases, figure out how they want to evaluate, run all the prompts, and repeat.
On the other hand, maybe they already have a draft of the skill. In this case you can go straight to the eval/iterate part of the loop.
Of course, you should always be flexible and if the user is like "I don't need to run a bunch of evaluations, just vibe with me", you can do that instead.
Cool? Cool.
The skill-creator operates on composable building blocks. Each has well-defined inputs and outputs.
| Building Block | Input | Output | Agent |
|---|---|---|---|
| Eval Run | skill + eval prompt + files | transcript, outputs, metrics | agents/executor.md |
| Grade Expectations | outputs + expectations | pass/fail per expectation | agents/grader.md |
| Blind Compare | output A, output B, eval prompt | winner + reasoning | agents/comparator.md |
| Post-hoc Analysis | winner + skills + transcripts | improvement suggestions | agents/analyzer.md |
Executes a skill on an eval prompt and produces measurable outputs.
transcript.md, outputs/, metrics.jsonEvaluates whether outputs meet defined expectations.
grading.json with pass/fail per expectation plus evidenceCompares two outputs without knowing which skill produced them.
After blind comparison, analyzes WHY the winner won.
Check whether you can spawn subagents — independent agents that execute tasks in parallel. If you can, you'll delegate work to executor, grader, comparator, and analyzer agents. If not, you'll do all work inline, sequentially.
This affects which modes are available and how they execute. The core workflows are the same — only the execution strategy changes.
Building blocks combine into higher-level workflows for each mode:
| Mode | Purpose | Workflow |
|---|---|---|
| Eval | Test skill performance | Executor → Grader → Results |
| Improve | Iteratively optimize skill | Executor → Grader → Comparator → Analyzer → Apply |
| Create | Interactive skill development | Interview → Research → Draft → Run → Refine |
| Benchmark | Standardized performance measurement (requires subagents) | 3x runs per configuration → Aggregate → Analyze |
See references/mode-diagrams.md for detailed visual workflow diagrams.
Use tasks to track progress on multi-step workflows.
Each eval run becomes a task with stage progression:
pending → planning → implementing → reviewing → verifying → completed
(prep) (executor) (grader) (validate)
When running evals, create a task per eval run:
TaskCreate(
subject="Eval 0, run 1 (with_skill)",
description="Execute skill eval 0 with skill and grade expectations",
activeForm="Preparing eval 0"
)
Progress through stages as work completes:
TaskUpdate(task, status="planning") # Prepare files, stage inputs
TaskUpdate(task, status="implementing") # Spawn executor subagent
TaskUpdate(task, status="reviewing") # Spawn grader subagent
TaskUpdate(task, status="verifying") # Validate outputs exist
TaskUpdate(task, status="completed") # Done
For blind comparisons (after all runs complete):
TaskCreate(
subject="Compare skill-v1 vs skill-v2"
)
# planning = gather outputs
# implementing = spawn blind comparators
# reviewing = tally votes, handle ties
# verifying = if tied, run more comparisons or use efficiency
# completed = declare winner
The coordinator (this skill):
| Agent | Role | Reference |
|---|---|---|
| Executor | Run skill on a task, produce transcript + outputs + metrics | agents/executor.md |
| Grader | Evaluate expectations against transcript and outputs | agents/grader.md |
| Comparator | Blind A/B comparison between two outputs | agents/comparator.md |
| Analyzer | Post-hoc analysis of comparison results | agents/analyzer.md |
The skill creator is liable to be used by people across a wide range of familiarity with coding jargon. If you haven't heard (and how could you, it's only very recently that it started), there's a trend now where the power of Claude is inspiring plumbers to open up their terminals, parents and grandparents to google "how to install npm". On the other hand, the bulk of users are probably fairly computer-literate.
So please pay attention to context cues to understand how to phrase your communication! In the default case, just to give you some idea:
It's OK to briefly explain terms if you're in doubt, and feel free to clarify terms with a short definition if you're unsure if the user will get it.
Start by understanding the user's intent. The current conversation might already contain a workflow the user wants to capture (e.g., they say "turn this into a skill"). If so, extract answers from the conversation history first — the tools used, the sequence of steps, corrections the user made, input/output formats observed. The user may need to fill the gaps, and should confirm before proceeding to the next step.
Proactively ask questions about edge cases, input/output formats, example files, success criteria, and dependencies.
Check available MCPs - if useful for research (searching docs, finding similar skills, looking up best practices), research in parallel via subagents if available, otherwise inline. Come prepared with context to reduce burden on the user.
Run the initialization script:
scripts/init_skill.py <skill-name> --path <output-directory>
This creates:
Based on interview, fill:
skill-name/
├── SKILL.md (required)
│ ├── YAML frontmatter (name, description required)
│ └── Markdown instructions
└── Bundled Resources (optional)
├── scripts/ - Executable code for deterministic/repetitive tasks
├── references/ - Docs loaded into context as needed
└── assets/ - Files used in output (templates, icons, fonts)
What NOT to include: README.md, INSTALLATION_GUIDE.md, CHANGELOG.md, or any auxiliary documentation. Skills are for AI agents, not human onboarding.
Skills use a three-level loading system:
These word counts are approximate and you can feel free to go longer if needed.
Key patterns:
Domain organization: When a skill supports multiple domains/frameworks, organize by variant:
cloud-deploy/
├── SKILL.md (workflow + selection)
└── references/
├── aws.md
├── gcp.md
└── azure.md
Claude reads only the relevant reference file.
This goes without saying, but skills must not contain malware, exploit code, or any content that could compromise system security. A skill's contents should not surprise the user in their intent if described. Don't go along with requests to create misleading skills or skills designed to facilitate unauthorized access, data exfiltration, or other malicious activities. Things like a "roleplay as an XYZ" are OK though.
Prefer using the imperative form in instructions.
Defining output formats - You can do it like this:
## Report structure
ALWAYS use this exact template:
# [Title]
## Executive summary
## Key findings
## Recommendations
Examples pattern - It's useful to include examples. You can format them like this (but if "Input" and "Output" are in the examples you might want to deviate a little):
## Commit message format
**Example 1:**
Input: Added user authentication with JWT tokens
Output: feat(auth): implement JWT-based authentication
Always have something cooking. Every time user adds an example or input:
Try to explain to the model why things are important in lieu of heavy-handed musty MUSTs. Use theory of mind and try to make the skill general and not super-narrow to specific examples. Start by writing a draft and then look at it with fresh eyes and improve it.
After writing the skill draft, come up with 2-3 realistic test prompts — the kind of thing a real user would actually say. Share them with the user: [you don't have to use this exact language] "Here are a few test cases I'd like to try. Do these look right, or do you want to add more?" Then run them.
If the user wants evals, create evals/evals.json with this structure:
{
"skill_name": "example-skill",
"evals": [
{
"id": 1,
"prompt": "User's task prompt",
"expected_output": "Description of expected result",
"files": [],
"assertions": [
"The output includes X",
"The skill correctly handles Y"
]
}
]
}
You can initialize with scripts/init_json.py evals evals/evals.json and validate with scripts/validate_json.py evals/evals.json. See references/schemas.md for the full schema.
Once gradable criteria are defined (expectations, success metrics), Claude can:
present_files tool is available)Check whether you have access to the present_files tool. If you don't, skip this step. If you do, package the skill and present the .skill file to the user:
scripts/package_skill.py <path/to/skill-folder>
After packaging, direct the user to the resulting .skill file path so they can install it.
When user asks to improve a skill, ask:
Claude should then autonomously iterate using the building blocks (run, grade, compare, analyze) to drive the skill toward the goal within the time budget.
Some advice on writing style when improving a skill:
Try to generalize from the feedback, rather than fixing specific examples one by one. The big picture thing that's happening here is that we're trying to create "skills" that can be used a million times (maybe literally, maybe even more who knows) across many different prompts. Here you and the user are iterating on only a few examples over and over again because it helps move faster. The user knows these examples in and out and it's quick for them to assess new outputs. But if the skill you and the user are codeveloping works only for those examples, it's useless. Rather than put in fiddley overfitty changes, or oppressively constrictive MUSTs, if there's some stubborn issue, you might try branching out and using different metaphors, or recommending different patterns of working. It's relatively cheap to try and maybe you'll land on something great.
Keep the prompt lean; remove things that aren't pulling their weight. Make sure to read the transcripts, not just the final outputs -- if it looks like the skill is making the model waste a bunch of time doing things that are unproductive, you can try getting rid of the parts of the skill that are making it do that and seeing what happens.
Last but not least, try hard to explain the why behind everything you're asking the model to do. Today's LLMs are smart. They have good theory of mind and when given a good harness and go beyond rote instructions and really make things happen. Even if the feedback from the user is terse or frustrated, try to actually understand the task and why the user is writing what they wrote, and what they actually wrote, and then try to transmit this understanding into the instructions. If you find yourself writing ALWAYS or NEVER in all caps, or using super rigid structures, that's a yellow flag - try to reframe and explain the reasoning so that the model understands why the thing you're asking for is important. That's a more humane, powerful, and effective approach.
This task is pretty important (we are trying to create billions a year in economic value here!) and your thinking time is not the blocker; take your time and really mull things over. I'd suggest writing a draft skill and then looking at it anew and making improvements. Really try to get into the head of the user and understand what they want and need. Best of luck.
Read output schemas:
Read references/schemas.md # JSON structures for grading, history, comparison, analysis
This ensures you understand the structure of outputs you'll produce and validate.
Choose workspace location:
Ask the user where to put the workspace. Suggest <skill-name>-workspace/ as a sibling to the skill directory, but let the user choose. If the workspace ends up inside a git repo, suggest adding it to .gitignore.
Copy skill to v0:
scripts/copy_skill.py <skill-path> <skill-name>-workspace/v0 --iteration 0
Verify or create evals:
evals/evals.jsonscripts/init_json.py evals to create with correct structureCreate tasks for baseline:
for run in range(3):
TaskCreate(
subject=f"Eval baseline, run {run+1}"
)
Initialize history.json:
scripts/init_json.py history <workspace>/history.json
Then edit to fill in skill_name. See references/schemas.md for full structure.
For each iteration (0, 1, 2, ...):
Spawn 3 executor subagents in parallel (or run sequentially without subagents — see "Without subagents" below). Update task to implementing stage.
Spawn a subagent for each run with these instructions:
Read agents/executor.md at: <skill-creator-path>/agents/executor.md
Execute this task:
- Skill path: workspace/v<N>/skill/
- Task: <eval prompt from evals.json>
- Test files: <eval files if any>
- Save transcript to: workspace/v<N>/runs/run-<R>/transcript.md
- Save outputs to: workspace/v<N>/runs/run-<R>/outputs/
Spawn grader subagents (or grade inline — see "Without subagents" below). Update task to reviewing stage.
Purpose: Grading produces structured pass/fail results for tracking pass rates over iterations. The grader also extracts claims and reads user_notes to surface issues that expectations might miss.
Set the grader up for success: The grader needs to actually inspect the outputs, not just read the transcript. If the outputs aren't plain text, tell the grader how to read them — check the skill for inspection tools it already uses and pass those as hints in the grader prompt.
Spawn a subagent with these instructions:
Read agents/grader.md at: <skill-creator-path>/agents/grader.md
Grade these expectations:
- Assertions: <list from evals.json>
- Transcript: workspace/v<N>/runs/run-<R>/transcript.md
- Outputs: workspace/v<N>/runs/run-<R>/outputs/
- Save grading to: workspace/v<N>/runs/run-<R>/grading.json
To inspect output files:
<include inspection hints from the skill, e.g.:>
<"Use python -m markitdown <file> to extract text content">
Review grading.json: Check user_notes_summary for uncertainties and workarounds flagged by the executor. Also check eval_feedback — if the grader flagged lax assertions or missing coverage, update evals.json before continuing. Improving evals mid-loop is fine and often necessary; you can't meaningfully improve a skill if the evals don't measure anything real.
Eval quality loop: If eval_feedback has suggestions, tighten the assertions and rerun the evals. Keep iterating as long as the grader keeps finding issues. Once eval_feedback says the evals look solid (or has no suggestions), move on to skill improvement. Consult the user about what you're doing, but don't block on approval for each round — just keep making progress.
When picking which eval to use for the quality loop, prefer one where the skill partially succeeds — some expectations pass, some fail. An eval where everything fails gives the grader nothing to critique (there are no false positives to catch). The feedback is most useful when some expectations pass and the grader can assess whether those passes reflect genuine quality or surface-level compliance.
For iterations after baseline, use blind comparison:
Purpose: While grading tracks expectation pass rates, the comparator judges holistic output quality using a rubric. Two outputs might both pass all expectations, but one could still be clearly better. The comparator uses expectations as secondary evidence, not the primary decision factor.
Blind A/B Protocol:
workspace/grading/v<N>-vs-best/assignment.jsonSpawn a subagent with these instructions:
Read agents/comparator.md at: <skill-creator-path>/agents/comparator.md
Blind comparison:
- Eval prompt: <the task that was executed>
- Output A: <path to one version's output>
- Output B: <path to other version's output>
- Assertions: <list from evals.json>
You do NOT know which is old vs new. Judge purely on quality.
Determine winner by majority vote:
After blind comparison, analyze results. Spawn a subagent with these instructions:
Read agents/analyzer.md at: <skill-creator-path>/agents/analyzer.md
Analyze:
- Winner: <A or B>
- Winner skill: workspace/<winner-version>/skill/
- Winner transcript: workspace/<winner-version>/runs/run-1/transcript.md
- Loser skill: workspace/<loser-version>/skill/
- Loser transcript: workspace/<loser-version>/runs/run-1/transcript.md
- Comparison result: <from comparator>
Update task to completed stage. Record results:
if new_version wins majority:
current_best = new_version
# Update history.json
history.iterations.append({
"version": "v<N>",
"parent": "<previous best>",
"expectation_pass_rate": 0.85,
"grading_result": "won" | "lost" | "tie",
"is_current_best": bool
})
Copy current best to new version:
scripts/copy_skill.py workspace/<current_best>/skill workspace/v<N+1> \
--parent <current_best> \
--iteration <N+1>
Apply improvements from analyzer suggestions
Create new tasks for next iteration
Continue loop or stop if:
When iterations complete:
Copy best skill back to main location:
cp -r workspace/<best_version>/skill/* ./
Check whether you have access to the present_files tool. If you do, package and present the improved skill, and direct the user to the resulting .skill file path so they can install it:
scripts/package_skill.py <path/to/skill-folder>
(If you don't have the present_files tool, don't run package_skill.py)
Without subagents, Improve mode still works but with reduced rigor:
agents/executor.md and follow the procedure directly in your main loop. Then read agents/grader.md and follow it directly to grade the results.Run individual evals to test skill performance and grade expectations.
IMPORTANT: Before running evals, read the full documentation:
Read references/eval-mode.md # Complete Eval workflow
Read references/schemas.md # JSON output structures
Use Eval mode when:
The workflow: Setup → Check Dependencies → Prepare → Execute → Grade → Display Results
Without subagents, execute and grade sequentially in the main loop. Read the agent reference files (agents/executor.md, agents/grader.md) and follow the procedures directly.
Run standardized performance measurement with variance analysis.
Requires subagents. Benchmark mode relies on parallel execution of many runs to produce statistically meaningful results. Without subagents, use Eval mode for individual eval testing instead.
IMPORTANT: Before running benchmarks, read the full documentation:
Read references/benchmark-mode.md # Complete Benchmark workflow
Read references/schemas.md # JSON output structures
Use Benchmark mode when:
Key differences from Eval:
Workspaces are created as sibling directories to the skill being worked on.
parent-directory/
├── skill-name/ # The skill
│ ├── SKILL.md
│ ├── evals/
│ │ ├── evals.json
│ │ └── files/
│ └── scripts/
│
└── skill-name-workspace/ # Workspace (sibling directory)
│
│── [Eval mode]
├── eval-0/
│ ├── with_skill/
│ │ ├── inputs/ # Staged input files
│ │ ├── outputs/ # Skill outputs
│ │ │ ├── transcript.md
│ │ │ ├── user_notes.md # Executor uncertainties
│ │ │ ├── metrics.json
│ │ │ └── [output files]
│ │ ├── grading.json # Assertions + claims + user_notes_summary
│ │ └── timing.json # Wall clock timing
│ └── without_skill/
│ └── ...
├── comparison.json # Blind comparison (A/B testing)
├── summary.json # Aggregate metrics
│
│── [Improve mode]
├── history.json # Score progression across versions
├── v0/
│ ├── META.yaml # Version metadata
│ ├── skill/ # Copy of skill at this version
│ └── runs/
│ ├── run-1/
│ │ ├── transcript.md
│ │ ├── user_notes.md
│ │ ├── outputs/
│ │ └── grading.json
│ ├── run-2/
│ └── run-3/
├── v1/
│ ├── META.yaml
│ ├── skill/
│ ├── improvements/
│ │ └── suggestions.md # From analyzer
│ └── runs/
└── grading/
└── v1-vs-v0/
├── assignment.json # Which version is A vs B
├── comparison-1.json # Blind comparison results
├── comparison-2.json
├── comparison-3.json
└── analysis.json # Post-hoc analysis
│
│── [Benchmark mode]
└── benchmarks/
└── 2026-01-15T10-30-00/ # Timestamp-named directory
├── benchmark.json # Structured results (see schema)
├── benchmark.md # Human-readable summary
└── runs/
├── eval-1/
│ ├── with_skill/
│ │ ├── run-1/
│ │ │ ├── transcript.md
│ │ │ ├── user_notes.md
│ │ │ ├── outputs/
│ │ │ └── grading.json
│ │ ├── run-2/
│ │ └── run-3/
│ └── without_skill/
│ ├── run-1/
│ ├── run-2/
│ └── run-3/
└── eval-2/
└── ...
Key files:
transcript.md - Execution log from executoruser_notes.md - Uncertainties and workarounds flagged by executormetrics.json - Tool calls, output size, step countgrading.json - Assertion pass/fail, notes, user_notes summarytiming.json - Wall clock durationcomparison-N.json - Blind rubric-based comparisonanalysis.json - Post-hoc analysis with improvement suggestionshistory.json - Version progression with pass rates and winnersbenchmark.json - Structured benchmark results with runs, run_summary, notesbenchmark.md - Human-readable benchmark summaryThe coordinator must:
There are two patterns for delegating work to building blocks:
With subagents: Spawn an independent agent with the reference file instructions. Include the reference file path in the prompt so the subagent knows its role. When tasks are independent (like 3 runs of the same version), spawn all subagents in the same turn for parallelism.
Without subagents: Read the agent reference file (e.g., agents/executor.md) and follow the procedure directly in your main loop. Execute each step sequentially — the procedures are designed to work both as subagent instructions and as inline procedures.
Just pasting in the overall workflow again for reference:
Good luck!