Create new skills, modify and improve existing skills, and measure skill performance. Use when users want to create a new skill (or update an existing skill) that extends Claude's capabilities with specialized knowledge, workflows, or tool integrations. Also use when users want to run evals to test a skill, benchmark skill performance, optimize a skill's description for better triggering accuracy, or when they say "turn this into a skill". Trigger on any mention of creating, writing, building, improving, or testing skills.
Create new skills and iteratively improve them through testing and evaluation.
Skills are modular, self-contained packages that extend Claude's capabilities with specialized knowledge, workflows, and tools. They transform Claude from a general-purpose agent into a specialized one equipped with procedural knowledge no model can fully possess.
Skills provide: specialized workflows, tool integrations, domain expertise, and bundled resources (scripts, references, assets).
The context window is a public good shared with system prompt, conversation history, other skills, and user requests. Claude is already very smart — only add context it doesn't already have. Challenge each piece: "Does Claude really need this?" and "Does this justify its token cost?" Prefer concise examples over verbose explanations.
Match specificity to the task's fragility and variability:
Think of Claude exploring a path: a narrow bridge needs guardrails (low freedom), an open field allows many routes (high freedom).
Explain why things are important rather than using heavy-handed MUSTs. LLMs have good theory of mind — when given reasoning, they go beyond rote instructions. If you find yourself writing ALWAYS or NEVER in all caps, reframe and explain the reasoning instead.
skill-name/
├── SKILL.md (required)
│ ├── YAML frontmatter (name, description — required)
│ └── Markdown instructions
└── Bundled Resources (optional)
├── scripts/ - Executable code for deterministic/repetitive tasks
├── references/ - Docs loaded into context as needed
└── assets/ - Files used in output (templates, icons, fonts)
name and description fields only. The description is the primary triggering mechanism — Claude reads it to decide when to use the skill. All "when to use" info goes here, not in the body.scripts/)Executable code for tasks requiring deterministic reliability or that get rewritten repeatedly.
references/)Documentation loaded into context as needed.
assets/)Files used in output, not loaded into context.
Do NOT create extraneous files: README.md, INSTALLATION_GUIDE.md, QUICK_REFERENCE.md, CHANGELOG.md, etc. The skill should only contain what an AI agent needs to do the job.
Skills use a three-level loading system:
Keep SKILL.md under 500 lines. When approaching this limit, split content into separate files with clear references describing when to read them.
Pattern 1: High-level guide with references
## Quick start
[Minimal working example]
## Advanced features
- **Form filling**: See [FORMS.md](FORMS.md)
- **API reference**: See [REFERENCE.md](REFERENCE.md)
Pattern 2: Domain-specific organization
bigquery-skill/
├── SKILL.md (overview + navigation)
└── references/
├── finance.md
├── sales.md
└── product.md
Claude reads only the relevant reference for the user's query.
Pattern 3: Conditional details — Show basics, link to advanced content only when needed.
Guidelines:
Figure out where the user is in this process and help them progress. Maybe they want to create from scratch, or maybe they already have a draft and want to iterate.
Start by understanding what the skill should do. The conversation may already contain a workflow to capture (e.g., "turn this into a skill") — extract what you can from history first.
Key questions (don't overwhelm — start with the most important):
Check available MCPs for research. Come prepared with context to reduce burden on the user.
Analyze each concrete example by considering how to execute it from scratch, then identifying what scripts, references, and assets would help when executing repeatedly.
Examples:
scripts/rotate_pdf.py (same code rewritten each time)assets/hello-world/ template (same boilerplate each time)references/schema.md (re-discovering schemas each time)Create a list of reusable resources: scripts, references, and assets.
Create the directory structure. Only create subdirectories the skill actually needs — most skills need just SKILL.md and perhaps one resource directory.
Implement the planned resources. This may require user input (e.g., brand assets, documentation). Test added scripts by running them.
The skill is for another Claude instance. Include information beneficial and non-obvious to Claude — procedural knowledge, domain details, reusable assets. Use imperative/infinitive form.
The description is the only thing the agent sees when deciding which skill to load. It must provide enough info to know:
Format:
Good: "Extract text and tables from PDF files, fill forms, merge documents. Use when working with PDF files or when user mentions PDFs, forms, or document extraction."
Bad: "Helps with documents."
Output format patterns:
## Report structure
ALWAYS use this exact template:
# [Title]
## Executive summary
## Key findings
Example patterns:
## Commit message format
**Example 1:**
Input: Added user authentication with JWT tokens
Output: feat(auth): implement JWT-based authentication
Reference any bundled resources and describe clearly when to read them.
After drafting, verify:
Present the draft to the user:
After the skill draft is ready, create 2-3 realistic test prompts and run them. See references/eval-workflow.md for the complete testing and evaluation workflow, including:
Save test cases to evals/evals.json. See references/schemas.md for the JSON schema.
If the user prefers to skip formal evals ("just vibe with me"), that's fine — adapt to their preference.
This is the heart of the loop. Apply feedback from user evaluation:
Generalize from feedback — The skill will be used across many prompts, not just test examples. Avoid fiddly, overfitting changes. If something is stubbornly wrong, try different metaphors or patterns rather than adding rigid constraints.
Keep the skill lean — Remove what isn't pulling its weight. Read transcripts, not just outputs — if the skill makes the model waste time unproductively, trim those parts.
Look for repeated work — If all test runs independently wrote similar helper scripts, bundle that script in scripts/. Save every future invocation from reinventing the wheel.
Apply, rerun, review, repeat — After improving, rerun all test cases into a new iteration directory. Keep going until the user is happy, feedback is empty, or no meaningful progress.
See references/eval-workflow.md for detailed iteration mechanics.
After creating or improving a skill, offer to optimize the description for better triggering accuracy. This uses an automated loop that tests different descriptions against eval queries.
See the "Description Optimization" section in references/eval-workflow.md for the full workflow including trigger eval generation, review, and the optimization loop.
scripts/run_loop.py — Automated description optimization loopscripts/run_eval.py — Execute evaluation queriesscripts/aggregate_benchmark.py — Calculate benchmark statisticsscripts/improve_description.py — Optimize skill descriptionsscripts/generate_report.py — Create HTML reportsscripts/package_skill.py — Package skills for distributionscripts/quick_validate.py — Quick validation utilityassets/eval_review.html — HTML template for eval query revieweval-viewer/generate_review.py — Generate the eval results viewereval-viewer/viewer.html — HTML viewer for eval results