Name: Skill Creator
Author: alim

Skill Creator

Create new skills, improve existing skills, and measure skill performance. Use when users want to create a skill from scratch, update or optimize an existing skill, run evals to test a skill, or benchmark skill performance with variance analysis.

alim0 starsMar 6, 2026

Occupation
Categories: Sales & Marketing

A skill for creating new skills and iteratively improving them.

At a high level, the process of creating a skill goes like this:

Decide what you want the skill to do and roughly how it should do it
Write a draft of the skill
Create a few test prompts and run claude-with-access-to-the-skill on them
Evaluate the results
- which can be through automated evals, but also it's totally fine and good for them to be evaluated by the human by hand and that's often the only way
Rewrite the skill based on feedback from the evaluation
Repeat until you're satisfied
Expand the test set and try again at larger scale

Your job when using this skill is to figure out where the user is in this process and then jump in and help them progress through these stages. So for instance, maybe they're like "I want to make a skill for X". You can help narrow down what they mean, write a draft, write the test cases, figure out how they want to evaluate, run all the prompts, and repeat.

On the other hand, maybe they already have a draft of the skill. In this case you can go straight to the eval/iterate part of the loop.

Of course, you should always be flexible and if the user is like "I don't need to run a bunch of evaluations, just vibe with me", you can do that instead.

Skill Creator

alim0 starsMar 6, 2026

Occupation
Categories: Sales & Marketing

A skill for creating new skills and iteratively improving them.

At a high level, the process of creating a skill goes like this:

Decide what you want the skill to do and roughly how it should do it
Write a draft of the skill
Create a few test prompts and run claude-with-access-to-the-skill on them
Evaluate the results
- which can be through automated evals, but also it's totally fine and good for them to be evaluated by the human by hand and that's often the only way
Rewrite the skill based on feedback from the evaluation
Repeat until you're satisfied
Expand the test set and try again at larger scale

On the other hand, maybe they already have a draft of the skill. In this case you can go straight to the eval/iterate part of the loop.

Of course, you should always be flexible and if the user is like "I don't need to run a bunch of evaluations, just vibe with me", you can do that instead.

parent-directory/ ├── skill-name/ # The skill │ ├── SKILL.md │ ├── evals/ │ │ ├── evals.json │ │ └── files/ │ └── scripts/ │ └── skill-name-workspace/ # Workspace (sibling directory) │ │── [Eval mode] ├── eval-0/ │ ├── with_skill/ │ │ ├── inputs/ # Staged input files │ │ ├── outputs/ # Skill outputs │ │ │ ├── transcript.md │ │ │ ├── user_notes.md # Executor uncertainties │ │ │ ├── metrics.json │ │ │ └── [output files] │ │ ├── grading.json # Assertions + claims + user_notes_summary │ │ └── timing.json # Wall clock timing │ └── without_skill/ │ └── ... ├── comparison.json # Blind comparison (A/B testing) ├── summary.json # Aggregate metrics │ │── [Improve mode] ├── history.json # Score progression across versions ├── v0/ │ ├── META.yaml # Version metadata │ ├── skill/ # Copy of skill at this version │ └── runs/ │ ├── run-1/ │ │ ├── transcript.md │ │ ├── user_notes.md │ │ ├── outputs/ │ │ └── grading.json │ ├── run-2/ │ └── run-3/ ├── v1/ │ ├── META.yaml │ ├── skill/ │ ├── improvements/ │ │ └── suggestions.md # From analyzer │ └── runs/ └── grading/ └── v1-vs-v0/ ├── assignment.json # Which version is A vs B ├── comparison-1.json # Blind comparison results ├── comparison-2.json ├── comparison-3.json └── analysis.json # Post-hoc analysis │ │── [Benchmark mode] └── benchmarks/ └── 2026-01-15T10-30-00/ # Timestamp-named directory ├── benchmark.json # Structured results (see schema) ├── benchmark.md # Human-readable summary └── runs/ ├── eval-1/ │ ├── with_skill/ │ │ ├── run-1/ │ │ │ ├── transcript.md │ │ │ ├── user_notes.md │ │ │ ├── outputs/ │ │ │ └── grading.json │ │ ├── run-2/ │ │ └── run-3/ │ └── without_skill/ │ ├── run-1/ │ ├── run-2/ │ └── run-3/ └── eval-2/ └── ...

Building Block	Input	Output	Agent
Eval Run	skill + eval prompt + files	transcript, outputs, metrics	`agents/executor.md`
Grade Expectations	outputs + expectations	pass/fail per expectation	`agents/grader.md`
Blind Compare	output A, output B, eval prompt	winner + reasoning	`agents/comparator.md`
Post-hoc Analysis	winner + skills + transcripts	improvement suggestions	`agents/analyzer.md`

Mode	Purpose	Workflow
Eval	Test skill performance	Executor → Grader → Results
Improve	Iteratively optimize skill	Executor → Grader → Comparator → Analyzer → Apply
Create	Interactive skill development	Interview → Research → Draft → Run → Refine
Benchmark	Standardized performance measurement (requires subagents)	3x runs per configuration → Aggregate → Analyze

Agent	Role	Reference
Executor	Run skill on a task, produce transcript + outputs + metrics	`agents/executor.md`
Grader	Evaluate expectations against transcript and outputs	`agents/grader.md`
Comparator	Blind A/B comparison between two outputs	`agents/comparator.md`
Analyzer	Post-hoc analysis of comparison results	`agents/analyzer.md`

Skill Creator

Skill Creator

Building Blocks

Eval Run

Grade Expectations

Blind Compare

Post-hoc Analysis

Environment Capabilities

Mode Workflows

Task Tracking

Task Lifecycle

Creating Tasks

Updating Stages

Comparison Tasks

Architecture

Agent Types

Communicating with the user

Creating a skill

Capture Intent

Interview and Research

Initialize

Fill SKILL.md Frontmatter

Skill Writing Guide

Anatomy of a Skill

Progressive Disclosure

Principle of Lack of Surprise

Writing Patterns

Immediate Feedback Loop

Writing Style

Test Cases

Transition to Automated Iteration

Package and Present (only if present_files tool is available)

Improving a skill

Setup Phase

Iteration Loop

Step 1: Execute (3 Parallel Runs)

Step 2: Grade Assertions

Step 3: Blind Compare (If N > 0)

Step 4: Post-hoc Analysis

Step 5: Update State

Step 6: Create New Version (If Continuing)

Final Report

Without Subagents

Eval Mode

Benchmark Mode

Workspace Structure

Coordinator Responsibilities

Delegating Work

Conclusion

Taskflow Inbox Triage

Accessibility

Open a Pull Request

Investor Materials

Continuous Agent Loop

Configure Ecc

Package and Present (only if `present_files` tool is available)