Diagnose and optimize Agent Skills (SKILL.md) with real session data and research-backed static analysis. Works with Claude Code, Codex, and any Agent Skills-compatible agent.
Analyze skills using historical session data + static quality checks, output a diagnostic report with P0/P1/P2 prioritized fixes. Scores each skill on a 5-point composite scale across 8 dimensions.
CSO (Claude/Agent Search Optimization) = writing skill descriptions so agents select the right skill at the right time. This skill checks for CSO violations.
/optimize-skill → scan all skills/optimize-skill my-skill → single skill/optimize-skill skill-a skill-b → multiple specified skillsAuto-detect the current agent platform and scan the corresponding paths:
| Source | Claude Code | Codex | Shared |
|---|---|---|---|
| Session transcripts | ~/.claude/projects/**/*.jsonl | ~/.codex/sessions/**/*.jsonl | — |
| Skill files | ~/.claude/skills/*/SKILL.md | ~/.codex/skills/*/SKILL.md | ~/.agents/skills/*/SKILL.md |
Platform detection: Check which directories exist. Scan all available sources — a user may have both Claude Code and Codex installed.
Identify target skills
↓
Collect session data (python3 scripts scan JSONL transcripts)
↓
Run 8 analysis dimensions
↓
Compute composite scores
↓
Output report with P0/P1/P2
Scan skill directories in order: ~/.claude/skills/, ~/.codex/skills/, ~/.agents/skills/. Deduplicate by skill name (same name in multiple locations = same skill). For each, read SKILL.md and extract:
If user specified skill names, filter to only those.
Use python3 scripts via Bash to scan session JSONL files. Extract:
Claude Code sessions (~/.claude/projects/**/*.jsonl):
Skill tool_use calls (which skills were invoked)Codex sessions (~/.codex/sessions/**/*.jsonl):
session_meta events → extract base_instructions for skill loading evidenceresponse_item events → assistant outputs (workflow tracking)event_msg events → tool execution and skill-related eventsturn_context events (for reaction analysis)Note: Codex injects skills via context rather than explicit Skill tool calls. Skill loading (present in base_instructions) does NOT equal active invocation. To detect actual use, search for skill-specific workflow markers (step headers, output formats) in response_item content within that session. A skill is "invoked" only if the agent produced output following the skill's defined workflow.
Aggregated:
You MUST run ALL 8 dimensions. The baseline behavior without this skill is to skip dimensions 4.2, 4.3, 4.5b, and 4.8. These are the most valuable dimensions — do not skip them.
Count how many times each skill was actually invoked vs how many times its trigger keywords appeared in user messages.
Claude Code: count Skill tool_use calls in transcripts.
Codex: count sessions where the agent produced output following the skill's workflow markers (not merely loaded in context).
Diagnose:
This dimension is critical and easy to skip. Do not skip it.
After a skill is invoked in a session, read the user's next 3 messages. Classify:
Report per-skill satisfaction rate.
This dimension is critical and easy to skip. Do not skip it.
For each skill invocation found in session data:
Report: {skill-name} (N steps): avg completed Step X/N (Y%)
If a specific step is frequently where execution stops, flag it.
Check each SKILL.md against these 14 rules:
| Check | Pass Criteria |
|---|---|
| Frontmatter format | Only name + description, total < 1024 chars |
| Name format | Letters, numbers, hyphens only |
| Description trigger | Starts with "Use when..." or has explicit trigger conditions |
| Description workflow leak | Description does NOT summarize the skill's workflow steps (CSO violation) |
| Description pushiness | Description actively claims scenarios where it should be used, not just passive |
| Overview section | Present |
| Rules section | Present |
| MUST/NEVER density | Count ALL-CAPS directive words; >5 per 100 words = flag |
| Word count | < 500 words (flag if over) |
| Narrative anti-pattern | No "In session X, we found..." storytelling |
| YAML quoting safety | description containing : must be wrapped in double quotes |
| Critical info position | Core trigger conditions and primary actions must be in the first 20% of SKILL.md |
| Description 250-char check | Primary trigger keywords must appear within the first 250 characters of description |
| Trigger condition count | ≤ 2 trigger conditions in description is ideal |
Skill was invoked but user immediately rejected or ignored it.
This is the highest-value dimension. For each skill, extract its capability keywords (not just trigger keywords — what the skill CAN do). Then scan user messages for tasks that match those capabilities but where the skill was NOT invoked.
Report: which user messages SHOULD have triggered the skill but didn't, and suggest description improvements.
Compounding Risk Assessment: For skills with chronic undertriggering (0 triggers across 5+ sessions where relevant tasks appeared), flag as "compounding risk" — undertriggered skills cannot self-improve through usage feedback, causing the gap to widen over time. Recommend immediate description rewrite as P0.
Compare all skill pairs:
For each skill, extract referenced:
test -e)which)Flag any broken references.
This dimension is critical and easy to skip. Do not skip it.
For each skill:
Progressive Disclosure Tier Check: Evaluate each skill against the 3-tier loading model:
Flag skills that put 500+ words in SKILL.md without using reference files as "poor progressive disclosure".
Rate each skill on a 5-point scale:
| Score | Meaning |
|---|---|
| 5 | Healthy: high trigger rate, positive reactions, complete workflows, clean static |
| 4 | Good: minor issues in 1-2 dimensions |
| 3 | Needs attention: significant gap in 1 dimension or minor gaps in 3+ |
| 2 | Problematic: never triggered, or negative user reactions, or major static issues |
| 1 | Broken: doesn't work, references missing, or fundamentally misaligned |
Scored dimensions (weighted average):
Qualitative dimensions (reported but not scored):
# Skill Optimization Report
**Date**: {date}
**Scope**: {all / specified skills}
**Session data**: {N} sessions, {date range}
## Overview
| Skill | Triggers | Reaction | Completion | Static | Undertrigger | Token | Score |
|-------|----------|----------|------------|--------|--------------|-------|-------|
| example-skill | 2 | 100% | 86% | B+ | 1 miss | 486w | 4/5 |
## P0 Fixes (blocking usage)
1. ...
## P1 Improvements (better experience)
1. ...
## P2 Optional Optimizations
1. ...
## Per-Skill Diagnostics
### {skill-name}
#### 4.1 Trigger Rate
...
#### 4.2 User Reaction
...
(all 8 dimensions)
The analysis dimensions in this report are grounded in the following research: