Use when a skill exists but blind assessment shows it underperforms, or when iteratively improving a skill through experiment cycles while avoiding overfitting to fixed benchmarks
If you only improve against a fixed benchmark, you're training to the test. Every improvement must generalize beyond the tasks that revealed it.
1. EXPERIMENT — Run baseline vs skill on diverse tasks
2. ASSESS — Blind assess (use blind-skill-assessment)
└─ Skill wins consistently? → DONE (see Convergence)
└─ Baseline wins consistently after 2+ cycles? → STOP (see When to Abandon)
└─ Use a separate agent/session for assessment when possible.
If same actor must assess: enforce time gap, strict sanitization, rubric-first.
3. DIAGNOSE — Root cause on dimensions where skill lost
└─ Don't fix symptoms. Ask: "What class of bugs does this represent?"
└─ Example: "bugs at hole boundaries" → missing cross-hole verification
4. TRIAGE — Rank causes by breadth. Fix the widest-impact cause first.
5. EDIT — One targeted change for one root cause. Log it (see Revision Log).
6. SANITIZE — Separate process artifacts from the edit itself.
└─ Submit only the skill edit for blind assessment, not revision logs,
anti-overfitting checklists, or "vs baseline" comparisons.
└─ Process artifacts go in your revision log, not in the assessed output.
7. RE-RUN — New experiments with improved skill
└─ 2+ cycles on same tasks? Add new tasks (see Anti-Overfitting)
└─ 2+ cycles with same judges? Rotate personas or dimensions
8. GOTO 2
Why SANITIZE matters: Process artifacts (revision logs, "vs baseline" framing, anti-overfitting checklists) make methodology-guided output structurally distinguishable from unguided output. If these leak into the assessed version, blind assessment isn't truly blind.
Before committing any skill edit, answer all four:
| Check | Pass | Fail |
|---|---|---|
| Would this help on a completely different task? | Structural improvement | Overfitting |
| Does this add a general step/rule, not task-specific wording? | Genuine | Overfitting |
| After 2+ cycles on same tasks, did you add new tasks? | Fresh signal | Stale benchmark |
| After 2+ cycles with same judges, did you rotate personas? | Diverse signal | Judge-fitted |
Litmus test: "Adds a structural step catching a class of bugs" = genuine. "Tunes wording to pass a specific test" = overfitting.
Stop when all hold:
Stop iterating and rethink/abandon the skill when any hold:
Abandoning is a valid conclusion, not a failure. A methodology that can't conclude "this doesn't work" is confirmation bias, not science.
Every skill edit gets a row. No undocumented changes.
| Date | What changed | Triggered by | Validated by |
|---|---|---|---|
| example | Added VERIFY step | Phase 3: bugs at hole seams | Phase 3b: HDD won 5/5 |
Blind assessment: A won 4/5 (decoded: A=baseline). Bug Hunter found bugs at hole boundaries — shared state and error paths crossing seams. Root cause: no cross-hole verification step. Edit: added VERIFY step (check state, scopes, error paths after each fill). Re-run: B won 5/5 (decoded: B=HDD). New tasks confirmed. Converged.
Counter-example: A skill adding "always validate inputs" was edited 3 times. Each cycle fixed one persona's scores but degraded another (validation noise hurt readability). After cycle 3, baseline still won 3/5. Conclusion: the skill's approach was zero-sum. Abandoned.
If you catch yourself doing any of these: STOP. Diagnose root cause. Plan validation. Then edit.