Core Principle

If you only improve against a fixed benchmark, you're training to the test. Every improvement must generalize beyond the tasks that revealed it.

The Improvement Loop

1. EXPERIMENT — Run baseline vs skill on diverse tasks
2. ASSESS    — Blind assess (use blind-skill-assessment)
   └─ Skill wins consistently? → DONE (see Convergence)
   └─ Baseline wins consistently after 2+ cycles? → STOP (see When to Abandon)
   └─ Use a separate agent/session for assessment when possible.
       If same actor must assess: enforce time gap, strict sanitization, rubric-first.
3. DIAGNOSE  — Root cause on dimensions where skill lost
   └─ Don't fix symptoms. Ask: "What class of bugs does this represent?"
   └─ Example: "bugs at hole boundaries" → missing cross-hole verification
4. TRIAGE    — Rank causes by breadth. Fix the widest-impact cause first.
5. EDIT      — One targeted change for one root cause. Log it (see Revision Log).
6. SANITIZE  — Separate process artifacts from the edit itself.
   └─ Submit only the skill edit for blind assessment, not revision logs,
       anti-overfitting checklists, or "vs baseline" comparisons.
   └─ Process artifacts go in your revision log, not in the assessed output.
7. RE-RUN    — New experiments with improved skill
   └─ 2+ cycles on same tasks? Add new tasks (see Anti-Overfitting)
   └─ 2+ cycles with same judges? Rotate personas or dimensions
8. GOTO 2

The Improvement Loop

1. EXPERIMENT — Run baseline vs skill on diverse tasks 2. ASSESS — Blind assess (use blind-skill-assessment) └─ Skill wins consistently? → DONE (see Convergence) └─ Baseline wins consistently after 2+ cycles? → STOP (see When to Abandon) └─ Use a separate agent/session for assessment when possible. If same actor must assess: enforce time gap, strict sanitization, rubric-first. 3. DIAGNOSE — Root cause on dimensions where skill lost └─ Don't fix symptoms. Ask: "What class of bugs does this represent?" └─ Example: "bugs at hole boundaries" → missing cross-hole verification 4. TRIAGE — Rank causes by breadth. Fix the widest-impact cause first. 5. EDIT — One targeted change for one root cause. Log it (see Revision Log). 6. SANITIZE — Separate process artifacts from the edit itself. └─ Submit only the skill edit for blind assessment, not revision logs, anti-overfitting checklists, or "vs baseline" comparisons. └─ Process artifacts go in your revision log, not in the assessed output. 7. RE-RUN — New experiments with improved skill └─ 2+ cycles on same tasks? Add new tasks (see Anti-Overfitting) └─ 2+ cycles with same judges? Rotate personas or dimensions 8. GOTO 2

Check	Pass	Fail
Would this help on a completely different task?	Structural improvement	Overfitting
Does this add a general step/rule, not task-specific wording?	Genuine	Overfitting
After 2+ cycles on same tasks, did you add new tasks?	Fresh signal	Stale benchmark
After 2+ cycles with same judges, did you rotate personas?	Diverse signal	Judge-fitted

Iterative Skill Refinement

Core Principle

The Improvement Loop

Iterative Skill Refinement

Core Principle

The Improvement Loop

Anti-Overfitting Checklist

Convergence — When to Stop

When to Abandon

Revision Log

Example: HDD VERIFY Step

Red Flags — STOP

Automation Audit Ops

Github Qa Labels

Jupyter Notebook

Tidb Integrationtest Recorder

Quality Nonconformance

Hugging Face Trackio