Skill File

Iterative Optimization Loop

Name: Iterative Optimization Loop
Author: EveryInc

Run metric-driven iterative optimization loops. Define a measurable goal, build measurement scaffolding, then run parallel experiments that try many approaches, measure each against hard gates and/or LLM-as-judge quality scores, keep improvements, and converge toward the best solution. Use when optimizing clustering quality, search relevance, build performance, prompt quality, or any measurable outcome that benefits from systematic experimentation. Inspired by Karpathy's autoresearch, generalized for multi-file code changes and non-ML domains.

EveryInc14,697 starsApr 13, 2026

Occupation
Categories: Lab Tools

Skill Content

Run metric-driven iterative optimization. Define a goal, build measurement scaffolding, then run parallel experiments that converge toward the best solution.

Interaction Method

Use the platform's blocking question tool when available (AskUserQuestion in Claude Code, request_user_input in Codex, ask_user in Gemini). Otherwise, present numbered options in chat and wait for the user's reply before proceeding.

Input

<optimization_input> #$ARGUMENTS </optimization_input>

If the input above is empty, ask: "What would you like to optimize? Describe the goal, or provide a path to an optimization spec YAML file."

Optimization Spec Schema

Reference the spec schema for validation:

references/optimize-spec-schema.yaml

Experiment Log Schema

Reference the experiment log schema for state management:

Related Skills

Iterative Optimization Loop | Skills Pool

references/experiment-log-schema.yaml

Checkpoint	File Written	Phase
CP-0: Spec saved	`spec.yaml`	Phase 0, after user approval
CP-1: Baseline recorded	`experiment-log.yaml` (initial with baseline)	Phase 1, after baseline measurement
CP-2: Hypothesis backlog saved	`experiment-log.yaml` (hypothesis_backlog section)	Phase 2, after hypothesis generation
CP-3: Each experiment result	`experiment-log.yaml` (append experiment entry)	Phase 3.3, immediately after each measurement
CP-4: Batch summary	`experiment-log.yaml` (outcomes + best) + `strategy-digest.md`	Phase 3.5, after batch evaluation
CP-5: Final summary	`experiment-log.yaml` (final state)	Phase 4, at wrap-up

File	Purpose	Written When
`spec.yaml`	Optimization spec (immutable during run)	Phase 0 (CP-0)
`experiment-log.yaml`	Full history of all experiments	Initialized at CP-1, appended at CP-3, updated at CP-4
`strategy-digest.md`	Compressed learnings for hypothesis generation	Written at CP-4 after each batch
`<worktree>/result.yaml`	Per-experiment crash-recovery marker	Immediately after measurement, before CP-3

Analyze the project to understand what can be measured
Detect whether the optimization target is qualitative or quantitative — this determines type: hard vs type: judge and is the single most important spec decision:

Use type: hard when:
- The metric is a scalar number with a clear "better" direction
- The metric is objectively measurable (build time, test pass rate, latency, memory usage)
- No human judgment is needed to evaluate "is this result actually good?"
- Examples: reduce build time, increase test coverage, reduce API latency, decrease bundle size
Use type: judge when:
- The quality of the output requires semantic understanding to evaluate
- A human reviewer would need to look at the results to say "this is better"
- Proxy metrics exist but can mislead (e.g., "more clusters" does not mean "better clusters")
- The optimization could produce degenerate solutions that look good on paper
- Examples: clustering quality, search relevance, summarization quality, code readability, UX copy, recommendation relevance
IMPORTANT: If the target is qualitative, strongly recommend type: judge. Explain that hard metrics alone will optimize proxy numbers without checking actual quality. Show the user the three-tier approach:
- Degenerate gates (hard, cheap, fast): catch obviously broken solutions — e.g., "all items in 1 cluster" or "0% coverage". Run first. If gates fail, skip the expensive judge step.
- LLM-as-judge (the actual optimization target): sample outputs, score them against a rubric, aggregate. This is what the loop optimizes.
- Diagnostics (logged, not gated): distribution stats, counts, timing — useful for understanding WHY a judge score changed.
If the user insists on type: hard for a qualitative target, proceed but warn that the results may optimize a misleading proxy.
Design the sampling strategy (for type: judge):

Guide the user through defining stratified sampling. The key question is: "What parts of the output space do you need to check quality on?"

Walk through these questions:
- What does one "item" look like? (a cluster, a search result page, a summary, etc.)
- What are the natural size/quality strata? (e.g., large clusters vs small clusters vs singletons)
- Where are quality failures most likely? (e.g., very large clusters may be degenerate merges; singletons may be missed groupings)
- What total sample size balances cost vs signal? (default: 30 items, adjust based on output volume)
Example stratified sampling for clustering:
```
stratification:
  - bucket: "top_by_size"     # largest clusters — check for degenerate mega-clusters
    count: 10
  - bucket: "mid_range"       # middle of non-solo cluster size range — representative quality
    count: 10
  - bucket: "small_clusters"  # clusters with 2-3 items — check if connections are real
    count: 10
singleton_sample: 15          # singletons — check for false negatives (items that should cluster)
```
The sampling strategy is domain-specific. For search relevance, strata might be "top-3 results", "results 4-10", "tail results". For summarization, strata might be "short documents", "long documents", "multi-topic documents".

Singleton evaluation is critical when the goal involves coverage — sampling singletons with the singleton rubric checks whether the system is missing obvious groupings.
Design the rubric (for type: judge):

Help the user define the scoring rubric. A good rubric:
- Has a 1-5 scale (or similar) with concrete descriptions for each level
- Includes supplementary fields that help diagnose issues (e.g., distinct_topics, outlier_count)
- Is specific enough that two judges would give similar scores
- Does NOT assume bigger/more is better — "3 items per cluster average" is not inherently good or bad
Example for clustering:
```
rubric: |
  Rate this cluster 1-5:
  - 5: All items clearly about the same issue/feature
  - 4: Strong theme, minor outliers
  - 3: Related but covers 2-3 sub-topics that could reasonably be split
  - 2: Weak connection — items share superficial similarity only
  - 1: Unrelated items grouped together
  Also report: distinct_topics (integer), outlier_count (integer)
```
Guide the user through the remaining spec fields:
- What degenerate cases should be rejected? (gates — e.g., "solo_pct <= 0.95" catches all-singletons, "max_cluster_size <= 500" catches mega-clusters)
- What command runs the measurement?
- What files can be modified? What is immutable?
- Any constraints or dependencies?
- If this is the first run: recommend execution.mode: serial, execution.max_concurrent: 1, stopping.max_iterations: 4, and stopping.max_hours: 1
- If type: judge: recommend sample_size: 10, batch_size: 5, and max_total_cost_usd: 5 until the rubric and harness are trusted
Write the spec to .context/compound-engineering/ce-optimize/<spec-name>/spec.yaml
Present the spec to the user for approval before proceeding

git rev-parse --verify "optimize/<spec-name>" 2>/dev/null

git checkout -b "optimize/<spec-name>"  # or switch to existing if resuming

mkdir -p .context/compound-engineering/ce-optimize/<spec-name>/

git status --porcelain

Run it once via the measurement script:

bash scripts/measure.sh "<measurement.command>" <timeout_seconds> "<measurement.working_directory or .>"

Validate the JSON output:
- Contains keys for all degenerate gate metric names
- Contains keys for all diagnostic metric names
- Values are numeric or boolean as expected
If validation fails, report what is missing and ask the user to fix the harness

Iterative Optimization Loop

Interaction Method

Input

Optimization Spec Schema

Experiment Log Schema

Iterative Optimization Loop

Interaction Method

Input

Optimization Spec Schema

Experiment Log Schema

Quick Start

Persistence Discipline

Core Rules

Mandatory Disk Checkpoints

File Locations (all under `.context/compound-engineering/ce-optimize/<spec-name>/`)

On Resume

Phase 0: Setup

0.1 Determine Input Type

0.2 Load or Create Spec

0.3 Search Prior Learnings

0.4 Run Identity Detection

0.5 Create Optimization Branch and Scratch Space

Phase 1: Measurement Scaffolding

1.1 Clean-Tree Gate

1.2 Build or Validate Measurement Harness

1.3 Establish Baseline

Automation Audit Ops

Github Qa Labels

Jupyter Notebook

Tidb Integrationtest Recorder

Quality Nonconformance

Hugging Face Trackio

Iterative Optimization Loop

Interaction Method

Input

Optimization Spec Schema

Experiment Log Schema

Iterative Optimization Loop

Interaction Method

Input

Optimization Spec Schema

Experiment Log Schema

Quick Start

Persistence Discipline

Core Rules

Mandatory Disk Checkpoints

File Locations (all under .context/compound-engineering/ce-optimize/<spec-name>/)

On Resume

Phase 0: Setup

0.1 Determine Input Type

0.2 Load or Create Spec

0.3 Search Prior Learnings

0.4 Run Identity Detection

0.5 Create Optimization Branch and Scratch Space

Phase 1: Measurement Scaffolding

1.1 Clean-Tree Gate

1.2 Build or Validate Measurement Harness

1.3 Establish Baseline

Automation Audit Ops

Github Qa Labels

Jupyter Notebook

Tidb Integrationtest Recorder

Quality Nonconformance

Hugging Face Trackio

File Locations (all under `.context/compound-engineering/ce-optimize/<spec-name>/`)