Run metric-driven iterative optimization loops. Define a measurable goal, build measurement scaffolding, then run parallel experiments that try many approaches, measure each against hard gates and/or LLM-as-judge quality scores, keep improvements, and converge toward the best solution. Use when optimizing clustering quality, search relevance, build performance, prompt quality, or any measurable outcome that benefits from systematic experimentation. Inspired by Karpathy's autoresearch, generalized for multi-file code changes and non-ML domains.
Run metric-driven iterative optimization. Define a goal, build measurement scaffolding, then run parallel experiments that converge toward the best solution.
Use the platform's blocking question tool when available (AskUserQuestion in Claude Code, request_user_input in Codex, ask_user in Gemini). Otherwise, present numbered options in chat and wait for the user's reply before proceeding.
<optimization_input> #$ARGUMENTS </optimization_input>
If the input above is empty, ask: "What would you like to optimize? Describe the goal, or provide a path to an optimization spec YAML file."
Reference the spec schema for validation:
references/optimize-spec-schema.yaml
Reference the experiment log schema for state management:
references/experiment-log-schema.yamlFor a first run, optimize for signal and safety, not maximum throughput:
references/example-hard-spec.yaml when the metric is objective and cheap to measurereferences/example-judge-spec.yaml only when actual quality requires semantic judgmentexecution.mode: serial and execution.max_concurrent: 1stopping.max_iterations: 4 and stopping.max_hours: 1sample_size: 10, batch_size: 5, and max_total_cost_usd: 5For a friendly overview of what this skill is for, when to use hard metrics vs LLM-as-judge, and example kickoff prompts, see:
references/usage-guide.md
CRITICAL: The experiment log on disk is the single source of truth. The conversation context is NOT durable storage. Results that exist only in the conversation WILL be lost.
The files under .context/compound-engineering/ce-optimize/<spec-name>/ are local scratch state. They are ignored by git, so they survive local resumes on the same machine but are not preserved by commits, branches, or pushes unless the user exports them separately.
This skill runs for hours. Context windows compact, sessions crash, and agents restart. Every piece of state that matters MUST live on disk, not in the agent's memory.
If you produce a results table in the conversation without writing those results to disk first, you have a bug. The conversation is for the user's benefit. The experiment log file is for durability.
Write each experiment result to disk IMMEDIATELY after measurement — not after the batch, not after evaluation, IMMEDIATELY. Append the experiment entry to the experiment log file the moment its metrics are known, before evaluating the next experiment. This is the #1 crash-safety rule.
VERIFY every critical write — after writing the experiment log, read the file back and confirm the entry is present. This catches silent write failures. Do not proceed to the next experiment until verification passes.
Re-read from disk at every phase boundary and before every decision — never trust in-memory state across phase transitions, batch boundaries, or after any operation that might have taken significant time. Re-read the experiment log and strategy digest from disk.
The experiment log is append-only during Phase 3 — never rewrite the full file. Append new experiment entries. Update the best section in place only when a new best is found. This prevents data loss if a write is interrupted.
Per-experiment result markers for crash recovery — each experiment writes a result.yaml marker in its worktree immediately after measurement. On resume, scan for these markers to recover experiments that were measured but not yet logged.
Strategy digest is written after every batch, before generating new hypotheses — the agent reads the digest (not its memory) when deciding what to try next.
Never present results to the user without writing them to disk first — the pattern is: measure -> write to disk -> verify -> THEN show the user. Not the reverse.
These are non-negotiable write-then-verify steps. At each checkpoint, the agent MUST write the specified file and then read it back to confirm the write succeeded.
| Checkpoint | File Written | Phase |
|---|---|---|
| CP-0: Spec saved | spec.yaml | Phase 0, after user approval |
| CP-1: Baseline recorded | experiment-log.yaml (initial with baseline) | Phase 1, after baseline measurement |
| CP-2: Hypothesis backlog saved | experiment-log.yaml (hypothesis_backlog section) | Phase 2, after hypothesis generation |
| CP-3: Each experiment result | experiment-log.yaml (append experiment entry) | Phase 3.3, immediately after each measurement |
| CP-4: Batch summary | experiment-log.yaml (outcomes + best) + strategy-digest.md | Phase 3.5, after batch evaluation |
| CP-5: Final summary | experiment-log.yaml (final state) | Phase 4, at wrap-up |
Format of a verification step:
.context/compound-engineering/ce-optimize/<spec-name>/)| File | Purpose | Written When |
|---|---|---|
spec.yaml | Optimization spec (immutable during run) | Phase 0 (CP-0) |
experiment-log.yaml | Full history of all experiments | Initialized at CP-1, appended at CP-3, updated at CP-4 |
strategy-digest.md | Compressed learnings for hypothesis generation | Written at CP-4 after each batch |
<worktree>/result.yaml | Per-experiment crash-recovery marker | Immediately after measurement, before CP-3 |
When Phase 0.4 detects an existing run:
result.yaml markers not yet in the logCheck whether the input is:
.yaml or .yml): read and validate itIf spec file provided:
references/optimize-spec-schema.yaml:
name is lowercase kebab-case and safe to use in git refs / worktree pathsmetric.primary.type is hard or judgejudge, metric.judge section exists with rubric and scoringmeasurement.command is non-emptyscope.mutable and scope.immutable each have at least one entry>=, <=, >, <, ==, !=)execution.max_concurrent is at least 1execution.max_concurrent does not exceed 6 when backend is worktreeIf description provided:
Analyze the project to understand what can be measured
Detect whether the optimization target is qualitative or quantitative — this determines type: hard vs type: judge and is the single most important spec decision:
Use type: hard when:
Use type: judge when:
IMPORTANT: If the target is qualitative, strongly recommend type: judge. Explain that hard metrics alone will optimize proxy numbers without checking actual quality. Show the user the three-tier approach:
If the user insists on type: hard for a qualitative target, proceed but warn that the results may optimize a misleading proxy.
Design the sampling strategy (for type: judge):
Guide the user through defining stratified sampling. The key question is: "What parts of the output space do you need to check quality on?"
Walk through these questions:
Example stratified sampling for clustering:
stratification:
- bucket: "top_by_size" # largest clusters — check for degenerate mega-clusters
count: 10
- bucket: "mid_range" # middle of non-solo cluster size range — representative quality
count: 10
- bucket: "small_clusters" # clusters with 2-3 items — check if connections are real
count: 10
singleton_sample: 15 # singletons — check for false negatives (items that should cluster)
The sampling strategy is domain-specific. For search relevance, strata might be "top-3 results", "results 4-10", "tail results". For summarization, strata might be "short documents", "long documents", "multi-topic documents".
Singleton evaluation is critical when the goal involves coverage — sampling singletons with the singleton rubric checks whether the system is missing obvious groupings.
Design the rubric (for type: judge):
Help the user define the scoring rubric. A good rubric:
distinct_topics, outlier_count)Example for clustering:
rubric: |
Rate this cluster 1-5:
- 5: All items clearly about the same issue/feature
- 4: Strong theme, minor outliers
- 3: Related but covers 2-3 sub-topics that could reasonably be split
- 2: Weak connection — items share superficial similarity only
- 1: Unrelated items grouped together
Also report: distinct_topics (integer), outlier_count (integer)
Guide the user through the remaining spec fields:
execution.mode: serial, execution.max_concurrent: 1, stopping.max_iterations: 4, and stopping.max_hours: 1type: judge: recommend sample_size: 10, batch_size: 5, and max_total_cost_usd: 5 until the rubric and harness are trustedWrite the spec to .context/compound-engineering/ce-optimize/<spec-name>/spec.yaml
Present the spec to the user for approval before proceeding
Dispatch compound-engineering:research:learnings-researcher to search for prior optimization work on similar topics. If relevant learnings exist, incorporate them into the approach.
Check if optimize/<spec-name> branch already exists:
git rev-parse --verify "optimize/<spec-name>" 2>/dev/null
If branch exists, check for an existing experiment log at .context/compound-engineering/ce-optimize/<spec-name>/experiment-log.yaml.
Present the user with a choice via the platform question tool:
result.yaml markers. Continue from the last iteration number in the log.optimize-archive/<spec-name>/archived-<timestamp>, clear the experiment log, start from scratchgit checkout -b "optimize/<spec-name>" # or switch to existing if resuming
Create scratch directory:
mkdir -p .context/compound-engineering/ce-optimize/<spec-name>/
This phase is a HARD GATE. The user must approve baseline and parallel readiness before Phase 2.
Verify no uncommitted changes to files within scope.mutable or scope.immutable:
git status --porcelain
Filter the output against the scope paths. If any in-scope files have uncommitted changes:
If user provides a measurement harness (the measurement.command already exists):
bash scripts/measure.sh "<measurement.command>" <timeout_seconds> "<measurement.working_directory or .>"
If agent must build the harness:
evaluate.py, evaluate.sh, or equivalent)scope.immutable -- the experiment agent must not modify itRun the measurement harness on the current code.
If stability mode is repeat:
repeat_count timesnoise_threshold, warn the user and suggest increasing repeat_countRecord the baseline in the experiment log: