Set up and run an autonomous experiment loop for any optimization target. Gathers what to optimize, then starts the loop immediately. Use when asked to "run autoresearch", "optimize X in a loop", "set up autoresearch for X", or "start experiments".
Autonomous experiment loop: try ideas, keep what works, discard what doesn't, never stop.
init_experiment — configure session (name, metric, unit, direction). Call again to re-initialize with a new baseline when the optimization target changes.run_experiment — runs command, times it, captures output.log_experiment — records result. keep auto-commits via jj commit using a Conventional Commit title plus a detailed body. discard/crash/checks_failed auto-reverts code changes (autoresearch files preserved). Always include secondary metrics dict. Dashboard: ctrl+x.jj bookmark create autoresearch/<goal>-<date> if you want a named bookmark).autoresearch.md and autoresearch.sh (see below). Commit both.init_experiment → run baseline → log_experiment → start looping immediately.autoresearch.mdThis is the heart of the session. A fresh agent with no context should be able to read this file and run the loop effectively. Invest time making it excellent.
# Autoresearch: <goal>
## Objective
<Specific description of what we're optimizing and the workload.>
## Metrics
- **Primary**: <name> (<unit>, lower/higher is better) — the optimization target
- **Secondary**: <name>, <name>, ... — independent tradeoff monitors
## How to Run
`./autoresearch.sh` — outputs `METRIC name=number` lines.
## Files in Scope
<Every file the agent may modify, with a brief note on what it does.>
## Off Limits
<What must NOT be touched.>
## Constraints
<Hard rules: tests must pass, no new deps, etc.>
## What's Been Tried
<Update this section as experiments accumulate. Note key wins, dead ends,
and architectural insights so the agent doesn't repeat failed approaches.>
Update autoresearch.md periodically — especially the "What's Been Tried" section — so resuming agents have full context.
autoresearch.shBash script (set -euo pipefail) that: pre-checks fast (syntax errors in <1s), runs the benchmark, and outputs structured lines to stdout. Keep the script fast — every second is multiplied by hundreds of runs.
For fast, noisy benchmarks (< 5s), run the workload multiple times inside the script and report the median. This produces stable data points and makes the confidence score reliable from the start. Slow workloads (ML training, large builds) don't need this — single runs are fine.
METRIC name=value — primary metric (must match init_experiment's metric_name) and any secondary metrics. Parsed automatically by run_experiment.The script should output whatever data helps you make better decisions in the next iteration. Think about what you'll need to see after each run to know where to focus:
The script runs the same code every iteration — but you can update it during the loop if you discover you need more signal. Add instrumentation as you learn what matters.
log_experimentUse log_experiment's asi parameter to annotate each run with whatever would help the next iteration make a better decision. Free-form key/value pairs — you decide what's worth recording. Don't repeat the description or raw output; capture what you'd lose after a context reset.
For kept runs, think of asi as the material for the auto-generated commit body. Include the reasoning you would want in a high-quality jj commit: experiment hypothesis, theory, validation notes, bottleneck explanation, trade-offs, and any metric interpretation that would help a reviewer understand why the change was kept.
Annotate failures and crashes heavily. Discarded and crashed runs are reverted — the code changes are gone. The only record that survives is the description and ASI in autoresearch.jsonl. If you don't capture what you tried and why it failed, future iterations will waste time re-discovering the same dead ends.
autoresearch.config.json (optional)JSON config file that lives in the pi session's working directory (ctx.cwd). Supported fields:
maxIterations (number) — maximum experiments before auto-stopping.workingDir (string) — override the directory for all autoresearch operations: file I/O (autoresearch.jsonl, autoresearch.md, autoresearch.sh, autoresearch.checks.sh, autoresearch.ideas.md), command execution, and version-control operations. Supports absolute paths or relative paths (resolved against ctx.cwd). The config file itself always stays in ctx.cwd. Fails if the directory doesn't exist.{
"workingDir": "/path/to/project",
"maxIterations": 50
}
autoresearch.checks.sh (optional)Bash script (set -euo pipefail) for backpressure/correctness checks: tests, types, lint, etc. Only create this file when the user's constraints require correctness validation (e.g., "tests must pass", "types must check").
When this file exists:
run_experiment.run_experiment reports it clearly — log as checks_failed.keep a result when checks have failed.checks_timeout_seconds).When this file does not exist, everything behaves exactly as before — no changes to the loop.
Keep output minimal. Only the last 80 lines of checks output are fed back to the agent on failure. Suppress verbose progress/success output and let only errors through. This keeps context lean and helps the agent pinpoint what broke.
#!/bin/bash
set -euo pipefail
# Example: run tests and typecheck — suppress success output, only show errors
pnpm test --run --reporter=dot 2>&1 | tail -50
pnpm typecheck 2>&1 | grep -i error || true
LOOP FOREVER. Never ask "should I continue?" — the user expects autonomous work.
keep. Worse/equal → discard. Secondary metrics rarely affect this.description for kept runs. log_experiment uses it as the commit title and builds the detailed body from the run context, metrics, and ASI.asi. Record what you learned — not what you did. What would help the next iteration or a fresh agent resuming this session? For kept runs, this should be rich enough to support a detailed commit body.log_experiment reports a confidence score (best improvement as a multiple of the session noise floor). ≥2.0× means the improvement is likely real. <1.0× means it's within noise — consider re-running to confirm before keeping. The score is advisory — it never auto-discards.autoresearch.md exists, read it + jj log, continue looping.NEVER STOP. The user may be away for hours. Keep going until interrupted.
When you discover complex but promising optimizations that you won't pursue right now, append them as bullets to autoresearch.ideas.md. Don't let good ideas get lost.
On resume (context limit, crash), check autoresearch.ideas.md — prune stale/tried entries, experiment with the rest. When all paths are exhausted, delete the file and write a final summary.
If the user sends a message while an experiment is running, finish the current run_experiment + log_experiment cycle first, then incorporate their feedback in the next iteration. Don't abandon a running experiment.