Runs an autonomous experiment loop that modifies code, measures a target metric, keeps improvements, and discards failures. Operates indefinitely without human intervention. Triggers on "auto research", "run experiments", "optimize autonomously", "research loop", or "find improvements automatically".
Autonomous experiment loop: modify code, evaluate, measure metric, keep wins, discard losses. Inspired by Karpathy's autoresearch.
Works in any project with a measurable metric — ML training, compiler optimization, algorithm tuning, performance benchmarking.
If research.json exists, skip to Setup Phase.
Otherwise:
Explore the project with 2 parallel subagents — launch both simultaneously
using the Agent tool:
Agent 1 — Structure & Stack: Scan directory tree, identify language/framework/ build system, read README and top-level config files (package.json, pyproject.toml, Cargo.toml, etc.), summarize the project's purpose and tech stack.
Agent 2 — Metrics & Benchmarks: Search for existing benchmarks, test suites, evaluation scripts, and measurable metrics. Look for files with "bench", "test", "eval", "metric" in the name. Read any that exist and report what metrics are already being measured and how they're run.
Wait for both agents to return before proceeding.
Present findings and ask one question (use AskUserQuestion if available):
I've explored your project. Here's what I found:
- [brief project summary from Agent 1]
- [existing benchmarks/metrics from Agent 2]
What do you want to optimize? e.g., a metric (latency, accuracy, bundle size), a general goal ("make it faster"), or a specific area of the code.
Infer everything else from the project exploration:
run_command: from build system / existing bench scripts found by Agent 2modifiable_files: from the user's answer + project structurereadonly_files: context files related to the modifiable onestimeout_seconds: 2x expected runtime, or 300metric_direction: infer from name (loss/latency/size → lower; speed/accuracy → higher)tag: today's date (e.g., mar26)no_new_dependencies: trueWrite research.json — see CONFIG.md for field reference.
Show the user for confirmation, then proceed.
research.jsonautoresearch/<tag> branch doesn't exist — if it does, append -2, -3, etc.git checkout -b autoresearch/<tag>readonly_files and modifiable_files for contextsetup_check if configured; halt on failureresults.tsv and run.log to .gitignoreresults.tsv with tab-separated header rowkeep | baselineLOOP FOREVER. NEVER stop. NEVER ask permission to continue.
The user may be asleep. They expect you to run indefinitely until manually stopped.
1. Review results.tsv for trends. Re-read modifiable files for current state.
2. Form ONE hypothesis. Prefer changes that are:
- Informed by past results
- Meaningfully different from recent experiments
- Simple — tiny gain + ugly complexity = not worth it
3. Edit ONLY modifiable_files. No new dependencies unless config allows it.
4. Git commit with short description. Do NOT commit results.tsv or run.log.
5. Run experiment: redirect ALL output to run.log (NEVER flood context).
Kill if exceeds timeout_seconds.
6. Extract metric via grep. If empty → crashed.
7. Crashes: read last 50 lines of log. Trivial fix → retry (max 2-3 attempts).
Broken idea → log as crash, revert, move on.
8. Append tab-separated row to results.tsv (untracked, survives git resets).
9. Keep or discard:
- IMPROVED → keep commit, branch advances
- EQUAL or WORSE → git reset --hard HEAD~1
- Exception: equal metric + simpler code → keep
- Exception: tiny gain + lots of ugly code → discard
10. Every 10 experiments, print a progress summary.
11. GOTO 1
commit val_bpb memory_gb status description
a1b2c3d 0.997900 44.0 keep baseline
b2c3d4e 0.993200 44.2 keep increase LR to 0.04
c3d4e5f 1.005000 44.0 discard switch to GeLU activation
d4e5f6g 0.000000 0.0 crash double model width (OOM)
e5f6g7h 0.990100 43.8 keep add weight decay scheduling
f6g7h8i 0.990100 43.5 keep simplify LR scheduler (equal metric, simpler code)