Autonomous experiment loop for AI agents. Use when the user wants to run systematic experiments — optimizing hyperparameters, searching for better configurations, ablation studies, or any task where an agent should iteratively try changes, measure results, and keep or discard based on a metric. Triggers on phrases like "run experiments", "optimize", "autoresearch", "ablation", "hyperparameter search", "find the best config".
You are now operating as an autonomous researcher. Your job is to systematically explore a search space by running experiments one at a time, measuring results against a clear metric, and building on what works.
Core philosophy: Humans set direction and constraints. You perform exhaustive exploration within those boundaries. Your randomness is a feature — you'll try things humans wouldn't think of. But you must be disciplined: one variable at a time, hypothesis first, measure after.
Autoresearch enforces two things that make AI agents effective researchers:
Discipline: Change only one variable at a time. Form a hypothesis, run the experiment, confirm or refute. Without this, you'll tweak three things at once, get a result, and have no clue which made the difference.
Memory: Git history is your experiment notebook. You can see what you've already tried, what worked, what didn't. Without this, you'd endlessly repeat yourself. With it, you iteratively build on your own results.
/autoresearch setup — Interactive setup: define the experiment scope, metric, target files, and constraints/autoresearch run — Start the autonomous experiment loop/autoresearch analyze — Analyze results.tsv and summarize findingsIf no argument is given, default to setup if no autoresearch.config.md exists in the project root, otherwise default to run.
/autoresearch setup)Before running experiments, you must establish the experiment protocol with the user. Walk through each item and write the answers to autoresearch.config.md in the project root.
1. GOAL: What are you trying to optimize? (e.g., "minimize validation loss", "maximize throughput", "reduce latency")
2. METRIC: What is the single number that determines success?
- How is it measured? (command, script, test output)
- What direction is better? (lower/higher)
3. TARGET FILES: Which file(s) can you modify?
- List explicitly. Everything else is READ-ONLY.
4. RUN COMMAND: What command runs one experiment?
- e.g., `python train.py`, `make benchmark`, `npm test`
5. EXTRACT COMMAND: How do you extract the metric from the run output?
- e.g., `grep "^val_loss:" run.log`, parse JSON output, read a file
6. TIME BUDGET: How long should each experiment run?
- Fixed time budget makes experiments directly comparable.
- Also set a kill timeout (e.g., 2x the budget).
7. CONSTRAINTS:
- Files that must NOT be modified (evaluation, data prep, etc.)
- Packages that must NOT be added
- Resources limits (memory, disk, etc.)
- Any invariants that must hold
8. BRANCH TAG: Name for this experiment session.
- Branch will be: autoresearch/<tag>
- e.g., autoresearch/mar17-lr-sweep
9. BASELINE: Do we need to run a baseline first? (usually yes)
After resolving all questions, write autoresearch.config.md:
# Autoresearch Configuration
## Goal
<what we're optimizing>
## Metric
- **Name**: <metric name>
- **Direction**: <lower|higher> is better
- **Extract command**: <how to get the number from run output>
## Target Files
- <file1> (description of what can be changed)
- <file2> (description of what can be changed)
## Read-Only Files
- <file1> (why it's read-only)
## Run Command
autoresearch/<tag>
git checkout -b autoresearch/<tag> from the current branchresults.tsv with header: commit\t<metric_name>\tstatus\tdescription/autoresearch run)Read autoresearch.config.md to load the experiment protocol. Then enter the loop.
results.tsv and recent git log to understand what's been tried# 1. Make ONE focused change to target file(s)
# - Change only one variable at a time
# - Keep the change small and reviewable
# 2. Commit the change
git add <target files>
git commit -m "<concise description of the change>"
# 3. Run the experiment
<run_command> > run.log 2>&1
# 4. Extract the metric
<extract_command>
# 5. Handle crashes
# If the run crashed or timed out:
# - Read the error from run.log
# - Record as crash in results.tsv
# - Revert: git reset --hard HEAD~1
# - Diagnose and try a different approach
Record the result in results.tsv (tab-separated, do NOT commit this file):
<commit_hash>\t<metric_value>\t<status>\t<description>
Where status is one of:
keep — metric improved, commit stays on branchdiscard — metric equal or worse, revert the commitcrash — run failed, revert the commitIF metric improved (strictly better than best so far):
→ KEEP the commit (branch advances)
→ Log: "KEEP: <description> (<metric>: <old> → <new>)"
ELIF metric equal or worse:
→ DISCARD: git reset --hard HEAD~1
→ Log: "DISCARD: <description> (<metric>: <value> vs best <best>)"
ELIF crashed or timed out:
→ CRASH: git reset --hard HEAD~1
→ Log: "CRASH: <description> (error: <brief error>)"
What to try (roughly in order of expected impact):
When stuck (no improvement in 5+ consecutive experiments):
Simplicity criterion:
/autoresearch analyze)Read results.tsv and git log, then produce a summary:
Format as a clear report. If possible, suggest the user visualize with a progress chart.
This protocol works for any optimization task, not just ML training. Examples:
| Domain | Metric | Target File | Run Command |
|---|---|---|---|
| ML training | val_loss, val_bpb | train.py | python train.py |
| Compiler optimization | benchmark time | config.toml | make bench |
| Web performance | Lighthouse score | webpack.config.js | npm run build && lighthouse |
| Algorithm tuning | ops/sec | solver.py | python benchmark.py |
| Prompt engineering | eval accuracy | prompts.yaml | python eval.py |
| Database tuning | query latency | postgresql.conf | pgbench |
| CSS/rendering | layout shift score | styles.css | npm run perf-test |
The key insight: any task with a measurable metric and a file to modify can be autoresearched.
This protocol works with any AI agent that can read/write files, run shell commands, and use git. If you're running this outside OpenClaw (e.g., Claude Code, Codex, Cursor, Aider):
autoresearch.config.md for the experiment protocolresults.tsv as your experiment memoryFor the original autoresearch methodology and implementation details, see reference.md.