Autonomous experiment loop: agent iterates on code, runs experiments, keeps improvements, discards failures, repeats indefinitely. Use for rapid metric optimization — model training (val_loss, accuracy), performance tuning (latency, throughput), or any measurable objective. Inspired by Karpathy's autoresearch.
Run an autonomous cycle of rapid experiments against a single target metric. Each iteration: hypothesize → modify code → run → evaluate → keep or discard. No human in the loop until manually interrupted.
Does not replace apm-ds-exp. That skill is for single, carefully-planned experiments with quality gates and user approval. This skill is for high-volume autonomous iteration.
Agree on these parameters before starting:
apr2). Branch: autoresearch/<tag>.minimize or maximize).uv run train.py, python benchmark.py, npm run perf-test).grep "^val_loss:" run.log). If the metric is not printed in a greppable format, agree on a parsing approach.autoresearch/<tag> branch does not exist. Create it from current HEAD.results.tsv with header row (see Results tracking below).results.tsv. This is the starting point for all comparisons.LOOP FOREVER:
results.tsv and recent git history. Identify patterns: what worked, what failed, what's unexplored.git commit with a concise message describing the change.<runner> > run.log 2>&1. Redirect everything — do NOT use tee or let output flood your context.run.log.tail -n 50 run.log for the error. If it's a simple fix (typo, import, off-by-one), fix and re-run. If the idea is fundamentally broken, log as crash and move on.results.tsv.keep_ref — this is the new baseline for future comparisons.git reset --hard <keep_ref> to discard back to the last kept commit. Do not use HEAD~1 — there may be multiple commits since the last keep (e.g. crash fix attempts).keep led to a local optimum that blocks further progress), you may rewind past it to an earlier keep_ref from results.tsv. Do this very sparingly — it discards validated improvements.All else equal, simpler is better.
Weigh complexity cost against improvement magnitude on every keep/discard decision.
results.tsv — tab-separated, not committed to git (leave untracked).
Required columns: commit, the primary metric, status, description. Beyond these, add any secondary metrics that help interpret results — decide based on the task. For DL: peak memory, training time, MFU, total tokens, num params. For dev: p50/p99 latency, throughput, binary size. Use judgment.
Header example (DL task):
commit val_bpb peak_vram_gb mfu_pct status description
a1b2c3d 0.9979 44.0 39.8 keep baseline
b2c3d4e 0.9932 44.2 40.1 keep increase LR to 0.04
c3d4e5f 1.0050 44.0 38.5 discard switch to GeLU activation
d4e5f6g 0.0000 0.0 0.0 crash double model width (OOM)
Header example (dev task):
commit p99_ms throughput_rps status description
a1b2c3d 142 3200 keep baseline
b2c3d4e 118 3450 keep switch to connection pooling
commit: short git hash (7 chars)status: keep, discard, or crashdescription: short text — what this experiment tried0 / 0.0 for metrics on crashed runsUse your judgment. If the crash is something dumb and easy to fix (typo, missing import, shape mismatch, off-by-one) — fix it and re-run the same idea. If the idea itself is fundamentally broken (OOM on a model that's 10× too large, an approach that can't converge) — skip it, log crash in the TSV, revert, and move on. Do not waste iterations on a dead end.
If budget is defined and a run exceeds 2× budget — kill the process, treat as crash.
results.tsv — it stays untracked.