Run autonomous deep learning experiments in a loop: modify code, train with fixed time budget, evaluate against a single metric, keep or discard, repeat indefinitely. Use when setting up overnight autonomous research, running hyperparameter sweeps, architecture search, or any iterative experiment loop on a single GPU. Triggers include 'run autoresearch', 'autonomous experiments', 'experiment loop', or 'overnight training'.
Autonomous deep learning experimentation. An AI agent modifies training code, runs fixed-budget experiments, evaluates results against a single metric, and keeps or discards changes -- looping indefinitely until manually stopped.
Inspired by karpathy/autoresearch: the human writes the program (instructions), the agent writes the code.
Before invoking this skill, ensure:
A training script exists that the agent will modify. It should:
An evaluation metric is defined -- a single scalar, lower-is-better or higher-is-better, printed by the training script. Must be comparable across experiments regardless of what the agent changes (architecture, batch size, etc.).
Data preparation is done -- any one-time setup (data download, tokenizer training) is already completed.
Dependencies are installed -- the environment is ready to uv run or python the training script.
The agent edits exactly one file. This keeps scope manageable and diffs reviewable. Everything else (data loading, evaluation, constants) is read-only.
Every experiment runs for the same wall-clock duration regardless of what the agent changes. This makes experiments directly comparable -- a larger model that trains slower is fairly compared against a smaller model that trains faster within the same budget.
One number decides keep or discard. No multi-objective balancing. The metric must be independent of implementation details (e.g., bits-per-byte instead of cross-entropy loss, so vocab size changes are fairly compared).
All else being equal, simpler is better:
Once the loop begins, the agent runs indefinitely until manually interrupted. No asking "should I continue?" -- the user might be asleep. If the agent runs out of ideas, it should think harder: re-read the code, try combining near-misses, try radical changes, reverse previous assumptions.
Work with the user to configure the experiment:
Identify:
train.py)prepare.py, evaluate.py)uv run train.py)val_bpb, val_accuracy)# Propose a tag based on today's date
git checkout -b autoresearch/<tag>
The branch must not already exist. Each experiment session gets a fresh branch.
Read ALL files the agent will work with for full context:
Create results.tsv with a header row:
commit val_bpb memory_gb status description
Columns (tab-separated, NOT comma-separated):
commit -- git short hash (7 chars)val_bpb) -- use 0.000000 for crashesmemory_gb -- peak VRAM in GB, rounded to .1f -- use 0.0 for crashesstatus -- keep, discard, or crashdescription -- short text of what this experiment triedDo NOT commit results.tsv -- leave it untracked by git.
The very first experiment is always the baseline: run the training script as-is, record the result. This establishes the reference point for all future comparisons.
Confirm setup with the user, then begin the experiment loop.
See references/experiment-protocol.md for the complete protocol.
Summary:
LOOP FOREVER:
1. Check git state (current branch/commit)
2. Modify the target file with an experimental idea
3. git commit the change
4. Run the experiment (redirect output to run.log)
5. Extract the metric from run.log
6. Log results to results.tsv
7. If improved: KEEP (advance the branch)
8. If equal or worse: DISCARD (git reset to previous commit)
9. Repeat
Each iteration takes ~5 minutes (the time budget) plus a few seconds for startup/eval overhead. Expect ~12 experiments/hour, ~100 overnight.
After the session, use the analysis notebook template to visualize results. See references/analysis-template.md.
Key analyses:
| File | Purpose |
|---|---|
| experiment-protocol.md | Detailed experiment loop with crash handling and decision rules |
| analysis-template.md | Jupyter notebook template for post-session analysis |