Universal autonomous optimization loop based on Karpathy's auto research methodology. Accepts any artifact (code, prompt, document, config, template) plus a metric and eval criteria, then runs an iterative improve-measure-keep loop without human involvement. Use when user says "auto research", "optimize this", "run the loop", "improve this autonomously", "auto optimize", "karpathy loop", "iterative improvement", "run evals on this", "make this better automatically", or wants to systematically improve any artifact with measurable outcomes. Also trigger when user mentions "binary evals", "pass rate", "optimization loop", "autonomous improvement", or "auto loop". Works for code performance, website speed, document quality, prompt reliability, config tuning, template optimization, and any domain with an objective metric. For skill-specific optimization, prefer /skill-optimizer which wraps this methodology with skill-aware eval infrastructure.
Autonomous iterative improvement for any artifact with a measurable outcome. Based on Karpathy's autoresearch.
Core principle: remove yourself as the bottleneck. Define the metric, set boundaries, hit go. The loop finds improvements humans miss because it explores systematically.
Every auto research loop needs exactly three things. No exceptions.
If any ingredient is missing, stop and help the user define it before proceeding.
{artifact}.backupPresent the setup to the user for confirmation before starting the loop.
Run the artifact through the measurement tool and score against all evals. Record the baseline score. This is round 0.
Baseline: X/Y evals passed (Z%)
For each round (1 to max iterations):
1. HYPOTHESIZE — analyze failures, propose ONE targeted change
2. APPLY — modify the artifact (minimum viable mutation)
3. MEASURE — run the metric / evals (multiple times for noisy domains)
4. COMPARE:
├─ Better → KEEP, log the change
├─ Same → KEEP (reduces variance)
└─ Worse → REVERT, try different approach
5. Report round results
Mutation rules:
After all rounds complete (or target reached), produce:
## Auto Research Report: {artifact}
**Rounds completed:** N
**Starting score:** X/Y (Z%)
**Final score:** X/Y (Z%)
**Improvement:** +N percentage points
### Eval Criteria
1. {criterion} — pass rate: X%
2. {criterion} — pass rate: X%
### Changes Applied
1. Round N: {description of mutation}
### Per-Eval Breakdown
| Eval | Start | Final | Trend |
|------|-------|-------|-------|
### Remaining Failures
- {description and why they're hard to fix}
### Research Log
{all attempted changes, including reverted ones — valuable for future optimization}
Save report to same directory as the artifact: {artifact-name}-autoresearch-report.md
npx lighthouse CLI or PlaywrightRun Karpathy's original auto research loop on Apple Silicon. Gemini acts as the program.md orchestrator — proposing changes, explaining ML concepts, and guiding the learning process. Only triggered when the user explicitly requests it.
~/apps/autoresearch-mlxcd ~/apps/autoresearch-mlx && uv run train.pytrain.py (~630 lines — model architecture, optimizer, hyperparameters)How to invoke: "auto research mlx", "run a training experiment", "train the model", "optimize train.py", "karpathy loop on mlx"
The loop for ML training:
1. BASELINE — run train.py, record val_bpb
2. READ — Gemini reads train.py and the training output
3. EXPLAIN — Gemini explains what the current architecture/config does
(learning opportunity — explain WHY, not just WHAT)
4. HYPOTHESIZE — Gemini proposes ONE change and explains the ML concept behind it
Examples:
- "Increasing weight decay on value embeddings to reduce overfitting"
- "Adjusting Adam beta2 from 0.99 to 0.95 for faster adaptation"
- "Adding a cosine learning rate schedule for smoother convergence"
5. APPLY — Edit train.py with the mutation
6. TRAIN — Run `uv run train.py` (5 min, read output when done)
7. COMPARE — Did val_bpb improve?
├─ Better → KEEP, explain WHY this worked
├─ Same → KEEP, explain what we learned
└─ Worse → REVERT, explain WHY it didn't work (also valuable)
8. LOG — Record the experiment in the research log
9. REPEAT or STOP — user decides
Teaching mode: After each round, Gemini explains:
Constraints:
prepare.py or the tokenizer (data is fixed)~/apps/autoresearch-mlx/research-log.mdGood evals make or break auto research. Bad evals produce optimized garbage.
See references/methodology.md for comprehensive eval examples by domain, anti-patterns, and the full methodology.
Karpathy's caveat: "If you can't evaluate, you can't auto research it."
| Skill | Relationship |
|---|---|
/skill-optimizer | Specialized wrapper — uses auto research for skills specifically |
/humanizer | Provides eval criteria for document/content quality loops |
/tdd | Red-green-refactor is a manual version of the same loop |
/build | Provides measurement (lint, build, test) for code loops |