스킬 파일

Auto Research — Universal Optimization Loop

Name: Auto Research — Universal Optimization Loop
Author: saimambayao

Universal autonomous optimization loop based on Karpathy's auto research methodology. Accepts any artifact (code, prompt, document, config, template) plus a metric and eval criteria, then runs an iterative improve-measure-keep loop without human involvement. Use when user says "auto research", "optimize this", "run the loop", "improve this autonomously", "auto optimize", "karpathy loop", "iterative improvement", "run evals on this", "make this better automatically", or wants to systematically improve any artifact with measurable outcomes. Also trigger when user mentions "binary evals", "pass rate", "optimization loop", "autonomous improvement", or "auto loop". Works for code performance, website speed, document quality, prompt reliability, config tuning, template optimization, and any domain with an objective metric. For skill-specific optimization, prefer /skill-optimizer which wraps this methodology with skill-aware eval infrastructure.

saimambayao0 스타2026. 4. 8.

직업
카테고리: 생산성 및 통합

스킬 내용

Autonomous iterative improvement for any artifact with a measurable outcome. Based on Karpathy's autoresearch.

Core principle: remove yourself as the bottleneck. Define the metric, set boundaries, hit go. The loop finds improvements humans miss because it explores systematically.

Three Ingredients

Every auto research loop needs exactly three things. No exceptions.

Objective metric — a number, not vibes
Measurement tool — automated, no human in the loop
Artifact to change — the single file or bounded set being optimized

If any ingredient is missing, stop and help the user define it before proceeding.

Process

Phase 1: Setup

Identify the artifact — what file(s) will be modified?

Auto Research — Universal Optimization Loop

saimambayao0 스타2026. 4. 8.

직업
카테고리: 생산성 및 통합

스킬 내용

Autonomous iterative improvement for any artifact with a measurable outcome. Based on Karpathy's autoresearch.

Core principle: remove yourself as the bottleneck. Define the metric, set boundaries, hit go. The loop finds improvements humans miss because it explores systematically.

Three Ingredients

Every auto research loop needs exactly three things. No exceptions.

Objective metric — a number, not vibes
Measurement tool — automated, no human in the loop
Artifact to change — the single file or bounded set being optimized

If any ingredient is missing, stop and help the user define it before proceeding.

Process

Phase 1: Setup

Identify the artifact — what file(s) will be modified?

관련 스킬

Baseline: X/Y evals passed (Z%)

1. HYPOTHESIZE — analyze failures, propose ONE targeted change
2. APPLY — modify the artifact (minimum viable mutation)
3. MEASURE — run the metric / evals (multiple times for noisy domains)
4. COMPARE:
   ├─ Better → KEEP, log the change
   ├─ Same → KEEP (reduces variance)
   └─ Worse → REVERT, try different approach
5. Report round results

## Auto Research Report: {artifact}

**Rounds completed:** N
**Starting score:** X/Y (Z%)
**Final score:** X/Y (Z%)
**Improvement:** +N percentage points

### Eval Criteria
1. {criterion} — pass rate: X%
2. {criterion} — pass rate: X%

### Changes Applied
1. Round N: {description of mutation}

### Per-Eval Breakdown
| Eval | Start | Final | Trend |
|------|-------|-------|-------|

### Remaining Failures
- {description and why they're hard to fix}

### Research Log
{all attempted changes, including reverted ones — valuable for future optimization}

1. BASELINE — run train.py, record val_bpb
2. READ — Gemini reads train.py and the training output
3. EXPLAIN — Gemini explains what the current architecture/config does
   (learning opportunity — explain WHY, not just WHAT)
4. HYPOTHESIZE — Gemini proposes ONE change and explains the ML concept behind it
   Examples:
   - "Increasing weight decay on value embeddings to reduce overfitting"
   - "Adjusting Adam beta2 from 0.99 to 0.95 for faster adaptation"
   - "Adding a cosine learning rate schedule for smoother convergence"
5. APPLY — Edit train.py with the mutation
6. TRAIN — Run `uv run train.py` (5 min, read output when done)
7. COMPARE — Did val_bpb improve?
   ├─ Better → KEEP, explain WHY this worked
   ├─ Same → KEEP, explain what we learned
   └─ Worse → REVERT, explain WHY it didn't work (also valuable)
8. LOG — Record the experiment in the research log
9. REPEAT or STOP — user decides

Skill	Relationship
`/skill-optimizer`	Specialized wrapper — uses auto research for skills specifically
`/humanizer`	Provides eval criteria for document/content quality loops
`/tdd`	Red-green-refactor is a manual version of the same loop
`/build`	Provides measurement (lint, build, test) for code loops

Auto Research — Universal Optimization Loop

Three Ingredients

Process

Phase 1: Setup

Auto Research — Universal Optimization Loop

Three Ingredients

Process

Phase 1: Setup

Phase 2: Baseline

Phase 3: The Loop

Phase 4: Report

Domain-Specific Guidance

Code Performance

Prompt / Skill Quality

Document / Content Quality

Website / Frontend

Config / Parameters

ML Training (Local — autoresearch-mlx)

Eval Design Rules

When NOT to Auto Research

Relationship to Other Skills

Feishu Perm

Discord

Coding Agent (bash-first)

Apple Notes

Feishu Wiki

Bear Notes