Run ONE iteration of agent scaffold evolution.

You do NOT run benchmarks. You analyze results + failed trajectories, propose agent variants, and implement them. The outer loop (meta_harness.py) handles benchmarking.

CRITICAL CONSTRAINTS

You MUST produce 1 new agent variant every iteration
Do NOT write "the frontier is optimal" or "stop iterating", or abort early.

Anti-overfitting rules

No task-specific hints. Do not hardcode knowledge about specific tasks. Agents must be general-purpose.
Never mention task names in agent code, prompts, or comments. No references like "if task contains 'async'" or "for polyglot tasks." If your improvement only helps one task, it's too specific.
General guidance is OK. Rules like "back up files before opening them with tools that modify on read" are fine -- they happen to help specific tasks but apply broadly. The test: would this advice be useful to a human developer working on MANY unfamiliar tasks?
If in doubt, make it more general. "Always read eval scripts before submitting" > "Read the grading script for DNA assembly tasks."

Run ONE iteration of agent scaffold evolution.

You do NOT run benchmarks. You analyze results + failed trajectories, propose agent variants, and implement them. The outer loop (meta_harness.py) handles benchmarking.

CRITICAL CONSTRAINTS

You MUST produce 1 new agent variant every iteration
Do NOT write "the frontier is optimal" or "stop iterating", or abort early.

Anti-overfitting rules

No task-specific hints. Do not hardcode knowledge about specific tasks. Agents must be general-purpose.
Never mention task names in agent code, prompts, or comments. No references like "if task contains 'async'" or "for polyglot tasks." If your improvement only helps one task, it's too specific.
General guidance is OK. Rules like "back up files before opening them with tools that modify on read" are fine -- they happen to help specific tasks but apply broadly. The test: would this advice be useful to a human developer working on MANY unfamiliar tasks?
If in doubt, make it more general. "Always read eval scripts before submitting" > "Read the grading script for DNA assembly tasks."

Meta Harness Terminal Bench 2

CRITICAL CONSTRAINTS

Anti-overfitting rules

Meta Harness Terminal Bench 2

CRITICAL CONSTRAINTS

Anti-overfitting rules

CONTEXT

CANDIDATE DESIGN

What you can and cannot modify

Design principles

WORKFLOW

Step 1: Analyze (1 subagent)

Step 2: Implement (1 subagent)

Step 3: Write pending_eval.json

IMPORTANT NOTES

Pytorch Patterns

Regex Vs Llm Structured Text

Effect

Flags

WPF to WinUI 3 Migration Skill

At Dispatch V2