Solve competition math problems (IMO, Putnam, USAMO, AIME) with adversarial verification that catches the errors self-verification misses. Activates when asked to 'solve this IMO problem', 'prove this olympiad inequality', 'verify this competition proof', 'find a counterexample', 'is this proof correct', or for any problem with 'IMO', 'Putnam', 'USAMO', 'olympiad', or 'competition math' in it. Uses pure reasoning (no tools) — then a fresh-context adversarial verifier attacks the proof using specific failure patterns, not generic 'check logic'. Outputs calibrated confidence — will say 'no confident solution' rather than bluff. If LaTeX is available, produces a clean PDF after verification passes.
Tool policy: Solvers and verifiers use THINKING ONLY in the tight-budget workflow. Competition math is reasoning. Computation is for deep mode (§6c), and even then bounded — a recurrence that's doubly-exponential can't be computed past n~30, work mod 2^m instead.
| Problem | Approach | Verification |
|---|---|---|
| AIME numeric answer | Best-of-N → majority vote | Answer check only |
| Olympiad proof (IMO/Putnam/USAMO) | Full workflow below | 5-pass adversarial |
| "Is this proof correct?" | Skip to verification (step 4) | Adversarial + spec-gaming |
| Full problem set (e.g. all 6 from a competition) | Sequential: one full workflow per problem, collect results, compile single PDF | Per-problem adversarial |
Batch in one Workflow: Set opts.label on every agent() call to include
the problem ID (e.g., label: "P3:solver:2"). Without labels, 36 results come
back with no problem association. Run problems in parallel — the label is what
matters, not ordering.
Launch one solver workflow per problem (same VERBATIM prompt, different statement). Run them in parallel. When all return, run adversarial verification per problem. Problems that pass get their proof in the PDF; problems that abstain get "No confident solution" with partial notes.
Don't try to solve all N problems in one agent's context — each problem needs its own thinking budget and its own fresh-context verifier. The composition is