Head-to-head comparison of coding agents (Claude Code, Aider, Codex, etc.) on custom tasks with pass rate, cost, time, and consistency metrics
A lightweight CLI tool for comparing coding agents head-to-head on reproducible tasks. Every "which coding agent is best?" comparison runs on vibes — this tool systematizes it.
Note: Install agent-eval from its repository after reviewing the source.
Define tasks declaratively. Each task specifies what to do, which files to touch, and how to judge success: