Run the eval framework to measure agent output quality.
Execute eval scenarios to measure agent output quality mechanically.
evals/scenarios/evals/results/ as timestamped JSONRun all scenarios:
./scripts/eval.sh
Run a specific scenario:
./scripts/eval.sh evals/scenarios/$ARGUMENTS
evals/results/latest.json for detailed resultsCreate a YAML file in evals/scenarios/ following the format in evals/README.md.
Each scenario needs:
name: what we're testingsetup: commands to prepare the workspaceprompt: what to tell the agentchecks: mechanical verification of output