Run an evaluation with specified cells and handle the full generation + judging pipeline
Run an evaluation pipeline. The user will specify which cells and how many runs.
Parse the request: Identify cell profiles, run count, model overrides, and options.
cell_1_base_single_unified, cell_5_recog_single_unified, etc.openrouter.nemotron or openrouter.kimi-k2.5--scenario <id>, --cluster <name>, --parallelism N, --live, --transcriptPre-flight checks:
grep "$CELL_NAME" config/tutor-agents.yamlnode scripts/test-rate-limit.js <model-alias>Run generation (skip rubric for speed):
node scripts/eval-cli.js run --profiles <cells> --runs N --skip-rubric [--live]
Common options:
--ego-model <ref> — override tutor ego model only--superego-model <ref> — override tutor superego model only--model <ref> — override ALL agent models--learner-model <ref> — override learner ego + superego uniformly--scenario <id> — specific scenario(s)--cluster <name> — scenario cluster (single-turn, multi-turn, core, etc.)--parallelism N — parallel tests (default: 2)--live — stream API calls in real time--transcript — write play-format transcript filesNote the run ID from output, then start judging:
node scripts/eval-cli.js evaluate <runId> --follow
CAUTION: Do NOT use --force unless the user explicitly asks to re-score existing rows.
--force overwrites existing scores and is destructive to cross-judge data.
Without --force, only NULL-scored rows are evaluated.
Report results when complete:
sqlite3 -header -column data/evaluations.db "SELECT profile_name, judge_model, COUNT(*) n, ROUND(AVG(tutor_first_turn_score),1) mean FROM evaluation_results WHERE run_id = '<runId>' AND tutor_first_turn_score IS NOT NULL GROUP BY profile_name, judge_model"
openrouter.nemotron, NOT openrouter/nemotron--runs NOT --repeatstutor_first_turn_score (Turn 0). overall_score is deprecated alias.tutor_last_turn_score (last turn) and tutor_development_score./resume-run or: node scripts/eval-cli.js resume <runId>EVAL_ONLY_PROFILES array in services/evaluationRunner.js--force on runs with multiple judge models — it silently destroys cross-judge data