Design and audit AI/ML experiments for comparability, reproducibility, and realistic compute use. Use when the user needs help defining baselines, ablations, metrics, data splits, seed handling, experiment tracking, statistical interpretation, compute budgets, or a reproducible experiment spec before running large jobs.
Treat the experiment as a specification, not a pile of runs.
Return one or more of:
experiment_spec: ready-to-run designcomparison_table: baselines, metrics, and fairness notesreproducibility_risks: likely failure pointscompute_budget: staged plan with must-have and optional runsreferences/experiment-checklists.md for the experiment-spec template, run checklist, and common failure modes.