머신러닝
Model Evaluation
**WORKFLOW SKILL** — Evaluate ML models against benchmarks, task-specific metrics, and quality criteria. USE FOR: automated evaluation pipelines, benchmark selection (MMLU, HumanEval, MT-Bench, HELM, lm-evaluation-harness), metric design (accuracy, BLEU, ROUGE, WER, MOS, toxicity), SOTA comparison, regression testing between checkpoints, evaluation harness setup, human evaluation design, model card reporting. USE WHEN: comparing model checkpoints, selecting a model for production, validating fine-tuning results, establishing a regression baseline, or reporting model capabilities in a model card.