Model Evaluation and Benchmarking

Overview

Evaluate trained models against standardized benchmarks to measure quality, detect regressions, and compare checkpoints. Run evaluations on cloud GPUs via SkyPilot to avoid blocking local resources.

Core principle: Every checkpoint decision (keep/discard/deploy) requires quantitative evaluation against a known baseline. Never ship a model without benchmarking it.

When to Use

After training completes or a checkpoint is saved
Comparing two model versions (A/B)
Preparing a model for deployment or leaderboard submission
Validating that fine-tuning did not degrade base capabilities
Selecting which benchmarks matter for a given use case

Do not use for:

Vibes-based evaluation (use MT-Bench or human eval instead)
Latency/throughput testing (use inference benchmarking tools)
Training loss curves (use training-monitoring skill)

Model Evaluation and Benchmarking

Overview

Evaluate trained models against standardized benchmarks to measure quality, detect regressions, and compare checkpoints. Run evaluations on cloud GPUs via SkyPilot to avoid blocking local resources.

Core principle: Every checkpoint decision (keep/discard/deploy) requires quantitative evaluation against a known baseline. Never ship a model without benchmarking it.

When to Use

After training completes or a checkpoint is saved
Comparing two model versions (A/B)
Preparing a model for deployment or leaderboard submission
Validating that fine-tuning did not degrade base capabilities
Selecting which benchmarks matter for a given use case

Do not use for:

Vibes-based evaluation (use MT-Bench or human eval instead)
Latency/throughput testing (use inference benchmarking tools)
Training loss curves (use training-monitoring skill)

Use Case	Benchmarks	Why
General knowledge	MMLU, ARC, HellaSwag, WinoGrande	Broad coverage of reasoning and knowledge
Math/reasoning	GSM8K, MATH, BBH	Chain-of-thought and multi-step
Code generation	HumanEval, MBPP, MultiPL-E	Functional correctness
Instruction following	MT-Bench, AlpacaEval, IFEval	Chat quality and compliance
Safety	TruthfulQA, ToxiGen, BBQ	Hallucination and toxicity
Long context	RULER, Needle-in-Haystack	Context window utilization

Model Evaluation

Model Evaluation and Benchmarking

Overview

When to Use

Model Evaluation

Model Evaluation and Benchmarking

Overview

When to Use

Primary Tools

lm-evaluation-harness (EleutherAI)

lighteval (HuggingFace)

HELM (Stanford)

Benchmark Selection Guide

SkyPilot Eval Job Pattern

Continuous Learning V2

Continuous Learning V2

Continuous Learning V2

Continuous Learning

Continuous Learning

Pytorch Patterns