Use when evaluating model quality, running benchmarks, comparing checkpoints, selecting evaluation tasks, or preparing leaderboard submissions - covers lm-evaluation-harness, lighteval, HELM, benchmark selection, and SkyPilot eval job patterns
Evaluate trained models against standardized benchmarks to measure quality, detect regressions, and compare checkpoints. Run evaluations on cloud GPUs via SkyPilot to avoid blocking local resources.
Core principle: Every checkpoint decision (keep/discard/deploy) requires quantitative evaluation against a known baseline. Never ship a model without benchmarking it.
Do not use for:
The standard. Backend for the Open LLM Leaderboard. 200+ tasks, YAML config, chat template support.
# Install
pip install lm-eval
# Run evaluation
lm_eval --model hf \
--model_args pretrained=/path/to/model \
--tasks mmlu,hellaswag,arc_easy,arc_challenge,winogrande,gsm8k \
--batch_size auto \
--output_path /results/
# List available tasks
lm_eval --tasks list
# Use chat template for instruction-tuned models
lm_eval --model hf \
--model_args pretrained=/path/to/model \
--tasks mmlu \
--apply_chat_template \
--batch_size auto
# VLLM backend for faster inference
lm_eval --model vllm \
--model_args pretrained=/path/to/model,tensor_parallel_size=2 \
--tasks mmlu \
--batch_size auto
Key flags:
--batch_size auto -- auto-detect max batch for available VRAM--num_fewshot N -- override default few-shot count--limit 100 -- run only 100 samples per task (fast debugging)--log_samples -- save per-sample predictions for error analysis--apply_chat_template -- required for chat/instruct modelsLighter weight, tighter HF Hub integration, faster iteration cycle.
pip install lighteval
# Run evaluation
lighteval accelerate \
--model_args "pretrained=/path/to/model" \
--tasks "leaderboard|mmlu|5" \
--output_dir /results/
# Evaluate model directly from HF Hub
lighteval accelerate \
--model_args "pretrained=meta-llama/Llama-3-8B" \
--tasks "leaderboard|hellaswag|10"
When to prefer lighteval:
Holistic evaluation: accuracy, calibration, robustness, fairness, bias, toxicity, efficiency. Use when evaluation must cover more than accuracy.
pip install crfm-helm
helm-run --run-entries mmlu:model=hf/my-model --suite my-eval
helm-summarize --suite my-eval
When to prefer HELM:
| Use Case | Benchmarks | Why |
|---|---|---|
| General knowledge | MMLU, ARC, HellaSwag, WinoGrande | Broad coverage of reasoning and knowledge |
| Math/reasoning | GSM8K, MATH, BBH | Chain-of-thought and multi-step |
| Code generation | HumanEval, MBPP, MultiPL-E | Functional correctness |
| Instruction following | MT-Bench, AlpacaEval, IFEval | Chat quality and compliance |
| Safety | TruthfulQA, ToxiGen, BBQ | Hallucination and toxicity |
| Long context | RULER, Needle-in-Haystack | Context window utilization |
For detailed benchmark descriptions and scoring methodology, see references/benchmark-guide.md.
Run evaluation on cloud GPUs without blocking local machines.