Add and manage evaluation results in Hugging Face model cards. Supports extracting eval tables from README content, importing scores from Artificial Analysis API, and running custom model evaluations with vLLM/lighteval. Works with the model-index metadata format.
This skill provides tools to add structured evaluation results to Hugging Face model cards. It supports multiple methods for adding evaluation data:
uv integration1.3.0
Note: vLLM dependencies are installed automatically via PEP 723 script headers when using uv run.
Before creating ANY pull request with --create-pr, you MUST check for existing open PRs:
uv run scripts/evaluation_manager.py get-prs --repo-id "username/model-name"
If open PRs exist:
This prevents spamming model repositories with duplicate evaluation PRs.
Use --help for the latest workflow guidance. Works with plain Python or uv run:
uv run scripts/evaluation_manager.py --help
uv run scripts/evaluation_manager.py inspect-tables --help
uv run scripts/evaluation_manager.py extract-readme --help
Key workflow (matches CLI help):
get-prs → check for existing open PRs firstinspect-tables → find table numbers/columnsextract-readme --table N → prints YAML by default--apply (push) or --create-pr to write changesinspect-tables to see all tables in a README with structure, columns, and sample rows--table N to extract from a specific table (required when multiple tables exist)--model-column-index (index from inspect output). Use --model-name-override only with exact column header text.--task-type sets the task.type field in model-index output (e.g., text-generation, summarization)inspect-ai library⚠️ Important: This approach is only possible on devices with uv installed and sufficient GPU memory.
Benefits: No need to use hf_jobs() MCP tool, can run scripts directly in terminal
When to use: User working in local device directly when GPU is available
nvidia-smiuv run scripts/train_sft_example.py
The skill includes Python scripts in scripts/ to perform operations.
uv run (PEP 723 header auto-installs deps)pip install huggingface-hub markdown-it-py python-dotenv pyyaml requestsHF_TOKEN environment variable with Write-access tokenAA_API_KEY environment variable.env is loaded automatically if python-dotenv is installedRecommended flow (matches --help):
# 1) Inspect tables to get table numbers and column hints
uv run scripts/evaluation_manager.py inspect-tables --repo-id "username/model"
# 2) Extract a specific table (prints YAML by default)
uv run scripts/evaluation_manager.py extract-readme \
--repo-id "username/model" \
--table 1 \
[--model-column-index <column index shown by inspect-tables>] \
[--model-name-override "<column header/model name>"] # use exact header text if you can't use the index
# 3) Apply changes (push or PR)
uv run scripts/evaluation_manager.py extract-readme \
--repo-id "username/model" \
--table 1 \
--apply # push directly
# or
uv run scripts/evaluation_manager.py extract-readme \
--repo-id "username/model" \
--table 1 \
--create-pr # open a PR
Validation checklist:
--model-column-index; if using --model-name-override, the column header text must be exact.Fetch benchmark scores from Artificial Analysis API and add them to a model card.
Basic Usage:
AA_API_KEY="your-api-key" python scripts/evaluation_manager.py import-aa \
--creator-slug "anthropic" \
--model-name "claude-sonnet-4" \
--repo-id "username/model-name"
With Environment File:
# Create .env file
echo "AA_API_KEY=your-api-key" >> .env
echo "HF_TOKEN=your-hf-token" >> .env
# Run import
python scripts/evaluation_manager.py import-aa \
--creator-slug "anthropic" \
--model-name "claude-sonnet-4" \
--repo-id "username/model-name"
Create Pull Request:
python scripts/evaluation_manager.py import-aa \
--creator-slug "anthropic" \
--model-name "claude-sonnet-4" \
--repo-id "username/model-name" \
--create-pr
Submit an evaluation job on Hugging Face infrastructure using the hf jobs uv run CLI.
Direct CLI Usage:
HF_TOKEN=$HF_TOKEN \
hf jobs uv run hf-evaluation/scripts/inspect_eval_uv.py \
--flavor cpu-basic \
--secret HF_TOKEN=$HF_TOKEN \
-- --model "meta-llama/Llama-2-7b-hf" \
--task "mmlu"
GPU Example (A10G):
HF_TOKEN=$HF_TOKEN \
hf jobs uv run hf-evaluation/scripts/inspect_eval_uv.py \
--flavor a10g-small \
--secret HF_TOKEN=$HF_TOKEN \
-- --model "meta-llama/Llama-2-7b-hf" \
--task "gsm8k"
Python Helper (optional):
python scripts/run_eval_job.py \
--model "meta-llama/Llama-2-7b-hf" \
--task "mmlu" \
--hardware "t4-small"
Evaluate custom HuggingFace models directly on GPU using vLLM or accelerate backends. These scripts are separate from inference provider scripts and run models locally on the job's hardware.
| Feature | vLLM Scripts | Inference Provider Scripts |
|---|---|---|
| Model access | Any HF model | Models with API endpoints |
| Hardware | Your GPU (or HF Jobs GPU) | Provider's infrastructure |
| Cost | HF Jobs compute cost | API usage fees |
| Speed | vLLM optimized | Depends on provider |
| Offline | Yes (after download) | No |
lighteval is HuggingFace's evaluation library, supporting Open LLM Leaderboard tasks.
Standalone (local GPU):
# Run MMLU 5-shot with vLLM
python scripts/lighteval_vllm_uv.py \
--model meta-llama/Llama-3.2-1B \
--tasks "leaderboard|mmlu|5"
# Run multiple tasks
python scripts/lighteval_vllm_uv.py \
--model meta-llama/Llama-3.2-1B \
--tasks "leaderboard|mmlu|5,leaderboard|gsm8k|5"
# Use accelerate backend instead of vLLM
python scripts/lighteval_vllm_uv.py \
--model meta-llama/Llama-3.2-1B \
--tasks "leaderboard|mmlu|5" \
--backend accelerate
# Chat/instruction-tuned models
python scripts/lighteval_vllm_uv.py \
--model meta-llama/Llama-3.2-1B-Instruct \
--tasks "leaderboard|mmlu|5" \
--use-chat-template
Via HF Jobs:
hf jobs uv run scripts/lighteval_vllm_uv.py \
--flavor a10g-small \
--secrets HF_TOKEN=$HF_TOKEN \
-- --model meta-llama/Llama-3.2-1B \
--tasks "leaderboard|mmlu|5"
lighteval Task Format:
Tasks use the format suite|task|num_fewshot:
leaderboard|mmlu|5 - MMLU with 5-shotleaderboard|gsm8k|5 - GSM8K with 5-shotlighteval|hellaswag|0 - HellaSwag zero-shotleaderboard|arc_challenge|25 - ARC-Challenge with 25-shotFinding Available Tasks: The complete list of available lighteval tasks can be found at: