Name: Hugging Face Evaluation
Author: jarbitechture

스킬 검색.../

Hugging Face Evaluation | Skills Pool

uv run scripts/evaluation_manager.py get-prs --repo-id "username/model-name"

uv run scripts/evaluation_manager.py --help
uv run scripts/evaluation_manager.py inspect-tables --help
uv run scripts/evaluation_manager.py extract-readme --help

uv run scripts/train_sft_example.py

# 1) Inspect tables to get table numbers and column hints
uv run scripts/evaluation_manager.py inspect-tables --repo-id "username/model"

# 2) Extract a specific table (prints YAML by default)
uv run scripts/evaluation_manager.py extract-readme \
  --repo-id "username/model" \
  --table 1 \
  [--model-column-index <column index shown by inspect-tables>] \
  [--model-name-override "<column header/model name>"]  # use exact header text if you can't use the index

# 3) Apply changes (push or PR)
uv run scripts/evaluation_manager.py extract-readme \
  --repo-id "username/model" \
  --table 1 \
  --apply       # push directly
# or
uv run scripts/evaluation_manager.py extract-readme \
  --repo-id "username/model" \
  --table 1 \
  --create-pr   # open a PR

AA_API_KEY="your-api-key" uv run scripts/evaluation_manager.py import-aa \
  --creator-slug "anthropic" \
  --model-name "claude-sonnet-4" \
  --repo-id "username/model-name"

# Create .env file
echo "AA_API_KEY=your-api-key" >> .env
echo "HF_TOKEN=your-hf-token" >> .env

# Run import
uv run scripts/evaluation_manager.py import-aa \
  --creator-slug "anthropic" \
  --model-name "claude-sonnet-4" \
  --repo-id "username/model-name"

uv run scripts/evaluation_manager.py import-aa \
  --creator-slug "anthropic" \
  --model-name "claude-sonnet-4" \
  --repo-id "username/model-name" \
  --create-pr

HF_TOKEN=$HF_TOKEN \
hf jobs uv run hf-evaluation/scripts/inspect_eval_uv.py \
  --flavor cpu-basic \
  --secret HF_TOKEN=$HF_TOKEN \
  -- --model "meta-llama/Llama-2-7b-hf" \
     --task "mmlu"

HF_TOKEN=$HF_TOKEN \
hf jobs uv run hf-evaluation/scripts/inspect_eval_uv.py \
  --flavor a10g-small \
  --secret HF_TOKEN=$HF_TOKEN \
  -- --model "meta-llama/Llama-2-7b-hf" \
     --task "gsm8k"

uv run scripts/run_eval_job.py \
  --model "meta-llama/Llama-2-7b-hf" \
  --task "mmlu" \
  --hardware "t4-small"

Feature	vLLM Scripts	Inference Provider Scripts
Model access	Any HF model	Models with API endpoints
Hardware	Your GPU (or HF Jobs GPU)	Provider's infrastructure
Cost	HF Jobs compute cost	API usage fees
Speed	vLLM optimized	Depends on provider
Offline	Yes (after download)	No

# Run MMLU 5-shot with vLLM
uv run scripts/lighteval_vllm_uv.py \
  --model meta-llama/Llama-3.2-1B \
  --tasks "leaderboard|mmlu|5"

# Run multiple tasks
uv run scripts/lighteval_vllm_uv.py \
  --model meta-llama/Llama-3.2-1B \
  --tasks "leaderboard|mmlu|5,leaderboard|gsm8k|5"

# Use accelerate backend instead of vLLM
uv run scripts/lighteval_vllm_uv.py \
  --model meta-llama/Llama-3.2-1B \
  --tasks "leaderboard|mmlu|5" \
  --backend accelerate

# Chat/instruction-tuned models
uv run scripts/lighteval_vllm_uv.py \
  --model meta-llama/Llama-3.2-1B-Instruct \
  --tasks "leaderboard|mmlu|5" \
  --use-chat-template

hf jobs uv run scripts/lighteval_vllm_uv.py \
  --flavor a10g-small \
  --secrets HF_TOKEN=$HF_TOKEN \
  -- --model meta-llama/Llama-3.2-1B \
     --tasks "leaderboard|mmlu|5"

Hugging Face Evaluation

Overview

Integration with HF Ecosystem

Hugging Face Evaluation

Overview

Integration with HF Ecosystem

Version

Dependencies

Core Dependencies

Inference Provider Evaluation

vLLM Custom Model Evaluation (GPU required)

IMPORTANT: Using This Skill

⚠️ CRITICAL: Check for Existing PRs Before Creating New Ones

Core Capabilities

1. Inspect and Extract Evaluation Tables from README

2. Import from Artificial Analysis

3. Model-Index Management

4. Run Evaluations on HF Jobs (Inference Providers)

5. Run Custom Model Evaluations with vLLM (NEW)

Before running the script

Running the script

Features

Usage Instructions

Prerequisites

Method 1: Extract from README (CLI workflow)

Method 2: Import from Artificial Analysis

Method 3: Run Evaluation Job

Method 4: Run Custom Model Evaluation with vLLM

When to Use vLLM Evaluation (vs Inference Providers)

Option A: lighteval with vLLM Backend

Session Logs

OpenClaw Test Heap Leaks

Node Connect

Openclaw Qa Testing

Openclaw Secret Scanning Maintainer

Flags