Evaluates accuracy of quantized or unquantized LLMs using NeMo Evaluator Launcher (NEL). Triggers on "evaluate model", "benchmark accuracy", "run MMLU", "evaluate quantized model", "accuracy drop", "run nel". Handles deployment, config generation, and evaluation execution. Not for quantizing models (use ptq) or deploying/serving models (use deployment).
You're an expert in NeMo Evaluator Launcher! Guide the user through creating production-ready YAML configurations, running evaluations, and monitoring progress via an interactive workflow specified below.
If MODELOPT_WORKSPACE_ROOT is set, read skills/common/workspace-management.md. Check for existing workspaces — especially if evaluating a model from a prior PTQ or deployment step. Reuse the existing workspace so you have access to the quantized checkpoint and any code modifications.
Config Generation Progress:
- [ ] Step 0: Check workspace (if MODELOPT_WORKSPACE_ROOT is set)
- [ ] Step 1: Check if nel is installed and if user has existing config
- [ ] Step 2: Build the base config file
- [ ] Step 3: Configure model path and parameters
- [ ] Step 4: Fill in remaining missing values
- [ ] Step 5: Confirm tasks (iterative)
- [ ] Step 6: Advanced - Multi-node (Data Parallel)
- [ ] Step 7: Advanced - Interceptors
- [ ] Step 7.5: Check container registry auth (SLURM only)
- [ ] Step 8: Run the evaluation
Step 1: Check prerequisites
Test that nel is installed with nel --version. If not, instruct the user to pip install nemo-evaluator-launcher.
If the user already has a config file (e.g., "run this config", "evaluate with my-config.yaml"), skip to Step 8. Optionally review it for common issues (missing ??? values, quantization flags) before running.
Step 2: Build the base config file
Prompt the user with "I'll ask you 5 questions to build the base config we'll adjust in the next steps". Guide the user through the 5 questions using AskUserQuestion:
Only accept options from the categories listed above (Execution, Deployment, Auto-export, Model type, Benchmarks). YOU HAVE TO GATHER THE ANSWERS for the 5 questions before you can build the base config.
Note: These categories come from NEL's
build-configCLI. Always runnel skills build-config --helpfirst to get the current options — they may differ from this list (e.g.,chat_reasoninginstead of separatechat/reasoning,general_knowledgeinstead ofstandard). When the CLI's current options differ from this list, prefer the CLI's options.
When you have all the answers, run the script to build the base config:
nel skills build-config --execution <local|slurm> --deployment <none|vllm|sglang|nim|trtllm> --model_type <base|chat|reasoning> --benchmarks <standard|code|math_reasoning|safety|multilingual> [--export <none|mlflow|wandb>] [--output <OUTPUT>]
Where --output depends on what the user provides:
It never overwrites existing files.
Step 3: Configure model path and parameters
Ask for model path. Determine type:
/, ./, ../, ~, or contains no / but exists on disk) → set deployment.checkpoint_path: <path> and deployment.hf_model_handle: nullorg/model-name — contains exactly one / and does not exist locally) → set deployment.hf_model_handle: <handle> and deployment.checkpoint_path: nullAuto-detect ModelOpt quantization format (checkpoint paths only):
Check for hf_quant_config.json in the checkpoint directory:
cat <checkpoint_path>/hf_quant_config.json 2>/dev/null
If found, read quantization.quant_algo and set the correct vLLM/SGLang quantization flag in deployment.extra_args:
quant_algo | Flag to add |
|---|---|
FP8 | --quantization modelopt |
W4A8_AWQ | --quantization modelopt |
NVFP4, NVFP4_AWQ | --quantization modelopt_fp4 |
| Other values | Try --quantization modelopt; consult vLLM/SGLang docs if unsure |
If no hf_quant_config.json, also check config.json for a quantization_config section with quant_method: "modelopt". If neither is found, the checkpoint is unquantized — no flag needed.
Note: Some models require additional env vars for deployment (e.g.,
VLLM_NVFP4_GEMM_BACKEND=marlinfor Nemotron Super). These are not inhf_quant_config.json— they are discovered during model card research below.
Quantization-aware benchmark defaults:
When a quantized checkpoint is detected, read references/quantization-benchmarks.md for benchmark sensitivity rankings and recommended sets. Present recommendations to the user and ask which to include.
Read references/model-card-research.md for the full extraction checklist (sampling params, reasoning config, ARM64 compatibility, pre_cmd, etc.). Use WebSearch to research the model card, present findings, and ask the user to confirm.
Step 4: Fill in remaining missing values
??? missing values in the config.Step 5: Confirm tasks (iterative)
Show tasks in the current config. Loop until the user confirms the task list is final:
Tell the user: "Run nel ls tasks to see all available tasks".
Ask if they want to add/remove tasks or add/remove/modify task-specific parameter overrides.
To add per-task nemo_evaluator_config as specified by the user, e.g.:
tasks:
- name: <task>
nemo_evaluator_config:
config:
params:
temperature: <value>
max_new_tokens: <value>
...
Apply changes.
Show updated list and ask: "Is the task list final, or do you want to make more changes?"
Known Issues
NeMo-Skills workaround (self-deployment only): If using nemo_skills.* tasks with self-deployment (vLLM/SGLang/NIM), add at top level:
target:
api_endpoint:
api_key_name: DUMMY_API_KEY
For the None (External) deployment the api_key_name should be already defined. The DUMMY_API_KEY export is handled in Step 8.
Step 6: Advanced - Multi-node
If the user needs multi-node evaluation (model >120B, or more throughput), read references/multi-node.md for the configuration patterns (HAProxy multi-instance, Ray TP/PP, or combined).
Step 7: Advanced - Interceptors
--overrides syntax but put the values in the YAML config under evaluation.nemo_evaluator_config.config.target.api_endpoint.adapter_config (NOT under target.api_endpoint.adapter_config) instead of using CLI overrides.
By defining interceptors list you'd override the full chain of interceptors which can have unintended consequences like disabling default interceptors. That's why use the fields specified in the CLI Configuration section after the --overrides keyword to configure interceptors in the YAML config.Documentation Errata
max_logged_requests and max_logged_responses (NOT max_saved_* or max_*).Step 7.5: Check container registry authentication (SLURM only)
NEL's default deployment images by framework:
| Framework | Default image | Registry |
|---|---|---|
| vLLM | vllm/vllm-openai:latest | DockerHub |
| SGLang | lmsysorg/sglang:latest | DockerHub |
| TRT-LLM | nvcr.io/nvidia/tensorrt-llm/release:... | NGC |
| Evaluation tasks | nvcr.io/nvidia/eval-factory/*:26.03 | NGC |
Before submitting, verify the cluster has credentials for the deployment image. See skills/common/slurm-setup.md section 6 for the full procedure.
ssh <host> "grep -E '^\s*machine\s+' ~/.config/enroot/.credentials 2>/dev/null"
Decision flow (check before submitting):
Check if the cluster has credentials for the default DockerHub image (see command above)
If DockerHub credentials exist → use the default image and submit
If DockerHub credentials are missing but can be added → add them (see slurm-setup.md section 6), then submit
If DockerHub credentials cannot be added → override deployment.image to the NGC alternative and submit:
deployment:
image: nvcr.io/nvidia/vllm:<YY.MM>-py3 # check https://catalog.ngc.nvidia.com/orgs/nvidia/containers/vllm for latest tag
Do not retry more than once without fixing the auth issue
Step 8: Run the evaluation
Print the following commands to the user. Propose to execute them in order to confirm the config works as expected before the full run.
Important: Export required environment variables based on your config. If any tokens or keys are missing (e.g. HF_TOKEN, NGC_API_KEY, api_key_name from the config), ask the user to put them in a .env file in the project root so you can run set -a && source .env && set +a (or equivalent) before executing nel run commands.
# If using pre_cmd or post_cmd (review pre_cmd content before enabling — it runs arbitrary commands):
export NEMO_EVALUATOR_TRUST_PRE_CMD=1
# If using nemo_skills.* tasks with self-deployment:
export DUMMY_API_KEY=dummy
Dry-run (validates config without running):
nel run --config <config_path> --dry-run
Test with limited samples (quick validation run):
nel run --config <config_path> -o ++evaluation.nemo_evaluator_config.config.params.limit_samples=10
Re-run a single task (useful for debugging or re-testing after config changes):
nel run --config <config_path> -t <task_name>
Combine with -o for limited samples: nel run --config <config_path> -t <task_name> -o ++evaluation.nemo_evaluator_config.config.params.limit_samples=10
Full evaluation (production run):
nel run --config <config_path>
After the dry-run, check the output from nel for any problems with the config. If there are no problems, propose to first execute the test run with limited samples and then execute the full evaluation. If there are problems, resolve them before executing the full evaluation.
Monitoring Progress
After job submission, you can monitor progress using:
Check job status:
nel status <invocation_id>
nel info <invocation_id>
Stream logs (Local execution only):
nel logs <invocation_id>
Note: nel logs is not supported for SLURM execution.
Inspect logs via SSH (SLURM workaround):
When nel logs is unavailable (SLURM), use SSH to inspect logs directly:
First, get log locations:
nel info <invocation_id> --logs
Then, use SSH to view logs:
Check server deployment logs:
ssh <username>@<hostname> "tail -100 <log path from `nel info <invocation_id> --logs`>/server-<slurm_job_id>-*.log"
Shows vLLM server startup, model loading, and deployment errors (e.g., missing wget/curl).
Check evaluation client logs:
ssh <username>@<hostname> "tail -100 <log path from `nel info <invocation_id> --logs`>/client-<slurm_job_id>.log"
Shows evaluation progress, task execution, and results.
Check SLURM scheduler logs:
ssh <username>@<hostname> "tail -100 <log path from `nel info <invocation_id> --logs`>/slurm-<slurm_job_id>.log"
Shows job scheduling, health checks, and overall execution flow.
Search for errors:
ssh <username>@<hostname> "grep -i 'error\|warning\|failed' <log path from `nel info <invocation_id> --logs`>/*.log"
Direct users with issues to:
Now, copy this checklist and track your progress:
Config Generation Progress:
- [ ] Step 0: Check workspace (if MODELOPT_WORKSPACE_ROOT is set)
- [ ] Step 1: Check if nel is installed and if user has existing config
- [ ] Step 2: Build the base config file
- [ ] Step 3: Configure model path and parameters
- [ ] Step 4: Fill in remaining missing values
- [ ] Step 5: Confirm tasks (iterative)
- [ ] Step 6: Advanced - Multi-node (Data Parallel)
- [ ] Step 7: Advanced - Interceptors
- [ ] Step 7.5: Check container registry auth (SLURM only)
- [ ] Step 8: Run the evaluation