This skill should be used when the user asks to "quantize a model", "run PTQ", "post-training quantization", "NVFP4 quantization", "FP8 quantization", "INT8 quantization", "INT4 AWQ", "quantize LLM", "quantize MoE", "quantize VLM", or needs to produce a quantized HuggingFace or TensorRT-LLM checkpoint from a pretrained model using ModelOpt.
Produce a quantized checkpoint from a pretrained model. Read examples/llm_ptq/README.md first — it has the support matrix, CLI flags, and accuracy guidance.
Read skills/common/environment-setup.md and skills/common/workspace-management.md. After completing them you should know:
Check the support table in examples/llm_ptq/README.md for verified HF models.
hf_ptq.py (step 4A/4B)references/unsupported-models.md to determine if can still work or if a custom script is needed (step 4C)hf_ptq.pyIf the model uses trust_remote_code (check config.json for auto_map), inspect its custom Python files for imports not present in the container:
grep -h "^from \|^import " <model_path>/modeling_*.py | sort -u
Known dependency patterns:
| Import found | Packages to install |
|---|---|
from mamba_ssm / from causal_conv1d | mamba-ssm causal-conv1d (Mamba/hybrid models: NemotronH, Jamba) |
If extra deps are needed:
EXTRA_PIP_DEPS in the task's environment section — ptq.sh installs them automaticallyunset PIP_CONSTRAINT && pip install <deps> before running hf_ptq.pyFirst, check for a model-specific recipe:
ls modelopt_recipes/models/ 2>/dev/null
If a model-specific recipe exists, use --recipe <path> — it may contain tuned settings.
If no model-specific recipe, choose a format based on GPU (details in examples/llm_ptq/README.md):
nvfp4 variantsfp8 or int4_awqUse --qformat <name> (e.g., --qformat nvfp4). Format definitions: modelopt/torch/quantization/config.py. General PTQ recipes in modelopt_recipes/general/ptq/ correspond to the same formats — --qformat is the simpler way to use them.
NVFP4 can be calibrated on Hopper but requires Blackwell for inference.
Goal: checkpoint on disk (.safetensors + config.json).
For listed models (4A/4B): run full calibration directly (--calib_size 512).
For unlisted models (4C): run a smoke test first (--calib_size 4), wait for success, then full calibration.
In README table? ─→ YES ──→ SLURM (local or remote)? ──→ LAUNCHER (4B)
│ Local Docker + GPU? ────────→ LAUNCHER (4B)
│ Remote Docker (no SLURM)? ──→ MANUAL (4A)
│ Bare GPU (local or remote)? → MANUAL (4A)
│
└→ NOT LISTED ──→ UNLISTED MODEL (4C)
pip install --no-build-isolation "nvidia-modelopt[hf]"
pip install -r examples/llm_ptq/requirements.txt
python examples/llm_ptq/hf_ptq.py \
--pyt_ckpt_path <model> \
--qformat <format> \
--calib_size 512 \
--export_path <output>
Run --help for all options.
For remote: use remote_run from remote_exec.sh (see skills/common/remote-execution.md).
Write a YAML config using common/hf_ptq/hf_ptq.sh. See references/launcher-guide.md for the full template.
cd tools/launcher
# SLURM (remote or local):
SLURM_HOST=<host> SLURM_ACCOUNT=<acct> uv run launch.py --yaml <config.yaml> user=<ssh_user> identity=<ssh_key> --yes
# Local Docker:
uv run launch.py --yaml <config.yaml> hf_local=<hf_cache> --yes
The launcher blocks and tails logs until the job completes. If the launcher fails (missing deps, config errors), fall back to path 4A (manual execution).
Follow references/unsupported-models.md. It walks through investigating the model, patching ModelOpt if needed, and running hf_ptq.py. Run manually (like 4A) for easier monitoring and debugging.
For SLURM, see skills/common/slurm-setup.md and references/slurm-setup-ptq.md.
squeue -u $USER + sleep (not cron or background tasks)ls -lh <output_path>/
# Expect: config.json, tokenizer files, model-*.safetensors
Report the path and size to the user.
Validate the exported checkpoint's quantization pattern matches the recipe. Quantization config patterns can silently miss layers if the model uses non-standard naming (e.g., Gemma4 experts.* missed by *mlp* patterns) — this only surfaces later as deployment failures. Read references/checkpoint-validation.md for the validation script, expected patterns per recipe, and common pattern gaps.
mtq.register() classes must define _setup() and call it from __init__mto.enable_huggingface_checkpointing() before quantization*gate* matches too broadly — use *mlp.gate* or *router*hf_ptq.py auto-extracts the language model via extract_and_prepare_language_model_from_vl() — no manual VLM handling needed in most cases_QuantFP8Linear (lazy dequant) over FineGrainedFP8Config(dequantize=True) which wastes ~2x memory. See references/unsupported-models.md for details_input_quantizer or _weight_quantizertrust_remote_code may import packages not in the container (e.g., mamba-ssm for hybrid Mamba models). See Step 2.5. Use EXTRA_PIP_DEPS env var with the launcher, or install manually before running hf_ptq.pyconfig.json for transformers_version. In containers, beware of PIP_CONSTRAINT blocking upgrades — see references/slurm-setup-ptq.md for workaroundsHF_TOKEN is set in the job environment, or use --dataset cnn_dailymail as a non-gated alternativeskills/common/slurm-setup.md section 5| Reference | When to read |
|---|---|
skills/common/environment-setup.md | Step 1: always |
skills/common/workspace-management.md | Step 1: always |
references/launcher-guide.md | Step 4B only (launcher path) |
tools/launcher/CLAUDE.md | Step 4B only, if you need more launcher detail |
references/unsupported-models.md | Step 4C only (unlisted model) |
references/checkpoint-validation.md | Step 5: validate quantization pattern matches recipe |
skills/common/remote-execution.md | Step 4A/4C only, if target is remote |
skills/common/slurm-setup.md | Step 4A/4C only, if using SLURM manually (not launcher) |
references/slurm-setup-ptq.md | Step 4A/4C only, PTQ-specific SLURM (container, GPU sizing, FSDP2) |
examples/llm_ptq/README.md | Step 3: support matrix, CLI flags, accuracy |
modelopt/torch/quantization/config.py | Step 3: format definitions |
modelopt/torch/export/model_utils.py | Step 4C: TRT-LLM export type mapping |
modelopt_recipes/ | Step 3: pre-built recipes |