Select and apply model quantization formats (GGUF, GPTQ, AWQ, bitsandbytes) with appropriate bit widths, calibration data, and quality-latency tradeoffs. Use when choosing quantization format for deployment, converting models between formats, tuning bitsandbytes config, or evaluating quantized model quality. Do not use for training, fine-tuning, or inference serving configuration unrelated to quantization.
Use this skill to select the right quantization format and bit width for a given model, hardware target, and quality requirement — then execute the conversion, validate output quality, and document the tradeoff decision.
Use this skill when:
llama.cpp/quantize, auto-gptq, autoawq, or transformers with bitsandbytesProfile the deployment constraints. Record target hardware (GPU model and VRAM, CPU cores, RAM), maximum acceptable latency (tokens/sec), maximum memory budget, and whether the model must run fully on GPU, CPU, or split.
Estimate model memory at each bit width.
Use the formula: memory_gb ≈ (params_billions × bits_per_weight) / 8 + kv_cache_overhead. For a 7B model at Q4: ~4 GB weights + KV cache. Compare against available VRAM.
Select the quantization format based on serving runtime.
auto-gptq (GPU-only, fast inference with Marlin/ExLlama kernels)autoawq (GPU-only, slightly better quality than GPTQ at same bits)Prepare the calibration dataset (GPTQ/AWQ only).
Select 128–512 representative samples from the target domain. Use c4 or wikitext as fallback if domain data is unavailable. Ensure samples cover the range of expected input lengths.
Execute the quantization.
./quantize input.gguf output.gguf Q4_K_Mauto_gptq.AutoGPTQForCausalLM.from_pretrained(...).quantize(calibration_data, bits=4, group_size=128)awq.AutoAWQForCausalLM.from_pretrained(...).quantize(calibration_data, quant_config={"w_bit": 4, "q_group_size": 128})BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16)Evaluate quality against the FP16 baseline. Run perplexity on a held-out set. Run 3–5 representative task prompts and compare output quality. Measure accuracy on a relevant benchmark (e.g., MMLU subset, domain-specific QA). Accept if degradation is <2% on the primary metric.
Document the decision. Record: source model, quantization format, bit width, calibration data, quality delta vs FP16, memory footprint, and throughput measured on target hardware.
Q4_K_M (GGUF) or 4-bit with group_size=128 (GPTQ/AWQ) as the starting point.Q5_K_M or Q5_K_S when Q4 shows >2% quality degradation on the primary evaluation metric.transformers directly.Hardware Profile — GPU/CPU specs, VRAM, RAM, and target throughputFormat Selection Rationale — chosen format, bit width, and why alternatives were rejectedQuantization Command or Config — exact command or code to reproduce the conversionQuality Evaluation Results — perplexity delta, task accuracy comparison, and sample output diffsDeployment Artifact — path to the quantized model file and its measured memory footprintRead these only when relevant:
references/gguf-quant-types.mdreferences/gptq-awq-comparison.mdreferences/bitsandbytes-config.mdlocal-llmollamavllm-servingllama-cppc4) for a domain-specific model when domain data is available.