Compile and run llama.cpp for local inference — GGUF quantization, context sizing, and GPU offloading. Use when building llama.cpp from source, converting models to GGUF, configuring n_gpu_layers for partial offload, or tuning context size and batch parameters. Do not use for vLLM/TGI serving (prefer inference-serving) or model selection (prefer model-selection).
Compile, quantize, and run models with llama.cpp for efficient local inference on CPU, GPU, or hybrid setups.
n_gpu_layers) for hybrid CPU/GPUinference-serving (vLLM)model-selectionagent-memorygit clone https://github.com/ggml-org/llama.cpp && cd llama.cpp.cmake -B build -DGGML_CUDA=ON (NVIDIA), -DGGML_METAL=ON (Apple), -DGGML_BLAS=ON (CPU).python convert_hf_to_gguf.py <model-dir> --outfile model.gguf --outtype f16../llama-quantize model.gguf model-Q4_K_M.gguf Q4_K_M. Choose quant level by quality/size tradeoff../llama-cli -m model-Q4_K_M.gguf -p "prompt" -n 256 -ngl 35 --ctx-size 4096../llama-server -m model.gguf --port 8080 -ngl 35 --ctx-size 8192 for OpenAI-compatible API.-t (threads), -b (batch size), -ngl (GPU layers), --ctx-size../llama-bench -m model.gguf -ngl 35 to measure tokens/sec.| Quant | Bits | Size (7B) | Quality | Use case |
|---|---|---|---|---|
| Q2_K | 2-3 | ~2.7 GB | Low | Extreme compression |
| Q4_K_M | 4 | ~4.1 GB | Good | Best balance |
| Q5_K_M | 5 | ~4.8 GB | Very good | Quality priority |
| Q6_K | 6 | ~5.5 GB | Near-fp16 | Max local quality |
| Q8_0 | 8 | ~7.2 GB | Excellent | If VRAM allows |
# Full GPU offload (all layers on GPU)
./llama-cli -m model.gguf -ngl 99 --ctx-size 4096
# Partial offload (first 20 layers on GPU, rest on CPU)
./llama-cli -m model.gguf -ngl 20 --ctx-size 4096
# CPU only
./llama-cli -m model.gguf -ngl 0 -t 8 --ctx-size 2048
Q4_K_M as the default quantization — best quality/size tradeoff.-ngl to as many layers as GPU VRAM allows — even partial offload helps significantly.--ctx-size 4096 needs ~2 GB extra for 7B model.-t equal to physical core count (not hyperthreads) for CPU inference.inference-serving — production GPU serving with vLLMoffline-cpu-inference — CPU-only optimization strategiesmodel-selection — choosing which model to quantizePyTorch深度学习模式与最佳实践,用于构建稳健、高效且可复现的训练流程、模型架构和数据加载。