Optimize LLM inference for CPU-only environments — quantization, threading, and memory mapping. Use when running models without GPU, optimizing llama.cpp for CPU, choosing quantization for RAM-constrained systems, or deploying inference on commodity hardware. Do not use for GPU inference (prefer inference-serving) or model selection (prefer model-selection).
Optimize LLM inference for CPU-only environments using quantization, memory mapping, thread tuning, and efficient model selection.
inference-servingmodel-selectionembeddings-indexingfree -h), CPU cores (nproc), and instruction set support (AVX2, AVX-512).Q4_K_M for best quality/size balance. Q3_K_S for extreme compression. Q5_K_M if RAM allows.mmap by default. Ensure sufficient virtual memory. Use --mlock to pin in RAM for consistent performance.-t to physical core count (not hyperthreads). On NUMA systems, use numactl --cpunodebind=0.-b 512 for prompt processing. Reduce for interactive use: -b 128.--prompt-cache to avoid re-processing repeated system prompts../llama-bench -m model.gguf -t <threads> to measure prompt eval and token generation speed.| RAM | Model size | Quantization | Context | Speed (est.) |
|---|---|---|---|---|
| 8 GB | 7B | Q4_K_M (4.1GB) | 2048 | ~10 tok/s |
| 16 GB | 7B | Q6_K (5.5GB) | 8192 | ~15 tok/s |
| 16 GB | 13B | Q4_K_M (7.4GB) | 4096 | ~6 tok/s |
| 32 GB | 30B | Q4_K_M (17GB) | 4096 | ~3 tok/s |
| 64 GB | 70B | Q4_K_M (38GB) | 4096 | ~2 tok/s |
# Optimal CPU inference (adjust -t to your physical core count)
./llama-cli -m model-Q4_K_M.gguf \
-t 8 \ # physical cores
-b 512 \ # batch size for prompt processing
--ctx-size 4096 \
--mlock \ # pin model in RAM
-p "Your prompt here"
# Server mode with prompt caching
./llama-server -m model-Q4_K_M.gguf \
-t 8 \
--ctx-size 4096 \
--port 8080 \
--prompt-cache prompt-cache.bin
Q4_K_M is the sweet spot — measurably better than Q4_0/Q4_1 with negligible size increase.--mlock only if the entire model fits in RAM — partial mlock causes OOM kills.llama-cpp — building and running llama.cppinference-serving — GPU-based serving when hardware is availablemodel-selection — choosing appropriately sized models