Name: Purpose
Author: merceralex397-collab

Optimize LLM inference for CPU-only environments using quantization, memory mapping, thread tuning, and efficient model selection.

When to use this skill

running LLM inference on machines without GPUs
optimizing llama.cpp CPU performance (threads, batch size, memory mapping)
choosing quantization levels for RAM-constrained systems
deploying inference on edge devices or commodity servers

Do not use this skill when

GPU hardware is available — prefer inference-serving
choosing which model to use — prefer model-selection
building vector search — prefer embeddings-indexing

Procedure

Assess hardware — check available RAM (free -h), CPU cores (nproc), and instruction set support (AVX2, AVX-512).
Choose model size — RAM budget: model file + ~2GB overhead + context KV cache. 8GB RAM = 7B Q4, 16GB = 13B Q4 or 7B Q6, 32GB = 30B Q4.

Optimize LLM inference for CPU-only environments using quantization, memory mapping, thread tuning, and efficient model selection.

When to use this skill

running LLM inference on machines without GPUs
optimizing llama.cpp CPU performance (threads, batch size, memory mapping)
choosing quantization levels for RAM-constrained systems
deploying inference on edge devices or commodity servers

Do not use this skill when

GPU hardware is available — prefer inference-serving
choosing which model to use — prefer model-selection
building vector search — prefer embeddings-indexing

Procedure

Assess hardware — check available RAM (free -h), CPU cores (nproc), and instruction set support (AVX2, AVX-512).
Choose model size — RAM budget: model file + ~2GB overhead + context KV cache. 8GB RAM = 7B Q4, 16GB = 13B Q4 or 7B Q6, 32GB = 30B Q4.

RAM	Model size	Quantization	Context	Speed (est.)
8 GB	7B	Q4_K_M (4.1GB)	2048	~10 tok/s
16 GB	7B	Q6_K (5.5GB)	8192	~15 tok/s
16 GB	13B	Q4_K_M (7.4GB)	4096	~6 tok/s
32 GB	30B	Q4_K_M (17GB)	4096	~3 tok/s
64 GB	70B	Q4_K_M (38GB)	4096	~2 tok/s

Purpose

When to use this skill

Do not use this skill when

Procedure

Purpose

When to use this skill

Do not use this skill when

Procedure

Hardware sizing guide

Performance tuning

Decision rules

References

Pytorch Patterns

Regex Vs Llm Structured Text

Effect

Flags

WPF to WinUI 3 Migration Skill

At Dispatch V2

Purpose

When to use this skill

Do not use this skill when

Procedure

Purpose

When to use this skill

Do not use this skill when

Procedure

Hardware sizing guide

Performance tuning

Decision rules

References

Related skills

Pytorch Patterns

Regex Vs Llm Structured Text

Effect

Flags

WPF to WinUI 3 Migration Skill

At Dispatch V2