Name: Model Inference
Author: K4M1coder

스킬 검색.../

Model Inference | Skills Pool

Method	Precision	Quality Impact	Speed	Hardware	Calibration
FP16/BF16	16-bit	Negligible	2× vs FP32	Any GPU	None
INT8 (torchao)	W8A8	< 1%	2-3× vs FP16	Ampere+	Activation stats
GPTQ	W4A16	1-3%	2-4× vs FP16	GPU only	Calibration dataset
AWQ	W4A16	< 1%	2-4× vs FP16	GPU only	Activation-aware
GGUF	2-8 bit	Varies	CPU-friendly	CPU / GPU	imatrix calibration
SmoothQuant	W8A8	< 1%	2-3× vs FP16	Ampere+	Smoothing factor

Framework	Best For	Key Features
vLLM	LLM serving (high throughput)	PagedAttention, continuous batching, speculative decoding
TGI	HuggingFace models	Tensor parallel, quantization, streaming
Triton	Multi-framework, multi-model	Dynamic batching, model ensemble, NVIDIA
FastAPI + torch	Custom models	Full control, WebSocket streaming
ONNX Runtime	Cross-platform	CPU/GPU/NPU, mobile, edge
Rust / Candle	Low-latency production	Moshi backend pattern
MLX	Apple Silicon	Native Metal acceleration
llama.cpp	CPU/edge	GGUF format, cross-platform
DeepSpeed-Inference (MoE)	MoE model serving	Expert slicing across GPUs, process-group all-to-all dispatch. See `skills/cutting-edge-architectures/references/moe-sparse-routing.md`

Deployment Envelope	Recommended Starting Points	Detailed Reference
CPU / mobile	LFM2.5-1.2B-Thinking, Ministral 3 3B, Tiny Aya	`../_shared/references/models/edge-small.md`
1 x 16-24 GB GPU	DeepSeek-R1 distill 7B / 14B, TranslateGemma 12B, Olmo 3 7B	`../_shared/references/llm-landscape.md`
1 x 48-80 GB GPU	Gemma 4 31B, Olmo 3.1 32B, Nemotron-3-Nano-30B-A3B, GLM-4.7-Flash	`../_shared/references/llm-landscape.md`
2-4 x 80 GB GPUs	MiniMax M2.5 quantized, Qwen3-Coder-Next, Devstral 2	`../_shared/references/llm-landscape.md`
Cluster	DeepSeek V3.2, Kimi K2.5, GLM-5, Qwen3.5 397B-A17B	`../_shared/references/llm-landscape.md`

Input Audio → Ring Buffer → Mimi Encoder (12.5 Hz) → LM (Moshi) → Mimi Decoder → Output Audio
                                                   ↕
                                              KV Cache (per session)

Project	Framework	Optimization	Notes
Moshi (Rust)	Candle	Custom CUDA kernels	WebSocket streaming
Moshi (Python)	PyTorch + torchao	INT8 quantization	`export_quantized.py`
Moshi (MLX)	MLX	Apple Silicon native	`moshi_mlx/`
Pocket-TTS	PyTorch / ONNX	CPU-targeted	Lightweight
Unmute	Docker + FastAPI	Multi-model routing	Docker Compose

Model: [name, params, format]
Hardware: [GPU model, count, driver version]
Quantization: [method, calibration details]
Results:
  Latency (p50): X ms
  Latency (p95): X ms
  Latency (p99): X ms
  Throughput: X req/s
  Memory: X GB
  Quality: [metric name] = X (vs baseline Y, delta Z%)

Concept	Description
Quantization	Reduce precision (FP32 → INT8/INT4) to save memory and speed up
KV Cache	Cache key/value tensors for autoregressive generation

Concept	Description
Quantization	Reduce precision (FP32 → INT8/INT4) to save memory and speed up
KV Cache	Cache key/value tensors for autoregressive generation

Model Inference

Model Inference

When to Use

Core Concepts

Model Inference

Model Inference

When to Use

Core Concepts

Procedure

Phase 1: Model Profiling

Phase 2: Quantization

Phase 3: Serving Framework Selection

Phase 3.5: Open-Weight Family Routing

Phase 4: Streaming (Audio/Real-time)

Phase 5: Deployment

Kyutai Open-Source Reference — Inference

Benchmark Template

Continuous Learning V2

Continuous Learning V2

Continuous Learning V2

Continuous Learning

Continuous Learning

Pytorch Patterns