Detect local hardware (RAM, CPU, GPU/VRAM) and recommend the best-fit local LLM models with optimal quantization, speed estimates, and fit scoring.
Hardware-aware local LLM advisor. Detects your system specs (RAM, CPU, GPU/VRAM) and recommends models that actually fit, with optimal quantization and speed estimates.
Use this skill immediately when the user asks any of:
Also use this skill when:
models.providers.ollama or models.providers.lmstudiollmfit --json system
Returns JSON with CPU, RAM, GPU name, VRAM, multi-GPU info, and whether memory is unified (Apple Silicon).
llmfit recommend --json --limit 5
Returns the top 5 models ranked by a composite score (quality, speed, fit, context) with optimal quantization for the detected hardware.
llmfit recommend --json --use-case coding --limit 3
llmfit recommend --json --use-case reasoning --limit 3
llmfit recommend --json --use-case chat --limit 3
Valid use cases: general, coding, reasoning, chat, multimodal, embedding.
llmfit recommend --json --min-fit good --limit 10
Valid fit levels (best to worst): perfect, good, marginal.
{
"system": {
"cpu_name": "Apple M2 Max",
"cpu_cores": 12,
"total_ram_gb": 32.0,
"available_ram_gb": 24.5,
"has_gpu": true,
"gpu_name": "Apple M2 Max",
"gpu_vram_gb": 32.0,
"gpu_count": 1,
"backend": "Metal",
"unified_memory": true
}
}
Each model in the models array includes:
| Field | Meaning |
|---|---|
name | HuggingFace model ID (e.g. meta-llama/Llama-3.1-8B-Instruct) |
provider | Model provider (Meta, Alibaba, Google, etc.) |
params_b | Parameter count in billions |
score | Composite score 0–100 (higher is better) |
score_components | Breakdown: quality, speed, fit, context (each 0–100) |
fit_level | Perfect, Good, Marginal, or TooTight |
run_mode | GPU, CPU+GPU Offload, or CPU |
category | Model category (e.g. Reasoning, Coding, Chat, Embedding) |
is_moe | Whether the model uses Mixture of Experts architecture |
parameter_count | Human-readable param count string (e.g. "7.6B") |
notes | Array of human-readable notes about the recommendation |
best_quant | Optimal quantization for the hardware (e.g. Q5_K_M, Q4_K_M) |
estimated_tps | Estimated tokens per second |
memory_required_gb | VRAM/RAM needed at this quantization |
memory_available_gb | Available VRAM/RAM detected |
utilization_pct | How much of available memory the model uses |
use_case | What the model is designed for |
context_length | Maximum context window |
After getting recommendations, configure the user's local model provider.
Map the HuggingFace model name to its Ollama tag. Common mappings:
| llmfit name | Ollama tag |
|---|---|
meta-llama/Llama-3.1-8B-Instruct | llama3.1:8b |
meta-llama/Llama-3.3-70B-Instruct | llama3.3:70b |
Qwen/Qwen2.5-Coder-7B-Instruct | qwen2.5-coder:7b |
Qwen/Qwen2.5-72B-Instruct | qwen2.5:72b |
deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct | deepseek-coder-v2:16b |
deepseek-ai/DeepSeek-R1-Distill-Qwen-32B | deepseek-r1:32b |
google/gemma-2-9b-it | gemma2:9b |
mistralai/Mistral-7B-Instruct-v0.3 | mistral:7b |
microsoft/Phi-3-mini-4k-instruct | phi3:mini |
microsoft/Phi-4-mini-instruct | phi4-mini |
Then update openclaw.json:
{
"models": {
"providers": {
"ollama": {
"models": ["ollama/<ollama-tag>"]
}
}
}
}
And optionally set as default:
{
"agents": {
"defaults": {
"model": {
"primary": "ollama/<ollama-tag>"
}
}
}
}
Use the HuggingFace model name directly as the model identifier with the appropriate provider prefix (vllm/ or lmstudio/).
When a user asks "what local models can I run?":
llmfit --json system to show hardware summaryllmfit recommend --json --limit 5 to get top picksopenclaw.json with the chosen modelWhen a user asks for a specific use case like "recommend a coding model":
llmfit recommend --json --use-case coding --limit 3best_quant field tells you the optimal quantization — higher quant (Q6_K, Q8_0) means better quality if VRAM allows.estimated_tps) are approximate and vary by hardware and quantization.fit_level: "TooTight" should never be recommended to users.