Select the best generation model, embedding model, inference backend, and quantization level based on memory budget and current leaderboard rankings. Requires memory budget from memory-budget skill. Use after memory budget calculation.
Select optimal models and backend using leaderboard rankings filtered by hardware constraints.
Read results/phase1/memory_budget.json and results/phase1/hardware_profile.json.
Also read config/decisions.json for pre-made preferences.
Check current rankings. Look up the latest open-weight model rankings from:
Filter to feasible models. From the memory budget, determine which models fit. Consider:
Select the top-ranked feasible model. If LMArena and Artificial Analysis disagree, use average rank to break ties.
Check GGUF availability. The selected model must be available in GGUF format for llama.cpp. Search HuggingFace for GGUF quantized versions. If no GGUF exists, move to the next-ranked model.
config/decisions.json for candidate preferences.Deterministic rules:
llama.cpp with CPU-only (GGUF format)llama.cpp with CUDA (preferred for GGUF),
or vLLM / Ollama if the model is in a supported formatSave complete selection to results/phase1/selected_config.json:
{
"generation_model": {
"name": "...",
"parameters": "...",
"quantization": "...",
"format": "GGUF",
"gguf_source": "HuggingFace URL",
"memory_required_gb": N,
"moe": true|false,
"active_parameters": "...",
"leaderboard_rank": { "lmarena": N, "artificial_analysis": N }
},
"embedding_model": {
"name": "...",
"memory_required_gb": N,
"mteb_retrieval_rank": N
},
"backend": {
"name": "llama.cpp",
"cuda": true|false,
"build_flags": "..."
},
"memory_summary": {
"total_gb": N,
"generation_model_gb": N,
"embedding_model_gb": N,
"os_reserve_gb": N,
"headroom_gb": N
}
}
IMPORTANT: After saving, STOP and print the full selection summary. Wait for human approval before downloading models or proceeding to Phase 2.