Skill File

Model Recommender

Name: Model Recommender
Author: projectious-work

Recommends the right AI model for a task by scoring models across six dimensions (Reasoning, Engineering, Speed, Breadth, Reliability, Governance) and displaying a spider-chart profile. Use when the user says "which model should I use", "recommend a model for this", "what LLM is best for X", "compare these models", "help me pick an AI model", "which Claude should I use", or "route these tasks to the right model". Also analyzes a task plan and clusters subtasks by model fit, showing which work to do with which model — like Dispatch matching heroes to missions by overlapping skill polygons.

projectious-work0 starsApr 10, 2026

Occupation
Categories: Debugging

Skill Content

Intro

This skill scores AI models across six capability dimensions, two gate dimensions (Availability and User Access), and per-token pricing — then helps you pick the right model. It has three workflows: Profile View (spider-chart for one or more models, with optional sub-dimension drill-down), Task Router (cluster a task plan and route each cluster to the optimal model), and Roster Refresh (live internet research to update benchmark scores and discover new models). Gates always apply before capability scoring: a model the user can't access or that is currently down is excluded regardless of how well it scores.

Overview

The six dimensions

Every model and every task is evaluated against the same six axes. Each scores 1–5.

Dim	Symbol	What it measures	Sub-dimensions
Reasoning	R	Hard thinking: math, science, logic, novel problem-solving

Related Skills

Model Recommender | Skills Pool

Score	Label	Meaning
5	Exceptional	Best-in-class or near-best; make this a primary reason to choose the model
4	Strong	Above average; reliable strength, not a risk
3	Moderate	Adequate; not a differentiator; works for routine tasks
2	Limited	Can do it, but expect trade-offs; consider alternatives
1	Minimal	Poor fit; the model is not designed for this; choose differently

Model	Input /1M	Output /1M	Value score	G
Gemini Flash 2.0	$0.08	$0.30	~43	2
DeepSeek V3	$0.14	$0.28	~50	1
Claude Haiku 4.5	$0.25	$1.25	~11	5
DeepSeek R1	$0.55	$2.19	~15	1
o4-mini	$1.10	$4.40	~5	2
Mistral Large 3	$2.00	$6.00	~3	4
Claude Sonnet 4.6	$3.00	$15.00	~1.5	5
Gemini 2.5 Pro	$1.25	$10.00	~2	2
GPT-4o	$2.50	$10.00	~1.4	2
o3	$10.00	$40.00	~0.5	2
Claude Opus 4.6	$15.00	$75.00	~0.3	5
Llama 3.3 70B	self-hosted	self-hosted	—	5

Tool	When to use
`list_models()`	First step when access config is unknown; shows what's usable
`query_models(R=4, G=5)`	"Find models with strong reasoning and full privacy"
`get_profile("claude-sonnet-4.6", scope="E")`	Sub-dimension drill-down on Engineering
`compare_models(["claude-sonnet-4.6", "gemini-2.5-pro"], scope="B")`	Side-by-side Breadth sub-dims
`get_pricing(sort_by="value_score")`	Value-for-money ranking
`check_availability()`	Live status before routing a plan
`get_config()` / `set_config(...)`	Show or update user access list

Model: Claude Sonnet 4.6
Provider: Anthropic · Tier: Frontier mid-size · Updated: 2026-Q1
────────────────────────────────────────────────
  Reasoning    ▓▓▓▓▓▓▓▓░░  4/5  Strong
  Engineering  ▓▓▓▓▓▓▓▓▓▓  5/5  Exceptional
  Speed        ▓▓▓▓▓▓░░░░  3/5  Moderate
  Breadth      ▓▓▓▓▓▓▓▓░░  4/5  Strong
  Reliability  ▓▓▓▓▓▓▓▓░░  4/5  Strong
  Governance   ▓▓▓▓▓▓▓▓▓▓  5/5  Exceptional
────────────────────────────────────────────────
Best for:   Complex coding, code review, refactoring, agentic workflows,
            tasks touching sensitive data or enterprise privacy requirements
Avoid for:  Extreme cost sensitivity at very high volume, real-time <100ms
            UX, native audio/video processing

Engineering sub-dimensions: Claude Sonnet 4.6 vs GPT-4o
─────────────────────────────────────────────────────
Sub-dimension          Sonnet 4.6   GPT-4o
Function codegen            5          4
Repo-scale tasks            5          4
Tool use / BFCL             5          4
Agentic planning            5          3
Codebase navigation         5          4
─────────────────────────────────────────────────────
Top-level E score           5          4

Parse the task list. Accept any format: markdown bullets, numbered list, WorkItems, or a free-form plan.
Score each task against the six dimensions: which dimensions does this task require? Use the Task Scoring Quick-Reference below.
Cluster tasks by their dominant dimension profile. Tasks with the same top-2 dimensions belong in the same cluster. Typical clusters:
- Deep-Think (R+E dominant) — complex architecture, novel algorithms
- Production-Coder (E dominant) — routine implementation, bug fixes
- High-Volume (S dominant) — repetitive generation, bulk transforms
- Long-Context (B dominant) — large codebase sweeps, doc analysis
- Privacy-First (G dominant) — PII, PHI, regulated data, secrets
Recommend a model per cluster from the user's accessible models. Call query_models(R=..., E=..., G=..., apply_user_filter=True) to find the best available match. State the primary recommendation and one fallback.
Theoretical-best hint. After recommending from the user's accessible models, run the same query with apply_user_filter=False. If the theoretical-best model differs from the recommended one, show the hint:
```
⚡ Theoretical best (not in your access): Claude Opus 4.6
   Gap: R:5 E:5 vs your best R:4 E:5 — 1 point on Reasoning matters for
   this cluster (novel algorithm design). Consider adding Opus access if
   the task justifies it → anthropic.com/api
```
Omit the hint if the user's accessible model already matches the theoretical best, or if the gap is only 1 point on a non-dominant dimension.
Output the routing table:

Task Routing Analysis
══════════════════════════════════════════════

Cluster 1 — Deep-Think (Reasoning + Engineering)   [N tasks]
  Profile:    R:5  E:5  S:2  B:3  L:4  G:4
  Your model: Claude Sonnet 4.6  (fallback: Qwen 2.5 Coder 32B self-hosted)
  ⚡ Theoretical best: Claude Opus 4.6 — R gap: 4→5; worth it for novel algorithms
  Tasks:
    • Design consensus algorithm for distributed cache
    • Refactor auth middleware to zero-trust model

Cluster 2 — Production-Coder (Engineering)          [N tasks]
  Profile:    R:3  E:5  S:3  B:3  L:4  G:4
  Your model: Claude Sonnet 4.6  (no gap — this is the theoretical best)
  Tasks:
    • Implement JWT refresh token rotation
    • Add pagination to /api/v2/users endpoint

Cluster 3 — High-Volume (Speed)                     [N tasks]
  Profile:    R:2  E:3  S:5  B:3  L:3  G:3
  Your model: Claude Haiku 4.5  (fallback: Gemini Flash 2.0)
  ⚡ Theoretical best: Gemini Flash 2.0 — slightly cheaper at this volume;
     only matters if processing >10M tokens/month
  Tasks:
    • Generate unit test stubs for all 200 endpoints
    • Reformat 5,000 changelog entries to new template

Cluster 4 — Privacy-First (Governance)              [N tasks]
  Profile:    R:3  E:3  S:3  B:3  L:4  G:5
  Your model: Llama 3.3 70B self-hosted  (no gap — self-hosted is optimal)
  Tasks:
    • Process patient consent records
    • Summarize HIPAA audit logs

══════════════════════════════════════════════
Total: N tasks across M clusters.
Suggested sequence: Cluster 4 → Cluster 1 → Cluster 2 → Cluster 3.

Show current state. Call get_config(). If available_models is non-empty, say "You currently have [N] models configured — I'll update your settings." If empty, say "Your model access isn't set up yet. Let me walk you through it."
Ask about provider access. One question, accept a free-form answer:

"Which AI providers do you have active API access to? Options: Anthropic, OpenAI, Google (Gemini), Meta/self-hosted (Llama), Mistral, xAI (Grok), Cohere, Alibaba/Qwen, MiniMax, DeepSeek, Microsoft (Phi), or other."
For each confirmed provider, ask about model tiers:
- Anthropic: "Do you have Haiku only, Haiku + Sonnet, or full access (all three)?"
- OpenAI: "Standard (GPT-4o), reasoning tier (o3/o4-mini), or both?"
- Google: "Gemini 2.5 Pro, Flash, or both?"
- Open-source (Llama/Qwen/Phi/Gemma): "Self-hosted or via a third-party API (Together, Groq, etc.)?" — this determines the effective G score.
- Others: accept the provider-level answer.
Data sensitivity floor. One question:

"What's the highest sensitivity of data you typically work with in this project? (a) Public only — open-source code, public information (b) Internal / proprietary — code or business data that isn't public (c) Personal data — names, emails, addresses, any PII (d) Regulated — HIPAA (medical), financial, legal, or government data" Map: (a)→G:0, (b)→G:3, (c)→G:4, (d)→G:5.
Budget preference. One question:

"Budget preference? (a) Cost-first — use the cheapest model that meets the requirement (b) Balanced — trade off cost and quality (c) Quality-first — use the best model regardless of cost" Map: (a)→low, (b)→medium, (c)→high. Set budget_tier.
Exclusions. One question:

"Any models to always exclude? (e.g., 'no DeepSeek', 'nothing from China', 'only Anthropic')" Parse response and add to blocked_models.

Summarise and confirm. Display a brief summary:

Here's your configuration:
  Accessible models: [list]
  Governance floor:  G:4 (personal data)
  Budget:            balanced
  Blocked:           deepseek-v3, deepseek-r1
Apply this? (yes / adjust)

Apply. On confirmation, call set_config(...) with all fields. Then call list_models() and show the effective roster so the user can verify it looks right.

Task contains…	Raise these dimensions
"design", "architect", "algorithm", "prove", "derive", "optimize (complexity)"	R
"implement", "fix", "debug", "refactor", "review code", "write tests", "agentic"	E
"generate N items", "bulk", "batch", "fast", "cheap", "thousands of"	S
"entire codebase", "all files", "image", "screenshot", "audio", "video", "long doc"	B
"exactly", "format must be", "strict schema", "no hallucination", "cite sources"	L
PII, PHI, credentials, regulated, GDPR, HIPAA, internal/confidential	G

Skipping the gate check before routing. Always check User Access and Availability before scoring capability. A model scoring E:5 G:5 is useless if the user's API key is expired or the provider is in a major outage. Call check_availability() and get_config() at the start of any Task Router session, or ask the user explicitly if MCP is unavailable.
Assuming access from context. If the user mentions "Claude" it does not mean they have Opus, Sonnet, and Haiku — they may only have one tier. Ask or call get_config() before recommending a specific tier.
Recommending a model the user has rate-limited. Rate limits and quota exhaustion are not visible on status pages — they are per-account states. If the user says "it keeps failing", check for rate-limit errors before re-routing to the same model. Suggest the next model in the fallback chain.
Conflating low cost with high value. DeepSeek models have the highest raw value score but score G:1, which disqualifies them for most enterprise work. Always surface the governance warning alongside the value score when a low-cost model has a G:1 or G:2 rating. Cost optimization within a G floor is the correct framing, not raw cost minimization.
Not telling the user the actual price. When recommending a model for a bulk task ("generate 10,000 descriptions"), compute an estimate: (estimated tokens × price per 1M / 1,000,000). Even a rough estimate ($0.50 vs. $8.00) changes the decision. Use get_pricing() for current rates.
Treating the roster as current truth. Model scores change with every release. The profiles in references/model-profiles.md have a "validated" date. If the user's model is newer or the date is >6 months old, caveat your recommendation and point to Artificial Analysis or LMSYS Arena for live data.
Recommending a model the user can't access. Before recommending o3 or Opus 4.6, ask (or check context) whether the user has access and budget. A perfect score on paper is useless if the quota is exhausted or the model is not provisioned. Offer a concrete fallback every time.
Scoring the task instead of the task cluster. When routing a plan, assess what the cluster as a whole needs, not individual task quirks. One unusual subtask should not pull an entire cluster to a different model.
Ignoring the Governance dimension for "internal" work. Many teams assume internal code is not sensitive. But source code with credentials, unreleased algorithms, or regulated business logic can be just as sensitive as PII. When in doubt, ask whether the organization has an AI data classification policy before routing to non-sovereign models.
Conflating Speed with Engineering. A fast model (S:5) is not necessarily a good coder (E:5). Haiku and Flash are excellent at high-volume, low-complexity code tasks but will struggle with novel architecture or debugging subtle async races. Always check both dimensions before routing engineering work to a speed-optimized model.
Presenting scores as objective benchmarks. The 1-5 scores in this skill are informed calibrations, not direct benchmark readings. When the user needs precision (e.g., choosing between two models scoring 4 vs. 4 on the same dimension), surface the underlying benchmarks (SWE-bench Verified, GPQA Diamond, BFCL v4) and link to current leaderboards. The skill's scores are a starting point, not the final word.
Over-routing to expensive models. The task router should push work toward the cheapest model that meets the required profile, not the highest- scoring model globally. Opus 4.6 is not the right answer for a task that only needs E:3 S:4 — that is Haiku or Sonnet territory.
Skipping the sequence suggestion. After clustering, always add a recommended execution sequence. Some clusters are prerequisites for others (e.g., design decisions made with Opus should precede implementation with Sonnet). The sequence is often more valuable than the per-cluster picks.
Calling set_config without user confirmation. set_config overwrites user_config.json in full — it is marked destructiveHint: true. Always show the proposed configuration to the user and wait for explicit approval before calling set_config. Never chain set_config into a batch of other calls.
Treating check_availability failures as errors. check_availability makes live HTTP requests to provider status pages. Network failures, rate limits, and subscription expiry are not detectable from status pages. Treat any failure or non-operational status as unknown, report it to the user, and do not retry more than once without explicit user awareness.

Trade-off	Typical tension	Resolution heuristic
Reasoning vs. Speed	o3 vs. Haiku	Choose by task complexity: reasoning chains >5 steps → o3 tier; routine → Haiku
Breadth vs. Governance	Gemini 2.5 Pro vs. Llama	If context >200K AND data is non-sensitive → Gemini; otherwise → Llama + chunking
Engineering vs. Governance	GPT-4o vs. Mistral	If task is routine coding AND data is non-sensitive → GPT-4o; regulated → Mistral or Llama
Speed vs. Reliability	Flash vs. Sonnet	For customer-facing output → Sonnet; for internal draft generation → Flash
Cost vs. Capability	DeepSeek vs. Claude	If data is non-sensitive AND cost is paramount → DeepSeek; otherwise avoid

Model Recommender

Intro

Overview

The six dimensions

Model Recommender

Intro

Overview

The six dimensions

Gate dimensions (applied before capability scoring)

Cost and value

MCP tools

Workflow A — Profile View

Workflow B — Task Router

Workflow D — Setup Questionnaire

Workflow C — Roster Refresh

Task Scoring Quick-Reference

Model roster summary

Commands

Gotchas

Full reference

Sub-dimension detail

Complete model profiles

Pricing and value analysis

Availability and user access

Dimension trade-off map

How scores were derived

Keeping profiles current

Anti-patterns

Session Logs

OpenClaw Test Heap Leaks

Node Connect

Openclaw Qa Testing

Openclaw Secret Scanning Maintainer

Flags