Measure and optimize the cost/quality curve — which model, prompt, and settings give the best quality per dollar. Covers Pareto analysis, break-even thresholds, and when to spend more vs less. Use this skill when optimizing LLM spend, picking a default model for a feature, or deciding whether a premium model is worth it. Activate when: cost vs quality, model selection, eval cost, Pareto frontier, cheaper model, premium model tradeoff.
Quality without cost context is half a decision. You need the Pareto frontier — for each quality bar, what's the cheapest config that hits it?
Plot each candidate config (model × prompt × settings) on quality (y-axis) vs cost per request (x-axis). The frontier is the set of configs where no other config is both cheaper AND better.
Any config NOT on the frontier is dominated — always strictly worse than another option. Drop it.
quality
↑
1 | *A (opus + thinking)
| *B (opus)
|*G *D (sonnet + few-shot)
|*F *C (sonnet)
0 |*E (haiku)
+---------------→ cost
Pareto: A, B, D, C, E. Dominated: F (worse than E at same cost), G (worse than D at same cost).
For each candidate, measure:
| Metric | Example |
|---|---|
| Input tokens / request | 2,500 |
| Output tokens / request | 400 |
| $ / request | $0.012 |
| Quality score | 0.87 |
| p95 latency | 1.8s |
const costPerRequest = (usage.input_tokens / 1e6) * inputRate +
(usage.output_tokens / 1e6) * outputRate +
(usage.cache_creation_input_tokens / 1e6) * cacheWriteRate +
(usage.cache_read_input_tokens / 1e6) * cacheReadRate;
Always include cache costs — they dominate on cached workloads.
For any feature, try at least:
One of these usually sits on the frontier for your workload. Don't assume — measure.
Before jumping to a bigger model, try prompt levers:
A better prompt on Haiku can beat a mediocre prompt on Sonnet — and cost 10× less.
When considering an upgrade, compute when it pays off:
Cost increase per request: Δcost = new - old
Quality increase: Δquality = new - old
Value per quality point: V (estimated from business metrics)
Worth it if: Δquality × V > Δcost
Example: If every 1% quality gain increases user retention revenue by $0.003/request, and upgrading Haiku→Sonnet costs +$0.002/request for +5% quality:
You don't have to pick one. Route by difficulty:
const difficulty = await classifyDifficulty(query);
const model = difficulty === "simple" ? "claude-haiku-4-5"
: difficulty === "medium" ? "claude-sonnet-4-6"
: "claude-opus-4-6";
Classification is a cheap Haiku call. Most queries are simple; you save money. Hard queries get the premium treatment.
Measure: does tiered routing actually improve your cost/quality position? Sometimes classification errors wipe out the gains.
Cost-quality isn't enough; latency matters too. Examples where it dominates:
Report 3-tuples: (quality, cost, p95 latency). The frontier in 3D is smaller; choose by which axis has a constraint.
If you can cache 90% of your input:
Decisions made without caching factored in are usually wrong. Re-measure with cache.
Don't eval each config on 10,000 items. Start small:
Saves 10-100× on eval cost.