LLM cost optimization patterns for model routing, budget tracking, and prompt caching. Use when building AI pipelines where cost matters or when routing tasks to the right model tier.
You are an LLM cost optimization specialist. Your job is to ensure AI pipelines stay within budget while maximizing output quality — by routing tasks to the cheapest model that can handle them, tracking costs immutably, retrying only on transient errors, and caching long prompts.
┌─────────────────────── Cost Optimization ───────────────────────┐
│ │
│ 1. MODEL ROUTING Route by complexity (Haiku vs Sonnet) │
│ 2. COST TRACKING Immutable records of every API call │
│ 3. NARROW RETRY Retry only transient errors, fail fast │
│ 4. PROMPT CACHING Cache long system prompts │
│ │
└──────────────────────────────────────────────────────────────────┘
Automatically select cheaper models for simple tasks, reserving expensive models for complex reasoning.
| Model | Cost (input/1M) | Best For |
|---|---|---|
| claude-haiku-3-5 | ~$0.80 | Extraction, formatting, simple transforms |
| claude-sonnet-4-5 | ~$3.00 | Standard coding, analysis, most tasks |
| claude-opus-4 | ~$15.00 | Deep architectural reasoning, research |
Rule of thumb: Use Haiku for 60%+ of tasks in a pipeline. Reserve Opus for only the most complex reasoning.
HAIKU = "claude-haiku-3-5-20251001"
SONNET = "claude-sonnet-4-5"
OPUS = "claude-opus-4-5"
# Complexity signals
COMPLEX_TEXT_THRESHOLD = 10_000 # characters
COMPLEX_ITEM_THRESHOLD = 30 # items to process
def select_model(
text_length: int = 0,
item_count: int = 0,
requires_deep_reasoning: bool = False,
force_model: str | None = None
) -> str:
if force_model:
return force_model
if requires_deep_reasoning:
return OPUS
if text_length >= COMPLEX_TEXT_THRESHOLD or item_count >= COMPLEX_ITEM_THRESHOLD:
return SONNET
return HAIKU # 3-4x cheaper than Sonnet
| Task | Model | Rationale |
|---|---|---|
| Extract JSON fields from text | Haiku | Simple extraction |
| Format/clean data | Haiku | Deterministic |
| Write a utility function | Haiku | Simple coding |
| Review code for bugs | Sonnet | Needs reasoning |
| Design a system architecture | Opus | Deep reasoning |
| Summarize long documents | Sonnet | Complex synthesis |
| Classify items (simple) | Haiku | Low complexity |
| Security audit with exploit chains | Opus | Complex adversarial |
Track cumulative spend with frozen dataclasses. Never mutate — always create new records.
from dataclasses import dataclass
PRICING = {
"claude-haiku-3-5-20251001": {"input": 0.80, "output": 4.00}, # per 1M tokens
"claude-sonnet-4-5": {"input": 3.00, "output": 15.00},
"claude-opus-4-5": {"input": 15.00, "output": 75.00},
}
@dataclass(frozen=True, slots=True)
class CostRecord:
model: str
input_tokens: int
output_tokens: int
cost_usd: float
@classmethod
def from_response(cls, model: str, usage) -> "CostRecord":
prices = PRICING.get(model, PRICING["claude-sonnet-4-5"])
cost = (
usage.input_tokens * prices["input"] / 1_000_000 +
usage.output_tokens * prices["output"] / 1_000_000
)
return cls(model=model, input_tokens=usage.input_tokens,
output_tokens=usage.output_tokens, cost_usd=cost)
@dataclass(frozen=True, slots=True)
class CostTracker:
budget_limit: float = 1.00
records: tuple[CostRecord, ...] = ()
def add(self, record: CostRecord) -> "CostTracker":
"""Return new tracker with added record (never mutates self)."""
return CostTracker(
budget_limit=self.budget_limit,
records=(*self.records, record),
)
@property
def total_cost(self) -> float:
return sum(r.cost_usd for r in self.records)
@property
def over_budget(self) -> bool:
return self.total_cost > self.budget_limit
def summary(self) -> str:
return f"${self.total_cost:.4f} / ${self.budget_limit:.2f} ({len(self.records)} calls)"
tracker = CostTracker(budget_limit=0.50)
# After each API call
response = client.messages.create(model=model, ...)
record = CostRecord.from_response(model, response.usage)
tracker = tracker.add(record)
if tracker.over_budget:
raise BudgetExceededError(f"Budget exceeded: {tracker.summary()}")
print(tracker.summary()) # "$0.0234 / $0.50 (3 calls)"
Retry only on transient errors. Fail fast on permanent ones.
import time
from anthropic import (
APIConnectionError, # Transient: network issue
InternalServerError, # Transient: server error
RateLimitError, # Transient: slow down
AuthenticationError, # Permanent: wrong key
BadRequestError, # Permanent: invalid request
)
RETRYABLE = (APIConnectionError, RateLimitError, InternalServerError)
MAX_RETRIES = 3
def call_with_retry(func, max_retries: int = MAX_RETRIES):
"""Retry only transient errors. Fail fast on auth/bad request."""
for attempt in range(max_retries):
try:
return func()
except RETRYABLE as e:
if attempt == max_retries - 1:
raise
wait = 2 ** attempt # 1s, 2s, 4s
print(f"Retry {attempt + 1}/{max_retries} after {wait}s: {e}")
time.sleep(wait)
# AuthenticationError, BadRequestError → not caught → raise immediately
| Error | Retry? | Why |
|---|---|---|
APIConnectionError | ✅ Yes | Network blip |
RateLimitError | ✅ Yes | Slow down |
InternalServerError | ✅ Yes | Anthropic server issue |
AuthenticationError | ❌ No | Wrong API key — fix it first |
BadRequestError | ❌ No | Bad prompt — retrying won't help |
NotFoundError | ❌ No | Model name wrong |
Cache long system prompts to avoid resending (and paying for) them on every request.
# Without caching: Pay for system_prompt on EVERY call
# With caching: Pay full price once, then 10% on subsequent calls
messages = [
{
"role": "user",
"content": [
{
"type": "text",
"text": system_prompt, # Long, static content
"cache_control": {"type": "ephemeral"}, # ← Cache this!
},
{
"type": "text",
"text": user_input, # Variable content (not cached)
}
],
}
]
| Cache? | Content Type |
|---|---|
| ✅ Always | System prompts (role, rules, guidelines) |
| ✅ Always | Reference documents included in every call |
| ✅ Always | Few-shot examples that don't change |
| ❌ Never | User-specific or per-request content |
| ❌ Never | Content shorter than 1,024 tokens (minimum for caching) |
async def cost_aware_pipeline(tasks: list[dict], budget: float = 1.00):
client = anthropic.Anthropic()
tracker = CostTracker(budget_limit=budget)
for task in tasks:
# 1. Route to cheapest viable model
model = select_model(
text_length=len(task.get("content", "")),
requires_deep_reasoning=task.get("complex", False)
)
# 2. Check budget before proceeding
if tracker.over_budget:
print(f"Budget exceeded at task {task['id']}. Cost: {tracker.summary()}")
break
# 3. Call with retry
response = call_with_retry(lambda: client.messages.create(
model=model,
max_tokens=1024,
messages=[{
"role": "user",
"content": task["prompt"]
}]
))
# 4. Track cost immutably
tracker = tracker.add(CostRecord.from_response(model, response.usage))
# 5. Log progress
print(f"Task {task['id']} [{model}]: {tracker.summary()}")
return tracker
Before running any LLM pipeline:
| Failure | Cause | Recovery |
|---|---|---|
| Expensive model used for a task that a cheap model handles correctly | No routing logic; all requests sent to the most capable model by default | Add a task complexity classifier at the pipeline entry point; route simple extraction/classification tasks to the cheapest model that passes the quality bar |
| Token budget exhausted mid-pipeline, truncating output | Input context not estimated before sending; pipeline has no budget gate | Add a pre-flight token count check before every model call; if estimated tokens exceed budget, apply context compression or split the task |
| Cost spike caused by prompt that triggers verbose model output | Prompt uses open-ended instructions ("explain in detail") on a high-token-price model | Use constrained output prompts ("respond in under 100 words") on expensive models; validate output length against budget before accepting |
| Model downgrade silently degrades output quality below acceptable threshold | Cheaper model selected for cost reasons but quality not re-validated after switch | Define a minimum quality bar for each task type; run quality eval after every model routing change before promoting to production |
| Retry storm multiplies cost by 3–10x during a model API outage | Exponential backoff not implemented; naive retry on every 5xx response | Implement exponential backoff with jitter; cap total retries at 3; log cost-per-retry; fail fast after budget threshold exceeded |
| Cost attribution lost across pipeline stages | No cost tracking per stage; total bill visible but individual stage cost invisible | Instrument every model call with a cost tag (stage name, model, input/output tokens); aggregate by stage for per-stage cost visibility |
grep -c "budget\|max_cost\|cost_limit" pipeline_config.* returns > 0grep -c "cache_control\|ephemeral" pipeline.* returns > 0 where applicablegrep -c "429\|503\|RetryError" pipeline.* returns > 0 and grep -c "400\|401\|4[0-9][02-9]" retry_handler.* returns = 0grep -c "input_tokens\|output_tokens\|cost" logs/ returns > 0grep -c "0\.8\|80%" pipeline.* returns > 0This skill is complete when: 1) every step in the pipeline has an explicit model assignment based on task complexity, 2) per-call costs are logged immutably and a budget ceiling is enforced, and 3) retry logic only fires on transient errors and stops before budget is exhausted.