Name: Cost Aware Llm Pipeline
Author: k1lgor

스킬 검색.../

Cost Aware Llm Pipeline | Skills Pool

┌─────────────────────── Cost Optimization ───────────────────────┐
│                                                                  │
│  1. MODEL ROUTING     Route by complexity (Haiku vs Sonnet)     │
│  2. COST TRACKING     Immutable records of every API call       │
│  3. NARROW RETRY      Retry only transient errors, fail fast    │
│  4. PROMPT CACHING    Cache long system prompts                 │
│                                                                  │
└──────────────────────────────────────────────────────────────────┘

Model	Cost (input/1M)	Best For
claude-haiku-3-5	~$0.80	Extraction, formatting, simple transforms
claude-sonnet-4-5	~$3.00	Standard coding, analysis, most tasks
claude-opus-4	~$15.00	Deep architectural reasoning, research

HAIKU = "claude-haiku-3-5-20251001"
SONNET = "claude-sonnet-4-5"
OPUS = "claude-opus-4-5"

# Complexity signals
COMPLEX_TEXT_THRESHOLD = 10_000   # characters
COMPLEX_ITEM_THRESHOLD = 30       # items to process

def select_model(
    text_length: int = 0,
    item_count: int = 0,
    requires_deep_reasoning: bool = False,
    force_model: str | None = None
) -> str:
    if force_model:
        return force_model
    if requires_deep_reasoning:
        return OPUS
    if text_length >= COMPLEX_TEXT_THRESHOLD or item_count >= COMPLEX_ITEM_THRESHOLD:
        return SONNET
    return HAIKU  # 3-4x cheaper than Sonnet

Task	Model	Rationale
Extract JSON fields from text	Haiku	Simple extraction
Format/clean data	Haiku	Deterministic
Write a utility function	Haiku	Simple coding
Review code for bugs	Sonnet	Needs reasoning
Design a system architecture	Opus	Deep reasoning
Summarize long documents	Sonnet	Complex synthesis
Classify items (simple)	Haiku	Low complexity
Security audit with exploit chains	Opus	Complex adversarial

from dataclasses import dataclass

PRICING = {
    "claude-haiku-3-5-20251001": {"input": 0.80, "output": 4.00},     # per 1M tokens
    "claude-sonnet-4-5":          {"input": 3.00, "output": 15.00},
    "claude-opus-4-5":            {"input": 15.00, "output": 75.00},
}

@dataclass(frozen=True, slots=True)
class CostRecord:
    model: str
    input_tokens: int
    output_tokens: int
    cost_usd: float

    @classmethod
    def from_response(cls, model: str, usage) -> "CostRecord":
        prices = PRICING.get(model, PRICING["claude-sonnet-4-5"])
        cost = (
            usage.input_tokens * prices["input"] / 1_000_000 +
            usage.output_tokens * prices["output"] / 1_000_000
        )
        return cls(model=model, input_tokens=usage.input_tokens,
                   output_tokens=usage.output_tokens, cost_usd=cost)

@dataclass(frozen=True, slots=True)
class CostTracker:
    budget_limit: float = 1.00
    records: tuple[CostRecord, ...] = ()

    def add(self, record: CostRecord) -> "CostTracker":
        """Return new tracker with added record (never mutates self)."""
        return CostTracker(
            budget_limit=self.budget_limit,
            records=(*self.records, record),
        )

    @property
    def total_cost(self) -> float:
        return sum(r.cost_usd for r in self.records)

    @property
    def over_budget(self) -> bool:
        return self.total_cost > self.budget_limit

    def summary(self) -> str:
        return f"${self.total_cost:.4f} / ${self.budget_limit:.2f} ({len(self.records)} calls)"

tracker = CostTracker(budget_limit=0.50)

# After each API call
response = client.messages.create(model=model, ...)
record = CostRecord.from_response(model, response.usage)
tracker = tracker.add(record)

if tracker.over_budget:
    raise BudgetExceededError(f"Budget exceeded: {tracker.summary()}")

print(tracker.summary())  # "$0.0234 / $0.50 (3 calls)"

import time
from anthropic import (
    APIConnectionError,    # Transient: network issue
    InternalServerError,   # Transient: server error
    RateLimitError,        # Transient: slow down
    AuthenticationError,   # Permanent: wrong key
    BadRequestError,       # Permanent: invalid request
)

RETRYABLE = (APIConnectionError, RateLimitError, InternalServerError)
MAX_RETRIES = 3

def call_with_retry(func, max_retries: int = MAX_RETRIES):
    """Retry only transient errors. Fail fast on auth/bad request."""
    for attempt in range(max_retries):
        try:
            return func()
        except RETRYABLE as e:
            if attempt == max_retries - 1:
                raise
            wait = 2 ** attempt  # 1s, 2s, 4s
            print(f"Retry {attempt + 1}/{max_retries} after {wait}s: {e}")
            time.sleep(wait)
    # AuthenticationError, BadRequestError → not caught → raise immediately

Error	Retry?	Why
`APIConnectionError`	✅ Yes	Network blip
`RateLimitError`	✅ Yes	Slow down
`InternalServerError`	✅ Yes	Anthropic server issue
`AuthenticationError`	❌ No	Wrong API key — fix it first
`BadRequestError`	❌ No	Bad prompt — retrying won't help
`NotFoundError`	❌ No	Model name wrong

# Without caching: Pay for system_prompt on EVERY call
# With caching: Pay full price once, then 10% on subsequent calls

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": system_prompt,          # Long, static content
                "cache_control": {"type": "ephemeral"},  # ← Cache this!
            },
            {
                "type": "text",
                "text": user_input,             # Variable content (not cached)
            }
        ],
    }
]

Cache?	Content Type
✅ Always	System prompts (role, rules, guidelines)
✅ Always	Reference documents included in every call
✅ Always	Few-shot examples that don't change
❌ Never	User-specific or per-request content
❌ Never	Content shorter than 1,024 tokens (minimum for caching)

async def cost_aware_pipeline(tasks: list[dict], budget: float = 1.00):
    client = anthropic.Anthropic()
    tracker = CostTracker(budget_limit=budget)

    for task in tasks:
        # 1. Route to cheapest viable model
        model = select_model(
            text_length=len(task.get("content", "")),
            requires_deep_reasoning=task.get("complex", False)
        )

        # 2. Check budget before proceeding
        if tracker.over_budget:
            print(f"Budget exceeded at task {task['id']}. Cost: {tracker.summary()}")
            break

        # 3. Call with retry
        response = call_with_retry(lambda: client.messages.create(
            model=model,
            max_tokens=1024,
            messages=[{
                "role": "user",
                "content": task["prompt"]
            }]
        ))

        # 4. Track cost immutably
        tracker = tracker.add(CostRecord.from_response(model, response.usage))

        # 5. Log progress
        print(f"Task {task['id']} [{model}]: {tracker.summary()}")

    return tracker

Failure	Cause	Recovery
Expensive model used for a task that a cheap model handles correctly	No routing logic; all requests sent to the most capable model by default	Add a task complexity classifier at the pipeline entry point; route simple extraction/classification tasks to the cheapest model that passes the quality bar
Token budget exhausted mid-pipeline, truncating output	Input context not estimated before sending; pipeline has no budget gate	Add a pre-flight token count check before every model call; if estimated tokens exceed budget, apply context compression or split the task
Cost spike caused by prompt that triggers verbose model output	Prompt uses open-ended instructions ("explain in detail") on a high-token-price model	Use constrained output prompts ("respond in under 100 words") on expensive models; validate output length against budget before accepting
Model downgrade silently degrades output quality below acceptable threshold	Cheaper model selected for cost reasons but quality not re-validated after switch	Define a minimum quality bar for each task type; run quality eval after every model routing change before promoting to production
Retry storm multiplies cost by 3–10x during a model API outage	Exponential backoff not implemented; naive retry on every 5xx response	Implement exponential backoff with jitter; cap total retries at 3; log cost-per-retry; fail fast after budget threshold exceeded
Cost attribution lost across pipeline stages	No cost tracking per stage; total bill visible but individual stage cost invisible	Instrument every model call with a cost tag (stage name, model, input/output tokens); aggregate by stage for per-stage cost visibility

Cost Aware Llm Pipeline

Identity

When to Activate

When NOT to Use

Cost Aware Llm Pipeline

Identity

When to Activate

When NOT to Use

Core Concept: The 4 Levers

Lever 1: Model Routing by Task Complexity

Model Tiers (2025-2026)

Routing Logic

Task → Model Mapping

Lever 2: Immutable Cost Tracking

Using the Tracker

Lever 3: Narrow Retry Logic

Error Classification

Lever 4: Prompt Caching

When to Cache

Pipeline Template

Cost Optimization Checklist

Tips

Anti-Patterns

Failure Modes

Self-Verification Checklist

Success Criteria

Continuous Learning V2

Continuous Learning V2

Continuous Learning V2

Continuous Learning

Continuous Learning

Pytorch Patterns