Cost optimization patterns for IDNA's GPT-4.1/GPT-4.1-mini routing — model selection by task complexity, budget tracking, retry logic.
Patterns for controlling LLM API costs while maintaining teaching quality. IDNA uses GPT-4.1 for teaching and GPT-4.1-mini for classification/evaluation.
| Model | Use Case | Cost (per 1M tokens) | Notes |
|---|---|---|---|
| GPT-4.1 | Teaching responses | ~$2 input / ~$8 output | Full pedagogical quality |
| GPT-4.1-mini | Classification (10 categories) | ~$0.40 input / ~$1.60 output | ~5x cheaper |
| GPT-4.1-mini |
| Answer evaluation (inline) |
| ~$0.40 input / ~$1.60 output |
| Via [CORRECT]/[INCORRECT] prefix |
| GPT-4.1-mini | Smart routing (~40% of turns) | ~$0.40 input / ~$1.60 output | Simple ACK/IDK/meta responses |
Route to GPT-4.1-mini when ALL true:
Route to GPT-4.1 when ANY true:
from dataclasses import dataclass
@dataclass(frozen=True, slots=True)
class LLMCostRecord:
model: str
input_tokens: int
output_tokens: int
cost_usd: float
purpose: str # "classify", "evaluate", "teach", "route"
@dataclass(frozen=True, slots=True)
class SessionCostTracker:
records: tuple[LLMCostRecord, ...] = ()
def add(self, record: LLMCostRecord) -> "SessionCostTracker":
return SessionCostTracker(records=(*self.records, record))
@property
def total_cost(self) -> float:
return sum(r.cost_usd for r in self.records)
from openai import APIConnectionError, InternalServerError, RateLimitError
_RETRYABLE = (APIConnectionError, RateLimitError, InternalServerError)
_MAX_RETRIES = 3
async def call_with_retry(func, max_retries=_MAX_RETRIES):
for attempt in range(max_retries):
try:
return await func()
except _RETRYABLE:
if attempt == max_retries - 1:
raise
await asyncio.sleep(2 ** attempt)
Typical 15-question session:
Adapted from ECC cost-aware-llm-pipeline (MIT license, credit: affaan-m/everything-claude-code)