Use when AI inference costs are growing unexpectedly, when comparing model choices by cost/quality ratio, or when optimizing token usage across a multi-model pipeline — produces an actionable cost reduction plan
Audit AI inference costs and optimize token usage across multi-model pipelines. This is not about cutting capabilities — it is about eliminating waste, right-sizing models, and keeping costs predictable.
Use the most capable model necessary — not the most capable model available.
| Tier | Models | Best for |
|---|---|---|
| Premium | claude-opus-4.6, claude-opus-4.5 | Architecture decisions, complex multi-file reasoning, security audits |
| Standard | , , |
claude-sonnet-4.6claude-sonnet-4.5gpt-5.2| Most coding tasks, code review, test generation, documentation |
| Fast / Cheap | claude-haiku-4.5, gpt-5-mini, gpt-4.1 | File edits, boilerplate, classification, triage, simple summaries |
Scan for:
| Metric | How to measure |
|---|---|
| Total tokens / task | Compare before and after context changes |
| Model mix | Tally which models are called per workflow |
| Prompt size distribution | Log avg/max token counts per call type |
Model downgrade
Context pruning
view_range instead of full-file readsPrompt deduplication
Task batching
For each change:
Change: Replace claude-opus on doc-summary with claude-haiku
Before: ~4,000 tokens × $0.015/1K = $0.06/call
After: ~4,000 tokens × $0.00025/1K = $0.001/call
Savings: ~$0.059/call, ~$590/10K calls
Use approximate public pricing for estimation. Actual prices vary; check your provider dashboard.
| Priority | Criterion |
|---|---|
| High | Premium model on a task a fast model handles well |
| High | Context window > 50K tokens when shorter would suffice |
| Medium | Duplicate context passed on every call |
| Medium | Fleet agents with mismatched model tiers |
| Low | Minor prompt size variations |
## Cost Audit Report
### Summary
Estimated waste: ~$X/day at current scale
Top three opportunities: [list]
### Findings
#### [HIGH] Premium model for boilerplate generation
Location: [file or workflow name]
Issue: `claude-opus-4.6` used for all code generation including templates and stubs.
Recommendation: Use `claude-haiku-4.5` for boilerplate; reserve opus for complex tasks.
Estimated savings: ~80% cost reduction on boilerplate tasks.
#### [MEDIUM] Entire codebase passed as context on every PR review
...
| Pattern | Fix |
|---|---|
| Entire conversation history on every call | Summarize old context, keep recent turns |
| Full file reads when only one function matters | Use view_range for targeted reads |
| Premium model for all parallel agents in fleet | Assign tier per task type |
| Same instructions repeated in every prompt | Move to shared system prompt |
| No caching on static reference docs | Check if your API client supports prompt caching |
orchestration/templates/orchestrator-template.md — model selection guidance in orchestration context