Work effectively across multiple AI models and model-agnostic platforms, selecting the right model for each task. Use when the user is building systems that use multiple AI models, comparing model capabilities, or working in multi-model environments. Synthesizes best practices from VSCode Agent, Augment Code, Amp, Traycer AI, and Cursor.
You are operating in a multi-model environment. Help users build, maintain, and optimize systems that span multiple AI models and providers.
1. Model Selection Criteria
Choose the right model for each task by evaluating:
Task complexity: Use large frontier models (Claude Sonnet/Opus, GPT-4o, Gemini Ultra) for complex reasoning, multi-step planning, and code generation. Use smaller models (Claude Haiku, GPT-4o mini, Gemini Flash) for classification, summarization, or high-volume triage.
Modality requirements: Match the model to required input/output modalities — text, code, vision, audio, structured data, or tool use.
Context window needs: Use models with larger context windows (200K+ tokens) for full codebase ingestion, long document analysis, or multi-turn conversations with heavy history.
Cost-per-token budget: Estimate token consumption upfront. For high-volume pipelines, model selection is a budget decision as much as a capability decision.
関連 Skill
Latency sensitivity: Real-time user-facing tasks require low-latency models. Offline batch pipelines can use larger, slower, cheaper models.
Decision rule: Default to the smallest model that satisfies the task requirements. Escalate to a larger model only when the smaller model demonstrably fails.
Model routing by task complexity (Python example):
MODEL_SONNET = "claude-sonnet-4-6"
MODEL_HAIKU = "claude-haiku-4-5-20251001"
def select_model(text_length: int, item_count: int, force_model: str | None = None) -> str:
if force_model is not None:
return force_model
if text_length >= 10_000 or item_count >= 30:
return MODEL_SONNET # Complex task
return MODEL_HAIKU # Simple task (3-4x cheaper)
2. Cost-Aware Pipeline Patterns
Immutable cost tracking — track cumulative spend with frozen dataclasses:
from dataclasses import dataclass
@dataclass(frozen=True, slots=True)
class CostTracker:
budget_limit: float = 1.00
records: tuple = ()
def add(self, record) -> "CostTracker":
return CostTracker(budget_limit=self.budget_limit, records=(*self.records, record))
@property
def total_cost(self) -> float:
return sum(r.cost_usd for r in self.records)
@property
def over_budget(self) -> bool:
return self.total_cost > self.budget_limit
Narrow retry logic — retry only on transient errors, fail fast on authentication or bad request errors:
Write prompts that work reliably across multiple models without per-model rewrites:
Avoid model-specific syntax: Do not rely on proprietary prompt features. Use neutral Markdown for structure.
Explicit role and context framing: Always open with a clear role statement and task description.
Structured output specification: Define expected output format in the prompt body itself — do not rely on model default behaviors.
Separate system and user content: Keep system-level instructions in the system prompt; task-specific content in the user turn. This is portable across all major APIs (Anthropic, OpenAI, Google, Mistral).
Use delimiters consistently: Wrap distinct content sections in consistent delimiters (XML tags, triple-backtick blocks, or --- separators).
Test prompt portability: After writing a prompt, run it against at least two models. If outputs diverge significantly, the prompt is under-specified.
4. Handling Model-Specific Capabilities and Limitations
Tool/function calling schemas: OpenAI, Anthropic, and Google each have different schemas. Abstract tool definitions into a canonical format and write per-provider adapters.
Context window utilization: Monitor output quality as context grows; implement context pruning or summarization before hitting the limit.
Streaming support: Implement a streaming abstraction with a non-streaming fallback.
Maximum output tokens: Chunk long-output tasks into multiple sequential calls if output may exceed the model's generation limit.
Refusal behavior: Each model has different content policies. Define the scope of your agent clearly in the system prompt so refusals are predictable.
5. Fallback Strategies When a Model Fails
Build resilient multi-model pipelines that degrade gracefully:
Define a model priority chain: For every task type, define ranked models: primary → secondary → emergency fallback. Example: claude-sonnet-4 → claude-haiku-3 → gpt-4o-mini.
Retry with exponential backoff before fallback: Distinguish transient errors (retry same model) from persistent failures (escalate). Minimum 3 attempts before switching models.
Output normalization after fallback: Ensure outputs from all fallback models are normalized to the same schema before passing to downstream systems.
Log every fallback event: Record the original model, reason for fallback, fallback model used, and outcome.
Circuit breaker pattern: After N consecutive failures on a primary model within a rolling window, open the circuit breaker and route directly to the fallback for a configurable cooldown period.
6. Context Window Management Across Models
Always measure before sending: Count tokens before constructing the final prompt. Use the target model's tokenizer. Never estimate by character count.
Define a context budget: Allocate the window into segments: system prompt, retrieved context, conversation history, task instructions, output buffer.
History truncation strategy: Truncate oldest turns first. Preserve the system prompt and most recent exchanges. Optionally replace truncated history with a rolling summary.
RAG for large codebases: Do not stuff entire files into context. Retrieve only the most relevant code chunks via embedding search. Limit retrieved context to 30-40% of total window budget.
Cross-model context hand-off: When moving from one model to another, serialize relevant context into a structured summary rather than passing raw history.
Monitor in production: Log prompt_tokens and completion_tokens from every response. Alert when prompt tokens exceed 80% of the model's limit.
For privacy-sensitive apps that must work offline, use Apple's on-device FoundationModels framework.
Availability check (always required):
struct GenerativeView: View {
private var model = SystemLanguageModel.default
var body: some View {
switch model.availability {
case .available:
ContentView()
case .unavailable(.deviceNotEligible):
Text("Device not eligible for Apple Intelligence")
case .unavailable(.appleIntelligenceNotEnabled):
Text("Please enable Apple Intelligence in Settings")
case .unavailable(.modelNotReady):
Text("Model is downloading or not ready")
case .unavailable(let other):
Text("Model unavailable: \(other)")
}
}
}
Basic session:
let session = LanguageModelSession(instructions: """
You are a cooking assistant. Provide brief, practical suggestions.
""")
let response = try await session.respond(to: "I have chicken and rice")
Structured output with @Generable:
@Generable(description: "A trip idea")
struct TripIdea {
var destination: String
@Guide(description: "Why it's a good choice", .range(10...200))
var rationale: String
}
let response = try await session.respond(to: "Suggest a trip", generating: TripIdea.self)
One request per session at a time (isResponding check required)
Access results via .content, not .output
Use @Generable for structured output — stronger guarantees than parsing raw strings
Snapshot streaming (not deltas) for real-time UI via session.streamResponse(...)
8. Consistent Output Format Regardless of Model
Schema-first design: Define the output schema before selecting a model. Every model must produce output conforming to this schema.
Structured output enforcement: Use JSON mode, structured output APIs, or tool-calling response format where available.
Output parsing and validation layer: Implement a validation layer between the model response and your application that parses, validates against schema, and raises errors on malformed output.
Normalization of model-specific quirks: Some models prepend preamble text or wrap JSON in markdown fences. Implement model-specific output normalization handlers.
Fallback parsing: If strict parsing fails, use a lenient fallback parser. Log all fallback parse events.
Version your output schemas: When changing the expected output format, version the schema and maintain backward-compatible parsing.
9. Model Versioning and Upgrades
Pin model versions in production: Never use floating aliases (e.g., gpt-4o-latest, claude-3-sonnet). Always pin to a specific dated version (e.g., claude-sonnet-4-20250514).
Upgrade tracking: Maintain a model version registry mapping task types to pinned versions. Run your test suite before deploying upgrades.
Canary deployments for model upgrades: Roll out model version changes to 1-5% of traffic first. Monitor output quality metrics before increasing rollout.
Regression testing on upgrade: Run your full prompt test suite against the new version before promoting it.
Deprecation monitoring: Subscribe to provider deprecation notices. Alert when a pinned version approaches its deprecation date minus 30 days.
Rollback capability: Maintain the ability to roll back to the previous model version within 5 minutes of detecting a production regression.
10. Latency Considerations
First-token latency vs. total latency: For streaming user-facing responses, TTFT is the metric that drives perceived responsiveness. For batch pipelines, total latency matters more.
Parallel model calls: When a task can be decomposed into independent subtasks, run them in parallel across models simultaneously.
Streaming for user-facing paths: Always stream responses in user-facing applications. Implement streaming-compatible output parsers.
Pre-computation and prompt caching: Use provider-side prompt caching (Anthropic prompt caching, OpenAI cached prompts) for repeated system prompt prefixes. Reduces both cost and latency.
Timeout budgets: Typical production timeouts: 5s for TTFT, 30s for total completion on frontier models. Fail fast and trigger fallback rather than waiting indefinitely.
Latency SLA monitoring: Track p50, p95, and p99 latency per model per task type. Alert on p95 degradation.
11. API Key and Credential Management
Never hardcode API keys: Keys must never appear in source code, configuration files, or log output.
Centralized secrets management: Store all API keys in a dedicated secrets manager (AWS Secrets Manager, HashiCorp Vault, Azure Key Vault, GCP Secret Manager).
Per-environment key isolation: Maintain separate keys for development, staging, and production.
Provider key abstraction: Abstract credentials behind a unified resolver: get_credential("anthropic") — not direct environment variable references. Enables central rotation without code changes.
Key inventory tracking: Maintain an inventory of all API keys: provider, environment, creation date, rotation date, owning team. Review and prune unused keys quarterly.
12. Monitoring and Observability
Structured request/response logging: Log every model call with: timestamp, model ID, task type, prompt token count, completion token count, latency, status code. Do not log raw prompt content without PII scrubbing.
Per-model metrics dashboards: Track request volume, error rate, latency distribution, token usage, and cost per model.
Unified tracing: Use OpenTelemetry distributed tracing spanning multiple model calls within a single user request.