Build and evolve the multi-tenant inference ingress layer, including request normalization, tenant-aware adapter routing, token-aware admission control, and overload shedding decisions. Use when implementing or refactoring API handlers, rate-limit logic, capacity checks, queue handoff contracts, or gateway-level telemetry for LLM serving systems.
Implement gateway logic in this order:
tenant_id, adapter_id, request_id, token budget).429 Too Many Requests for rate-limit or pressure rejection.503 Service Unavailable for internal scheduler unavailable states.scripts/token_estimator.py to sanity-check rough token estimates for prompt text.