AI engineering

How to Build a Per-Tenant AI Agent Failover Routing Pipeline That Automatically Switches Between Competing Foundation Model Providers

Scott Miller

Mar 28, 2026 • 10 min read

If you run a multi-tenant LLM platform in 2026, you already know the pain: one provider spikes their token pricing at 2 AM, another throttles your highest-tier tenants during peak hours, and suddenly your SLA dashboard lights up like a Christmas tree. The naive solution is to hard-code a fallback provider. The real solution is a per-tenant, policy-driven failover routing pipeline that treats foundation model providers the way a smart load balancer treats upstream servers.

This tutorial walks you through building exactly that. By the end, you will have a working architecture that monitors provider health, detects capacity constraints and pricing anomalies in real time, and automatically reroutes each tenant's AI agent traffic to the best available provider, all without a single line of tenant-specific glue code in your hot path.

Why Per-Tenant Routing Matters (Not Just Global Failover)

Most teams start with a global fallback chain: try Provider A, then Provider B, then Provider C. This works until you realize that different tenants have radically different constraints:

Data residency requirements: A European enterprise tenant may only be allowed to route to EU-hosted model endpoints.
Model fidelity tiers: A free-tier tenant can tolerate a smaller, cheaper model on failover. An enterprise tenant cannot.
Cost caps: Some tenants have hard monthly token budgets. Switching to a more expensive fallback provider could blow their cap before the billing cycle ends.
Latency SLAs: Real-time voice agent tenants need sub-400ms first-token latency. Batch analytics tenants do not care.

A global failover chain treats all these tenants identically. A per-tenant routing pipeline respects each tenant's policy envelope and makes the right call for each one independently.

The Architecture at a Glance

Before diving into code, here is the high-level component map:

Tenant Policy Store: A configuration database (Redis or a document store) holding each tenant's provider preferences, model tiers, cost caps, latency budgets, and data residency rules.
Provider Health Broker: A lightweight background service that continuously probes each foundation model provider for latency, error rate, and published pricing signals.
Routing Decision Engine: The core of the system. It takes an incoming agent request, loads the tenant's policy, queries the Health Broker, and selects the optimal provider.
LLM Gateway Proxy: A thin HTTP proxy that sits between your agent orchestration layer and the upstream providers. It enforces the routing decision and handles retries.
Observability Pipeline: Streams routing events, cost deltas, and latency measurements into your data warehouse for post-hoc analysis and policy tuning.

Step 1: Define the Tenant Policy Schema

Everything starts with a well-defined policy schema. Store this in Redis as a JSON hash, keyed by tenant_id. Here is a Python dataclass that mirrors the schema:


from dataclasses import dataclass, field
from typing import List, Optional

@dataclass
class ProviderPreference:
    provider_id: str          # e.g. "openai", "anthropic", "mistral", "cohere"
    model_id: str             # e.g. "gpt-5", "claude-4-opus", "mistral-large-3"
    priority: int             # lower = higher priority
    allowed_regions: List[str] = field(default_factory=list)
    max_cost_per_1k_tokens: Optional[float] = None  # USD hard cap

@dataclass
class TenantRoutingPolicy:
    tenant_id: str
    tier: str                          # "free", "pro", "enterprise"
    providers: List[ProviderPreference]
    max_monthly_token_budget: Optional[int] = None
    max_first_token_latency_ms: int = 2000
    allow_degraded_model_on_failover: bool = True
    require_data_residency: Optional[str] = None  # ISO country code or region

Load and cache this policy at the start of every request. Use a short TTL (30 to 60 seconds) so policy changes propagate quickly without hammering your store on every token.

Step 2: Build the Provider Health Broker

The Health Broker runs as a separate async service. It maintains a rolling window of health metrics for each provider and exposes a simple in-memory snapshot that the Routing Engine can query in microseconds.


import asyncio
import time
import httpx
from dataclasses import dataclass, field
from collections import deque

PROBE_INTERVAL_SECONDS = 10
WINDOW_SIZE = 12  # last 2 minutes at 10s intervals

@dataclass
class ProviderHealthSnapshot:
    provider_id: str
    is_available: bool
    p95_latency_ms: float
    error_rate_pct: float
    current_price_per_1k_tokens: float  # pulled from provider pricing API or webhook
    last_updated: float = field(default_factory=time.time)

class ProviderHealthBroker:
    def __init__(self, providers: list[dict]):
        self.providers = providers
        self._snapshots: dict[str, ProviderHealthSnapshot] = {}
        self._latency_windows: dict[str, deque] = {
            p["id"]: deque(maxlen=WINDOW_SIZE) for p in providers
        }

    async def probe_provider(self, provider: dict):
        start = time.monotonic()
        try:
            async with httpx.AsyncClient(timeout=5.0) as client:
                resp = await client.post(
                    provider["health_endpoint"],
                    json={"prompt": "ping", "max_tokens": 1},
                    headers={"Authorization": f"Bearer {provider['api_key']}"}
                )
            latency_ms = (time.monotonic() - start) * 1000
            self._latency_windows[provider["id"]].append(latency_ms)
            is_available = resp.status_code == 200

            sorted_latencies = sorted(self._latency_windows[provider["id"]])
            p95_idx = int(len(sorted_latencies) * 0.95)
            p95 = sorted_latencies[p95_idx] if sorted_latencies else 9999.0

            self._snapshots[provider["id"]] = ProviderHealthSnapshot(
                provider_id=provider["id"],
                is_available=is_available,
                p95_latency_ms=p95,
                error_rate_pct=0.0 if is_available else 100.0,
                current_price_per_1k_tokens=provider.get("last_known_price", 0.01),
            )
        except Exception:
            self._snapshots[provider["id"]] = ProviderHealthSnapshot(
                provider_id=provider["id"],
                is_available=False,
                p95_latency_ms=9999.0,
                error_rate_pct=100.0,
                current_price_per_1k_tokens=9999.0,
            )

    async def run(self):
        while True:
            await asyncio.gather(*[self.probe_provider(p) for p in self.providers])
            await asyncio.sleep(PROBE_INTERVAL_SECONDS)

    def get_snapshot(self, provider_id: str) -> ProviderHealthSnapshot | None:
        return self._snapshots.get(provider_id)

For pricing signals, integrate with each provider's pricing webhook or poll their pricing API. In 2026, providers like Anthropic, OpenAI, Mistral, and Cohere all expose machine-readable pricing endpoints. Parse those and update last_known_price in your provider config whenever a change is detected.

Step 3: Build the Routing Decision Engine

This is the heart of the system. Given a tenant policy and a set of provider health snapshots, it returns an ordered list of candidate providers to try.


from typing import Optional

class RoutingDecisionEngine:
    def __init__(self, health_broker: ProviderHealthBroker):
        self.broker = health_broker

    def rank_providers(
        self,
        policy: TenantRoutingPolicy,
        current_token_spend_usd: float = 0.0,
    ) -> list[ProviderPreference]:

        candidates = []

        for pref in sorted(policy.providers, key=lambda p: p.priority):
            snapshot = self.broker.get_snapshot(pref.provider_id)

            if snapshot is None or not snapshot.is_available:
                continue  # provider is down, skip

            # Data residency check
            if policy.require_data_residency:
                if policy.require_data_residency not in pref.allowed_regions:
                    continue

            # Latency SLA check
            if snapshot.p95_latency_ms > policy.max_first_token_latency_ms:
                continue

            # Pricing spike check
            if pref.max_cost_per_1k_tokens is not None:
                if snapshot.current_price_per_1k_tokens > pref.max_cost_per_1k_tokens:
                    continue  # pricing spike detected, skip this provider

            candidates.append(pref)

        return candidates

    def select_provider(
        self,
        policy: TenantRoutingPolicy,
        current_token_spend_usd: float = 0.0,
    ) -> Optional[ProviderPreference]:
        ranked = self.rank_providers(policy, current_token_spend_usd)
        return ranked[0] if ranked else None

Notice the pricing spike check: if a provider's real-time price per 1,000 tokens exceeds the tenant's configured cap, that provider is excluded from consideration entirely. This is what prevents a pricing anomaly at one provider from silently burning through a tenant's budget before anyone notices.

Step 4: Wire Up the LLM Gateway Proxy

Use FastAPI to build the gateway. It receives agent requests, resolves the tenant policy, runs the routing engine, and proxies the call upstream. It also implements retry logic with exponential backoff across providers.


from fastapi import FastAPI, Request, HTTPException
from fastapi.responses import StreamingResponse
import httpx, json, asyncio

app = FastAPI()

# Assume these are initialized at startup
health_broker: ProviderHealthBroker = ...
routing_engine: RoutingDecisionEngine = ...
policy_store: RedisPolicyStore = ...
token_spend_tracker: TokenSpendTracker = ...

PROVIDER_ENDPOINTS = {
    "openai": "https://api.openai.com/v1/chat/completions",
    "anthropic": "https://api.anthropic.com/v1/messages",
    "mistral": "https://api.mistral.ai/v1/chat/completions",
    "cohere": "https://api.cohere.ai/v2/chat",
}

@app.post("/v1/agent/complete")
async def agent_complete(request: Request):
    body = await request.json()
    tenant_id = request.headers.get("X-Tenant-ID")

    if not tenant_id:
        raise HTTPException(status_code=400, detail="X-Tenant-ID header required")

    policy = await policy_store.get_policy(tenant_id)
    if not policy:
        raise HTTPException(status_code=404, detail="Tenant policy not found")

    current_spend = await token_spend_tracker.get_monthly_spend_usd(tenant_id)
    ranked_providers = routing_engine.rank_providers(policy, current_spend)

    if not ranked_providers:
        raise HTTPException(status_code=503, detail="No eligible providers available")

    last_error = None
    for provider_pref in ranked_providers:
        try:
            endpoint = PROVIDER_ENDPOINTS[provider_pref.provider_id]
            upstream_body = adapt_request_body(body, provider_pref)

            async with httpx.AsyncClient(timeout=30.0) as client:
                resp = await client.post(
                    endpoint,
                    json=upstream_body,
                    headers=build_auth_headers(provider_pref.provider_id),
                )

            if resp.status_code == 200:
                result = resp.json()
                await token_spend_tracker.record_usage(tenant_id, provider_pref.provider_id, result)
                return result

            last_error = f"{provider_pref.provider_id} returned {resp.status_code}"

        except httpx.TimeoutException:
            last_error = f"{provider_pref.provider_id} timed out"
            continue
        except Exception as e:
            last_error = str(e)
            continue

    raise HTTPException(status_code=503, detail=f"All providers failed. Last error: {last_error}")

The adapt_request_body function is important: each provider has a slightly different request schema. Build a thin adapter for each one that translates your canonical agent request format into the provider-specific format. This is the only place where provider-specific logic lives, keeping your routing engine clean and provider-agnostic.

Step 5: Handle Pricing Spike Detection Proactively

Polling provider health endpoints every 10 seconds is good, but pricing spikes can be faster than your polling interval. Implement a webhook receiver so providers can push pricing change events to you in real time:


@app.post("/webhooks/pricing-update")
async def handle_pricing_update(request: Request):
    payload = await request.json()
    provider_id = payload.get("provider_id")
    new_price = payload.get("price_per_1k_tokens")

    if provider_id and new_price is not None:
        # Update the provider config immediately
        await provider_config_store.update_price(provider_id, new_price)
        # Invalidate health broker snapshot to force a re-evaluation
        health_broker.invalidate(provider_id)

    return {"status": "acknowledged"}

Not all providers support outbound webhooks yet. For those that do not, implement a price-change poller that checks their pricing API every 60 seconds and compares against the last known value. A delta of more than 15 to 20 percent in either direction should trigger an immediate snapshot invalidation and a re-routing evaluation for all active tenants on that provider.

Step 6: Track Token Spend Per Tenant in Real Time

Without real-time spend tracking, your cost-cap enforcement is blind. Use Redis with atomic increment operations to track token usage as it happens:


import redis.asyncio as aioredis
from datetime import datetime

class TokenSpendTracker:
    def __init__(self, redis_client: aioredis.Redis, price_per_1k_map: dict):
        self.redis = redis_client
        self.prices = price_per_1k_map

    def _monthly_key(self, tenant_id: str) -> str:
        month = datetime.utcnow().strftime("%Y-%m")
        return f"spend:{tenant_id}:{month}"

    async def record_usage(self, tenant_id: str, provider_id: str, response: dict):
        usage = response.get("usage", {})
        total_tokens = usage.get("total_tokens", 0)
        price = self.prices.get(provider_id, 0.01)
        cost_usd = (total_tokens / 1000) * price

        key = self._monthly_key(tenant_id)
        await self.redis.incrbyfloat(key, cost_usd)
        # Set TTL to 35 days so old keys self-clean
        await self.redis.expire(key, 60 * 60 * 24 * 35)

    async def get_monthly_spend_usd(self, tenant_id: str) -> float:
        key = self._monthly_key(tenant_id)
        val = await self.redis.get(key)
        return float(val) if val else 0.0

Feed this spend figure into the routing engine. If a tenant is within 5 percent of their monthly budget, the engine can automatically deprioritize expensive providers and favor cheaper ones, even if the cheaper provider has slightly higher latency.

Step 7: Build the Observability Pipeline

Every routing decision should emit a structured event. Stream these to your data warehouse (ClickHouse, BigQuery, or Snowflake all work well here) for analysis:


import json
from dataclasses import asdict
from datetime import datetime, timezone

@dataclass
class RoutingEvent:
    tenant_id: str
    request_id: str
    selected_provider: str
    selected_model: str
    skipped_providers: list[str]
    skip_reasons: list[str]
    first_token_latency_ms: float
    total_tokens: int
    cost_usd: float
    timestamp: str = field(
        default_factory=lambda: datetime.now(timezone.utc).isoformat()
    )

async def emit_routing_event(event: RoutingEvent, kafka_producer):
    await kafka_producer.send(
        topic="llm_routing_events",
        value=json.dumps(asdict(event)).encode("utf-8"),
        key=event.tenant_id.encode("utf-8"),
    )

With this data flowing, you can build dashboards that answer questions like: Which tenants are triggering the most failovers? Which provider is causing the most pricing-related skips? What is the average cost delta when a tenant falls back to their secondary provider? These insights directly inform your policy tuning and contract negotiations with providers.

Step 8: Add a Circuit Breaker Per Provider Per Tenant

A circuit breaker prevents the system from hammering a degraded provider on every request. Implement a simple state machine with three states: Closed (normal), Open (provider excluded), and Half-Open (testing recovery):


import time

class CircuitBreaker:
    FAILURE_THRESHOLD = 5
    RECOVERY_TIMEOUT_SECONDS = 60

    def __init__(self):
        self._failures: dict[str, int] = {}
        self._opened_at: dict[str, float] = {}
        self._state: dict[str, str] = {}  # "closed", "open", "half_open"

    def _key(self, tenant_id: str, provider_id: str) -> str:
        return f"{tenant_id}:{provider_id}"

    def is_allowed(self, tenant_id: str, provider_id: str) -> bool:
        key = self._key(tenant_id, provider_id)
        state = self._state.get(key, "closed")

        if state == "closed":
            return True
        if state == "open":
            if time.monotonic() - self._opened_at.get(key, 0) > self.RECOVERY_TIMEOUT_SECONDS:
                self._state[key] = "half_open"
                return True
            return False
        if state == "half_open":
            return True
        return True

    def record_success(self, tenant_id: str, provider_id: str):
        key = self._key(tenant_id, provider_id)
        self._failures[key] = 0
        self._state[key] = "closed"

    def record_failure(self, tenant_id: str, provider_id: str):
        key = self._key(tenant_id, provider_id)
        self._failures[key] = self._failures.get(key, 0) + 1
        if self._failures[key] >= self.FAILURE_THRESHOLD:
            self._state[key] = "open"
            self._opened_at[key] = time.monotonic()

Integrate the circuit breaker check into your routing engine's rank_providers method. If the breaker is open for a given tenant-provider pair, that provider is excluded from the candidate list for that tenant, even if it looks healthy globally.

Putting It All Together: The Request Lifecycle

Here is the complete flow for a single agent completion request:

The AI agent sends a POST to /v1/agent/complete with the X-Tenant-ID header.
The gateway fetches the tenant's TenantRoutingPolicy from Redis (cached with a 30-second TTL).
The gateway fetches the tenant's current monthly spend from the TokenSpendTracker.
The RoutingDecisionEngine ranks all providers in the tenant's policy against live health snapshots, filtering out unavailable, latency-violating, pricing-spiked, and circuit-broken providers.
The gateway attempts the top-ranked provider. On success, it records usage and emits a routing event.
On failure (timeout, 429, 5xx), the circuit breaker records a failure, and the gateway tries the next ranked provider.
If all providers fail, the gateway returns a 503 with a detailed error message and emits a "full failover exhaustion" alert.

Common Pitfalls to Avoid

Thundering herd on recovery: When a provider comes back online after an outage, do not route all tenants to it simultaneously. Use a jittered half-open window that gradually increases traffic to the recovered provider over 2 to 5 minutes.
Stale policy cache during incidents: During an active incident, operators need to update tenant policies and have those changes take effect within seconds. Implement a pub/sub invalidation channel in Redis so policy updates are pushed to all gateway instances immediately.
Ignoring model capability drift on failover: If a tenant's primary model is a 200B-parameter frontier model and their failover is a 7B model, the output quality difference may be unacceptable for their use case. Always validate that your failover model meets the tenant's minimum capability tier before enabling it.
Not accounting for streaming: If your agents use streaming responses, your proxy must handle chunked transfer encoding correctly and be able to abandon a stream mid-flight and retry on a different provider without corrupting the agent's context.

Conclusion

Building a per-tenant AI agent failover routing pipeline is one of the highest-leverage infrastructure investments you can make in 2026 if you operate a multi-tenant LLM platform. The foundation model provider landscape is more competitive than ever, but that competition comes with volatility: pricing spikes, capacity crunches, and regional outages are facts of life.

The architecture described here gives you a system that is proactive rather than reactive. Instead of waiting for a tenant to file a support ticket because their agent started timing out, your routing engine detects the degradation in real time and silently moves that tenant to a healthy, policy-compliant provider before they ever notice a problem. Your observability pipeline then gives you the data you need to tune policies, renegotiate provider contracts, and continuously improve your platform's resilience.

Start with Steps 1 through 4 to get a working baseline, then layer in pricing spike detection, circuit breakers, and the observability pipeline as your platform scales. The investment compounds quickly: every new provider you onboard multiplies the resilience of every tenant on your platform simultaneously.