AI engineering

How to Build a Per-Tenant AI Agent Graceful Degradation Pipeline for Multi-Tenant LLM Platforms in 2026

Scott Miller

Mar 27, 2026 • 9 min read

It's 2:47 AM. Your on-call phone buzzes. OpenAI, Anthropic, or one of the newer frontier model providers has just gone dark. Your multi-tenant LLM platform serves 3,000 paying customers, and every single one of them is about to hit a wall of 503 errors. Your enterprise clients have SLAs. Your support queue is already filling up. And your churn dashboard is about to become your worst nightmare.

This scenario is no longer hypothetical. In 2026, model provider outages are a routine operational hazard for any team running a serious multi-tenant AI platform. The question is no longer if an upstream provider will go down; it's whether your architecture is ready when it does.

This guide walks you through building a per-tenant graceful degradation pipeline for AI agents, one that keeps your highest-value customers served, maintains trust with mid-tier tenants, and fails gracefully for everyone else, all without a single manual intervention from your ops team.

Why "One-Size-Fits-All" Fallback Is the Wrong Approach

Most engineering teams bolt on a basic fallback strategy as an afterthought: if Provider A fails, route to Provider B. Done. But this blunt instrument creates serious problems in a multi-tenant context:

Cost blowout: Routing all tenants to a premium backup provider during an outage can spike your infrastructure bill by 300-400% in a matter of minutes.
SLA inequality: Your $5/month hobbyist tier should not consume the same fallback resources as your $50,000/month enterprise tenant.
Capability mismatch: A tenant whose agent relies on 128K-context reasoning cannot be silently downgraded to a 4K-context model without breaking their workflows.
Compliance violations: Some enterprise tenants have data residency or model-specific contractual requirements. Blindly rerouting them could violate those agreements.

The solution is a per-tenant degradation policy engine that knows exactly what each tenant is allowed to fall back to, in what order, and with what behavioral guardrails. Let's build it.

Step 1: Define Your Tenant Degradation Profile Schema

Every tenant in your system needs a degradation profile stored alongside their configuration. This is the contract your pipeline will honor when things go wrong. Here's a practical schema using JSON:

{
  "tenant_id": "acme-corp-001",
  "tier": "enterprise",
  "primary_model": {
    "provider": "openai",
    "model_id": "gpt-5",
    "context_window": 128000
  },
  "degradation_policy": {
    "max_degradation_level": 3,
    "levels": [
      {
        "level": 1,
        "provider": "anthropic",
        "model_id": "claude-4-sonnet",
        "context_window": 100000,
        "notify_tenant": false,
        "allowed_use_cases": ["all"]
      },
      {
        "level": 2,
        "provider": "google",
        "model_id": "gemini-2-flash",
        "context_window": 32000,
        "notify_tenant": true,
        "allowed_use_cases": ["summarization", "classification", "qa"],
        "blocked_use_cases": ["code_generation", "long_form_writing"]
      },
      {
        "level": 3,
        "provider": "internal",
        "model_id": "llama-3-70b-self-hosted",
        "context_window": 8000,
        "notify_tenant": true,
        "queue_non_critical": true,
        "allowed_use_cases": ["classification", "qa"]
      }
    ],
    "hard_stop_on_no_fallback": false,
    "queue_on_hard_stop": true,
    "max_queue_duration_minutes": 30
  },
  "compliance": {
    "data_residency": "us-east",
    "allowed_providers": ["openai", "anthropic", "google", "internal"],
    "pii_scrubbing_required": true
  }
}

A few key design decisions embedded in this schema are worth highlighting. The max_degradation_level field lets you cap how far a tenant can fall. A free-tier tenant might only have level 1 available, while an enterprise tenant gets the full ladder. The allowed_use_cases array prevents your pipeline from silently routing a code-generation task to a model that handles it poorly. And the compliance block ensures provider selection never violates contractual obligations.

Step 2: Build the Health Monitor with a Per-Provider Circuit Breaker

Before your degradation pipeline can act, it needs accurate, low-latency signal about provider health. A naive polling approach introduces too much lag. Instead, combine two mechanisms: a passive circuit breaker that trips on live request failures, and an active synthetic probe that runs lightweight health checks in the background.

The Circuit Breaker State Machine

Implement a standard three-state circuit breaker for each provider: Closed (healthy, requests flow normally), Open (unhealthy, requests are blocked), and Half-Open (recovering, limited probe requests allowed).

class ProviderCircuitBreaker:
    def __init__(self, provider_id, failure_threshold=5,
                 recovery_timeout_seconds=60, probe_count=2):
        self.provider_id = provider_id
        self.state = "CLOSED"
        self.failure_count = 0
        self.success_count = 0
        self.failure_threshold = failure_threshold
        self.recovery_timeout = recovery_timeout_seconds
        self.probe_count = probe_count
        self.last_failure_time = None

    def record_success(self):
        if self.state == "HALF_OPEN":
            self.success_count += 1
            if self.success_count >= self.probe_count:
                self._transition("CLOSED")
        elif self.state == "CLOSED":
            self.failure_count = 0

    def record_failure(self, error_type):
        # Only trip on hard failures, not rate limits or content policy
        if error_type in ["timeout", "connection_error", "server_error_5xx"]:
            self.failure_count += 1
            self.last_failure_time = time.time()
            if self.failure_count >= self.failure_threshold:
                self._transition("OPEN")

    def is_available(self):
        if self.state == "OPEN":
            if time.time() - self.last_failure_time > self.recovery_timeout:
                self._transition("HALF_OPEN")
                return True
            return False
        return True

    def _transition(self, new_state):
        self.state = new_state
        self.failure_count = 0
        self.success_count = 0
        publish_event("circuit_breaker_transition", {
            "provider": self.provider_id,
            "new_state": new_state
        })

Notice that record_failure intentionally ignores rate limit errors and content policy rejections. These are not provider outages; they are expected operational signals that should be handled separately, not by triggering a degradation cascade for all tenants.

The Synthetic Health Probe

Run a lightweight background job every 15 seconds that sends a minimal, fixed prompt to each provider. Keep the prompt cheap (a one-token completion works fine) and track P95 latency alongside availability. This gives you early warning before your circuit breakers trip on real customer traffic.

async def synthetic_probe(provider_id: str, client):
    probe_payload = {
        "model": PROBE_MODELS[provider_id],
        "messages": [{"role": "user", "content": "Reply with the word OK."}],
        "max_tokens": 5
    }
    try:
        start = time.monotonic()
        response = await client.complete(probe_payload, timeout=5.0)
        latency_ms = (time.monotonic() - start) * 1000
        await metrics.record("probe_latency_ms", latency_ms,
                             tags={"provider": provider_id})
        circuit_breakers[provider_id].record_success()
    except Exception as e:
        circuit_breakers[provider_id].record_failure(classify_error(e))

Step 3: Build the Per-Tenant Degradation Router

This is the heart of the system. The degradation router intercepts every outbound LLM call, checks provider health, and applies the correct tenant policy to select the best available model.

class TenantDegradationRouter:
    def __init__(self, tenant_profile_store, circuit_breakers, metrics_client):
        self.profiles = tenant_profile_store
        self.breakers = circuit_breakers
        self.metrics = metrics_client

    async def route(self, tenant_id: str, request: AgentRequest) -> AgentResponse:
        profile = await self.profiles.get(tenant_id)
        primary = profile["primary_model"]
        policy = profile["degradation_policy"]

        # Try primary provider first
        if self.breakers[primary["provider"]].is_available():
            try:
                response = await self._call_provider(primary, request)
                await self.metrics.increment("requests_served_primary",
                                             tags={"tenant": tenant_id})
                return response
            except Exception as e:
                self.breakers[primary["provider"]].record_failure(
                    classify_error(e)
                )

        # Walk the degradation ladder
        for level in policy["levels"]:
            provider_id = level["provider"]

            if not self.breakers[provider_id].is_available():
                continue

            if not self._use_case_allowed(request.use_case, level):
                continue

            if not self._compliance_check(level["provider"],
                                          profile["compliance"]):
                continue

            # Context window guard: truncate or reject oversized requests
            if request.estimated_tokens > level["context_window"]:
                if level.get("queue_non_critical") and not request.is_critical:
                    return await self._queue_request(tenant_id, request,
                                                     level["level"])
                request = self._truncate_context(request,
                                                 level["context_window"])

            try:
                response = await self._call_provider(level, request)
                response.degradation_level = level["level"]
                response.degraded = True

                if level.get("notify_tenant"):
                    await self._notify_tenant(tenant_id, level)

                await self.metrics.increment("requests_served_degraded",
                    tags={"tenant": tenant_id, "level": level["level"]})
                return response

            except Exception as e:
                self.breakers[provider_id].record_failure(classify_error(e))
                continue

        # All fallbacks exhausted
        return await self._handle_hard_stop(tenant_id, request, policy)

    def _use_case_allowed(self, use_case: str, level: dict) -> bool:
        if "all" in level.get("allowed_use_cases", []):
            return True
        if use_case in level.get("blocked_use_cases", []):
            return False
        return use_case in level.get("allowed_use_cases", [])

    def _compliance_check(self, provider_id: str, compliance: dict) -> bool:
        return provider_id in compliance.get("allowed_providers", [])

Step 4: Handle the Hard Stop Gracefully

When all fallback levels are exhausted, you have two options: fail fast with a meaningful error, or queue the request for deferred execution. Your tenant profile already encodes which behavior applies. Here's how to implement the queueing path:

async def _handle_hard_stop(self, tenant_id, request, policy):
    if policy.get("queue_on_hard_stop"):
        max_wait = policy.get("max_queue_duration_minutes", 15)
        job_id = await self.queue.enqueue(
            tenant_id=tenant_id,
            request=request,
            ttl_minutes=max_wait,
            priority=self._get_tenant_priority(tenant_id)
        )
        await self._notify_tenant_queued(tenant_id, job_id, max_wait)
        return AgentResponse(
            status="queued",
            job_id=job_id,
            message=f"Your request is queued and will be processed "
                    f"within {max_wait} minutes as service restores.",
            degraded=True
        )
    else:
        await self.metrics.increment("requests_hard_stopped",
                                     tags={"tenant": tenant_id})
        raise ServiceUnavailableError(
            tenant_id=tenant_id,
            message="All model providers are currently unavailable. "
                    "Please retry in a few minutes.",
            retry_after_seconds=120
        )

The priority queue is critical here. Enterprise tenants should sit at the top of the deferred execution queue so that when providers recover, their backlogged requests are processed first. A simple priority scoring function based on tenant tier and request age works well for most platforms.

Step 5: Add Context-Aware Truncation (Not Just Naive Trimming)

When a fallback model has a smaller context window than the primary, you cannot simply chop the prompt at a character limit. For AI agents, this is especially dangerous because you may silently drop tool call history, system instructions, or critical conversation turns.

Implement a structured truncation strategy that respects message roles and agent memory importance scores:

def _truncate_context(self, request: AgentRequest,
                      max_tokens: int) -> AgentRequest:
    # Always preserve: system prompt, latest user message, tool definitions
    preserved = (
        request.system_prompt_tokens
        + request.latest_user_message_tokens
        + request.tool_definition_tokens
    )

    if preserved > max_tokens:
        # Cannot serve this request at this degradation level
        raise ContextTooLargeForFallbackError(
            required=preserved, available=max_tokens
        )

    budget = max_tokens - preserved
    # Fill remaining budget with conversation history,
    # prioritized by recency and importance score
    trimmed_history = self._trim_history(
        request.conversation_history,
        token_budget=budget
    )
    return request.with_history(trimmed_history)

def _trim_history(self, history, token_budget):
    # Sort by importance score descending, then recency descending
    scored = sorted(history,
                    key=lambda m: (m.importance_score, m.turn_index),
                    reverse=True)
    selected, used = [], 0
    for message in scored:
        if used + message.token_count <= token_budget:
            selected.append(message)
            used += message.token_count
    # Re-sort selected messages by turn index for coherent ordering
    return sorted(selected, key=lambda m: m.turn_index)

Step 6: Instrument Everything with Per-Tenant Observability

A degradation pipeline that runs silently is an ops team's nightmare. You need rich, per-tenant telemetry so you can answer questions like: "How many of Acme Corp's requests were degraded in the last hour?" and "Which tenants hit a hard stop during the outage?"

Emit structured events for every routing decision:

{
  "event_type": "routing_decision",
  "timestamp": "2026-03-15T02:47:33Z",
  "tenant_id": "acme-corp-001",
  "tenant_tier": "enterprise",
  "request_id": "req_abc123",
  "use_case": "code_generation",
  "primary_provider": "openai",
  "primary_available": false,
  "degradation_level_selected": 1,
  "fallback_provider": "anthropic",
  "fallback_model": "claude-4-sonnet",
  "context_truncated": false,
  "tenant_notified": false,
  "latency_added_ms": 12,
  "outcome": "served_degraded"
}

Feed these events into your observability stack (Datadog, Grafana, or a purpose-built AI ops platform) and build dashboards that surface the following key metrics:

Degradation rate by tenant tier: Are enterprise tenants experiencing degradation disproportionately?
Fallback provider utilization: Is your Anthropic backup approaching its rate limits?
Hard stop rate: How many requests fell off the bottom of the ladder?
Queue depth and age: How long are queued requests waiting for provider recovery?
Context truncation frequency: Which tenants are regularly pushing context limits that become problematic during fallback?

Step 7: Wire Up Tenant Notifications Thoughtfully

Transparency during degradation is a trust-building opportunity, not just a compliance checkbox. But notification fatigue is real. Here's a practical notification policy that respects both:

Level 1 fallback (near-equivalent model): No notification. The experience is seamless enough that alerting the tenant adds noise without value.
Level 2 fallback (reduced capability): In-app banner or API response header indicating degraded mode. Include which use cases are restricted.
Level 3 fallback (significant capability reduction): Proactive webhook or email notification to the tenant's technical contact, with estimated recovery time if available.
Hard stop with queueing: Immediate notification with a job ID, estimated wait time, and a status page link.
Hard stop without queueing: Structured error response with Retry-After header and a link to your status page.

Putting It All Together: The Request Lifecycle

Here is the complete flow when a request enters your platform during an outage:

Request arrives at your AI gateway layer with a tenant identifier.
The gateway loads the tenant's degradation profile from a low-latency cache (Redis or equivalent).
The TenantDegradationRouter checks the primary provider's circuit breaker state.
If the primary is unavailable, the router walks the degradation ladder, applying use case, compliance, and context window guards at each level.
The first viable fallback is selected, context is truncated if necessary, and the request is dispatched.
If all fallbacks are exhausted, the request is queued (for tenants with that policy) or rejected with a structured error.
A routing decision event is emitted to your observability pipeline regardless of outcome.
Tenant notifications are dispatched based on the degradation level reached.

Common Pitfalls to Avoid

After building and operating systems like this, a few failure modes come up repeatedly:

Thundering herd on recovery: When a provider comes back online and your circuit breaker transitions to Half-Open, don't immediately flush your entire queue. Implement a gradual ramp using a token bucket to avoid hammering the recovering provider and triggering another outage.
Stale degradation profiles: Tenant plan changes, new compliance requirements, and model upgrades must propagate to your profile cache immediately. Use cache invalidation on write, not TTL-based expiry, for this data.
Ignoring per-tenant rate limits on fallback providers: Your fallback provider likely has its own rate limits. Track per-tenant usage against those limits during degradation so one large tenant doesn't exhaust the fallback capacity for everyone else.
Silent context truncation bugs: Always log when truncation occurs and what was dropped. Debugging a broken agent workflow is nearly impossible if you don't know the model never saw the relevant history.

Conclusion: Resilience Is a Product Feature

In 2026, the reliability of your AI platform is inseparable from the reliability of your upstream model providers. You cannot control when OpenAI, Anthropic, Google, or any other provider has an incident. But you can absolutely control how your platform behaves when they do.

A per-tenant graceful degradation pipeline is not just an engineering best practice. It is a competitive differentiator and a retention mechanism. When your enterprise customers experience a seamless, transparent fallback during an industry-wide outage while your competitor's platform goes dark, that is the moment your SLA becomes a selling point rather than a liability.

Start with the tenant profile schema, get your circuit breakers in place, and build the router incrementally. You don't need to implement every level on day one. Even a single, well-governed fallback provider with proper per-tenant policy enforcement is dramatically better than the naive "route everyone to Provider B" approach that most teams are still running today.

Your paying customers are counting on you. Build the pipeline before the next outage, not during it.