LLM platforms

How to Build a Per-Tenant AI Agent Rollback and State Snapshot Pipeline for Multi-Tenant LLM Platforms When Upstream Model Provider Outages Force Emergency Failover

Scott Miller

Mar 22, 2026 • 12 min read

It happened again. At 2:47 AM on a Tuesday, your on-call engineer gets paged. A major upstream model provider is down. Not degraded. Down. And now hundreds of tenant AI agents, mid-conversation, mid-workflow, mid-tool-call, are frozen in place. Some tenants have enterprise SLAs. Some are running autonomous agents that were 80% through a multi-step financial reconciliation task. The clock is ticking.

If your multi-tenant LLM platform does not have a per-tenant state snapshot and rollback pipeline, this scenario is not a question of if but when. In 2026, with the proliferation of agentic AI workloads across OpenAI, Anthropic, Google Gemini, Mistral, and a growing field of regional model providers, upstream outages have become a recurring operational reality. The platforms that survive them gracefully are the ones that treat agent state as a first-class infrastructure concern.

This tutorial walks you through building a production-grade, per-tenant AI agent rollback and state snapshot pipeline from the ground up. We will cover the architecture, the data models, the snapshot triggers, the rollback logic, and the failover routing layer. By the end, you will have a complete mental model and code-level blueprint you can adapt to your stack.

Why "Just Retry" Is Not Enough in 2026

The naive approach to upstream model outages is a simple retry loop with exponential backoff. This worked reasonably well in 2023 when most LLM usage was stateless prompt-response pairs. But the agentic era has changed the calculus entirely.

Modern AI agents maintain rich, layered state that includes:

Conversation history and memory windows: accumulated context that may span dozens of turns
Tool call chains: sequences of external API calls, database writes, and file operations already executed
Planning graphs: partially completed ReAct or chain-of-thought reasoning trees
Tenant-specific persona and instruction overlays: per-tenant system prompts, guardrails, and fine-tuning references
Session-scoped variables: intermediate computation results, retrieved documents, and user preferences

A blind retry after a 10-minute outage does not restore any of this. Worse, it may cause duplicate side effects: re-executing tool calls that already wrote to a database, re-sending emails, or re-charging a payment. For multi-tenant platforms, this is not just a UX problem. It is a liability problem.

The solution is a snapshot-and-rollback pipeline that treats every meaningful agent state transition as a durable, addressable checkpoint.

Core Architecture Overview

Before diving into implementation, here is the high-level architecture of the system we are building. It has five primary components:

State Snapshot Engine (SSE): captures and serializes agent state at defined checkpoints
Per-Tenant Snapshot Store: isolated, versioned storage for each tenant's agent snapshots
Provider Health Monitor (PHM): continuously polls upstream model provider health endpoints and emits outage events
Failover Router: intercepts inference requests during outages and reroutes them to alternative providers
Rollback Coordinator: orchestrates the restore of agent state from the last clean snapshot before failover execution

These components communicate via an internal event bus (Kafka, Redis Streams, or NATS all work well here). The key design principle is tenant isolation at every layer: one tenant's snapshot operations must never block, corrupt, or leak into another tenant's state.

Step 1: Define Your Agent State Schema

You cannot snapshot what you have not modeled. The first step is defining a canonical, serializable agent state schema. Here is a TypeScript-style interface that captures the essential fields:


interface AgentStateSnapshot {
  // Identity
  snapshotId: string;          // UUID v7 (time-ordered)
  tenantId: string;            // Tenant namespace
  agentId: string;             // Agent instance ID
  sessionId: string;           // User session ID
  workflowId?: string;         // Optional: parent workflow ID

  // Versioning
  schemaVersion: string;       // e.g., "2.1.0"
  sequenceNumber: number;      // Monotonically increasing per session
  createdAt: string;           // ISO 8601 timestamp

  // Model context
  modelProvider: string;       // e.g., "openai", "anthropic", "gemini"
  modelId: string;             // e.g., "gpt-5", "claude-4-opus"
  conversationHistory: Message[];

  // Execution state
  pendingToolCalls: ToolCall[];
  completedToolCalls: ToolCallResult[];
  planningGraph?: PlanNode[];

  // Tenant configuration
  systemPromptHash: string;    // Hash of tenant system prompt (not the prompt itself)
  tenantConfigVersion: string;

  // Integrity
  stateHash: string;           // SHA-256 of the serialized state payload
  previousSnapshotId?: string; // Linked list of snapshots
}

A few design decisions worth highlighting here. First, store the system prompt hash, not the raw system prompt. Tenant system prompts are sensitive configuration data and should not be duplicated across every snapshot. The hash lets you verify that the tenant config at rollback time matches what was active when the snapshot was taken. Second, use UUID v7 for snapshot IDs. The time-ordered nature of UUID v7 makes range queries on your snapshot store dramatically more efficient. Third, the linked list of snapshots via previousSnapshotId gives you a traversable chain for auditing and selective rollback.

Step 2: Build the State Snapshot Engine

The Snapshot Engine is responsible for capturing state at the right moments. There are three categories of snapshot triggers you should implement:

Trigger Category 1: Periodic Checkpoints

For long-running agent sessions, capture a snapshot every N turns or every T seconds of wall-clock time, whichever comes first. A reasonable default is every 5 conversation turns or every 60 seconds.

Trigger Category 2: Pre-Tool-Call Snapshots

This is the most critical trigger. Before every tool call execution, capture a snapshot. This gives you a clean restore point that precedes any side effects. If the provider goes down mid-tool-chain, you can roll back to before the first tool call in the affected sequence and replay cleanly after failover.

Trigger Category 3: Explicit Checkpoints

Allow tenants (via API) and your internal orchestration layer to request explicit snapshots. This is useful for long-horizon planning tasks where a tenant wants to mark a "safe point" before a risky sub-task.

Here is a Python implementation of the core snapshot capture logic:


import hashlib, json, uuid
from datetime import datetime, timezone
from dataclasses import dataclass, asdict
from typing import Optional

class StateSnapshotEngine:
    def __init__(self, snapshot_store: "SnapshotStore", event_bus: "EventBus"):
        self.store = snapshot_store
        self.bus = event_bus

    async def capture(
        self,
        agent_context: "AgentContext",
        trigger: str,  # "periodic" | "pre_tool_call" | "explicit"
        metadata: Optional[dict] = None
    ) -> str:
        """Capture and persist a snapshot. Returns the snapshotId."""

        payload = {
            "snapshotId": str(uuid.uuid7()),
            "tenantId": agent_context.tenant_id,
            "agentId": agent_context.agent_id,
            "sessionId": agent_context.session_id,
            "workflowId": agent_context.workflow_id,
            "schemaVersion": "2.1.0",
            "sequenceNumber": agent_context.sequence_number,
            "createdAt": datetime.now(timezone.utc).isoformat(),
            "modelProvider": agent_context.model_provider,
            "modelId": agent_context.model_id,
            "conversationHistory": agent_context.conversation_history,
            "pendingToolCalls": agent_context.pending_tool_calls,
            "completedToolCalls": agent_context.completed_tool_calls,
            "planningGraph": agent_context.planning_graph,
            "systemPromptHash": agent_context.system_prompt_hash,
            "tenantConfigVersion": agent_context.tenant_config_version,
            "previousSnapshotId": agent_context.last_snapshot_id,
            "triggerType": trigger,
            "metadata": metadata or {},
        }

        # Compute integrity hash (exclude stateHash field itself)
        raw = json.dumps(payload, sort_keys=True, default=str)
        payload["stateHash"] = hashlib.sha256(raw.encode()).hexdigest()

        snapshot_id = payload["snapshotId"]

        # Write to tenant-isolated store
        await self.store.write(
            tenant_id=agent_context.tenant_id,
            snapshot_id=snapshot_id,
            payload=payload
        )

        # Update agent context with latest snapshot reference
        agent_context.last_snapshot_id = snapshot_id
        agent_context.sequence_number += 1

        # Emit event for observability
        await self.bus.publish("agent.snapshot.created", {
            "tenantId": agent_context.tenant_id,
            "snapshotId": snapshot_id,
            "trigger": trigger,
        })

        return snapshot_id

Step 3: Design the Per-Tenant Snapshot Store

Tenant isolation in the snapshot store is non-negotiable. A misconfiguration that lets Tenant A read Tenant B's agent state is a catastrophic data breach. Here are the storage patterns that work best in practice:

Option A: Prefixed Key-Value Store (Redis / DynamoDB)

Prefix every key with the tenant ID and enforce access at the application layer. Use DynamoDB's partition key as tenantId#agentId and sort key as sessionId#sequenceNumber. This gives you efficient range queries for "give me the last 10 snapshots for this agent session."

Option B: Schema-per-Tenant (PostgreSQL)

For platforms already running Postgres, use a separate schema per tenant. Each schema contains a agent_snapshots table. Row-level security policies provide an additional enforcement layer. This approach adds migration complexity but gives you strong isolation guarantees and excellent query flexibility.

Option C: Object Storage with Tenant-Scoped Buckets (S3 / GCS)

For high-volume platforms, serialize snapshots as compressed JSON blobs and write them to tenant-scoped object storage paths: s3://snapshots/{tenantId}/{agentId}/{sessionId}/{snapshotId}.json.gz. Use a metadata index (DynamoDB or Postgres) for fast lookups. This is the most cost-effective option at scale.

Regardless of which option you choose, implement a retention policy per tenant. A reasonable default is to keep the last 50 snapshots per session and all snapshots from the last 7 days, then archive to cold storage. Tenants with compliance requirements (HIPAA, SOC 2, GDPR) may have specific retention and deletion obligations you must honor.

Step 4: Build the Provider Health Monitor

The Provider Health Monitor (PHM) is your early warning system. It needs to detect outages faster than your tenants do. Here is what a production PHM looks like:


import asyncio, httpx, time
from enum import Enum
from collections import deque

class ProviderStatus(Enum):
    HEALTHY = "healthy"
    DEGRADED = "degraded"
    DOWN = "down"

class ProviderHealthMonitor:
    PROVIDERS = {
        "openai":    "https://status.openai.com/api/v2/status.json",
        "anthropic": "https://status.anthropic.com/api/v2/status.json",
        "gemini":    "https://status.cloud.google.com/incidents.json",
        "mistral":   "https://status.mistral.ai/api/v2/status.json",
    }

    def __init__(self, event_bus: "EventBus", poll_interval: int = 15):
        self.event_bus = event_bus
        self.poll_interval = poll_interval
        # Rolling window of last 10 probe results per provider
        self.probe_history: dict[str, deque] = {
            p: deque(maxlen=10) for p in self.PROVIDERS
        }
        self.current_status: dict[str, ProviderStatus] = {
            p: ProviderStatus.HEALTHY for p in self.PROVIDERS
        }

    async def probe_provider(self, name: str, url: str) -> bool:
        try:
            async with httpx.AsyncClient(timeout=5.0) as client:
                resp = await client.get(url)
                data = resp.json()
                indicator = data.get("status", {}).get("indicator", "none")
                return indicator in ("none", "minor")
        except Exception:
            return False

    async def run(self):
        while True:
            tasks = {
                name: asyncio.create_task(self.probe_provider(name, url))
                for name, url in self.PROVIDERS.items()
            }
            results = {name: await task for name, task in tasks.items()}

            for provider, healthy in results.items():
                self.probe_history[provider].append(healthy)
                new_status = self._evaluate_status(provider)

                if new_status != self.current_status[provider]:
                    old_status = self.current_status[provider]
                    self.current_status[provider] = new_status
                    await self.event_bus.publish("provider.status.changed", {
                        "provider": provider,
                        "oldStatus": old_status.value,
                        "newStatus": new_status.value,
                        "timestamp": time.time(),
                    })

            await asyncio.sleep(self.poll_interval)

    def _evaluate_status(self, provider: str) -> ProviderStatus:
        history = list(self.probe_history[provider])
        if len(history) < 3:
            return ProviderStatus.HEALTHY
        recent = history[-3:]
        if not any(recent):
            return ProviderStatus.DOWN
        if sum(recent) < len(recent):
            return ProviderStatus.DEGRADED
        return ProviderStatus.HEALTHY

The key design here is the rolling window evaluation. A single failed probe does not trigger a status change. Only when 3 consecutive probes fail do you declare a provider DOWN. This prevents flapping on transient network hiccups. Adjust the threshold based on your SLA tolerance.

You should also supplement status page polling with synthetic inference probes: send a minimal, cheap inference request (a single-token completion) to each provider every 30 seconds. Status pages often lag real degradation by several minutes. Your own probes will catch it faster.

Step 5: Build the Failover Router

When the PHM emits a provider.status.changed event with newStatus: "down", the Failover Router must spring into action. Its job is twofold: stop routing new inference requests to the downed provider, and coordinate with the Rollback Coordinator to restore in-flight agent sessions.

The routing layer sits as middleware in your inference request path. Here is the core logic:


class FailoverRouter:
    # Ordered fallback chains per provider
    FALLBACK_CHAINS = {
        "openai":    ["anthropic", "gemini", "mistral"],
        "anthropic": ["openai",    "gemini", "mistral"],
        "gemini":    ["openai",    "anthropic", "mistral"],
        "mistral":   ["openai",    "anthropic", "gemini"],
    }

    def __init__(self, phm: ProviderHealthMonitor,
                 rollback_coordinator: "RollbackCoordinator",
                 tenant_config_service: "TenantConfigService"):
        self.phm = phm
        self.rollback = rollback_coordinator
        self.tenant_config = tenant_config_service

    async def route_inference(
        self,
        tenant_id: str,
        agent_context: "AgentContext",
        inference_request: dict
    ) -> dict:
        preferred_provider = agent_context.model_provider
        provider_status = self.phm.current_status.get(
            preferred_provider, ProviderStatus.HEALTHY
        )

        if provider_status == ProviderStatus.HEALTHY:
            return await self._execute_inference(
                preferred_provider, agent_context, inference_request
            )

        # Provider is degraded or down: select fallback
        fallback = await self._select_fallback(
            tenant_id, preferred_provider
        )

        if fallback is None:
            raise AllProvidersUnavailableError(
                f"No healthy fallback available for tenant {tenant_id}"
            )

        # Rollback to last clean snapshot before rerouting
        restored_context = await self.rollback.restore_for_failover(
            tenant_id=tenant_id,
            agent_context=agent_context,
            target_provider=fallback,
        )

        return await self._execute_inference(
            fallback, restored_context, inference_request
        )

    async def _select_fallback(
        self, tenant_id: str, failed_provider: str
    ) -> Optional[str]:
        # Check tenant-specific provider allowlist first
        tenant_cfg = await self.tenant_config.get(tenant_id)
        allowed = tenant_cfg.get("allowedProviders", list(self.FALLBACK_CHAINS.keys()))

        for candidate in self.FALLBACK_CHAINS.get(failed_provider, []):
            if candidate not in allowed:
                continue
            if self.phm.current_status.get(candidate) == ProviderStatus.HEALTHY:
                return candidate
        return None

Notice the tenant allowlist check. This is critical for enterprise customers. A healthcare tenant may have a BAA only with specific providers. A European tenant may have data residency requirements that prohibit certain providers. Your failover router must respect these constraints even in an emergency. Blindly rerouting to a non-compliant provider to maintain uptime can create regulatory violations far more costly than a brief outage.

Step 6: Build the Rollback Coordinator

The Rollback Coordinator is the most nuanced piece of the system. Its job is to take an in-flight agent context, find the appropriate snapshot to restore from, adapt that snapshot for the target provider, and return a clean context ready for execution.


class RollbackCoordinator:
    def __init__(self, snapshot_store: "SnapshotStore",
                 model_adapter_registry: "ModelAdapterRegistry"):
        self.store = snapshot_store
        self.adapters = model_adapter_registry

    async def restore_for_failover(
        self,
        tenant_id: str,
        agent_context: "AgentContext",
        target_provider: str,
    ) -> "AgentContext":

        # Find the last snapshot that precedes any pending (unconfirmed) tool calls
        snapshot = await self._find_safe_rollback_point(
            tenant_id, agent_context
        )

        if snapshot is None:
            # No snapshot available: start fresh with conversation history only
            return self._create_minimal_context(agent_context, target_provider)

        # Validate snapshot integrity
        self._verify_snapshot_hash(snapshot)

        # Adapt conversation history format for target provider
        adapter = self.adapters.get(target_provider)
        adapted_history = adapter.adapt_conversation_history(
            snapshot["conversationHistory"],
            source_provider=snapshot["modelProvider"]
        )

        # Build restored context
        restored = AgentContext(
            tenant_id=tenant_id,
            agent_id=snapshot["agentId"],
            session_id=snapshot["sessionId"],
            workflow_id=snapshot["workflowId"],
            model_provider=target_provider,
            model_id=adapter.select_equivalent_model(snapshot["modelId"]),
            conversation_history=adapted_history,
            # Only replay tool calls confirmed complete before the snapshot
            completed_tool_calls=snapshot["completedToolCalls"],
            pending_tool_calls=[],  # Clear pending: will be re-executed
            planning_graph=snapshot["planningGraph"],
            system_prompt_hash=snapshot["systemPromptHash"],
            tenant_config_version=snapshot["tenantConfigVersion"],
            last_snapshot_id=snapshot["snapshotId"],
            sequence_number=snapshot["sequenceNumber"],
            failover_metadata={
                "restoredFromSnapshot": snapshot["snapshotId"],
                "originalProvider": snapshot["modelProvider"],
                "failoverProvider": target_provider,
                "failoverTimestamp": datetime.now(timezone.utc).isoformat(),
            }
        )

        return restored

    async def _find_safe_rollback_point(
        self, tenant_id: str, agent_context: "AgentContext"
    ) -> Optional[dict]:
        # Walk back through snapshot chain to find last pre-tool-call snapshot
        snapshot_id = agent_context.last_snapshot_id
        while snapshot_id:
            snapshot = await self.store.read(tenant_id, snapshot_id)
            if snapshot["triggerType"] == "pre_tool_call" or \
               snapshot["triggerType"] == "periodic":
                return snapshot
            snapshot_id = snapshot.get("previousSnapshotId")
        return None

    def _verify_snapshot_hash(self, snapshot: dict):
        stored_hash = snapshot.pop("stateHash")
        raw = json.dumps(snapshot, sort_keys=True, default=str)
        computed = hashlib.sha256(raw.encode()).hexdigest()
        snapshot["stateHash"] = stored_hash
        if computed != stored_hash:
            raise SnapshotCorruptionError(
                f"Hash mismatch for snapshot {snapshot['snapshotId']}"
            )

Step 7: Handle Model Adapter Translation

One of the trickiest parts of cross-provider failover is that different model providers use different message formats, tool call schemas, and capability sets. You need a Model Adapter Registry that handles translation between providers.

Key translation concerns include:

Message role names: OpenAI uses assistant, some providers use model or bot
Tool call format: The JSON schema for function/tool definitions varies significantly across providers
Context window limits: If the failover provider has a smaller context window, you need a truncation strategy that preserves the most recent and most relevant turns
System prompt placement: Some providers treat the system prompt as a separate field; others expect it as the first message in the history array
Model equivalence mapping: When failing over from gpt-5 to claude-4-opus, you want to select the closest capability tier, not the cheapest available model

Build a static equivalence map and update it as providers release new models. In 2026, most major providers have converged on OpenAI-compatible APIs as a baseline, which reduces (but does not eliminate) translation overhead.

Step 8: Observability, Alerting, and Tenant Notifications

A rollback pipeline that operates silently is a liability. You need full observability at every stage, plus a tenant notification system so customers know what happened and why.

Metrics to Emit

snapshot.capture.latency_ms (per tenant, per trigger type)
snapshot.store.size_bytes (per tenant)
failover.triggered.count (per provider, per tenant tier)
rollback.restore.latency_ms
rollback.snapshot_age_seconds (how old was the snapshot we rolled back to?)
failover.tool_call_replay.count (how many tool calls had to be re-executed?)

Tenant Notification Webhook

Send a structured webhook payload to tenants when failover occurs:


{
  "event": "agent.failover.executed",
  "tenantId": "tenant_abc123",
  "agentId": "agent_xyz789",
  "sessionId": "sess_001",
  "originalProvider": "openai",
  "failoverProvider": "anthropic",
  "restoredFromSnapshot": "019x-...",
  "snapshotAge": 47,          // seconds
  "toolCallsReplayed": 2,
  "estimatedContextLoss": 0,  // turns of conversation lost
  "timestamp": "2026-03-18T02:47:33Z"
}

Transparency here builds trust. Enterprise tenants will have their own incident response processes and they need this data to close the loop on their end.

Step 9: Testing Your Failover Pipeline

A failover system you have never tested is a failover system that will fail when you need it most. Build a chaos testing harness that you run in staging weekly:

Provider kill switch: Inject a mock provider.status.changed: down event for each provider and verify that all active agent sessions trigger rollback within your SLA window (target: under 5 seconds)
Snapshot corruption test: Deliberately corrupt a snapshot's hash and verify the coordinator falls back to the previous snapshot rather than restoring corrupted state
Mid-tool-call outage: Simulate an outage that fires exactly between a tool call dispatch and its result return. Verify that the tool call is replayed (not skipped) after failover
Tenant allowlist enforcement: Configure a tenant with a restricted provider allowlist and verify that failover never routes to a disallowed provider, even if it is the only healthy one
Snapshot store unavailability: Kill the snapshot store itself and verify the system degrades gracefully (conversation history only, no tool replay) rather than crashing

Production Checklist Before You Ship

Before deploying this pipeline to production, run through this checklist:

Snapshot writes are non-blocking relative to the main inference path (use async fire-and-forget with a write confirmation timeout)
Snapshot store has per-tenant encryption at rest with tenant-managed keys for enterprise tiers
Rollback operations are idempotent: running restore twice for the same snapshot ID produces the same result
Failover events are logged to your audit trail with full context for compliance purposes
Tenant allowlist configs are cached locally in the failover router so a config service outage does not block failover
Your model equivalence map is maintained as a versioned config, not hardcoded
Snapshot retention policies are enforced by a background job, not inline with writes
You have a manual override API so on-call engineers can force a specific tenant to a specific provider without waiting for automation

Conclusion

Building a per-tenant AI agent rollback and state snapshot pipeline is not glamorous infrastructure work. It is the kind of system that is invisible when it works and catastrophic when it does not exist. In 2026, as agentic AI workloads take on increasingly consequential tasks across finance, healthcare, legal, and operations, the platforms that earn long-term enterprise trust will be the ones that treat agent state durability with the same rigor as financial transaction logs.

The architecture laid out in this guide gives you the foundation: a well-modeled state schema, trigger-based snapshot capture, tenant-isolated storage, intelligent health monitoring, compliant failover routing, and a rollback coordinator that handles the messy reality of cross-provider model translation. Start with the snapshot engine and the health monitor. Get those into production first. Then layer in the full rollback coordinator once you have real snapshot data to test against.

Your future on-call engineer, paged at 2:47 AM on a Tuesday, will thank you.