How to Design a Backend Circuit Breaker Pattern for AI Model API Failures: A Step-by-Step Guide for Production Multi-Agent Systems
Your multi-agent system is humming along in production when suddenly one of your third-party LLM providers starts returning garbled partial outputs. Within seconds, an orchestrator agent retries the call, a downstream summarization agent stalls waiting for a response, a vector search step times out, and your entire pipeline grinds to a halt. You have just experienced an AI cascade failure, and if you have not designed for it, you will keep experiencing it.
The classic circuit breaker pattern, popularized in microservices architecture, is the right tool for this problem. But applying it naively to LLM APIs misses a critical reality: AI model failures are not binary. A provider might be "up" by every traditional health-check measure and still be poisoning your pipeline with degraded token outputs, truncated completions, or semantically incoherent responses that slip right past a standard HTTP 200 check.
This guide walks you through designing a circuit breaker that is purpose-built for the unique failure modes of third-party LLM providers, including rate limits, partial outputs, and soft degradation, in the context of production multi-agent systems.
Why Standard Circuit Breakers Fall Short for LLM APIs
In a typical microservice, a circuit breaker monitors for HTTP 5xx errors or connection timeouts. If failures cross a threshold, the circuit "opens" and stops forwarding requests for a cooldown period. Simple, effective, and well-understood.
LLM providers introduce a new class of failure that breaks these assumptions:
- Partial outputs: The model returns a response that is syntactically valid JSON or plain text but is truncated mid-sentence, missing required fields, or cut off at the token limit without a proper stop token.
- Semantic degradation: The model is technically responding but producing low-quality, hallucinated, or off-topic content, often a sign of infrastructure issues on the provider's side.
- Rate limit errors (HTTP 429): These are not server errors. They are expected signals that need their own handling logic, including exponential backoff and provider failover, not just a trip to the open state.
- Soft timeouts: The provider starts streaming tokens but then stalls mid-stream for 30 or more seconds, which a naive timeout won't catch if it only measures time-to-first-token.
- Inconsistent schema compliance: Providers under load sometimes drop structured output formatting, returning raw prose when your agent pipeline expects a strict JSON schema.
A circuit breaker for LLM APIs must understand all of these failure modes. Let's build one from the ground up.
Step 1: Define Your Failure Taxonomy
Before writing a single line of code, document every failure type your system can encounter from each LLM provider. Categorize them into three buckets that will drive your circuit breaker's state transitions:
Bucket A: Hard Failures (Trip the Circuit Immediately)
- HTTP 500, 502, 503, 504 responses
- Connection refused or DNS resolution failure
- SSL/TLS handshake errors
- Streaming connection dropped before any token is received
Bucket B: Soft Failures (Increment a Degradation Counter)
- HTTP 429 (rate limit exceeded)
- Response received but fails schema validation
- Response received but is below a minimum token length threshold
- Streaming stall: no new token received within a configurable inter-token timeout window
- Response quality score below threshold (more on this below)
Bucket C: Warnings (Log Only, No State Change)
- Latency above P95 baseline but response is valid
- Response received with a model-reported finish reason of "length" (hit token limit)
- Non-critical field missing in structured output
This taxonomy is the foundation of everything else. Pin it in your team's design doc before you proceed.
Step 2: Build the State Machine
A circuit breaker is fundamentally a state machine with three states: Closed, Open, and Half-Open. For LLM APIs, you need a fourth state to handle the rate-limit scenario gracefully.
- Closed: Normal operation. All requests pass through. Failures are counted.
- Open: The circuit has tripped. All requests are immediately rejected or routed to a fallback provider. No calls reach the degraded LLM endpoint.
- Half-Open: The cooldown has expired. A limited number of probe requests are allowed through to test if the provider has recovered.
- Throttled (new state): The provider is responding but is rate-limiting. Requests are queued and released at a controlled rate using a token bucket algorithm. The circuit does not fully open, but throughput is reduced.
Here is a Python skeleton illustrating this state machine:
import time
import threading
from enum import Enum
from collections import deque
class CircuitState(Enum):
CLOSED = "closed"
OPEN = "open"
HALF_OPEN = "half_open"
THROTTLED = "throttled"
class LLMCircuitBreaker:
def __init__(
self,
provider_name: str,
hard_failure_threshold: int = 5,
soft_failure_threshold: int = 10,
open_duration_seconds: int = 60,
half_open_probe_count: int = 3,
rate_limit_window_seconds: int = 30,
):
self.provider_name = provider_name
self.state = CircuitState.CLOSED
self.hard_failures = 0
self.soft_failures = 0
self.hard_failure_threshold = hard_failure_threshold
self.soft_failure_threshold = soft_failure_threshold
self.open_duration = open_duration_seconds
self.half_open_probe_count = half_open_probe_count
self.half_open_successes = 0
self.opened_at = None
self.lock = threading.Lock()
self.rate_limit_window = rate_limit_window_seconds
def record_hard_failure(self):
with self.lock:
self.hard_failures += 1
if self.hard_failures >= self.hard_failure_threshold:
self._trip_open()
def record_soft_failure(self, is_rate_limit: bool = False):
with self.lock:
if is_rate_limit:
self.state = CircuitState.THROTTLED
return
self.soft_failures += 1
if self.soft_failures >= self.soft_failure_threshold:
self._trip_open()
def record_success(self):
with self.lock:
if self.state == CircuitState.HALF_OPEN:
self.half_open_successes += 1
if self.half_open_successes >= self.half_open_probe_count:
self._reset()
else:
self._reset()
def _trip_open(self):
self.state = CircuitState.OPEN
self.opened_at = time.time()
def _reset(self):
self.state = CircuitState.CLOSED
self.hard_failures = 0
self.soft_failures = 0
self.half_open_successes = 0
self.opened_at = None
def allow_request(self) -> bool:
with self.lock:
if self.state == CircuitState.CLOSED:
return True
if self.state == CircuitState.THROTTLED:
return True # Controlled by token bucket upstream
if self.state == CircuitState.OPEN:
if time.time() - self.opened_at > self.open_duration:
self.state = CircuitState.HALF_OPEN
self.half_open_successes = 0
return True
return False
if self.state == CircuitState.HALF_OPEN:
return self.half_open_successes < self.half_open_probe_count
return False
Step 3: Implement Response Quality Scoring
This is the step that separates a naive circuit breaker from one that actually protects against LLM-specific failures. You need a lightweight response quality scorer that runs on every LLM output before it is passed downstream to the next agent in your pipeline.
The scorer does not need to be a second LLM call (that would be expensive and circular). Instead, use a rule-based scoring function that checks the properties most critical to your use case. Here is a composable example:
from dataclasses import dataclass
from typing import Optional
import json
@dataclass
class QualityScore:
passed: bool
score: float # 0.0 to 1.0
failure_reason: Optional[str] = None
def score_llm_response(
raw_response: str,
expected_schema: Optional[dict] = None,
min_token_estimate: int = 20,
finish_reason: Optional[str] = None,
) -> QualityScore:
score = 1.0
reason = None
# Check 1: Minimum length
token_estimate = len(raw_response.split())
if token_estimate < min_token_estimate:
score -= 0.5
reason = f"Response too short: {token_estimate} tokens estimated"
# Check 2: Finish reason
if finish_reason == "length":
score -= 0.2
reason = reason or "Response truncated at token limit"
# Check 3: Schema validation (if structured output expected)
if expected_schema:
try:
parsed = json.loads(raw_response)
for required_key in expected_schema.get("required", []):
if required_key not in parsed:
score -= 0.3
reason = f"Missing required field: {required_key}"
break
except json.JSONDecodeError:
score -= 0.6
reason = "Response is not valid JSON"
# Check 4: Repetition detection (a common sign of model degradation)
words = raw_response.lower().split()
if len(words) > 10:
unique_ratio = len(set(words)) / len(words)
if unique_ratio < 0.3:
score -= 0.4
reason = reason or "High token repetition detected (possible model degradation)"
return QualityScore(passed=score >= 0.6, score=max(0.0, score), failure_reason=reason)
When QualityScore.passed is False, call circuit_breaker.record_soft_failure(). This is how your circuit breaker learns about degraded responses that HTTP status codes will never surface.
Step 4: Wire Up Provider Failover
An open circuit is only useful if you have somewhere to route traffic. In production multi-agent systems running in 2026, most teams have access to multiple LLM providers, whether that is OpenAI, Anthropic, Google Gemini, Mistral, or self-hosted open-weight models like Llama or Qwen. Your circuit breaker should sit inside a provider router that implements a priority-ordered failover list.
from typing import List, Callable, Any
class LLMProviderRouter:
def __init__(self, providers: List[dict]):
# providers: [{"name": str, "call_fn": Callable, "breaker": LLMCircuitBreaker}]
self.providers = providers
def call(self, prompt: str, schema: dict = None, **kwargs) -> Any:
for provider in self.providers:
breaker: LLMCircuitBreaker = provider["breaker"]
call_fn: Callable = provider["call_fn"]
if not breaker.allow_request():
print(f"[Router] Skipping {provider['name']}: circuit is {breaker.state.value}")
continue
try:
response = call_fn(prompt, **kwargs)
quality = score_llm_response(
raw_response=response.text,
expected_schema=schema,
finish_reason=response.finish_reason,
)
if quality.passed:
breaker.record_success()
return response
else:
print(f"[Router] Soft failure on {provider['name']}: {quality.failure_reason}")
breaker.record_soft_failure()
continue # Try next provider
except RateLimitError:
breaker.record_soft_failure(is_rate_limit=True)
continue
except (ConnectionError, TimeoutError, ServerError) as e:
breaker.record_hard_failure()
continue
raise RuntimeError("All LLM providers are unavailable or degraded. No response returned.")
This router tries each provider in order, respects circuit state, and automatically escalates to the next provider on any failure type. The last line raises a clean exception that your orchestration layer can catch and handle, rather than letting a bad response silently corrupt your agent pipeline.
Step 5: Handle Streaming Responses with an Inter-Token Watchdog
Streaming is where most circuit breaker implementations break down for LLM APIs. A provider can begin streaming tokens and then stall indefinitely. Your timeout set on the initial HTTP connection will not fire because the connection is technically still open.
The solution is an inter-token watchdog timer: a background thread that resets a countdown every time a new token arrives. If no token arrives within the window, it cancels the stream and records a soft failure.
import threading
from typing import Generator
def stream_with_watchdog(
stream_generator: Generator,
inter_token_timeout: float = 5.0,
breaker: LLMCircuitBreaker = None,
):
last_token_time = [time.time()]
timed_out = [False]
stop_event = threading.Event()
def watchdog():
while not stop_event.is_set():
elapsed = time.time() - last_token_time[0]
if elapsed > inter_token_timeout:
timed_out[0] = True
stop_event.set()
return
time.sleep(0.1)
watchdog_thread = threading.Thread(target=watchdog, daemon=True)
watchdog_thread.start()
collected_tokens = []
try:
for token in stream_generator:
if stop_event.is_set():
break
last_token_time[0] = time.time()
collected_tokens.append(token)
finally:
stop_event.set()
watchdog_thread.join(timeout=1.0)
if timed_out[0]:
if breaker:
breaker.record_soft_failure()
raise TimeoutError(
f"Stream stalled: no token received within {inter_token_timeout}s"
)
return "".join(collected_tokens)
Set your inter_token_timeout based on observed P99 inter-token latency for each provider. A value of 5 to 8 seconds is a reasonable starting point for most hosted LLM APIs in 2026.
Step 6: Persist Circuit State Across Instances
If you are running your multi-agent system on multiple worker instances (and you almost certainly are), each instance cannot maintain its own in-memory circuit breaker state independently. One worker might have already tripped the circuit while another keeps hammering the degraded provider.
The solution is to back your circuit breaker state with a shared store. Redis is the standard choice here due to its atomic operations and TTL support:
- Use a Redis key like
cb:openai:stateto store the current circuit state as a string. - Use
INCRwith a TTL-based sliding window for failure counters, using keys likecb:openai:hard_failureswith a 60-second expiry. - Use a Redis Lua script to make state transitions atomic, preventing race conditions when multiple workers simultaneously record failures.
- Publish state change events to a Redis pub/sub channel so all workers immediately learn when a circuit opens or resets, rather than waiting for the next request cycle.
This distributed circuit breaker pattern ensures that a failure detected by any one worker immediately protects all workers in your fleet, which is critical for preventing cascade failures under high load.
Step 7: Instrument Everything with Observability
A circuit breaker you cannot observe is a circuit breaker you cannot trust. Instrument your implementation with the following metrics and events, emitting them to your observability stack (Datadog, Grafana, OpenTelemetry, or equivalent):
Metrics to Track
llm.circuit.state: A gauge per provider, with value 0 (closed), 1 (open), 2 (half-open), 3 (throttled).llm.circuit.hard_failures.total: Counter per provider.llm.circuit.soft_failures.total: Counter per provider, tagged by failure reason.llm.circuit.failover.total: Counter tracking how often traffic was rerouted to a secondary provider.llm.response.quality_score: Histogram of quality scores per provider, so you can spot degradation trends before the circuit trips.llm.stream.watchdog_timeout.total: Counter of stream stalls per provider.
Alerts to Configure
- Alert when any primary provider circuit enters the Open state.
- Alert when the failover provider circuit also enters the Open state (this is a severity-1 incident).
- Alert when the P50 quality score for any provider drops below 0.75 over a 5-minute window, even if the circuit is still closed. This is your early warning system.
Step 8: Test Your Circuit Breaker Before Production
Do not wait for a real provider outage to find out your circuit breaker has a bug. Build a test harness that injects each failure type systematically:
- Hard failure injection: Mock the provider client to raise
ConnectionErroron every call and verify the circuit opens after hitting the threshold. - Soft failure injection: Return valid HTTP 200 responses with intentionally malformed or truncated bodies and verify the quality scorer catches them and increments the soft failure counter.
- Rate limit injection: Return HTTP 429 responses and verify the circuit transitions to the Throttled state rather than Open.
- Stream stall injection: Yield a few tokens from the mock stream generator, then block indefinitely, and verify the watchdog fires and records a soft failure.
- Recovery testing: Open the circuit, wait for the cooldown, send probe requests that succeed, and verify the circuit returns to Closed after the required number of successful probes.
- Chaos testing: Run a multi-agent pipeline end-to-end with random failure injection at the provider layer and verify that no cascade failures occur and that all requests are either served by a fallback provider or fail with a clean, catchable exception.
Putting It All Together: The Architecture at a Glance
Here is how all the pieces fit together in a production multi-agent system:
- An agent orchestrator calls the LLMProviderRouter with a prompt and an optional schema.
- The router checks each provider's LLMCircuitBreaker state before forwarding the request.
- If the provider uses streaming, the response is wrapped in the stream_with_watchdog utility.
- Every completed response passes through the response quality scorer before being returned to the orchestrator.
- Successes and failures are recorded to the circuit breaker, which persists state to Redis for cross-instance consistency.
- All state transitions and quality metrics are emitted to your observability platform.
- If all providers are unavailable, the router raises a clean exception that the orchestrator handles with a graceful degradation strategy (such as returning a cached response or queuing the task for retry).
Common Mistakes to Avoid
- Treating 429 the same as 500: Rate limits are recoverable and predictable. Opening the circuit on a 429 will cause unnecessary downtime. Use the Throttled state instead.
- Using a single global circuit breaker: Each provider, and ideally each model endpoint within a provider, should have its own circuit breaker instance. A degraded
gpt-4oendpoint does not meangpt-4o-miniis also degraded. - Ignoring the quality of successful responses: An HTTP 200 is not a success signal for an LLM. Always score the response content before recording a success.
- Setting thresholds without data: Calibrate your hard and soft failure thresholds against your actual traffic volume and provider SLAs. A threshold of 5 hard failures might be appropriate for 100 requests per minute but far too sensitive for 10,000 requests per minute.
- Forgetting about the half-open state under load: Under high traffic, multiple workers can simultaneously send probe requests during the half-open state. Use a distributed lock in Redis to ensure only one probe request goes through at a time.
Conclusion
Building reliable multi-agent AI systems in production is not just a machine learning problem; it is a distributed systems engineering problem. Third-party LLM providers will degrade, rate-limit, and partially fail, and when they do, a naive retry loop will turn a localized provider hiccup into a full-system cascade failure within seconds.
The circuit breaker pattern, extended with LLM-specific response quality scoring, a Throttled state for rate limits, a streaming watchdog, and distributed state persistence, gives your backend the resilience it needs to keep agent pipelines running even when the models underneath them are having a bad day.
Start with Step 1 and define your failure taxonomy. Everything else flows from that decision. Once your circuit breaker is in place, you will spend far less time firefighting provider outages and far more time building the features that actually matter.
Have you implemented a circuit breaker for your LLM-powered system? Share your approach and lessons learned in the comments below.