multi-agent AI

How One B2B SaaS Team's Post-Mortem Uncovered a Single Misconfigured Rate Limiter Behind Their Multi-Agent Pipeline's Cascading Failures

Scott Miller

Mar 7, 2026 • 9 min read

It started with a routine Monday morning alert. The on-call engineer at Velorant AI (a mid-stage B2B SaaS company building AI-powered revenue intelligence tools) woke up to a Slack flood of red. Their flagship multi-agent pipeline, the one that automated prospect research, CRM enrichment, and outbound sequence generation for enterprise sales teams, had silently degraded overnight. Dozens of tool calls were returning empty results. Agents were retrying endlessly. One downstream agent was hallucinating outputs because its upstream data-fetching counterpart had stopped responding with usable payloads.

The incident lasted six hours. It cost the company three enterprise client SLA violations and triggered a full engineering post-mortem. What they found at the root cause was not a broken API, not a model regression, not a network partition. It was a single misconfigured rate limiter sitting quietly in their backend gateway, one that nobody had touched in weeks, silently strangling every agent in the pipeline.

This is the story of what went wrong, what the post-mortem revealed, and the throttling architecture Velorant's team built to make sure it never happened again. If you are building or operating multi-agent AI systems in production today, this one is worth reading carefully.

The Architecture Before the Incident

Velorant's pipeline was a classic orchestrator-plus-specialist pattern, a design that has become one of the most common multi-agent topologies in production B2B SaaS systems as of early 2026. The setup looked like this:

Orchestrator Agent: Received a sales rep's natural language request (e.g., "Build me a prospect list of CFOs at Series B SaaS companies in DACH") and decomposed it into subtasks.
Research Agent: Called external enrichment APIs (LinkedIn data partners, Crunchbase, proprietary firmographic databases) to gather raw prospect data.
Scoring Agent: Evaluated and ranked prospects using internal fit-score models and CRM signal data.
Sequence Agent: Generated personalized outbound email sequences for each scored prospect.
Write-Back Agent: Pushed finalized data and sequences back into the customer's CRM via webhook.

Each agent communicated through a shared tool-call bus, a lightweight internal message broker that routed function calls between agents and to external APIs. The entire pipeline ran asynchronously, with agents polling a Redis-backed task queue and executing tool calls through a centralized API gateway layer.

That gateway layer is where the problem lived.

What Actually Happened: The Root Cause

Three weeks before the incident, a backend engineer had updated the API gateway's rate limiter configuration to accommodate a new enterprise customer whose contract included a higher API call volume tier. The intent was to raise the global rate limit ceiling. The execution, however, introduced a subtle but catastrophic misconfiguration.

The rate limiter was using a token bucket algorithm with two parameters: a bucket capacity (maximum burst) and a refill rate (tokens added per second). The engineer correctly updated the bucket capacity from 500 to 1,200 tokens. But they inadvertently set the refill rate in the wrong unit field, entering 10 where the system expected tokens-per-second, but the config file had recently been refactored to accept tokens-per-minute. The result: the gateway was refilling at 10 tokens per minute instead of 10 per second, a 6x reduction in sustained throughput.

During low-traffic hours, this went completely unnoticed. The burst capacity of 1,200 tokens was more than enough to handle sporadic daytime requests. But at 2:14 AM, a scheduled batch job kicked off. A large enterprise customer had queued up a bulk pipeline run, enriching 4,000 new prospects imported from a trade show lead list. The orchestrator spun up dozens of concurrent Research Agent instances. Within 90 seconds, the token bucket was exhausted.

Then the cascade began.

The Cascade: How One Limiter Took Down Five Agents

This is the part of the post-mortem that the team found most instructive, and most humbling. The failure did not look like a failure at first. It looked like slowness.

Stage 1: Silent Throttling

When the gateway began rejecting tool calls with 429 Too Many Requests, the Research Agents did exactly what they were designed to do: they applied exponential backoff and retried. The problem was that the retry logic had been written with a global, non-jittered backoff. Every agent backed off for the same interval and retried at the same moment, creating a thundering herd that immediately re-exhausted the already-depleted token bucket on every retry cycle.

Stage 2: Queue Saturation

As Research Agents stalled, their incomplete task results sat in the Redis queue with TTLs that were set generously (to handle legitimate slow enrichment APIs). The Scoring Agent, polling the queue for completed research payloads, began receiving partial or empty data objects because some research sub-tasks had timed out and written null results before the retry logic could recover them. The Scoring Agent had no schema validation on its input; it accepted partial payloads and attempted to score them anyway.

Stage 3: Garbage In, Hallucinations Out

The Scoring Agent passed its malformed scores downstream to the Sequence Agent. The Sequence Agent's prompt template included a block for "top 3 firmographic signals to reference in the opening line." With null firmographic data, the LLM filling that template had nothing to ground on and began confabulating company details: inventing funding rounds, fabricating team sizes, generating plausible-sounding but entirely fictional context. Several of these sequences were written back to the CRM before the on-call alert fired.

Stage 4: Write-Back Amplification

The Write-Back Agent, receiving a flood of completed (but corrupted) sequence payloads, hammered the customer CRM webhooks. Some CRM endpoints began rate-limiting the write-back agent independently. This triggered a second wave of retries, this time against external systems, compounding the internal queue backup.

By the time the on-call engineer was paged, the pipeline had been in a degraded state for over two hours, and the failure signature looked nothing like a rate limiter problem. It looked like a data quality issue, an LLM reliability issue, a CRM integration issue. It took the post-mortem team four hours of log archaeology to trace every symptom back to a single misconfigured integer.

The Post-Mortem: Five Key Findings

Velorant's engineering team ran a structured post-mortem using a blameless RCA (Root Cause Analysis) framework. Their five key findings became the blueprint for their redesign:

Rate limiter config had no unit validation. The config schema accepted numeric values without enforcing or labeling units. A simple type annotation or enum constraint (e.g., refill_rate_per: "second" | "minute") would have made the misconfiguration immediately visible.
Retry logic was non-jittered and globally synchronized. All agents shared the same base backoff interval with no randomization, making thundering herd behavior inevitable under any sustained throttling event.
Agents lacked input contract enforcement. The Scoring Agent accepted partial payloads silently. No schema validation, no dead-letter queue for malformed inputs, no circuit breaker to halt processing when upstream data quality dropped below a threshold.
The LLM prompt had no null-safety handling. The Sequence Agent's prompt template did not check for missing fields before injecting them into the model context. There was no fallback instruction telling the model to omit or flag missing data rather than infer it.
Observability was pipeline-unaware. Existing monitoring tracked individual API call latency and error rates but had no concept of pipeline-level health. There was no alert that fired when agent-to-agent data fidelity dropped, only when raw error counts crossed a threshold.

The New Architecture: Backend Throttling Built for Agentic Systems

Over the six weeks following the incident, Velorant's platform team rebuilt their throttling and resilience layer from the ground up. Here is what they shipped:

1. Hierarchical, Agent-Aware Rate Limiting

The old gateway used a single global token bucket. The new architecture implements a three-tier rate limiting hierarchy:

Global tier: Hard ceiling on total API calls per minute across all tenants, protecting infrastructure.
Pipeline tier: Each active pipeline run is assigned a dedicated token budget, calculated at orchestration time based on the estimated number of tool calls in the task graph. Pipelines cannot starve each other.
Agent tier: Each agent type has its own sub-bucket with limits tuned to its expected call patterns. Research Agents, which are the highest-volume callers, have their own bucket with a higher burst allowance but a stricter sustained rate.

Critically, all rate limiter configurations now use a strongly typed schema enforced at deploy time, with explicit unit fields, range validation, and a required code-review checklist item for any changes to throttling parameters.

2. Jittered Exponential Backoff with Per-Agent Seeds

The retry logic was rewritten to implement full jitter (as opposed to equal jitter or no jitter). Each agent instance seeds its jitter calculation with a hash of its unique agent ID, ensuring that even agents of the same type retry at different intervals. The formula follows the pattern popularized by AWS's retry guidance but adapted for high-concurrency agentic workloads:

sleep = random_between(0, min(cap, base * 2^attempt))

Maximum retry attempts per tool call are capped at five, after which the task is routed to a dead-letter queue for human review rather than silently failing or writing a null result.

3. Pipeline-Level Circuit Breakers

Inspired by the classic microservices circuit breaker pattern, Velorant implemented pipeline-scoped circuit breakers that monitor the ratio of successful to failed tool calls across a rolling 60-second window. If any agent's failure rate exceeds 40% within a pipeline run, the circuit breaker trips and the entire pipeline is paused, not cancelled, paused. The orchestrator receives a backpressure signal and holds new task dispatches until the failure rate recovers. This prevents the cascade from propagating downstream while the upstream issue resolves.

4. Agent Input Schema Validation with Dead-Letter Routing

Every agent now declares a strict Pydantic-enforced input schema as part of its agent definition. Before any agent processes a payload, the schema is validated. Payloads that fail validation are not passed to the agent; they are routed to a dead-letter topic in the message broker, tagged with the pipeline run ID, the originating agent, and the validation error. A separate monitoring dashboard surfaces these events in real time, giving the engineering team visibility into data quality degradation before it reaches the LLM layer.

5. Null-Safe Prompt Templates with Explicit Fallback Instructions

The Sequence Agent's prompt template was refactored to include conditional blocks for every data field that could potentially be missing. The template now explicitly instructs the model on what to do when a field is absent:

"If [firmographic_signal] is not provided, do not invent or infer a value. Instead, write a generic opening line that does not reference specific company details, and append the tag [DATA_MISSING] to the output for review."

This single change, adding explicit null-handling instructions to the prompt, eliminated hallucinated outputs in testing across hundreds of deliberately corrupted test payloads.

6. Pipeline-Aware Observability with Fidelity Scoring

The team integrated a new observability layer that tracks pipeline fidelity as a first-class metric. For each pipeline run, a fidelity score is computed as the percentage of agent-to-agent data handoffs that passed schema validation with complete, non-null required fields. If a pipeline run's fidelity score drops below 85%, an alert fires before the run completes, giving engineers and (optionally) the customer a chance to review or halt the run. This metric is surfaced in both the internal ops dashboard and the customer-facing pipeline health panel.

Results: Three Months Post-Redesign

By early 2026, Velorant had been running the new architecture in production for roughly three months. The numbers told a clear story:

Zero cascading pipeline failures attributable to rate limiter exhaustion since the redesign went live.
Dead-letter queue volume dropped 94% after the jittered retry logic was deployed, as thundering herd retries were eliminated.
Mean time to detection (MTTD) for pipeline degradation fell from 127 minutes to 8 minutes, thanks to the fidelity scoring alerts.
LLM hallucination rate in Sequence Agent outputs dropped to near zero for runs with complete upstream data, and flagged outputs for incomplete data were caught and quarantined before CRM write-back in 100% of test cases.
The engineering team recovered two of the three SLA-violating enterprise accounts by presenting the post-mortem findings and redesign roadmap transparently, turning a trust-damaging incident into a demonstration of engineering maturity.

What Every AI Engineering Team Should Take Away From This

The Velorant incident is not an edge case. As multi-agent pipelines become the default architecture for sophisticated AI features in B2B SaaS products, the blast radius of a single misconfigured infrastructure component grows dramatically. A rate limiter that would cause a brief blip in a monolithic API integration can bring down an entire agentic workflow when five interdependent agents are all funneling calls through the same gateway.

Here are the principles worth internalizing from this case study:

Config changes to shared infrastructure need agentic blast-radius analysis. Before changing a rate limiter, ask: how many agents call through this gateway, and what happens to each of them if this limit is halved?
Retry logic designed for single-service calls fails at agent scale. Jitter is not optional in multi-agent systems. It is foundational.
LLMs will fill data voids with plausible fiction. Null-safety in prompts is as important as null-safety in code. Treat missing data as a first-class case in every prompt template.
Pipeline-level observability is a different discipline from service-level observability. You need metrics that span agent boundaries, not just metrics within them.
Transparency with customers after incidents builds more trust than silence. Velorant's decision to share the post-mortem summary with affected clients was one of the best decisions they made in the entire episode.

Conclusion

The irony of the Velorant incident is that the root cause was not an AI problem at all. It was a classic distributed systems problem, a misconfigured rate limiter, amplified by the unique characteristics of agentic architectures: high concurrency, deep interdependency between agents, and LLMs that will confidently synthesize outputs from incomplete inputs.

As the industry matures its understanding of multi-agent systems in 2026, the most important lesson may be this: the infrastructure beneath your agents matters as much as the intelligence within them. A brilliant orchestrator agent cannot compensate for a throttling layer that is quietly starving it. Build your backend with the same rigor you bring to your prompts and your models, and your pipelines will be far more resilient for it.

If your team is building multi-agent pipelines and you have not yet stress-tested your rate limiting behavior under sustained concurrent load, do it before your customers do it for you.