How a Healthcare SaaS Team's Multi-Tenant Agentic Platform Collapsed Under Uncontrolled Tool-Call Fan-Out Storms , And the Per-Tenant Concurrency Throttling Architecture That Restored Stability
At 2:47 AM on a Tuesday in January 2026, the on-call engineer at a mid-sized healthcare SaaS company received a cascade of PagerDuty alerts. Within four minutes, three of the platform's largest hospital-network tenants had gone completely dark. The AI-powered clinical workflow automation platform, which had been celebrated internally as the team's flagship product, was returning timeouts across the board. Downstream services were queuing up. Postgres connection pools were exhausted. And the root cause was not a network outage, not a cloud provider blip, not a bad deployment. It was a pattern the team had never explicitly planned for: a tool-call fan-out storm.
This is the story of what happened, why it happened, and how the engineering team rebuilt their agentic infrastructure with a per-tenant concurrency throttling architecture that not only stopped the bleeding but made the entire platform more resilient, fair, and observable than it had ever been before.
The Platform: What They Built and Why It Was Exciting
The company, which we will call ClarityHealth (a composite pseudonym representing a real class of healthcare SaaS platforms operating in 2025 and into 2026), had built a multi-tenant agentic platform designed to automate clinical administrative workflows. Think prior authorization processing, patient intake triage, insurance eligibility verification, and discharge summary generation.
Their architecture was genuinely impressive. Each tenant (a hospital network, a large clinic group, or a specialty practice) got a logically isolated workspace. Workflows were orchestrated by LLM-based agents built on top of a leading frontier model API, with a library of over 40 registered tools. These tools ranged from lightweight lookups (patient demographics, formulary checks) to heavy integrations (HL7 FHIR API calls, EHR write-backs, external payer portals).
The agent runtime was built on a popular open-source orchestration framework, with a custom execution layer the team had written in Python. Each incoming workflow request spun up an agent context, and the agent would reason through the task, deciding which tools to call and in what sequence. The team had invested heavily in prompt engineering and retrieval-augmented context injection. What they had not invested in was any meaningful constraint on how many tool calls a single agent, or a single tenant's fleet of agents, could fire simultaneously.
Understanding the Fan-Out Storm: What Actually Happened
To understand the failure, you need to understand a specific behavior that emerges in agentic systems at scale: tool-call fan-out.
In a well-behaved workflow, an agent reasons step by step, calling one or two tools, processing the results, then proceeding. But in practice, modern LLMs with parallel tool-calling capabilities will often decide to fire multiple tool calls simultaneously when they determine the results are independent. This is a feature, not a bug. It dramatically reduces end-to-end latency for complex tasks. The problem is that "simultaneously" at the agent level becomes "simultaneously" at the infrastructure level, and when you have hundreds of agents running across dozens of tenants, the math gets ugly fast.
Here is what happened on that January morning. A large hospital network tenant, let us call them Meridian Regional, had configured a nightly batch job that kicked off at 2:30 AM. The job processed prior authorization requests that had accumulated during the day, approximately 1,200 of them. Each request spawned an agent. Each agent, when processing a complex prior auth case, would typically fan out to between 4 and 9 tool calls in its first reasoning step: checking patient eligibility, pulling the relevant clinical notes, querying the formulary database, checking payer-specific rules, and so on.
The arithmetic is straightforward and alarming:
- 1,200 agents spawned near-simultaneously from the batch trigger
- Each agent firing an average of 6 parallel tool calls in the first step
- That is up to 7,200 concurrent tool call executions within the first 60 seconds
- Many of those tool calls hit the same downstream services: the FHIR API gateway, the payer rules engine, the internal PostgreSQL read replicas
The platform had no per-tenant concurrency cap. It had no global tool-call concurrency limit. It had basic rate limiting on the LLM API calls (because the model provider enforced it), but nothing governing the downstream tool execution layer. Within 90 seconds of the batch job starting, the PostgreSQL connection pool was fully exhausted. The FHIR API gateway began returning 503s. Because the agents were designed to retry on transient failures, they did exactly that, amplifying the load further. The retry storm on top of the fan-out storm created a feedback loop that took down not just Meridian Regional's workflows but every other tenant on the platform, including two other hospital networks running urgent overnight workflows.
The Blast Radius: More Than Just Downtime
The immediate impact was four hours of degraded service across the platform. But the blast radius extended well beyond the outage window itself.
Because this was a healthcare platform, the stakes were not just financial. Delayed prior authorization processing meant clinical staff arrived in the morning to a backlog of cases that should have been pre-cleared. Two emergency medication authorizations that had been queued for overnight processing were delayed, requiring manual intervention by on-call pharmacists. No patient harm resulted, but the near-miss was sobering and generated a formal incident report that went to the company's compliance and legal teams.
On the business side, two of the three affected tenants triggered SLA breach notifications. One of them, a large integrated health system that represented roughly 18% of the company's ARR, requested an emergency review meeting and put their contract renewal on hold pending a root cause analysis and remediation plan.
Internally, the post-mortem revealed that the engineering team had known, in a vague sense, that the platform had no fan-out controls. It had been on the backlog. It had been deprioritized in favor of feature development. The incident converted it from a backlog item to a company-level priority overnight.
The Root Cause Analysis: Three Compounding Design Gaps
The post-mortem identified three distinct architectural gaps that combined to produce the failure. Addressing only one or two of them would not have been sufficient.
1. No Per-Tenant Agent Concurrency Limit
The platform had no concept of a per-tenant execution budget. Any tenant could spawn as many concurrent agent contexts as their workload demanded. In a single-tenant system, this is fine. In a multi-tenant system sharing infrastructure, it is a recipe for noisy-neighbor failures. Meridian Regional's batch job consumed the entirety of available execution capacity, leaving nothing for other tenants.
2. No Tool-Call Concurrency Throttling at the Execution Layer
Even within a single agent context, there was no ceiling on how many tool calls could be in-flight simultaneously. The orchestration framework's default behavior was to execute all parallel tool calls as true concurrent coroutines, with no semaphore, no token bucket, no leaky bucket, nothing. The tool execution layer was effectively a free-for-all.
3. Retry Logic Without Backpressure Awareness
The retry logic was naive: exponential backoff with jitter, which is generally good practice. But it had no awareness of system-wide backpressure signals. When the database connection pool was exhausted, the correct behavior was to stop retrying and surface a degraded-service signal. Instead, agents continued retrying, treating pool exhaustion as a transient error, which it technically was, but which retrying actively made worse.
The Architecture That Fixed It: Per-Tenant Concurrency Throttling
The team spent three weeks designing and implementing a remediation architecture. The core of the solution was a Per-Tenant Concurrency Throttling (PTCT) layer inserted between the agent orchestrator and the tool execution runtime. Here is how it works.
The Tenant Execution Budget Model
Every tenant on the platform is now assigned an execution budget, expressed as two parameters:
- Max Concurrent Agents (MCA): The maximum number of agent contexts that can be in an active (non-idle) state simultaneously for that tenant. This is sized based on the tenant's tier, their historical usage patterns, and their SLA class.
- Max Concurrent Tool Calls (MCTC): The maximum number of tool call executions that can be in-flight simultaneously across all active agents for that tenant.
For Meridian Regional, post-incident, the MCA was set to 150 (not 1,200) for their nightly batch tier, with a burst allowance of up to 250 for a maximum of 60 seconds. The MCTC was set to 400. This meant their batch job would still complete, just over a longer window (roughly 25 to 35 minutes instead of 8 to 10), and would not consume the shared infrastructure in a way that affected other tenants.
The Priority Queue and Workflow Class System
One of the team's key concerns during the redesign was avoiding the opposite problem: throttling so aggressively that time-sensitive workflows (like urgent prior authorizations or real-time patient intake) were starved by background batch jobs. This was the "starving low-priority workflows" problem referenced in their internal design doc, though it was actually the inverse: they needed to ensure that high-priority workflows were never starved by low-priority batch work.
The solution was a three-class workflow priority system:
- Class A (Urgent): Real-time clinical workflows, emergency authorizations, active patient intake. These consume from a dedicated, always-reserved concurrency pool. They are never queued behind Class B or C work.
- Class B (Standard): Normal business-hours workflows, routine prior auths, scheduled report generation. These consume from the main tenant pool.
- Class C (Batch): Overnight batch jobs, bulk data processing, non-urgent analytics. These consume from a separate batch pool with lower MCTC ceilings and are explicitly rate-limited to prevent fan-out storms.
Crucially, Class C work can only consume concurrency budget that is not needed by Class A or B. The scheduler enforces this through a weighted fair-queue algorithm that checks priority class before allocating execution slots. A Class A workflow from any tenant will preempt a Class C slot if the system is under pressure.
The Semaphore-Based Tool Execution Gate
At the tool execution layer, the team implemented a distributed semaphore using Redis. Before any tool call is dispatched, the agent runtime must acquire a semaphore token from the tenant's MCTC pool. If no tokens are available, the tool call is queued, not dropped. The queue has a configurable max depth (default: 3x the MCTC ceiling) and a configurable wait timeout. If a tool call cannot acquire a token within the timeout window, it surfaces a structured THROTTLED error to the agent, which the agent's system prompt instructs it to handle gracefully by reducing its parallelism in the next reasoning step.
This last detail is subtle but important. The agents themselves are now backpressure-aware. When they receive a THROTTLED signal, they do not retry blindly. They adjust their tool-calling strategy, preferring sequential calls over parallel ones until the throttle signal clears. This required a modest but meaningful change to the agent system prompts and the tool result schema.
Backpressure-Aware Retry Logic
The team replaced the naive retry logic with a backpressure-aware retry policy. The new policy consults a lightweight system health signal (a Redis key updated every 5 seconds by a background health monitor) before deciding whether to retry. If the health signal indicates that a downstream dependency (the FHIR gateway, the database pool, a payer API) is in a degraded state, retries are suspended and the workflow is parked in a "pending-backpressure" state. A separate recovery coordinator polls for parked workflows and requeues them when the health signal recovers. This breaks the retry amplification loop that turned the original fan-out storm into a full outage.
Observability: You Cannot Throttle What You Cannot See
A critical enabler of the new architecture was a significant investment in observability. The team built a real-time Tenant Concurrency Dashboard that visualizes, for each tenant:
- Current active agent count vs. MCA ceiling
- Current in-flight tool calls vs. MCTC ceiling
- Tool call queue depth and average wait time
- Throttle event rate (how often tool calls are being queued rather than immediately dispatched)
- Workflow class distribution (what fraction of load is Class A vs. B vs. C)
- Downstream dependency health signals
This dashboard became one of the most-used internal tools almost immediately. Before the incident, the team had essentially zero visibility into per-tenant concurrency behavior. Now, on-call engineers can see a fan-out storm forming in real time and intervene before it cascades. More importantly, the throttling architecture means that even without intervention, the system self-limits before reaching a failure state.
Results: Six Weeks After the Remediation Went Live
By mid-March 2026, the new architecture had been running in production for approximately six weeks. The results were clear:
- Zero cross-tenant impact events since go-live, compared to three in the preceding 90 days
- P99 latency for Class A workflows improved by 34%, primarily because they were no longer competing with batch fan-out for shared resources
- Batch job completion time for Meridian Regional increased by 22 minutes on average, a trade-off the tenant accepted willingly given the context
- Database connection pool utilization peaked at 71% during the next Meridian Regional batch run, compared to 100%+ (with queuing) during the incident
- The large health system that had paused its contract renewal signed a two-year renewal after reviewing the post-mortem and the new architecture documentation
The team also noted an unexpected benefit: the throttle event rate data gave them precise, empirical guidance for right-sizing tenant execution budgets. Tenants whose MCTC was frequently hitting its ceiling got their budgets increased. Tenants with near-zero throttle events were identified as candidates for cost optimization. The throttling layer became, in effect, a continuous capacity planning instrument.
Lessons for Teams Building Multi-Tenant Agentic Platforms
The ClarityHealth incident is not unique. As agentic AI platforms proliferate across healthcare, fintech, legal tech, and enterprise software in 2026, the tool-call fan-out problem is appearing repeatedly. Here are the distilled lessons from this case study:
Lesson 1: Parallel Tool Calling Is a Double-Edged Sword
The ability of modern LLMs to issue parallel tool calls is a genuine performance win. But it must be governed at the infrastructure level. Do not assume that because the model provider rate-limits your LLM API calls, your downstream tool execution is safe. It is not. The fan-out happens between the model and your tools, and that layer is entirely your responsibility.
Lesson 2: Multi-Tenancy Demands Explicit Isolation Budgets
Logical isolation (separate data, separate contexts) is not the same as resource isolation. In a shared-infrastructure multi-tenant system, every tenant needs an explicit execution budget. This is not optional. It is the difference between a noisy-neighbor problem and a platform-wide outage.
Lesson 3: Priority Classes Prevent Batch Work from Eating Real-Time Work
Not all workflows are equal. Building a priority class system is more complex than a single global queue, but it is essential in healthcare and any other domain where some workflows are time-sensitive and others are not. The weighted fair-queue approach ensures that low-priority batch work gets done, just not at the expense of urgent work.
Lesson 4: Make Your Agents Backpressure-Aware
This is the most underappreciated lesson. Throttling at the infrastructure layer is necessary but not sufficient if your agents respond to throttle signals by retrying harder. Agents need to be first-class participants in the backpressure contract. This means designing your tool result schema to include structured throttle signals and updating your agent system prompts to handle them intelligently.
Lesson 5: Observability Is Not Optional, It Is Load-Bearing
You cannot tune, debug, or trust a throttling architecture you cannot observe. Per-tenant concurrency dashboards should be built alongside the throttling layer, not as an afterthought. The data they produce will pay dividends in capacity planning, incident response, and customer conversations.
Conclusion: Stability as a Product Feature
The ClarityHealth incident is a case study in a failure mode that is becoming increasingly common as the industry moves from single-tenant AI prototypes to production-grade, multi-tenant agentic platforms. The fan-out storm was not caused by bad code or bad intentions. It was caused by a gap between how the system was designed (one workflow at a time) and how it was actually used (1,200 workflows simultaneously, each spawning parallel tool calls).
The per-tenant concurrency throttling architecture that emerged from the incident is not glamorous engineering. There are no novel algorithms involved. The Redis semaphore, the weighted fair queue, the priority class system: these are well-understood patterns applied thoughtfully to a new context. What made the difference was the discipline to implement them, the observability to validate them, and the organizational will to accept that a batch job taking 22 extra minutes is an acceptable price for a platform that never goes down.
For any team building a multi-tenant agentic platform in 2026, especially in a regulated domain like healthcare, the message from this incident is unambiguous: fan-out storms are not an edge case. They are an inevitable consequence of scale. Build for them before they find you.