SaaS

The Silent Compliance Skip: How One SaaS Team Found a Race Condition Hiding in Their AI Agent Onboarding Pipeline

Scott Miller

Mar 29, 2026 • 8 min read

It started with a routine pre-audit review. A senior engineer at a mid-market SaaS company, let's call them Veridian Labs, was cross-referencing tenant provisioning logs against their compliance audit trail two weeks before a major enterprise client's SOC 2 Type II review. The numbers did not add up. Specifically, 11 of their 47 most recently onboarded tenants were missing entries in the compliance check log, even though their accounts were fully active, their AI agents were running, and no errors had ever been surfaced to the operations team.

No alerts. No exceptions. No failed webhooks. Just silently absent compliance records for nearly a quarter of their newest enterprise tenants.

What followed was a four-day forensic engineering sprint that uncovered one of the most insidious classes of bugs in modern distributed SaaS systems: a concurrency-induced race condition that only manifested when three or more tenant provisioning requests arrived within the same narrow time window. This is the full story of how it happened, how they found it, and how they fixed it before an enterprise audit exposed it first.

The Architecture: AI Agent Onboarding at Scale

Veridian Labs builds a B2B SaaS platform that provisions dedicated AI agents for enterprise tenants. Each tenant gets an isolated agent environment, complete with custom tool configurations, data access policies, and a role-based permission model. Their onboarding pipeline, built on a microservices architecture, handled the following steps in sequence for every new tenant:

Tenant record creation in the primary Postgres database
AI agent instantiation via their internal orchestration API
Compliance check execution: data residency validation, PII scope enforcement, and role permission boundary checks
Compliance log write to an append-only audit store
Tenant activation and welcome email dispatch

On paper, the pipeline was clean, linear, and well-tested. Their unit and integration test suites passed with 97% coverage. The compliance check step was guarded by a service-level check that verified the tenant record existed before proceeding. What the tests never simulated, however, was what happened when three or more of these pipelines ran simultaneously against shared state.

The Bug: A Race Condition That Only Appeared at Three or More Concurrent Requests

The compliance check service used a shared in-memory provisioning state registry. This registry was a simple dictionary object that tracked which tenant IDs were currently mid-pipeline. The logic was designed to prevent duplicate provisioning runs for the same tenant. It looked roughly like this (simplified):


if tenant_id not in provisioning_registry:
    provisioning_registry[tenant_id] = "in_progress"
    run_compliance_checks(tenant_id)
    provisioning_registry.pop(tenant_id)

The intent was reasonable. The problem was that this registry was not protected by any locking primitive. In Python's asyncio event loop, the if check and the registry write were not atomic. Under low concurrency (one or two simultaneous requests), the event loop's cooperative scheduling happened to serialize these operations closely enough that collisions never occurred in practice.

But at three or more concurrent requests, the timing windows widened. Here is what actually happened:

Tenant A, B, and C all arrive within a 200-millisecond window.
Tenant A's coroutine checks the registry: empty. It does not yet write its entry because the event loop yields at the next await point inside run_compliance_checks().
Tenant B's coroutine checks the registry: also empty (A's entry was never written). It enters the compliance block.
Tenant C's coroutine does the same.
All three coroutines are now inside run_compliance_checks() simultaneously.
The compliance check function itself made an external HTTP call to a third-party data residency validation API. This call was slow, averaging 800 milliseconds.
During that 800ms await, the registry entry writes for A, B, and C were all queued but the execution order of their pop() calls became non-deterministic.
In roughly 30% of cases under this load pattern, the compliance check's result-writing coroutine for one tenant would read a stale registry state, assume the tenant was already processed, and skip the audit log write entirely without raising an exception.

The activation step (step 5) had no dependency on the compliance log write. It only checked whether the compliance check function had been called, not whether the audit log entry had been successfully committed. So tenants were activated regardless, with no visible error to the end user or the operations team.

How They Found It: The Audit Trail Discrepancy

The discovery was almost accidental. Veridian's lead platform engineer, preparing compliance documentation for their upcoming enterprise audit, wrote a one-off SQL query joining the tenant activation table against the compliance audit log on tenant_id. The query returned a left-join gap: activated tenants with no corresponding compliance log entry.

The initial assumption was data loss in the audit store. The team spent the first day ruling that out. The audit store (an append-only event log backed by Amazon DynamoDB with point-in-time recovery enabled) showed no gaps, no throttled writes, and no failed batch operations in its own internal metrics. The records were simply never written.

On day two, they added verbose structured logging to a staging environment reproduction of the pipeline and began load-testing with simulated concurrent provisioning bursts. The pattern emerged immediately: the compliance log skip only occurred when three or more requests hit the pipeline within a 150 to 250 millisecond window. One or two concurrent requests: clean every time. Three or more: reproducible failure rate of approximately 28 to 34%.

They had their culprit.

Why This Is More Common Than Teams Realize in 2026

Veridian's situation is not unique. As AI agent pipelines have become a standard feature of enterprise SaaS platforms in 2026, many engineering teams are adapting patterns originally designed for synchronous, single-tenant workflows into high-concurrency, multi-tenant environments. The gap between those two worlds is where race conditions live.

Several structural factors make this class of bug particularly dangerous:

Async-first frameworks hide cooperative scheduling hazards. Python's asyncio, JavaScript's event loop, and Go's goroutines all make concurrency easy to write but non-obvious to reason about at the point of shared state mutation.
Compliance steps are often bolted on, not built in. In many pipelines, compliance checks were added after the core provisioning flow was already designed. They sit at the edge of the pipeline rather than being structurally load-bearing, which means their failure modes are less likely to block downstream steps.
Test suites rarely simulate realistic concurrency. Most integration tests provision tenants one at a time or in a controlled sequence. Burst concurrency scenarios require deliberate load-test design that many teams deprioritize until something breaks.
AI agent provisioning introduces new latency variables. The external API calls required to instantiate, configure, and validate AI agents are significantly slower than traditional SaaS provisioning steps. Longer await windows mean wider race condition exposure.

The Fix: Queue-Backed Serialization for Compliance Steps

Veridian's engineering team evaluated three potential fixes before settling on their final approach:

Option 1: Database-Level Advisory Locks

They considered using Postgres advisory locks keyed on tenant ID to serialize the compliance check execution. This would have worked, but it introduced database connection pressure at the exact moment the system was already under provisioning load. It also created a tight coupling between the provisioning service and the database that the team wanted to avoid for scalability reasons. Ruled out.

Option 2: Redis-Based Distributed Lock (Redlock)

A Redis-backed distributed lock using the Redlock algorithm was a strong candidate. It would have provided atomic check-and-set semantics across all service instances. The concern was operational: their Redis cluster was shared infrastructure, and a lock contention spike during a provisioning burst could create cascading latency across unrelated services. Ruled out for the compliance pipeline specifically, though noted as the right tool for other use cases.

Option 3: Queue-Backed Serialization (The Winner)

The solution they implemented was architectural rather than purely mechanical. They introduced a dedicated compliance check queue using Amazon SQS with a FIFO queue configuration and message deduplication enabled. The provisioning pipeline was refactored so that after agent instantiation (step 2), a compliance check job was enqueued rather than executed inline. A separate, single-threaded compliance worker service consumed from this queue and processed one tenant at a time.

The key design decisions were:

FIFO queue with content-based deduplication: SQS FIFO queues guarantee exactly-once delivery within a deduplication window, eliminating the possibility of a tenant being processed twice or skipped due to duplicate message delivery.
Tenant activation decoupled from compliance completion: Rather than activating tenants immediately after the compliance check, the worker emitted a compliance.completed event to an internal event bus. The activation service subscribed to this event. No activation could occur without a confirmed compliance log write.
Dead letter queue (DLQ) with alerting: Any compliance job that failed after three retry attempts was routed to a DLQ and triggered a PagerDuty alert. This made compliance failures loud and visible rather than silent.
Idempotent compliance workers: The worker was designed to be fully idempotent. If a tenant's compliance check had already been logged (checked via a conditional write to DynamoDB using a condition expression), the worker would acknowledge the message and exit cleanly without duplicate writes.

The refactored pipeline flow looked like this:

Tenant record creation (unchanged)
AI agent instantiation (unchanged)
Compliance check job enqueued to SQS FIFO queue
Compliance worker dequeues job, runs checks, writes audit log entry
Worker emits compliance.completed event
Activation service receives event, activates tenant, sends welcome email

The Results: Before and After

After deploying the fix to production and running it through a two-week validation period with synthetic load testing, the results were unambiguous:

Compliance log completeness: 100% of provisioned tenants had corresponding audit log entries, including under simulated burst loads of up to 50 concurrent provisioning requests.
Provisioning latency impact: Average end-to-end tenant activation time increased by approximately 4 seconds due to the async queue hop. This was deemed acceptable given the compliance guarantee it provided.
Operational visibility: The DLQ alert had fired zero times in the two weeks following deployment, confirming the pipeline's stability.
Retroactive remediation: For the 11 tenants whose compliance records were missing, Veridian ran a manual compliance check job for each and backfilled the audit log with timestamped annotations noting the retroactive nature of the entry. Their legal and compliance team documented the root cause and remediation for the enterprise audit record.

The Enterprise Audit: What Actually Happened

The SOC 2 Type II audit proceeded as scheduled. Veridian's team disclosed the gap proactively, presenting the root cause analysis, the remediation timeline, and the architectural fix as part of their audit documentation. Rather than treating it as a finding against them, the auditors noted the proactive discovery, the quality of the post-mortem, and the robustness of the new architecture as positive evidence of a mature engineering culture.

The enterprise client, far from being alarmed, asked Veridian to present the case study internally to their own DevSecOps team as an example of responsible vulnerability disclosure and engineering remediation. The relationship deepened rather than fractured.

Key Takeaways for SaaS Engineering Teams

If your platform provisions AI agents or any compliance-sensitive resource in a multi-tenant context, here are the lessons Veridian's experience makes concrete:

Never use in-memory state to coordinate concurrent async pipelines. In-memory registries, dictionaries, and flags are not safe for concurrency control without explicit locking. If multiple coroutines or threads can touch shared state, treat it as a race condition until proven otherwise.
Compliance steps must be structurally load-bearing. If a downstream step (like tenant activation) can proceed without a confirmed compliance write, your architecture is implicitly treating compliance as optional. Make it impossible to activate a tenant without a completed compliance record.
Test for concurrency explicitly. Add burst concurrency scenarios to your integration test suite. Tools like locust, k6, or simple asyncio gather patterns in pytest can simulate the conditions that expose these bugs before production does.
Make failures loud. Silent failures are the most dangerous kind. Every compliance-adjacent operation should have an observable failure path: a DLQ, an alert, a metric, or a dashboard. If a compliance check fails and nobody is paged, the architecture is broken regardless of whether the code is correct.
Queues are not just for throughput; they are for correctness. The instinct to reach for a queue is often framed as a scalability decision. In Veridian's case, the queue was a correctness decision. Serializing compliance work through a FIFO queue eliminated an entire class of race conditions by design.

Conclusion: The Bugs That Don't Crash Are the Most Dangerous

Veridian Labs got lucky. Their engineer ran that SQL query two weeks before the audit, not two days after it. But luck is not an architecture. The deeper lesson of this case study is that in distributed, async, multi-tenant AI pipelines, the bugs most likely to cause regulatory and reputational damage are not the ones that throw exceptions or crash services. They are the ones that silently do less than they are supposed to, leaving no trace except an absence.

Building systems that make absence visible, that make compliance structurally mandatory rather than procedurally assumed, and that treat concurrency as a first-class design concern: that is the engineering discipline that separates teams who find their own bugs from teams whose auditors find them first.

Veridian found theirs first. Make sure your team does too.