AI Agents

How One B2B SaaS Team's AI Observability Stack Became the Bottleneck (And How They Fixed It With Async Telemetry Decoupling)

Scott Miller

Apr 3, 2026 • 8 min read

There is a cruel irony hiding inside many modern AI-powered SaaS platforms: the tools you build to watch your agents can slow them down more than the agents themselves. For the engineering team at Velorant (a composite case study representing a real pattern observed across multiple B2B SaaS platforms in 2026), that irony became a production crisis that took three sprints, two architecture rewrites, and one very uncomfortable all-hands to resolve.

This is the story of how a per-tenant AI agent observability stack silently accumulated latency debt, how the team diagnosed it, and how an async telemetry decoupling architecture brought response times back below the 200ms threshold their enterprise customers demanded, without sacrificing a single trace.

The Setup: Per-Tenant AI Agents at Scale

Velorant is a B2B workflow automation platform. By early 2026, their core product had evolved from a rule-based automation engine into a multi-agent orchestration layer, where each enterprise customer (tenant) received a dedicated pool of AI agents responsible for tasks like document classification, contract summarization, CRM enrichment, and anomaly detection.

The per-tenant isolation model was a deliberate product decision. Enterprise buyers, particularly in finance and healthcare verticals, demanded that their agent workloads never share compute context with other tenants. Fair enough. But this model introduced a fundamental observability challenge: if every tenant has their own agents, every tenant also needs their own trace context, their own metric namespaces, their own log streams, and their own alerting thresholds.

At 40 tenants, this was manageable. At 340 tenants, it became a latency nightmare.

The Symptom: P95 Latency Creeping Past 400ms

The first sign of trouble arrived in January 2026, buried in a weekly SLA report. Velorant's enterprise tier promised sub-200ms response times for synchronous agent invocations. The P50 (median) latency looked fine at around 130ms. But the P95 had crept to 410ms, and the P99 was touching 680ms.

The initial hypothesis was predictable: the LLM inference calls were taking longer. The team checked token throughput on their model provider's dashboard. Normal. They checked GPU queue depths on their self-hosted fine-tuned models. Normal. They checked database query plans. Normal.

It was a junior engineer on the platform reliability team, during a routine flame graph review, who spotted something odd. In the distributed trace waterfall for a typical agent invocation, the LLM call itself took 85ms. But the total span duration was 430ms. The gap was enormous, and it was not coming from where anyone expected.

It was coming from the observability instrumentation itself.

Root Cause Analysis: The Observability Tax

Velorant's observability stack had grown organically over 18 months. It was built on a combination of OpenTelemetry SDK instrumentation, a custom per-tenant span enrichment middleware, a tenant-scoped Prometheus metrics pipeline, and a structured logging layer that tagged every log line with 23 tenant-specific metadata fields.

When the team traced the latency gap, they found four compounding contributors:

1. Synchronous Span Flushing on Every Agent Step

The OpenTelemetry exporter was configured with a SimpleSpanProcessor rather than a BatchSpanProcessor. This meant every agent reasoning step (and there could be 8 to 14 steps per invocation) triggered a synchronous HTTP export to their collector endpoint. Each flush added between 12ms and 35ms of blocking time to the hot path. Across 12 steps, that alone accounted for up to 420ms of pure observability overhead.

2. Per-Tenant Metadata Enrichment in the Critical Path

The custom span enrichment middleware performed a Redis lookup on every span to hydrate tenant-specific attributes: plan tier, feature flags, data residency region, agent version, and compliance tags. The Redis call itself was fast (sub-2ms), but it was called synchronously on every span, and at high concurrency it created lock contention in the connection pool. Under load, this lookup ballooned to 40ms per span.

3. Synchronous Log Aggregation With Tenant Routing

Their structured logging pipeline routed logs to tenant-specific S3 prefixes in real time. The routing logic evaluated a rules engine on every log write, and the rules engine held a read lock on a shared configuration map. Under concurrent agent invocations across hundreds of tenants, this created a thundering herd on the configuration map read lock.

4. Metric Cardinality Explosion Causing Collector Backpressure

Each tenant had unique label sets on their Prometheus metrics, which had caused the team's OpenTelemetry Collector to balloon to over 4.2 million active time series. The collector was spending significant CPU time on cardinality management, and this backpressure was causing the SDK-side metric export calls to time out and retry, adding unpredictable latency spikes to the agent response path.

The Architecture That Was Killing Performance

The core problem was architectural, not configurational. The team had built their observability stack on a synchronous, in-band telemetry model. Every piece of observability data (spans, metrics, logs) was generated, enriched, and exported as a blocking operation within the same execution context as the agent itself. The agent could not return a response until every telemetry operation had completed.

This made intuitive sense when the system was small. You want your traces to be complete. You want your logs to be consistent. You want your metrics to reflect reality. But at scale, with per-tenant complexity layered on top, the observability stack had become a synchronous tax on every user-facing request.

The diagram below (conceptually) illustrates the problem:

Agent receives request (0ms)
Span started, metadata enrichment lookup (synchronous, 2-40ms)
LLM reasoning step 1 (12ms), span flushed (synchronous, 15-35ms)
LLM reasoning step 2 (10ms), span flushed (synchronous, 15-35ms)
... repeated for 8 to 14 steps ...
Log written with tenant routing (synchronous, 5-20ms)
Metric exported (synchronous, 3-15ms)
Response returned to caller (total: 380ms to 680ms)

The LLM work itself represented less than 30% of the total wall-clock time. The observability stack owned the rest.

The Solution: Async Telemetry Decoupling Architecture

The team spent two weeks designing what they called the Async Telemetry Decoupling Layer (ATDL). The guiding principle was simple but non-trivial to implement correctly: no telemetry operation should ever block an agent response. But this had to be achieved without losing trace fidelity, without creating data loss risk, and without breaking the per-tenant isolation guarantees that enterprise customers depended on.

Here is how they built it.

Step 1: Replace SimpleSpanProcessor With a Buffered Async Processor

The first and highest-impact change was switching from SimpleSpanProcessor to a heavily tuned BatchSpanProcessor with an in-process ring buffer. Spans were written to a lock-free ring buffer in memory and exported by a dedicated background goroutine (the team's agents ran on a Go-based orchestration runtime). The ring buffer was sized at 50,000 spans per agent pool instance, with a configurable overflow policy (drop oldest, not newest, to preserve recency of trace data). This change alone reduced hot-path latency by 180ms in staging tests.

Step 2: Pre-Warm Tenant Metadata Into a Local Cache

The per-tenant Redis enrichment lookup was moved entirely off the critical path. Instead of looking up tenant metadata on every span, the team built a local in-process cache (a simple concurrent hash map with a 30-second TTL) that was pre-warmed at agent pool startup and refreshed asynchronously by a background subscription to a Redis keyspace notification channel. Span enrichment now read from local memory with zero network hops. The 2ms to 40ms enrichment latency collapsed to under 100 microseconds.

Step 3: Decouple Log Routing With an Async Fan-Out Queue

The synchronous log routing pipeline was replaced with an in-process async fan-out queue. Log writes from agent code now appended to a per-tenant channel (a buffered Go channel with a capacity of 10,000 entries). A pool of background workers consumed from these channels and handled the tenant routing and S3 write logic entirely out of band. The agent's log write call became a non-blocking channel send, completing in nanoseconds. The configuration map lock contention disappeared entirely because the routing logic now ran only in the background worker goroutines, not in the agent hot path.

Step 4: Resolve Metric Cardinality With Tenant-Scoped Aggregation Before Export

Rather than pushing raw per-tenant metrics with high-cardinality label sets directly to the collector, the team introduced a local pre-aggregation layer. Each agent pool instance maintained in-memory metric accumulators per tenant, using a fixed-cardinality label schema (tenant ID as a single opaque label, with detailed breakdowns deferred to log-based analytics). Pre-aggregated summaries were pushed to the collector every 15 seconds. Active time series in the collector dropped from 4.2 million to under 180,000, eliminating backpressure entirely.

Step 5: Introduce a Trace Fidelity Guarantee via a Write-Ahead Telemetry Log

The team's biggest concern with async telemetry was data loss. If an agent pool instance crashed between generating a span and flushing it, those spans would be lost, making incident diagnosis harder. To address this, they introduced a lightweight write-ahead telemetry log (WATL): a local append-only file that spans and structured log entries were written to synchronously (but with O_DIRECT to bypass the OS page cache and minimize write latency, keeping the write under 500 microseconds). A separate recovery process monitored this file and replayed any unflushed entries to the collector after a crash. This gave them crash-safe trace fidelity without blocking the agent response path.

The Results: Back Below 200ms, With Better Data

The ATDL was rolled out progressively across tenant pools over a two-week canary period in February 2026. The results were dramatic and immediate:

P50 latency: 130ms to 98ms (24% improvement)
P95 latency: 410ms to 162ms (60% improvement)
P99 latency: 680ms to 194ms (71% improvement)
Trace completeness: 99.97% (up from 99.81%, because the WATL eliminated the silent drops that had been occurring during collector backpressure events)
Collector CPU utilization: Down 67% due to cardinality reduction
Observability infrastructure cost: Down 41% due to reduced collector fleet sizing

Critically, the team did not sacrifice any observability capability. Every span was still captured. Every log line was still routed. Every metric was still recorded. The only thing that changed was when and how those operations happened relative to the user-facing response.

The Broader Lesson: Observability Is Not Free, and at Scale It Is Expensive

The Velorant case illustrates a pattern that is becoming increasingly common in 2026 as AI agent workloads mature in production environments. The observability practices that served teams well during the prototype and early-scale phases (synchronous, simple, comprehensive) become liabilities at multi-tenant scale.

Several principles emerge from this case that apply broadly to any team running AI agents in a multi-tenant B2B context:

Measure your observability overhead explicitly. Add a dedicated "telemetry cost" span to your traces. If you do not measure it, you will not see it accumulating.
Treat the hot path as sacred. Nothing that is not strictly necessary for producing the user-facing response should execute synchronously in the critical path. Telemetry, by definition, is never strictly necessary for the response.
Per-tenant isolation does not require per-tenant synchronous overhead. Isolation can be enforced at the data layer (routing, storage, access control) without requiring that isolation logic to execute in the request hot path.
Cardinality is a first-class performance concern. In multi-tenant systems, metric cardinality grows with tenant count by default. Left unmanaged, it will eventually create backpressure that propagates into unexpected places.
Async does not mean unreliable. With a write-ahead log or equivalent durability mechanism, async telemetry can be as reliable as synchronous telemetry, with a fraction of the latency impact.

What This Means for AI Agent Platform Teams in 2026

As AI agent orchestration becomes the dominant architectural pattern for enterprise SaaS products, the observability challenge will only intensify. Agents are inherently more complex to observe than traditional request-response services: they have multi-step reasoning chains, non-deterministic execution paths, tool call graphs, and memory state that all need to be traced faithfully.

The temptation is to instrument everything, synchronously, to ensure nothing is missed. That temptation is understandable and, at small scale, harmless. But at the scale of hundreds of tenants running thousands of concurrent agent invocations, synchronous observability becomes a performance anti-pattern that directly degrades the product experience you are trying to observe.

The async telemetry decoupling architecture is not a silver bullet, and it introduces genuine engineering complexity: buffer sizing, overflow policies, crash recovery, background worker lifecycle management. But for teams whose observability stack has grown to the point where it is measurably impacting user-facing latency, it is the correct architectural direction.

Conclusion

Velorant's engineering team discovered something counterintuitive and important: you can build a world-class AI agent platform and then accidentally make it slow by watching it too carefully, in the wrong way. The fix was not to observe less. It was to observe smarter, by decoupling the act of collecting telemetry from the act of delivering value to the user.

If your team is running per-tenant AI agents in production and your P95 latency is trending in the wrong direction, the first place to look might not be your models or your infrastructure. It might be your observability stack. Pull up a flame graph. Find your telemetry spans. And ask yourself: is your monitoring watching your agents, or is it holding them back?

The answer might surprise you.