AI Observability

How Backend Engineers Should Redesign Per-Tenant AI Agent Observability Pipelines Over the Next 12 Months

Scott Miller

Mar 31, 2026 • 12 min read

There is a quiet crisis unfolding inside the infrastructure teams of nearly every SaaS company that has shipped an AI-powered product in the last two years. The crisis is not a model accuracy problem. It is not a latency problem, exactly. It is a visibility problem, and it is getting worse fast.

Traditional distributed tracing was designed around a beautifully simple contract: a request enters, it fans out across services, it produces a deterministic tree of spans, and it exits. Trace IDs propagate through HTTP headers. Spans have start times and end times. The story has a beginning and an end.

Agentic AI workflows have shredded that contract entirely.

In 2026, backend engineers operating multi-tenant SaaS platforms are watching their observability stacks buckle under the weight of AI agents that loop, self-delegate, spawn sub-agents, call tools asynchronously, pause for human-in-the-loop approval, and resume hours or even days later. A single tenant's agent run can produce thousands of non-linear spans that no Jaeger UI or Honeycomb waterfall view was ever designed to render meaningfully. And because these runs belong to specific paying customers, the stakes around per-tenant isolation, cost attribution, and audit compliance are extremely high.

This post is a forward-looking engineering guide. Over the next 12 months, here is how backend engineers should deliberately redesign their per-tenant AI agent observability pipelines to match the actual shape of agentic complexity in 2026 and beyond.

Why Traditional Distributed Tracing Assumptions Are Breaking Down

Before prescribing solutions, it is worth being precise about the failure modes. Traditional distributed tracing, as codified in the OpenTelemetry specification and popularized by tools like Jaeger, Zipkin, Tempo, and Honeycomb, rests on several assumptions that agentic systems violate systematically:

Assumption 1: Traces are bounded in time. A trace has a start and an end within a reasonable window, typically seconds to minutes. Agentic workflows routinely span hours or days, especially when they include approval gates, scheduled retries, or long-running tool executions like browser automation or code compilation.
Assumption 2: Causality is a tree. Spans have one parent and potentially many children, forming a directed acyclic graph at worst. Agentic systems produce genuine cycles: a planning agent delegates to a sub-agent, which calls back to the planner for clarification, creating feedback loops that break DAG assumptions in most trace storage backends.
Assumption 3: Spans are cheap and uniform. A span represents a discrete unit of work. But an LLM inference call is not uniform. It carries token counts, prompt hashes, model versions, temperature settings, tool call sequences, and intermediate reasoning traces that are orders of magnitude richer than a typical database query span.
Assumption 4: Tenant identity is a tag, not a structural concern. In most tracing setups, tenant ID is just another attribute on a span. In a multi-tenant agentic platform, tenant identity must be a first-class structural dimension that governs data routing, retention, cost accounting, and access control at the pipeline level, not as an afterthought in a query filter.
Assumption 5: Observability is for operators, not customers. Distributed tracing was built for SREs debugging production incidents. But in agentic SaaS products, tenants increasingly demand real-time visibility into what their agents are doing, why they made specific decisions, and how much compute budget they consumed. Operator-only observability is no longer sufficient.

These are not minor gaps to patch with a custom attribute here and a longer trace timeout there. They represent a fundamental architectural mismatch that demands a rethought pipeline.

The Core Architectural Shift: From Trace Trees to Agent Execution Graphs

The single most important conceptual shift backend engineers need to make in 2026 is to stop modeling agent runs as traces and start modeling them as Agent Execution Graphs (AEGs). This is not just a naming change; it is a storage, query, and rendering paradigm change.

An Agent Execution Graph treats each agent run as a labeled directed graph where:

Nodes represent discrete agent actions: LLM inference calls, tool invocations, memory reads and writes, human-in-the-loop checkpoints, sub-agent spawns, and state transitions.
Edges represent causal or temporal relationships between actions, including back-edges for loops and retry cycles.
Node metadata carries rich semantic payload: token usage, model identity, tool input/output hashes, confidence scores, latency breakdowns, and cost estimates.
Graph-level metadata carries tenant identity, workflow template version, triggering context, and SLA classification.

This model maps naturally to a graph database or a columnar store with adjacency list encoding, neither of which is a traditional trace backend. Teams should plan to introduce a purpose-built agent execution store alongside (not replacing) their existing distributed tracing infrastructure. The two systems serve different audiences: tracing serves SRE incident response, while the AEG store serves agent debugging, tenant reporting, and compliance auditing.

Practical Storage Recommendations for 2026

For teams evaluating storage backends for AEG data, the tradeoffs in 2026 look roughly like this:

Apache Parquet on object storage (S3/GCS) with DuckDB or Apache Arrow Flight for query: Best for cost-sensitive teams with high write volume and analytical query patterns. Tenant partitioning is straightforward via directory structure. Latency for real-time queries is acceptable if materialized views are maintained.
ClickHouse with adjacency list tables: Excellent for high-cardinality tenant workloads where you need sub-second query performance on graph traversals. ClickHouse's columnar compression handles the repetitive schema of agent node records extremely efficiently.
Neo4j or Amazon Neptune for deep graph analytics: Justified only if your platform needs multi-hop graph queries, such as "show me all agent runs across all tenants where a sub-agent spawned by a planning agent triggered a tool call that exceeded $0.50 in cost." For most teams, the operational overhead is not worth it.
OpenTelemetry-compatible backends (Grafana Tempo, Honeycomb) with semantic conventions extensions: Viable as a bridge strategy while purpose-built infrastructure matures, but expect significant limitations in rendering non-linear execution paths and in enforcing per-tenant data boundaries at the storage layer.

Redesigning the Pipeline: A 12-Month Phased Roadmap

Rather than prescribing a single big-bang migration, here is a phased approach that backend engineering teams can realistically execute across the next 12 months while continuing to ship product.

Phase 1 (Months 1 to 3): Instrument Correctly at the Source

Most teams have instrumentation that was bolted on after the fact, capturing only surface-level signals. The first phase is about getting the instrumentation right before worrying about the pipeline behind it.

Adopt the OpenTelemetry GenAI semantic conventions as your baseline. The OTel GenAI working group has published semantic conventions for LLM spans that cover model name, token counts, prompt/completion content, and tool call sequences. These are now mature enough to use as a foundation in 2026. Do not invent proprietary attribute schemas if you can avoid it; standardization will pay dividends when tooling catches up.

Introduce Agent Span Kind as a first-class concept. Beyond the standard OTel span kinds (client, server, producer, consumer, internal), your instrumentation library should emit a custom agent.step span kind that carries agent-specific attributes: step type (plan, act, observe, reflect), parent step ID, sub-agent ID if applicable, and a stable workflow template ID. This allows downstream consumers to reconstruct the AEG from span data without requiring a schema migration.

Inject tenant context at the agent runtime level, not the HTTP middleware level. In multi-tenant agentic platforms, tenant context can be lost when agents execute asynchronously, resume from a queue, or spawn sub-processes. Implement a TenantContext propagator that serializes tenant ID into every async job payload, every queue message, and every sub-agent invocation. Treat this as non-negotiable infrastructure, equivalent to how you treat authentication tokens.

Capture cost signals inline. Every LLM inference span should carry a computed cost estimate based on token counts and the model's current pricing. Do not leave cost attribution to a post-processing step. Inline cost signals enable real-time budget enforcement and per-tenant cost rollups without requiring a separate billing pipeline join.

Phase 2 (Months 3 to 6): Build the Per-Tenant Routing Layer

Once instrumentation is clean, the next challenge is routing observability data correctly. This is where most teams hit a wall: their existing observability pipeline is a single stream that ingests everything into a shared backend. Per-tenant isolation requires a fundamentally different routing architecture.

Introduce an Observability Router service. This is a lightweight service (or a configuration layer on top of an existing OpenTelemetry Collector deployment) that sits between your agent runtimes and your storage backends. Its job is to inspect the tenant ID on each span or AEG node event and route it to the correct tenant-scoped storage partition, retention policy, and export destination.

The router should support:

Tenant-scoped sampling policies: Enterprise tenants may require 100% trace retention for compliance. Free-tier tenants may be sampled at 10%. Sampling decisions must be made per-tenant, not globally.
Tenant-scoped export destinations: Some enterprise customers will contractually require that their agent execution data be exported to their own cloud storage bucket or SIEM. The router must support pluggable export destinations keyed by tenant ID.
Tenant-scoped retention enforcement: GDPR, SOC 2, and emerging AI-specific data regulations in the EU and US require that tenant data be deletable on request. A routing layer that partitions data by tenant from the start makes deletion a partition drop rather than a full-table scan.
Backpressure and tenant fairness: A single high-volume tenant should not be able to saturate the observability pipeline and degrade visibility for other tenants. Implement per-tenant rate limiting with graceful degradation (drop low-priority debug spans before dropping cost or error spans).

Use OpenTelemetry Collector's routing processor as a starting point, but plan to extend it. The built-in routing processor supports attribute-based routing, which is sufficient for simple tenant partitioning. For more complex policies (tiered retention, conditional exports, cost-based sampling), you will likely need a custom processor or a sidecar service.

Phase 3 (Months 6 to 9): Reconstruct and Query Agent Execution Graphs

With clean instrumentation and a routing layer in place, phase three focuses on making the data useful: reconstructing AEGs from span streams and building the query interfaces that engineering and product teams actually need.

Build an AEG Reconstruction Service. This is a stream processing job (Flink, Spark Streaming, or a lighter-weight option like Bytewax or Pathway if your volumes are modest) that consumes the span stream and materializes Agent Execution Graphs in your chosen storage backend. The reconstruction logic needs to handle:

Out-of-order span arrival (common in async agent workflows where sub-agents complete before their parent span is closed)
Long-running graph assembly (a graph for a multi-day agent run needs to be queryable incrementally, not just when the run completes)
Cycle detection and representation (back-edges must be stored explicitly rather than causing infinite loops in graph traversal queries)
Orphan span reconciliation (spans that arrive after a graph is considered "closed" due to a timeout must be reattached or flagged)

Define a standard AEG query API for internal consumers. Product teams, ML engineers, and customer success managers all need to query agent execution data, but they have very different query patterns. Define a GraphQL or REST API over your AEG store that exposes common query patterns as first-class operations: fetch the full execution graph for a run, summarize token and cost usage for a tenant over a time range, find all runs where a specific tool call failed, compare two runs of the same workflow template.

Phase 4 (Months 9 to 12): Expose Per-Tenant Observability to Customers

This is the phase that most backend teams deprioritize and should not. In 2026, per-tenant observability is rapidly becoming a product differentiator, not just an internal engineering concern. Enterprise buyers of AI SaaS products are increasingly asking: "Can I see exactly what your agent did with my data, step by step, with timestamps and cost attribution?" Teams that can answer yes with a self-serve UI are closing deals that teams with only internal dashboards are losing.

Build a Tenant Observability Portal. This does not need to be a full-featured APM product. It needs to answer the questions tenants actually ask:

What did my agent do in this specific run, and why did it make the decisions it made?
How much did this run cost, broken down by model and tool usage?
Did my agent encounter any errors or safety guardrail activations?
What is my aggregate usage over this billing period?
Can I export a full audit log of my agent's actions for compliance purposes?

Implement strict tenant data isolation at the API layer. Every query to the Tenant Observability Portal must be scoped to the authenticated tenant's data. This sounds obvious, but the failure mode is subtle: if your AEG query API was built for internal use first, it likely lacks row-level tenant isolation and relies on callers to pass the correct tenant filter. Harden this with mandatory tenant scoping at the query planner level, not as an optional parameter.

Consider a read-only data sharing model for compliance exports. Enterprise tenants in regulated industries (finance, healthcare, legal) will want to export their agent execution data to their own data warehouses or compliance tools. Build a standardized export format (Parquet with a published schema, or NDJSON with OTel-compatible attributes) and a tenant-triggered export mechanism. This is a table-stakes feature for enterprise sales in 2026.

The Cost Attribution Problem: Why It Deserves Its Own Section

Cost attribution in multi-tenant agentic systems is one of the most underengineered problems in the industry right now, and it is going to become a major source of pain over the next 12 months as LLM inference costs remain significant and tenants become more cost-conscious.

The naive approach is to sum token counts per tenant at the end of a billing period. This breaks down in several ways:

Shared infrastructure costs are invisible. Vector database queries, embedding generation, tool execution compute, and orchestration overhead are not captured in token counts but are real costs that vary by tenant workload.
Model mix varies by tenant. A tenant whose workflows preferentially invoke GPT-4-class models costs more to serve than a tenant whose workflows use smaller, cheaper models. Flat per-seat pricing increasingly fails to reflect actual cost structure.
Retry and error costs are hidden. When an agent retries a failed tool call five times, the retry cost is real. When a guardrail fires and triggers a re-planning cycle, that re-planning cost is real. These costs must be attributed to the tenant and surfaced in observability data.

The solution is to treat cost as a first-class observability signal, not a billing artifact. Every AEG node should carry a cost.estimate_usd attribute computed at emission time. The AEG store should support cost rollup queries with the same granularity as latency and error rate queries. And the Tenant Observability Portal should surface cost breakdowns alongside performance metrics, because for tenants, cost predictability is a performance concern.

Emerging Standards to Watch and Adopt in 2026

The observability tooling ecosystem is moving quickly to address agentic system needs, and backend engineers should track several emerging standards rather than building purely proprietary solutions:

OpenTelemetry GenAI Semantic Conventions (stable release expected mid-2026): The OTel GenAI working group has been iterating on semantic conventions for LLM and agent spans. Adopting these conventions now, even in their pre-stable form, will reduce migration costs when the stable release lands and third-party tooling aligns to it.
OpenInference and the Arize AI trace format: Arize's OpenInference specification has gained significant adoption as a schema for LLM observability data. While it predates the OTel GenAI conventions, it covers agent-specific concepts (retrieval spans, tool call spans, agent step spans) that OTel is still catching up to. Evaluate it as a complement to OTel, particularly for the agent step schema.
W3C Trace Context for async agent handoffs: The W3C Trace Context specification (traceparent and tracestate headers) is the right propagation mechanism for synchronous agent calls. For async handoffs via message queues, adopt the emerging W3C Baggage specification to carry tenant context and workflow correlation IDs across async boundaries.
Model Context Protocol (MCP) observability extensions: Anthropic's Model Context Protocol has become the de facto standard for agent-to-tool communication in 2026. Watch for emerging observability extensions to MCP that would allow tool servers to emit structured telemetry that integrates with agent-level tracing. Early implementations are appearing in the ecosystem and could significantly reduce instrumentation burden for tool-heavy workflows.

The Organizational Dimension: Who Owns Agent Observability?

A final prediction that backend engineers should internalize: over the next 12 months, the ownership of agent observability will become a source of genuine organizational tension at most AI-first companies. Platform engineering teams own the observability infrastructure. ML engineering teams own the model and agent behavior. Product teams own the customer-facing reporting. None of these teams, acting alone, can build what is actually needed.

The teams that get this right will establish a dedicated Agent Reliability Engineering (ARE) function, analogous to Site Reliability Engineering but scoped to the reliability, cost efficiency, and behavioral correctness of AI agent systems. ARE teams sit at the intersection of platform engineering, ML ops, and product, and they own the observability pipeline end to end: from instrumentation standards to tenant-facing dashboards.

This is not a staffing luxury. As agentic workflows become the core value delivery mechanism of AI SaaS products, the ability to understand, debug, and account for agent behavior at the per-tenant level is as foundational as the ability to monitor uptime and latency. Teams that treat it as such will build more reliable products, close more enterprise deals, and debug production incidents in minutes rather than days.

Conclusion: The Next 12 Months Are a Narrow Window

The complexity of agentic workflows in 2026 is already outpacing the observability infrastructure that most teams built in 2024 and 2025. The gap will widen. The teams that act now, moving deliberately through instrumentation reform, per-tenant routing, AEG reconstruction, and customer-facing observability, will find themselves with a durable competitive advantage in both product quality and enterprise sales.

The teams that wait, hoping that existing tracing tools will grow to accommodate agentic complexity, will find themselves debugging five-day agent runs with tools designed for five-second API calls. That is not a debugging problem. It is a business problem.

The next 12 months are the window to get this right. The architectural decisions you make in this period will define your platform's observability posture for the next five years. Design for the agent execution graph. Build for per-tenant isolation from the start. Treat cost as a first-class signal. And build the customer-facing visibility that enterprise buyers are already asking for.

The distributed tracing playbook served us well for a decade. It is time to write the next one.