OpenTelemetry GenAI Conventions Are Now Stable: Why Enterprise Backend Teams Must Redesign Their AI Agent Observability Pipelines Before Cost Allocation Breaks in Production

OpenTelemetry GenAI Conventions Are Now Stable: Why Enterprise Backend Teams Must Redesign Their AI Agent Observability Pipelines Before Cost Allocation Breaks in Production

There is a quiet crisis building inside enterprise AI platforms right now. Most backend teams do not know it yet because it has not exploded in production. But the fuse was lit the moment OpenTelemetry's Semantic Conventions for Generative AI moved from experimental status to stable. If your observability pipeline was instrumented against the experimental spec, you are now running on borrowed time. And if your multi-tenant SaaS product uses LLM token consumption as the basis for cross-tenant cost allocation, the clock is ticking faster than you think.

This is not a minor version bump story. This is a structural reckoning for how enterprise backend teams instrument, collect, attribute, and bill for AI agent workloads. In this deep dive, we will cover exactly what changed in the stable GenAI semantic conventions, why span attribution is the silent killer of accurate cost allocation, and what a production-ready observability pipeline redesign looks like in 2026.

The Backstory: How We Got Here

OpenTelemetry's Semantic Conventions for Generative AI began as an experimental working group effort in 2023, driven by the explosion of LLM integrations across the industry. The initial experimental attributes like llm.vendor, llm.request.model, and llm.usage.prompt_tokens were community-contributed, loosely coordinated, and intentionally unstable. The message from the OTel maintainers was clear: use these at your own risk, they will change.

Most enterprise teams heard that message and ignored it anyway. The business pressure to ship AI features was simply too great to wait for stability. Teams instrumented their LangChain pipelines, their custom agent loops, their OpenAI and Anthropic gateway wrappers, and their vector search middleware using whatever attribute names were available at the time. Observability backends like Datadog, Honeycomb, Dynatrace, and Grafana Cloud built dashboards around those experimental attribute names. Cost allocation queries in ClickHouse or BigQuery were written against those column names.

Then, in late 2025 and carrying into early 2026, the OTel GenAI SIG (Special Interest Group) finalized and promoted the semantic conventions to stable status. The attribute namespace shifted from the loosely structured experimental schema to a formalized, versioned, and breaking-change-protected schema under the gen_ai.* namespace. Attributes were renamed, restructured, and in some cases split into separate spans entirely.

The result: every pipeline built on experimental attributes is now silently emitting spans that either do not match your dashboards, do not join correctly in your analytics warehouse, or worse, attribute token consumption to the wrong tenant entirely.

What Actually Changed in the Stable GenAI Semantic Conventions

To understand the blast radius, you need to understand the specific structural changes the stable spec introduced. Here are the most impactful ones for enterprise backend teams:

1. The Namespace Formalization

The experimental spec used a mixed namespace approach. You would find attributes like llm.request.model sitting alongside ai.completion.tokens depending on which instrumentation library you used. The stable spec enforces a clean, consistent gen_ai.* root namespace. Every attribute now lives under this prefix with no exceptions. This means:

  • llm.request.model is now gen_ai.request.model
  • llm.usage.prompt_tokens is now gen_ai.usage.input_tokens
  • llm.usage.completion_tokens is now gen_ai.usage.output_tokens
  • llm.response.model is now gen_ai.response.model (and critically, this is now a separate required attribute from the request model)

That last point deserves emphasis. The stable spec formally recognizes that the model you request and the model that actually responds can differ. This happens constantly in enterprise deployments that use model routing layers, fallback chains, or provider-level model aliasing. If your cost allocation was based solely on the request model, you have been charging tenants for the wrong compute tier in every routing scenario.

2. Agent Span Decomposition

Perhaps the most significant structural change is how the stable conventions handle multi-step agent execution. The experimental spec treated an entire agent run as a single span with aggregated token counts. The stable spec introduces a hierarchical span model that decomposes agent execution into distinct span kinds:

  • Chat spans (gen_ai.system scoped): Individual LLM inference calls
  • Tool spans: Function/tool invocations made by the agent
  • Pipeline spans: The orchestration wrapper that links steps in a multi-turn or multi-tool agent loop

This decomposition is architecturally correct and long overdue. But it breaks every aggregation query that assumed a flat span model. If your ClickHouse cost allocation query does a SUM(gen_ai.usage.input_tokens) across all spans in a trace without filtering by span kind, you will now double-count tokens in any agent trace that has both a pipeline span and child chat spans, because the stable spec allows both levels to carry token attributes for different purposes.

3. System and Operation Attributes

The stable spec introduces gen_ai.system as a required attribute that identifies the AI provider or framework (for example, openai, anthropic, aws.bedrock, vertex_ai). It also introduces gen_ai.operation.name to distinguish between operations like chat, text_completion, embeddings, and create_image.

For multi-provider enterprise deployments, this is transformative. You can now build observability pipelines that correctly route cost attribution by provider, model, and operation type in a single, standardized query. But only if your instrumentation is actually emitting these attributes correctly, and only if your collector pipeline is not stripping or renaming them.

The Cross-Tenant Cost Allocation Problem Explained

Let us be precise about what "cross-tenant cost allocation" means in this context and why span attribution is the exact point of failure.

In a typical enterprise SaaS platform offering AI features, the architecture looks roughly like this:

  • Tenant A, Tenant B, and Tenant C all call your AI API gateway
  • Your gateway routes requests to one or more LLM providers (OpenAI, Anthropic, Bedrock, etc.)
  • An agent orchestration layer (LangGraph, CrewAI, a custom loop) may execute multiple LLM calls per user-initiated action
  • Token consumption is metered per tenant for billing or showback purposes

The tenant context must propagate through every span in that execution chain. In OpenTelemetry terms, this means the tenant identifier needs to live either in the trace context baggage or as a span attribute at every level of the hierarchy. This is where the experimental-to-stable transition creates a subtle but catastrophic failure mode.

The Silent Attribution Gap

Here is the exact failure scenario playing out in production systems right now:

Your API gateway creates a root span and correctly attaches tenant.id as a span attribute. Your old instrumentation library, still using experimental GenAI conventions, creates a single child span for the entire agent run and propagates the tenant context correctly. Your cost allocation query joins on tenant.id and sums token usage. Everything looks fine.

Now you upgrade your instrumentation library to one that implements the stable GenAI conventions. The agent run is now decomposed into a pipeline span and multiple child chat spans. The pipeline span correctly carries tenant.id from baggage propagation. But the child chat spans, created deep inside the instrumentation library's internal span creation logic, may not carry the tenant.id attribute if your baggage propagation is not configured to automatically annotate all child spans.

Your cost allocation query now misses all token counts that live on child chat spans without tenant.id. You are undercharging tenants. Worse, if your query has any fallback logic that attributes unmatched spans to a default tenant, you are overcharging that default tenant. Neither failure is visible until a tenant disputes an invoice or an audit catches the discrepancy.

Diagnosing Your Current Pipeline: A Practical Checklist

Before redesigning anything, you need to understand the current state of your instrumentation. Here is the diagnostic checklist your team should run:

Step 1: Audit Your Attribute Namespace

Query your observability backend or tracing store for any span attributes that begin with llm. or ai. instead of gen_ai.. The presence of old-namespace attributes means you have instrumentation libraries or manual instrumentation code that has not been updated to the stable spec. In many enterprise environments, this audit reveals a mix of old and new attributes in the same trace because different services upgraded at different times.

Step 2: Validate Span Hierarchy Completeness

For a sample of agent traces, verify that every span in the hierarchy carries your tenant context attribute. You can do this with a query like the following in your tracing backend:

SELECT trace_id, COUNT(*) as total_spans,
  COUNTIF(attributes['tenant.id'] IS NOT NULL) as attributed_spans
FROM traces
WHERE span_kind IN ('CLIENT', 'INTERNAL')
  AND attributes['gen_ai.system'] IS NOT NULL
GROUP BY trace_id
HAVING attributed_spans < total_spans

Any trace where attributed_spans is less than total_spans is a trace with attribution gaps. The ratio of these traces to your total AI traces tells you the severity of your current problem.

Step 3: Check for Double-Counting Risk

In traces that use the new hierarchical span model, verify that your cost aggregation query does not sum token attributes from both pipeline spans and their child chat spans. The correct approach is to sum only from leaf-level chat spans, which carry the actual per-call token counts. Pipeline spans should carry only metadata and propagation context, not token totals.

Step 4: Validate Response Model Attribution

Check whether your spans carry both gen_ai.request.model and gen_ai.response.model, and whether they ever differ. If you use any model routing, aliasing, or fallback logic, they will differ. Your cost allocation must use gen_ai.response.model for pricing lookups, not the request model.

Redesigning the Observability Pipeline: The Target Architecture

Now for the prescriptive part. Here is what a production-ready AI agent observability pipeline looks like when built correctly against the stable GenAI semantic conventions in 2026.

Layer 1: Instrumentation Layer

Your instrumentation layer must do three things consistently:

  1. Emit stable gen_ai.* attributes exclusively. Audit and remove all experimental attribute names. If you use a framework like LangChain, LlamaIndex, or LangGraph, pin to a version of the OpenTelemetry instrumentation plugin that explicitly documents stable convention support. Do not assume a library is stable-compliant because it uses the gen_ai. prefix; verify the full attribute set against the spec.
  2. Propagate tenant context via W3C Baggage. Your tenant identifier, and any other cost-allocation dimensions like workspace ID, feature flag cohort, or subscription tier, must be injected into W3C Baggage at the API gateway boundary. Every downstream span creation must read from baggage and stamp the relevant attributes onto the new span. Do not rely on span attribute inheritance; OTel does not automatically copy parent span attributes to child spans.
  3. Instrument at the correct span granularity. Follow the stable spec's span kind model. Each discrete LLM inference call gets its own chat span. Tool calls get their own tool spans. The orchestration loop gets a pipeline span. Never aggregate token counts manually into a parent span; let the hierarchy do that work at query time.

Layer 2: Collector Pipeline

The OpenTelemetry Collector is where many enterprise pipelines silently corrupt their data. Common mistakes include:

  • Attribute renaming processors that were written to normalize experimental attribute names and now conflict with stable names
  • Sampling rules that drop child spans based on heuristics that assumed a flat span model, now causing the leaf chat spans carrying actual token counts to be dropped
  • Batch processors configured with timeouts that split a single agent trace across multiple export batches, causing incomplete trace assembly in the backend

Your collector pipeline redesign should include a dedicated GenAI enrichment processor that performs the following operations in order:

  1. Validate the presence of required stable attributes (gen_ai.system, gen_ai.operation.name, gen_ai.request.model) and emit a metric counter for any span missing them
  2. Read tenant context from W3C Baggage headers and stamp it as a span attribute if not already present
  3. Enrich gen_ai.response.model from a model registry lookup if the instrumentation library did not capture it (some provider SDKs do not return the actual model name in streaming responses)
  4. Tag spans with a cost_allocation.eligible boolean attribute based on whether all required dimensions are present, giving your downstream query a clean filter

Layer 3: Analytics and Billing Backend

Your cost allocation queries need to be rewritten from scratch against the stable schema. The key principles:

  • Filter by span kind before aggregating. Only sum token counts from spans where gen_ai.operation.name = 'chat' or 'text_completion' or 'embeddings' as appropriate. Never aggregate across all spans in a trace indiscriminately.
  • Use gen_ai.response.model for pricing lookups. Maintain a model pricing table keyed on the combination of gen_ai.system and gen_ai.response.model, with separate rates for gen_ai.usage.input_tokens and gen_ai.usage.output_tokens.
  • Build a reconciliation job. Daily or hourly, run a query that identifies traces where the sum of child span token counts does not match any pipeline-level aggregate. Flag these for manual review. This reconciliation job is your early warning system for future instrumentation drift.
  • Version your cost allocation schema. Store the OTel semantic conventions version alongside each billing period's aggregated data. When the conventions update again (and they will), you will be able to clearly identify which billing periods used which schema version.

The Governance Problem Nobody Is Talking About

There is a dimension to this problem that goes beyond the technical pipeline redesign: instrumentation governance. In most enterprise engineering organizations, the team that owns the observability pipeline is not the same team that owns the AI feature code. Platform engineers maintain the collector infrastructure. Application teams instrument their own services. ML engineers build the agent orchestration logic. Nobody owns the full chain.

This organizational seam is exactly where instrumentation drift happens. An ML engineer upgrades a LangGraph dependency that pulls in a new version of the OTel GenAI plugin. The new plugin emits stable attributes. The platform team's collector is still running an attribute renaming processor that was written to normalize experimental attributes. The renaming processor now corrupts the stable attributes into garbage. Nobody notices until the monthly billing reconciliation fails.

The fix requires a governance layer, not just a technical one:

  • Define a GenAI Observability Contract as an internal API: a versioned document that specifies exactly which attributes must be present on every AI agent span, what their types are, and who is responsible for emitting them versus enriching them at the collector layer.
  • Add instrumentation validation to CI/CD. Use OTel's semantic conventions schema validation tooling to run automated checks against span samples in your staging environment before any AI service deployment reaches production.
  • Establish a cross-team GenAI observability working group that includes platform engineering, ML engineering, and finance (yes, finance). The cost allocation problem is a business problem, not just a technical one, and the people who feel the pain of incorrect billing need a seat at the table when instrumentation decisions are made.

What This Looks Like at Scale: A Reference Scenario

Consider a hypothetical enterprise platform that serves 200 tenants, processes roughly 40 million LLM inference calls per day across three providers (OpenAI GPT-4o, Anthropic Claude 3.7, and Amazon Bedrock Titan), and uses a LangGraph-based agent framework for its core AI workflows. Each tenant is billed monthly based on token consumption, with separate rates for input tokens, output tokens, and embedding tokens.

Before the stable conventions migration, this platform ran a single nightly ClickHouse aggregation job that summed llm.usage.prompt_tokens and llm.usage.completion_tokens across all spans tagged with a given tenant.id. Simple, fast, and seemingly reliable.

After upgrading to stable-compliant instrumentation without updating the pipeline, here is what broke:

  • The attribute renaming processor in the collector was transforming gen_ai.usage.input_tokens back to llm.usage.prompt_tokens for approximately 60% of spans, and silently dropping the attribute for the other 40% where the renaming logic failed due to type mismatches in the new schema.
  • The hierarchical span model meant that 15% of all agent traces had token counts split across pipeline and chat spans. The aggregation query was double-counting those traces.
  • Model routing was active for 8% of requests, meaning those requests were billed at the wrong model tier because the query used gen_ai.request.model instead of gen_ai.response.model.

The combined effect was a billing discrepancy of approximately 12 to 18% across the tenant base. Some tenants were overcharged; others were undercharged. The platform's finance team caught it during a quarterly audit, not through any automated alerting. The remediation required three weeks of engineering time to reprocess historical trace data and issue billing corrections.

This scenario is not hypothetical in its mechanics. It is a direct composite of patterns that are already emerging in enterprise AI platform post-mortems in early 2026.

The Timeline Pressure: Why You Cannot Wait

You might be thinking: "We will get to this in Q3." Here is why that timeline is dangerous.

First, instrumentation libraries are moving fast. The major LLM orchestration frameworks (LangChain, LlamaIndex, LangGraph, AutoGen, CrewAI) are all actively updating their OTel plugins to emit stable attributes. If your application teams are doing routine dependency upgrades, they may already be emitting a mix of experimental and stable attributes in production right now, without anyone having made a deliberate decision to migrate.

Second, observability vendors are deprecating experimental attribute support. Datadog, Dynatrace, and Grafana Cloud have all signaled that their built-in AI observability dashboards and cost analytics features are being rebuilt around the stable gen_ai.* schema. Vendor-provided dashboards that your team currently relies on may stop populating correctly as the vendors sunset experimental attribute support in their backends.

Third, the longer you wait, the more historical billing data becomes tainted with mixed-schema spans. Retroactively reprocessing months of trace data to correct billing records is an expensive, error-prone operation that creates significant customer trust risk if discrepancies are large enough to require invoice corrections.

Conclusion: Stability Is Not a Feature, It Is a Forcing Function

The promotion of OpenTelemetry's GenAI semantic conventions to stable status is genuinely good news for the industry. It means the community has reached consensus on a durable, well-designed schema for AI observability. It means tooling can now be built with confidence. It means the chaos of the experimental era is behind us.

But for enterprise backend teams that built production systems on experimental foundations, stability is a forcing function. It draws a clear line between the old way and the correct way, and it removes the excuse of "the spec is still changing" for not doing the migration work.

The teams that act now, who audit their instrumentation, redesign their collector pipelines, rewrite their cost allocation queries, and put governance structures in place, will have AI observability infrastructure that is genuinely reliable and scalable. They will be able to add new providers, new agent frameworks, and new tenants without rebuilding their billing logic from scratch each time.

The teams that wait will face the billing discrepancy post-mortem. And in a multi-tenant enterprise environment, that post-mortem has a way of becoming a very public, very expensive conversation with customers who did not appreciate being billed incorrectly for AI compute they trusted you to measure accurately.

The stable spec is here. The migration window is now. The cost of waiting is not technical debt; it is real dollars misattributed to real tenants. That is the only deadline that actually matters.