How to Instrument Your First AI Agent Pipeline With OpenTelemetry: A Step-by-Step Guide for Backend Engineers

You've built an AI agent pipeline. It calls an LLM, maybe invokes a few tools, retrieves documents from a vector store, and chains reasoning steps together. It works, mostly. But when it doesn't, you have no idea why. The logs are a wall of JSON. The latency graph is a flat line of confusion. And your on-call rotation is dreading the next incident.

Welcome to the observability gap that almost every backend team hits when they move from traditional microservices to non-deterministic, multi-step AI workflows. The good news: OpenTelemetry is now mature enough to close that gap, and the GenAI semantic conventions that stabilized in late 2025 give you a standardized vocabulary for tracing LLM calls, tool invocations, and agent reasoning loops.

This guide is a hands-on, step-by-step walkthrough for backend engineers who already understand distributed tracing but have never applied it to an AI agent pipeline. By the end, you'll have a fully instrumented Python-based agent with spans that capture token usage, tool calls, retrieval steps, and retry logic, all flowing into your existing observability backend.

Why AI Agent Pipelines Break Traditional Tracing Assumptions

Before writing a single line of instrumentation code, it's worth understanding what makes AI agent pipelines fundamentally different from the REST-service chains you're used to tracing.

  • Non-deterministic branching: A standard microservice calls service B, then service C, in a predictable order. An agent decides at runtime whether to call a search tool, a calculator, or loop back to re-query the LLM. The call graph is not known at deploy time.
  • Variable-length token streams: Latency is not just network latency. It includes time-to-first-token, streaming duration, and retry backoff when rate limits are hit. None of these fit neatly into a single span duration.
  • Nested reasoning loops: ReAct-style agents and multi-agent orchestration frameworks (like those built on top of LangGraph or custom async Python) create deeply nested, sometimes recursive span trees.
  • Semantic meaning in payloads: For a database query, the SQL string is useful metadata. For an LLM call, the prompt, model name, temperature, token counts, and finish reason are all critical diagnostic signals.

The OpenTelemetry GenAI semantic conventions address all of these by defining a standard set of span attributes and span kinds specifically for generative AI workloads. Let's build on top of them.

Prerequisites and Setup

This tutorial uses Python 3.12+. The agent pipeline we'll instrument is a simple ReAct-style agent that answers questions by searching a knowledge base and optionally calling a calculator tool. The principles apply equally to TypeScript/Node.js agents.

Install the required packages:

pip install opentelemetry-api \
            opentelemetry-sdk \
            opentelemetry-exporter-otlp-proto-grpc \
            opentelemetry-instrumentation-httpx \
            openai \
            httpx

If you're using a framework like LangGraph, CrewAI, or a custom async orchestrator, the same SDK applies. The manual instrumentation patterns we cover here translate directly to any framework.

Step 1: Bootstrap the OpenTelemetry SDK

Create a file called telemetry.py. This is your single source of truth for tracer and meter initialization. Keep it separate from your agent logic so it can be imported without side effects.

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource, SERVICE_NAME

def configure_telemetry(service_name: str = "ai-agent-pipeline") -> trace.Tracer:
    resource = Resource(attributes={
        SERVICE_NAME: service_name,
        "deployment.environment": "production",
        "ai.agent.framework": "custom-react",
    })

    exporter = OTLPSpanExporter(endpoint="http://otel-collector:4317")

    provider = TracerProvider(resource=resource)
    provider.add_span_processor(BatchSpanProcessor(exporter))
    trace.set_tracer_provider(provider)

    return trace.get_tracer(service_name, schema_url="https://opentelemetry.io/schemas/1.27.0")

A few things to note here. First, the Resource carries metadata that persists across every span your service emits. Adding ai.agent.framework as a custom resource attribute lets you filter all agent spans in your backend without touching individual span attributes. Second, we pin the schema_url to a specific OTel schema version. This matters when your collector or backend performs schema transformations.

Step 2: Create a Root Span for Each Agent Run

Every agent invocation should produce exactly one root span. This is the "envelope" that wraps the entire reasoning loop, and it's what your SLO dashboards will measure against.

from telemetry import configure_telemetry
from opentelemetry import trace
from opentelemetry.trace import SpanKind, StatusCode

tracer = configure_telemetry()

def run_agent(user_query: str, session_id: str) -> str:
    with tracer.start_as_current_span(
        "agent.run",
        kind=SpanKind.SERVER,
        attributes={
            "gen_ai.operation.name": "agent_run",
            "gen_ai.system": "openai",
            "session.id": session_id,
            "user.query": user_query[:512],  # truncate for safety
        }
    ) as root_span:
        try:
            result = execute_react_loop(user_query)
            root_span.set_status(StatusCode.OK)
            root_span.set_attribute("agent.final_answer_length", len(result))
            return result
        except Exception as e:
            root_span.record_exception(e)
            root_span.set_status(StatusCode.ERROR, description=str(e))
            raise

Notice that we use SpanKind.SERVER for the root span. This signals to your observability backend that this is an inbound operation, making it compatible with standard RED (Rate, Errors, Duration) metric derivation. Child spans for LLM calls will use SpanKind.CLIENT.

Step 3: Instrument LLM Calls With GenAI Semantic Conventions

This is the heart of AI agent observability. Every call to an LLM must be wrapped in a span that captures the inputs, outputs, and token economics. The OpenTelemetry GenAI semantic conventions define the exact attribute names to use.

import time
from opentelemetry.trace import SpanKind

def call_llm(messages: list, model: str = "gpt-4o", temperature: float = 0.2) -> dict:
    with tracer.start_as_current_span(
        "gen_ai.chat",
        kind=SpanKind.CLIENT,
        attributes={
            "gen_ai.operation.name": "chat",
            "gen_ai.system": "openai",
            "gen_ai.request.model": model,
            "gen_ai.request.temperature": temperature,
            "gen_ai.request.max_tokens": 1024,
            "gen_ai.request.message_count": len(messages),
        }
    ) as llm_span:
        start = time.perf_counter()

        response = openai_client.chat.completions.create(
            model=model,
            messages=messages,
            temperature=temperature,
            max_tokens=1024,
        )

        latency_ms = (time.perf_counter() - start) * 1000

        # Capture response metadata using GenAI conventions
        llm_span.set_attribute("gen_ai.response.model", response.model)
        llm_span.set_attribute("gen_ai.response.finish_reason", response.choices[0].finish_reason)
        llm_span.set_attribute("gen_ai.usage.input_tokens", response.usage.prompt_tokens)
        llm_span.set_attribute("gen_ai.usage.output_tokens", response.usage.completion_tokens)
        llm_span.set_attribute("gen_ai.usage.total_tokens", response.usage.total_tokens)
        llm_span.set_attribute("llm.latency_ms", latency_ms)

        return response

A critical note on prompt logging: Do not store raw prompt content in span attributes by default. Prompts frequently contain PII, proprietary context, or user data that must not flow into your observability backend unfiltered. Instead, log a truncated hash or a semantic label. If your organization has a secure, access-controlled telemetry pipeline, you can enable full prompt capture via a feature flag, but make it opt-in.

Step 4: Instrument Tool Calls as Child Spans

In a ReAct agent, tool calls are the side effects that make agents useful and also the most common source of latency spikes and silent failures. Each tool invocation deserves its own span with the tool name, inputs, and outcome clearly recorded.

def execute_tool(tool_name: str, tool_input: dict) -> str:
    with tracer.start_as_current_span(
        f"agent.tool.{tool_name}",
        kind=SpanKind.INTERNAL,
        attributes={
            "gen_ai.operation.name": "tool_call",
            "tool.name": tool_name,
            "tool.input_keys": list(tool_input.keys()),
        }
    ) as tool_span:
        try:
            if tool_name == "search_knowledge_base":
                result = search_knowledge_base(tool_input["query"])
                tool_span.set_attribute("tool.result_count", len(result))
                tool_span.set_attribute("tool.success", True)
                return str(result)

            elif tool_name == "calculator":
                result = evaluate_expression(tool_input["expression"])
                tool_span.set_attribute("tool.result", str(result))
                tool_span.set_attribute("tool.success", True)
                return str(result)

            else:
                raise ValueError(f"Unknown tool: {tool_name}")

        except Exception as e:
            tool_span.record_exception(e)
            tool_span.set_attribute("tool.success", False)
            raise

Step 5: Trace the Reasoning Loop Itself

The reasoning loop is where non-determinism lives. An agent might complete in one LLM call or ten. Each iteration should be a named span so you can see the loop depth in your trace waterfall and immediately spot runaway agents.

def execute_react_loop(user_query: str, max_iterations: int = 10) -> str:
    messages = build_system_prompt(user_query)
    
    for iteration in range(max_iterations):
        with tracer.start_as_current_span(
            "agent.iteration",
            attributes={
                "agent.iteration_number": iteration,
                "agent.max_iterations": max_iterations,
            }
        ) as iter_span:
            response = call_llm(messages)
            content = response.choices[0].message.content

            # Parse the agent's next action
            action = parse_agent_action(content)
            iter_span.set_attribute("agent.action_type", action["type"])

            if action["type"] == "final_answer":
                iter_span.set_attribute("agent.completed", True)
                return action["answer"]

            elif action["type"] == "tool_call":
                tool_result = execute_tool(action["tool"], action["input"])
                messages.append({"role": "assistant", "content": content})
                messages.append({"role": "tool", "content": tool_result})

    # Max iterations exceeded: record this as a warning event
    current_span = trace.get_current_span()
    current_span.add_event("agent.max_iterations_exceeded", {
        "agent.iteration_count": max_iterations
    })
    return "Agent could not complete the task within the iteration limit."

The add_event call on the root span when max iterations are exceeded is particularly valuable. Span events are timestamped log entries attached to a span, and they show up inline in your trace waterfall. This makes it trivial to correlate "the agent got stuck" with the exact moment in the timeline.

Step 6: Add Baggage for Cross-Service Context Propagation

If your agent pipeline calls downstream microservices (a retrieval API, a reranker service, a database proxy), you need context propagation so that the trace ID flows across service boundaries. OpenTelemetry's W3C Trace Context propagation handles this automatically for HTTP calls, but you need to be deliberate about it.

from opentelemetry.propagate import inject
from opentelemetry import baggage
import httpx

def search_knowledge_base(query: str) -> list:
    headers = {}
    inject(headers)  # Injects traceparent and tracestate headers

    # Add business-level baggage for downstream filtering
    ctx = baggage.set_baggage("agent.session_id", current_session_id)

    with httpx.Client() as client:
        response = client.post(
            "http://retrieval-service/search",
            json={"query": query, "top_k": 5},
            headers=headers,
        )
        response.raise_for_status()
        return response.json()["results"]

With this in place, your retrieval service's spans will appear as children of the agent.tool.search_knowledge_base span in the same trace. You get end-to-end visibility from the user's question to the vector store lookup and back, all in a single trace waterfall.

Step 7: Emit Token Usage as Metrics (Not Just Span Attributes)

Span attributes are great for debugging individual traces, but for cost monitoring and capacity planning, you need metrics. Token counts should be emitted as OpenTelemetry metrics so they can feed into your dashboards and alerting rules.

from opentelemetry import metrics
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter

def configure_metrics():
    exporter = OTLPMetricExporter(endpoint="http://otel-collector:4317")
    reader = PeriodicExportingMetricReader(exporter, export_interval_millis=30000)
    provider = MeterProvider(metric_readers=[reader])
    metrics.set_meter_provider(provider)
    return metrics.get_meter("ai-agent-pipeline")

meter = configure_metrics()

token_counter = meter.create_counter(
    "gen_ai.token.usage",
    unit="tokens",
    description="Total tokens consumed by LLM calls",
)

agent_duration = meter.create_histogram(
    "agent.run.duration",
    unit="ms",
    description="End-to-end duration of agent runs",
)

# In your call_llm function, after getting the response:
token_counter.add(
    response.usage.total_tokens,
    attributes={
        "gen_ai.system": "openai",
        "gen_ai.request.model": model,
        "token.type": "total",
    }
)

With this metric flowing into your backend (Prometheus, Grafana Cloud, Datadog, or Honeycomb), you can build alerts like "alert when total tokens per minute exceeds 500,000" or "chart daily LLM cost by model version." This is the instrumentation that your finance team will eventually thank you for.

Step 8: Configure the OpenTelemetry Collector

All the telemetry you've generated needs to flow through an OpenTelemetry Collector before reaching your backend. Here's a minimal but production-ready collector configuration that handles batching, tail-based sampling for long agent traces, and multi-backend export.

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 5s
    send_batch_size: 1024
  
  # Tail-based sampling: keep 100% of error traces and 10% of success traces
  tail_sampling:
    decision_wait: 10s
    policies:
      - name: errors-policy
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: slow-agents
        type: latency
        latency: { threshold_ms: 5000 }
      - name: sample-success
        type: probabilistic
        probabilistic: { sampling_percentage: 10 }

  # Redact any prompt content that slipped through
  attributes:
    actions:
      - key: user.query
        action: hash

exporters:
  otlp/honeycomb:
    endpoint: api.honeycomb.io:443
    headers:
      x-honeycomb-team: ${HONEYCOMB_API_KEY}
  
  prometheusremotewrite:
    endpoint: http://prometheus:9090/api/v1/write

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch, tail_sampling, attributes]
      exporters: [otlp/honeycomb]
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [prometheusremotewrite]

The tail-based sampling policy here is critical for AI agent workloads. Because agent runs can take 30 to 60 seconds and span many LLM calls, head-based sampling (which makes the keep/drop decision at the start of a trace) is unreliable. Tail-based sampling waits until the full trace arrives before deciding, meaning you never accidentally drop a trace that turned into an error halfway through a 15-step reasoning loop.

Step 9: Validate Your Instrumentation Locally

Before deploying, validate that your spans are structured correctly using the OTEL debug exporter. Add this to your telemetry.py during local development:

from opentelemetry.sdk.trace.export import ConsoleSpanExporter, SimpleSpanProcessor

# Add this alongside your OTLP exporter in dev environments
if os.getenv("OTEL_DEBUG") == "true":
    provider.add_span_processor(
        SimpleSpanProcessor(ConsoleSpanExporter())
    )

Run your agent with OTEL_DEBUG=true and verify the following in the console output:

  • Every agent.run span has a trace_id and no parent_id (it is the root).
  • Every gen_ai.chat span has a parent_id matching the enclosing agent.iteration span.
  • Token usage attributes (gen_ai.usage.input_tokens, gen_ai.usage.output_tokens) are present on all LLM spans.
  • Tool spans appear as children of iteration spans, not as children of LLM spans.
  • No raw prompt text appears in any attribute value (check your redaction rules).

What Your Trace Waterfall Should Look Like

Once everything is wired up, a successful 3-iteration agent run should produce a trace that looks roughly like this in your UI:

agent.run (2.4s)
├── agent.iteration [0] (800ms)
│   └── gen_ai.chat (780ms)  [tokens: 312 in / 89 out]
├── agent.iteration [1] (1.1s)
│   ├── gen_ai.chat (650ms)  [tokens: 401 in / 55 out]
│   └── agent.tool.search_knowledge_base (440ms)
│       └── [retrieval-service] vector_search (390ms)
└── agent.iteration [2] (500ms)
    └── gen_ai.chat (480ms)  [tokens: 890 in / 210 out, finish_reason: stop]

This waterfall immediately tells you: the agent took 3 iterations, the bottleneck was the vector search in iteration 2, and the final LLM call consumed the most tokens because the retrieved context was appended to the prompt. You can now debug latency, cost, and correctness from a single pane of glass.

Common Pitfalls to Avoid

Pitfall 1: Creating Spans Outside of Context

If you create a span in a background thread or async task without propagating the parent context, it will appear as a disconnected root span in your backend. Always use context.attach() when crossing thread or async boundaries, or use the opentelemetry-instrumentation-asyncio package to handle this automatically.

Pitfall 2: Blocking the Event Loop With Synchronous Exporters

Use BatchSpanProcessor, never SimpleSpanProcessor, in production. The batch processor exports asynchronously and will not add latency to your agent's critical path. The simple processor exports synchronously and will make every span export block your agent's execution.

Pitfall 3: Over-Attributing Spans

OpenTelemetry has a default attribute count limit of 128 per span. Resist the temptation to dump every message in the conversation history as a span attribute. Use span events for high-cardinality, repeated data, and keep attributes for stable, filterable metadata.

Pitfall 4: Ignoring Streaming Responses

If you use streaming LLM responses (which you should, for latency), your span must remain open until the stream is exhausted. Open the span before calling the streaming API, consume the full stream inside the span context, and set the token count attributes after the final chunk arrives. The span duration then correctly reflects the full time-to-last-token.

Conclusion

Instrumenting an AI agent pipeline with OpenTelemetry is not just a nice-to-have in 2026. As agents handle more consequential tasks, as multi-agent systems grow more complex, and as LLM API costs become a real line item in engineering budgets, observability is the foundation that makes everything else possible: debugging, cost attribution, SLO enforcement, and compliance auditing.

The key takeaways from this guide are straightforward. Use a single root span per agent run with SpanKind.SERVER. Wrap every LLM call with a gen_ai.chat span using the GenAI semantic conventions. Trace every tool call as a child span. Propagate context across service boundaries with W3C Trace Context headers. Emit token counts as metrics for cost dashboards. And use tail-based sampling in your collector so you never lose a trace that matters.

The instrumentation patterns in this guide are framework-agnostic. Whether you're running a custom async Python agent, a LangGraph workflow, or a multi-agent system built on an emerging 2026 orchestration framework, the OpenTelemetry SDK and the GenAI semantic conventions provide the vocabulary and the plumbing to make your pipelines observable. Start with one agent, get the trace waterfall looking right, and then roll it out across your fleet.

Your future on-call engineer (who might be you at 2am) will be grateful.