Everything You've Been Afraid to Ask About Observability in AI-Powered Codebases
You've spent years building systems you understood. You knew what a slow database query looked like. You knew how to read a flame graph. You knew that when your p99 latency spiked, you could open a trace, find the offending span, and fix it before your on-call rotation ended. Then someone decided to wire a large language model into your backend, and suddenly your entire mental model of "what is this system doing" evaporated.
Welcome to observability in AI-powered codebases: a discipline that is equal parts familiar and completely alien. This FAQ is written for backend engineers who are competent, experienced, and deeply confused. No condescension, no buzzword soup. Just plain answers to the questions you've been embarrassed to ask in Slack.
The Basics: What Even Changed?
Q: I already have logs, metrics, and traces. Why isn't that enough anymore?
Because the contract between your code and its output has fundamentally changed. In a traditional backend, the same input reliably produces the same output. A function is deterministic. A SQL query is deterministic. If something breaks, your logs capture the exact state that caused the failure, and you can reproduce it locally in five minutes.
An LLM call is non-deterministic by design. The same prompt sent twice can return two meaningfully different responses, even at temperature zero (due to floating-point variance across hardware). Your existing observability stack was built on the assumption that you can replay events to understand them. With AI components, you often cannot. You need to capture not just what happened, but what the model was thinking, which means capturing inputs, outputs, token counts, latency per token, model version, and system prompt state at the time of the call. None of that is in a standard HTTP trace span.
Q: Is this just an MLOps problem? Why should I care as a backend engineer?
MLOps is about training and deploying models. What you're dealing with is a runtime problem: a live model is embedded in your service mesh, making decisions that affect real users right now. That is squarely a backend engineering concern. The model itself is a black box you didn't write. But the infrastructure around it, the prompt construction, the context injection, the retry logic, the fallback routing, the token budget management, all of that is your code, and all of it needs to be observable.
Think of it this way: you don't need to understand how PostgreSQL's query planner works internally to observe its behavior. You instrument the calls you make to it. The same principle applies to LLMs. You own the edges; you observe the edges.
Q: What does "observability" actually mean in this context? Is it different from monitoring?
Monitoring tells you when something is wrong based on thresholds you defined in advance. Observability lets you ask arbitrary questions about your system's state, including questions you didn't think to ask when you deployed it. The classic definition, borrowed from control theory, is that a system is observable if you can infer its internal state from its external outputs.
In AI-powered systems, the gap between monitoring and observability is enormous. You can monitor token usage and latency easily. But observing why a model started giving subtly worse answers after a dependency update changed how you format your system prompt requires the kind of rich, queryable telemetry that traditional monitoring dashboards simply weren't designed to surface. Observability in this context means: can you answer questions you haven't thought of yet? For AI systems, the answer to that question is almost always "not without significant work."
Traces, Spans, and the LLM Call Problem
Q: Can I just add an OpenTelemetry span around my LLM API call and call it done?
You can, and you should, but it's the floor, not the ceiling. A bare OTel span around an LLM call gives you latency and whether it succeeded or threw an exception. That's useful. It's also wildly incomplete. Here's what a minimal but actually useful LLM span should capture:
- The full prompt (or a hashed/truncated version for PII-sensitive workloads)
- The full completion
- Token counts: prompt tokens, completion tokens, total tokens
- Model name and version (never assume this is static; providers update models silently)
- Temperature and other sampling parameters
- Latency to first token (TTFT) if you're streaming
- Finish reason: did the model stop naturally, hit a token limit, or get cut off by a content filter?
- Cost estimate based on token counts and current pricing
The OpenTelemetry GenAI semantic conventions working group has been formalizing attribute names for exactly this purpose since late 2024, and by 2026 most major SDKs have at least partial support. Use those conventions rather than inventing your own attribute names, or your traces will be unsearchable across services.
Q: What about AI agents that make multiple LLM calls in a loop? How do I even trace that?
This is where things get genuinely hard, and where most teams discover their observability strategy was built for request/response, not for recursive reasoning loops. An agentic system might spawn sub-agents, call tools, re-prompt based on tool results, and retry failed reasoning steps, all within a single user-facing request. A flat list of spans doesn't capture that structure.
The pattern that works is nested span trees with explicit parent context propagation. Every agent invocation should be a span. Every tool call within that agent should be a child span. Every LLM call within that tool execution should be a grandchild span. When you visualize this in a tool like Jaeger, Honeycomb, or Langfuse, you get a tree that mirrors the agent's reasoning structure. You can see exactly where it went wrong, which tool returned bad data, which re-prompt caused the context window to balloon, and where the latency actually lives.
The critical mistake teams make is losing trace context when they cross async boundaries. If your agent dispatches work to a queue or a separate microservice, you must propagate the W3C traceparent header (or equivalent) through that handoff. If you don't, your trace tree fractures into disconnected islands and you're back to guessing.
Q: My LLM calls are streaming. Does that break tracing?
Streaming complicates tracing but doesn't break it. The key insight is that a streaming LLM response is a single logical operation even though it arrives as a stream of events. You should start the span when you initiate the request and end it when the stream closes (either naturally or on error). Capture TTFT as a span event or attribute, and accumulate the full completion text so you can log it when the span closes.
The temptation is to create a new span for each streamed chunk. Don't. That creates thousands of meaningless spans and destroys your trace's readability. One span, one LLM call, regardless of how the response arrives.
Prompt Engineering Meets Observability
Q: How do I know if a change to my system prompt broke something in production?
This is one of the most underappreciated failure modes in AI-powered systems. A prompt change is a code change, but it's often not treated as one. Engineers edit a string in a config file or a database row, deploy nothing, and suddenly model behavior shifts in ways that don't trigger any existing alerts.
The solution has two parts. First, version your prompts the same way you version your code. Store them in source control. Tag them. Associate a prompt version identifier with every LLM span so you can correlate behavioral changes with prompt changes after the fact. Second, define behavioral metrics that you can track over time: output length distribution, refusal rate (how often the model declines to answer), format compliance rate (if you're asking for JSON, how often do you actually get valid JSON), and user-facing quality signals if you have them.
When those metrics drift, you have a signal. When that drift correlates with a prompt version change in your trace data, you have a diagnosis. Without prompt versioning, you're flying blind.
Q: What about RAG pipelines? There are so many moving parts. Where do I even start?
Retrieval-Augmented Generation pipelines are observability nightmares if you treat them as a single operation. They're not. A RAG pipeline is at minimum four distinct observable operations:
- Query embedding: converting the user's input into a vector
- Vector search: retrieving relevant chunks from your knowledge base
- Context assembly: selecting, ranking, and formatting the retrieved chunks into the prompt
- LLM generation: the actual model call with the assembled context
Each of these should be its own span with its own attributes. For the vector search span, capture the query vector (or a hash of it), the number of results retrieved, the similarity scores of the top results, and the retrieval latency. For the context assembly span, capture how many tokens of context were injected and which document chunks were selected.
Why does this matter? Because when your RAG system gives a wrong answer, the failure could be at any of these stages. The embedding model might have encoded the query poorly. The vector search might have retrieved irrelevant chunks. The context assembly might have truncated the most relevant passage. The LLM might have ignored the context entirely. Without instrumentation at each stage, you can't tell which of these happened, and you'll waste hours tuning the wrong component.
Alerts, Anomalies, and the "What Is Normal?" Problem
Q: What should I actually alert on for an AI-powered service?
This is the question that exposes the gap between traditional and AI observability most sharply. Your standard SLO toolkit (error rate, latency percentiles, availability) still applies and you should keep those. But AI systems have failure modes that those signals miss entirely. Here's a practical alerting checklist to layer on top:
- Token budget exhaustion rate: what percentage of requests are hitting your context window limit? A spike here often means your context injection logic has a bug or your users' inputs have changed character.
- Cost per request anomalies: sudden increases in average token count can indicate prompt injection attacks or runaway agent loops.
- Finish reason distribution shifts: if your "stop" to "length" ratio changes significantly, your model is being cut off more often, which usually means degraded output quality.
- Refusal rate: a spike in content filter refusals can indicate adversarial input patterns or a model update that changed safety thresholds.
- Downstream parse failure rate: if your code expects JSON from the model and suddenly 15% of responses aren't valid JSON, that's a quality regression worth alerting on.
- Latency to first token (TTFT) percentiles: this is the metric that most directly correlates with user-perceived responsiveness in streaming applications, and it behaves very differently from total response latency.
Q: How do I set baselines when LLM behavior is inherently variable?
You set baselines on the distribution of outputs, not on individual outputs. A single LLM response being "different" is not a signal. The average response length over 1,000 requests shifting by 40% is a signal. This is why you need to think in statistical terms rather than threshold terms for AI metrics.
Practically, this means: collect a week or two of production data before you try to set any alerts on AI-specific metrics. Let the distribution stabilize. Then set anomaly detection thresholds based on that baseline, using rolling windows rather than static values. Most modern observability platforms (Datadog, Honeycomb, Grafana with appropriate plugins) support this kind of dynamic baselining. Use it.
Privacy, Cost, and the Uncomfortable Trade-offs
Q: I want to log prompts and completions for debugging, but they contain user data. What do I do?
This is a real tension and there's no perfect answer, but there are good practices. The approach most mature teams use is a tiered logging strategy:
- In development and staging: log everything. Full prompts, full completions, no redaction. You need this for debugging.
- In production: log metadata by default (token counts, latency, model version, finish reason). Log full content only when a request is sampled or flagged for review, and store that content in a separate, access-controlled store with a short retention window.
- For PII-sensitive domains (healthcare, finance, legal): apply PII detection and redaction before logging, or use hashing for prompt fingerprinting so you can correlate without storing raw content.
The key architectural principle is to decouple your observability pipeline from your serving path. Prompt content should flow to your observability store asynchronously, so that logging failures or redaction processing never add latency to user-facing requests.
Q: Logging all these tokens sounds expensive. How do I manage the cost of observability itself?
Observability cost is a real concern that many teams discover too late. A high-traffic AI service logging full prompt and completion text can generate observability data volumes that cost more than the model inference itself. The answer is intelligent sampling.
Head-based sampling (randomly sampling X% of requests at the entry point) is simple but loses rare failure cases. Tail-based sampling (making the sampling decision after the request completes, based on whether it was slow, errored, or anomalous) is more expensive to implement but far more valuable. With tail-based sampling, you can capture 100% of error cases and slow requests while sampling only 1-5% of happy-path requests, giving you high fidelity where it matters and low cost everywhere else.
Also: store prompt and completion text separately from your trace metadata. Metadata (latency, token counts, model version) is small and cheap to store long-term. Full text content is large and only needs to be stored for days or weeks in most cases. Separate retention policies for each will cut your observability bill significantly.
Tooling: What Should I Actually Use?
Q: Is there a standard toolchain for AI observability yet, or is it still the Wild West?
It's somewhere between the two, trending toward standardization. As of 2026, here's a reasonable lay of the land:
- OpenTelemetry remains the foundational standard for instrumentation. The GenAI semantic conventions are now stable enough to build on. Start here regardless of what backend you use.
- Langfuse has emerged as a strong open-source option specifically for LLM observability, with good support for prompt versioning, evals, and cost tracking. It integrates with OTel and can be self-hosted.
- Arize Phoenix is strong for RAG pipeline debugging and embedding analysis. If you're doing heavy RAG work, it's worth evaluating.
- Honeycomb remains excellent for high-cardinality trace analysis and is a natural fit for the kind of wide-event telemetry that AI systems produce.
- Datadog and Grafana have both shipped AI/LLM observability features, and if you're already on those platforms, the integration cost is low enough that you should use them rather than introducing a new tool.
The honest advice: don't let perfect be the enemy of good. Pick one backend, instrument your LLM calls with OTel today, and iterate. The tooling will continue to mature, but the biggest gap in most teams' AI observability is not the tool, it's the instrumentation.
Q: What about evaluating output quality? Is that observability?
Yes, and it's the frontier that most teams haven't reached yet. Structural observability (did the call succeed, how long did it take, how many tokens) is table stakes. Semantic observability (was the answer actually good) is the hard problem. Automated evaluation, or "evals," is the current best approach: running your outputs through a separate model or heuristic scorer to assess quality dimensions like accuracy, relevance, tone, and format compliance.
The practical pattern is to run evals asynchronously on a sample of production traffic and feed the scores back into your observability platform as metrics. When your eval scores drop, you have a quality regression signal even if no structural metrics changed. This is how you catch the insidious failures: the model that's still fast, still cheap, still returning valid JSON, but whose answers have quietly gotten worse.
Conclusion: You're Not Behind, You're Early
If you've read this far and feel like your current observability setup is inadequate for your AI-powered systems, you're not alone. The majority of engineering teams that have integrated LLMs into production backends are operating with instrumentation that would have been considered insufficient for a basic microservice five years ago. The discipline of AI observability is genuinely new, and the tooling, standards, and best practices are still solidifying.
The good news is that the foundations you already know, distributed tracing, structured logging, metric collection, alerting on distributions, still apply. You're not starting from zero. You're extending a skill set you already have into a domain that has some new wrinkles: non-determinism, probabilistic outputs, semantic quality, and the challenge of capturing rich context without creating a privacy or cost disaster.
Start small. Add a proper OTel span to your LLM calls this week, with token counts and model version as attributes. Version your prompts. Set up one alert on a metric you don't currently track. Each of those steps moves you from "I have no idea what my AI system is doing" to "I have a fighting chance of finding out." That's all observability has ever promised, and it's still worth the work.