The $180K Wake-Up Call: How One SaaS Team's Post-Mortem Exposed a Single Misconfigured Context Window and Led to a 60% Token Cost Reduction

It started with an invoice. A $180,000 monthly cloud bill for LLM API compute, up from $74,000 just two months prior. No new features had shipped. No significant user growth had occurred. The engineering team at Meridian Analytics (a mid-market B2B SaaS company providing AI-powered data intelligence to logistics firms) had simply upgraded their internal tooling to a multi-agent pipeline architecture, and somehow, their token spend had quietly spiraled into a financial emergency.

What followed was one of the most instructive post-mortems in the company's engineering history. The culprit was not a runaway loop, a broken rate limiter, or even a poorly prompted agent. It was a single, quietly misconfigured context window truncation strategy, buried in a shared utility function that every agent in the pipeline was calling. This is the story of how they found it, fixed it, and cut token spend by 60% without sacrificing a single percentage point of agent accuracy.

The Architecture: A Multi-Agent Pipeline Built for Speed

Meridian's platform allows logistics managers to ask natural language questions about their supply chain data. The backend processes these queries through a multi-agent orchestration layer built on a leading LLM API (at the time, using a 128K-context model). The pipeline looked roughly like this:

  • Agent 1 (Router Agent): Classifies the user query and routes it to the appropriate downstream agent.
  • Agent 2 (Data Retrieval Agent): Pulls relevant records from the data warehouse using tool calls and vector search.
  • Agent 3 (Reasoning Agent): Synthesizes retrieved data, performs multi-step reasoning, and drafts a response.
  • Agent 4 (Critic Agent): Reviews the reasoning agent's output for factual consistency and flags hallucinations.
  • Agent 5 (Formatter Agent): Converts the final answer into the user-facing format (tables, summaries, charts).

Each agent was stateless in isolation but shared context through a centralized conversation state object that was passed sequentially through the pipeline. This state object accumulated the full message history, tool call results, intermediate reasoning traces, and metadata annotations at every step.

The design was intentional. The team wanted every downstream agent to have full visibility into what had happened upstream, reasoning that more context would yield better accuracy. And for a while, it did. The problem was in how that context was being managed as conversations grew longer.

The Misconfiguration: A Truncation Strategy That Was Never Actually Truncating

Early in the project, a senior engineer had written a utility function called prepare_context_window(). Its job was straightforward: before passing the state object to any agent, trim the message history so it fit comfortably within the model's context window, leaving a safe buffer for the agent's output tokens.

The function accepted a parameter called max_context_tokens, which was intended to be set per agent based on that agent's needs. The Router Agent, for example, only needed a small slice of recent history. The Reasoning Agent needed more. The Critic Agent needed the full reasoning trace but not the raw data chunks.

Here is where the misconfiguration lived. The max_context_tokens parameter had a default value of 128000, matching the model's full context window. When the team wired up the agents during the initial build sprint, they passed the state object to prepare_context_window() correctly in every case, but they never overrode the default. Every single agent was receiving the full, untruncated context window on every single call.

In other words, the truncation function existed, it was called on every request, and it did absolutely nothing.

For short conversations, this was invisible. The state object was small, and the token count was manageable. But as real enterprise users began running longer, multi-turn sessions with complex queries, the state object ballooned. By the time a query reached the Formatter Agent at the end of the pipeline, it was carrying the full message history, all tool call outputs (including raw JSON payloads from the data warehouse), all intermediate reasoning traces from the Reasoning Agent, and all critic annotations. The Formatter Agent, whose only job was to pretty-print a table, was consuming 80,000 to 110,000 tokens per call just to produce a few hundred tokens of output.

Multiplied across five agents, multiplied across thousands of daily queries, the math became brutal.

The Post-Mortem: Finding the Signal in the Noise

The team's initial assumption was that the cost spike was caused by increased usage volume. A quick check of their analytics dashboard disproved this. Query volume had grown by roughly 18% over the billing period in question, which could not explain a 143% increase in token spend.

Their second hypothesis was prompt bloat, a common issue where system prompts grow over time as engineers add more instructions. A prompt audit found some inefficiencies but nothing close to the scale needed to explain the numbers.

The breakthrough came when a junior engineer on the platform team, Priya Nair, pulled per-agent token consumption logs from their observability stack. The team had been logging total tokens per request but had never broken the numbers down by agent. When Priya ran the breakdown query, the results were striking:

  • Router Agent: Average 4,200 input tokens per call
  • Data Retrieval Agent: Average 6,800 input tokens per call
  • Reasoning Agent: Average 22,400 input tokens per call
  • Critic Agent: Average 58,000 input tokens per call
  • Formatter Agent: Average 94,000 input tokens per call

The pattern was immediately obvious. Token consumption was not just growing across the pipeline; it was compounding. Each agent was receiving everything the previous agents had produced, plus everything before that, plus the original data payloads. The Formatter Agent, the cheapest and simplest agent in the pipeline, was consuming more tokens than all the other agents combined.

A code review of prepare_context_window() took less than ten minutes to confirm the root cause. The default parameter was the culprit. Five minutes of reading and the team had their post-mortem headline: "Truncation function called on every request. Truncation never actually triggered. Default parameter never overridden."

The Refactor: A Principled Context Management Strategy

Rather than simply patching the default value, the engineering team used the incident as an opportunity to build a proper context budgeting system. The refactor had four core components:

1. Per-Agent Context Budgets

Each agent was assigned an explicit, documented context budget based on a careful analysis of what information it actually needed to perform its function. The Router Agent, which only needed to classify the query and check for recent conversation turns, was capped at 6,000 tokens. The Reasoning Agent received up to 32,000 tokens. The Formatter Agent, which only needed the final synthesized answer and the user's original query, was capped at 4,000 tokens. These budgets were enforced at the infrastructure level, not left as optional parameters.

2. Structured Context Distillation

Instead of passing raw message history and raw tool outputs downstream, the team introduced a distillation step between each agent handoff. After each agent completed its work, a lightweight summarization pass compressed its outputs into a structured, token-efficient handoff payload. Tool call results, which had been the single largest contributor to token bloat (raw warehouse JSON payloads averaging 12,000 tokens each), were replaced with compressed semantic summaries of roughly 400 to 600 tokens. The full raw payloads were stored in a sidecar cache and only retrieved if a downstream agent explicitly requested them via a tool call.

3. Relevance-Gated History Injection

The team implemented a lightweight relevance scoring layer that evaluated each historical message turn against the current agent's task before injecting it into the context window. Turns with a relevance score below a calibrated threshold were excluded from the context. This was particularly impactful for the Critic Agent, which had been receiving the full conversation history even when most of it was unrelated to the specific claim being verified.

4. Token-Aware Observability as a First-Class Metric

Perhaps the most durable change was organizational rather than technical. The team added per-agent token consumption as a tracked engineering metric with alerting thresholds. Any agent whose average input token count exceeded its budget by more than 15% would trigger a Slack alert to the platform team. This ensured that context bloat could never quietly compound again over a billing cycle without being caught.

The Results: 60% Cost Reduction, Zero Accuracy Regression

The refactored pipeline was deployed to production over a two-week period, with a staged rollout that allowed the team to compare accuracy metrics between the old and new pipelines in parallel. The results exceeded expectations:

  • Token spend: Reduced by 61% in the first full billing cycle post-deployment
  • Agent accuracy (measured by human evaluation on a 500-query benchmark): No statistically significant change. The Reasoning Agent's accuracy on complex multi-hop queries actually improved by 3.2%, likely because the cleaner, more focused context reduced distraction from irrelevant historical turns.
  • Latency: Average end-to-end pipeline latency dropped by 28%, because smaller context windows meant faster model inference times.
  • Monthly LLM API spend: Fell from $180,000 to approximately $70,200, slightly below pre-spike levels despite the 18% growth in query volume.

The Formatter Agent's average input token count dropped from 94,000 to 3,800 tokens per call. A 96% reduction for the agent that needed it the least.

The Broader Lesson: Context Is Infrastructure

The most important takeaway from Meridian's post-mortem is not technical. It is philosophical. The team had treated context management as a detail, a utility function tucked away in a shared library, something to configure once and forget. In a multi-agent system, that assumption is dangerous.

In a single-agent LLM application, a bloated context window is expensive but bounded. In a multi-agent pipeline, context bloat is multiplicative. Every token of unnecessary context passed to Agent 2 becomes part of the payload that Agent 3 receives, which becomes part of what Agent 4 receives. The inefficiency compounds at every hop. A 10% context inefficiency in a five-agent pipeline does not cost 10% more. Depending on pipeline architecture, it can cost several times more.

The team also noted a subtler insight: more context does not always mean better accuracy. The assumption that every downstream agent benefits from full visibility into the entire conversation history is intuitive but often wrong. Agents perform better when their context is focused and relevant. Noise is not neutral. Irrelevant tokens do not simply go unnoticed; they compete with signal for the model's attention and can degrade output quality in ways that are difficult to detect without rigorous evaluation.

Practical Takeaways for Teams Building Multi-Agent Pipelines

If your team is building or operating a multi-agent LLM system in 2026, here are the concrete practices Meridian's experience points toward:

  • Audit your context at every agent boundary. Log per-agent input token counts from day one. Do not wait for a billing spike to discover what your pipeline is actually consuming.
  • Never use full context window size as a default. Treat the model's maximum context window as a hard ceiling, not a reasonable default. Default to the minimum your agent needs, and expand deliberately.
  • Distill, do not forward. Raw outputs from one agent should rarely be passed verbatim to the next. Build structured handoff payloads that carry meaning, not bulk.
  • Evaluate accuracy with and without context reduction. Do not assume that reducing context will hurt accuracy. Test it. You may find, as Meridian did, that focused context improves performance.
  • Make context budgets a code review concern. Any pull request that modifies agent context handling should require explicit justification for the token budget assigned. Treat it with the same seriousness as a database schema change.

Conclusion

Meridian Analytics did not have a runaway AI problem. They had a configuration problem that a multi-agent architecture turned into a financial emergency. The fix was not a new model, a new framework, or a new vendor. It was a principled approach to something that should have been designed carefully from the start: how much context each agent actually needs to do its job well.

In the current era of enterprise AI adoption, where multi-agent pipelines are rapidly becoming the default architecture for complex LLM applications, context management is not a backend detail. It is a core engineering discipline. The teams that treat it as such will build systems that are not only cheaper to run, but measurably more accurate, faster, and easier to debug.

The $180,000 invoice was painful. But the post-mortem it forced was, in the words of Meridian's CTO, "the best unplanned architecture review we have ever done."