AI Agents

How to Build a Per-Tenant AI Agent Context Window Budget Enforcement Pipeline That Stops Token Sprawl from Silently Inflating Inference Costs Across Heterogeneous Foundation Models

Scott Miller

Apr 2, 2026 • 12 min read

There is a quiet budget leak running inside thousands of AI-powered SaaS products right now. It does not trigger an alarm. It does not throw an exception. It simply accumulates, line by line, token by token, until your monthly inference bill arrives and the number on it looks like a phone number with too many digits.

The culprit is token sprawl: the gradual, uncontrolled expansion of context windows across concurrent AI agent sessions, multiplied across every tenant on your platform, compounded by the fact that different foundation models price tokens differently and count them differently. In a heterogeneous model environment where one tenant's workflow routes through GPT-5, another through Gemini Ultra 2.0, and a third through a fine-tuned Llama-4 variant on your own GPU cluster, the problem is not just expensive. It is architecturally invisible unless you build the right enforcement layer.

This is a deep dive into how to build that layer. We will cover the theory, the data structures, the enforcement pipeline stages, and the operational patterns you need to make per-tenant context window budgeting a first-class concern in your multi-tenant SaaS architecture in 2026.

Why Token Sprawl Is a 2026 Problem, Not a 2023 Problem

Three years ago, most production LLM integrations were simple: a user sends a message, the app prepends a system prompt, and the model responds. Context windows were small (4K to 32K tokens), agents were largely stateless, and the cost surface was easy to reason about.

The landscape today is categorically different. Consider what a "normal" AI agent session looks like in a modern SaaS product in 2026:

Long-horizon memory: Agents maintain rolling conversation summaries, tool call histories, and retrieved document chunks across dozens of turns.
Multi-step tool use: Each tool call injects its output back into the context, often verbosely. A single web search result or database query response can add 2,000 to 8,000 tokens per invocation.
Multi-agent orchestration: Orchestrator agents pass full sub-agent transcripts upstream, causing exponential context inflation at the orchestration layer.
Model context windows of 1M+ tokens: Models like Gemini 2.5 Ultra and GPT-5 Turbo support context windows exceeding one million tokens. This is a capability gift that becomes a cost liability when agents are not constrained from using it.
Heterogeneous model routing: Cost-optimized routing means a single user session may touch three or four different models, each with its own tokenizer, pricing tier, and counting semantics.

The result is that a single runaway agentic session can consume more tokens in one hour than your entire product consumed in a week two years ago. Multiply that by hundreds of tenants with varying subscription tiers, and you have a structural cost risk that no amount of prompt engineering can fix retroactively.

The Core Concept: A Budget Enforcement Pipeline

A per-tenant context window budget enforcement pipeline is a set of interception, measurement, and enforcement components that sit between your application logic and your model inference layer. Its job is to ensure that no tenant's agent sessions consume more tokens than their allocated budget, and that when limits are approached, the system degrades gracefully rather than either crashing or silently overspending.

Think of it like a circuit breaker pattern, but for token consumption rather than service availability. The pipeline has four primary stages:

Budget Allocation: Define per-tenant token budgets at multiple granularities (session, daily, monthly, per-model).
Context Metering: Accurately count tokens before and after every inference call, normalized across heterogeneous models.
Budget Enforcement: Apply hard and soft limits with configurable strategies (truncation, summarization, rejection, tier upgrade prompts).
Observability and Attribution: Emit granular telemetry so you can audit, alert, and optimize at the tenant level.

Let us go through each stage in detail.

Stage 1: Budget Allocation and the Tenant Budget Model

Before you can enforce anything, you need a data model that represents what each tenant is allowed to spend. This model needs to be hierarchical, because real-world SaaS pricing is hierarchical.

The Budget Hierarchy

A robust tenant budget model operates at four levels:

Plan-level budget: The total token envelope granted to a subscription tier per billing cycle (e.g., "Pro plan = 50M tokens/month across all models").
Model-level budget: A sub-allocation per model family, since a token on GPT-5 costs more than a token on a self-hosted Llama-4 variant. You may want to cap expensive model usage specifically.
Agent-type budget: Different agent types (customer support agent, code generation agent, data analysis agent) have different expected token footprints. Allocating budgets per agent type lets you catch runaway behavior in a specific workflow without penalizing the whole tenant.
Session-level budget: The maximum tokens a single agent session can consume. This is your primary defense against runaway loops.

Here is an example schema in pseudocode for a tenant budget configuration object:


TenantBudgetConfig {
  tenant_id: string,
  billing_cycle_token_limit: int,         // e.g., 50_000_000
  billing_cycle_tokens_used: int,
  model_budgets: {
    "gpt-5-turbo":       { monthly_limit: 10_000_000, session_limit: 50_000 },
    "gemini-2.5-ultra":  { monthly_limit: 8_000_000,  session_limit: 40_000 },
    "llama-4-finetune":  { monthly_limit: 30_000_000, session_limit: 100_000 }
  },
  agent_type_budgets: {
    "support_agent":     { session_limit: 20_000 },
    "code_agent":        { session_limit: 60_000 },
    "analysis_agent":    { session_limit: 80_000 }
  },
  soft_limit_threshold: 0.80,             // warn at 80% consumption
  hard_limit_strategy: "summarize",       // or "truncate" | "reject" | "upgrade_prompt"
  reset_cadence: "monthly"
}

Store these configurations in a fast-read store (Redis or a low-latency document DB) because they will be read on every single inference call. The write path (updating tokens_used) should be asynchronous and eventually consistent, using a counter store like Redis INCR to avoid write contention at scale.

Normalized Token Units (NTUs): Solving the Heterogeneous Model Problem

Here is where most teams make their first architectural mistake. They count tokens per model using each model's native tokenizer, then try to compare or aggregate across models. This breaks down because:

GPT-5 uses a BPE tokenizer (tiktoken cl100k variant) that tokenizes differently from Gemini's SentencePiece tokenizer.
The same 500-word paragraph may be 680 tokens in one model and 740 in another.
Pricing is per-token but at different rates, so raw token counts are not a fair cost proxy across models.

The solution is to define a Normalized Token Unit (NTU) as your internal accounting currency. An NTU is a cost-weighted token unit anchored to a reference model price. For example, if your reference is $0.50 per million tokens (roughly a mid-tier model in 2026), then:

A token on a $2.00/M model = 4 NTUs
A token on a $0.10/M model = 0.2 NTUs
A token on your self-hosted model (marginal GPU cost $0.05/M) = 0.1 NTUs

All budget limits and usage counters are expressed in NTUs. This gives you a single unified budget that is economically meaningful regardless of which model a tenant's session happens to use. Update the NTU conversion table whenever your model pricing changes, and store it as a versioned configuration artifact so historical usage records remain accurate.

Stage 2: Context Metering at the Inference Proxy Layer

Accurate token counting requires an inference proxy: a middleware component that intercepts every request to every model endpoint before it is dispatched. This proxy is the single most important piece of infrastructure in the entire pipeline.

The Inference Proxy Architecture

The proxy should be implemented as a thin, high-performance service (Go or Rust are good choices for latency-sensitive deployments) with the following responsibilities:

Pre-flight token estimation: Before sending the request, count the tokens in the outgoing context using the appropriate tokenizer for the target model. This gives you the input token count.
Budget pre-check: Query the tenant's current NTU balance. If the estimated input tokens (converted to NTUs) would exceed the session or monthly limit, apply the configured enforcement strategy before the request is even sent.
Response token counting: After receiving the model response, count the output tokens (either from the model's usage metadata or by counting the response directly).
Ledger update: Asynchronously post the total NTU cost (input + output) to the tenant's usage ledger.
Telemetry emission: Emit a structured event to your observability pipeline with full attribution metadata.

A critical design note: do not trust model-reported token counts as your sole source of truth. Model APIs do report usage in their responses, but there can be discrepancies between what was billed and what was reported, especially with streaming responses, cached prompt tokens, and speculative decoding artifacts. Count independently and reconcile periodically.

Tokenizer Normalization at Scale

Maintaining accurate tokenizers for every model in your fleet is an operational burden. A practical approach is to use a tokenizer registry service that maps model identifiers to tokenizer implementations, and to pre-tokenize context segments that are reused across requests (system prompts, retrieved documents, tool schemas). Cache these pre-tokenized lengths aggressively. Since system prompts and tool definitions rarely change between requests, you can avoid re-tokenizing them on every call.

For models where you do not have direct tokenizer access (some closed-API models), use a calibration approach: send a set of benchmark texts to the model and compare reported token counts to your estimator's output. Maintain a per-model calibration factor (typically within 2 to 5 percent error) and apply it as a conservative multiplier to your estimates.

Stage 3: Budget Enforcement Strategies

This is where the pipeline earns its keep. When a session approaches or exceeds its budget, you have several enforcement strategies available. The right choice depends on your product's tolerance for degradation and the nature of the agent task.

Strategy 1: Hard Truncation

The simplest strategy: trim the oldest or least-relevant messages from the context to bring the token count back under the limit. This is fast and deterministic, but it can break agent reasoning if the truncated content contained important earlier instructions or tool results.

Best practice for truncation: always preserve the system prompt and the most recent N turns, and truncate from the middle of the conversation history. Use a relevance scorer (a lightweight embedding similarity check against the current user query) to decide which middle segments to drop first.

Strategy 2: Hierarchical Summarization

Rather than truncating, compress older context segments into a summary using a cheap, fast model (a small local model or a low-cost API model). The summary replaces the raw transcript, freeing up budget while preserving semantic continuity.

This is the most user-experience-friendly strategy, but it has a cost: the summarization call itself consumes tokens. Budget for this overhead. A good rule of thumb is that a well-prompted summarization of a 10,000-token conversation segment produces a 500 to 800 token summary, a compression ratio of roughly 12 to 20x. The summarization call itself costs around 10,500 tokens total, so you need to be sure the freed budget justifies the summarization cost. Apply summarization proactively when a session reaches 70 percent of its budget, not reactively at 100 percent.

Strategy 3: Tool Output Compression

In agentic workflows, tool outputs are often the primary driver of context inflation. A database query that returns a 500-row result set should never be injected into the context verbatim. Instead, apply a tool output compression middleware that:

Extracts only the rows or fields relevant to the agent's current sub-goal (using a lightweight extraction model or structured query).
Converts verbose JSON structures into compact representations.
Caches tool outputs by input hash so identical tool calls in the same session do not re-inject duplicate content.

Tool output compression is often the single highest-ROI optimization in the entire pipeline because tool outputs are both large and highly compressible.

Strategy 4: Graceful Rejection with Upgrade Prompting

When a session has genuinely exhausted its budget and compression cannot help further, the agent should stop gracefully and surface a meaningful message to the user. For SaaS products, this is also a monetization touchpoint: a well-crafted "You've reached your AI usage limit for this month. Upgrade to Pro for 5x more capacity" message converts significantly better than an opaque error.

Implement rejection as a first-class response type in your agent framework, not as an exception. The agent should detect the budget-exceeded signal from the proxy and generate a structured rejection response rather than crashing or silently returning an empty result.

Strategy 5: Model Downgrade Routing

A more sophisticated strategy: when a session approaches its NTU budget on an expensive model, automatically route subsequent turns to a cheaper model that can handle the task adequately. For example, if a code generation agent has consumed 80 percent of its GPT-5 session budget, route the remaining turns through a capable but cheaper open-weight model on your own infrastructure.

This requires your agent orchestration layer to support dynamic model switching mid-session, which is a non-trivial engineering investment. But for high-volume tenants, it can reduce per-session costs by 40 to 60 percent without meaningfully degrading output quality for simpler sub-tasks.

Stage 4: Observability, Attribution, and the Cost Intelligence Layer

Enforcement without observability is blind. You need a cost intelligence layer that gives you and your tenants full visibility into token consumption patterns.

The Token Telemetry Event Schema

Every inference call processed by your proxy should emit a structured event with at minimum the following fields:


TokenUsageEvent {
  timestamp: ISO8601,
  tenant_id: string,
  session_id: string,
  agent_type: string,
  model_id: string,
  input_tokens_raw: int,
  output_tokens_raw: int,
  input_ntu: float,
  output_ntu: float,
  total_ntu: float,
  session_ntu_cumulative: float,
  monthly_ntu_cumulative: float,
  session_budget_pct_used: float,
  monthly_budget_pct_used: float,
  enforcement_action_taken: string | null,   // "summarized", "truncated", "rejected", null
  tool_calls_in_context: int,
  context_compression_ratio: float | null,
  routing_model_override: string | null
}

Ship these events to your data warehouse (BigQuery, Snowflake, or Redshift) for long-term analysis and to a real-time stream (Kafka or Kinesis) for alerting and dashboarding.

Alerting Patterns

Build alerts for the following conditions:

Tenant approaching monthly budget: Alert your customer success team when a tenant reaches 80 percent of their monthly NTU budget. This is a churn risk and an upsell opportunity simultaneously.
Session budget anomaly: Alert when a single session consumes more than 3x the median session NTU for that agent type. This usually indicates a prompt injection attack, an infinite tool loop, or a misconfigured agent.
Model pricing drift: Alert when the NTU conversion table has not been updated in more than 7 days. Model pricing changes frequently in 2026, and stale conversion factors will corrupt your budget accounting.
Enforcement action spike: Alert when the rate of enforcement actions (truncations, rejections) for a tenant increases sharply. This often signals a product usage pattern change that warrants a plan upgrade conversation.

Tenant-Facing Usage Dashboards

Expose usage data to your tenants directly. Tenants who can see their own token consumption are more likely to optimize their own usage patterns, less likely to be surprised by limit hits, and more likely to proactively upgrade their plans. A good tenant usage dashboard shows:

Current billing cycle NTU consumption vs. limit (with a progress bar).
Consumption breakdown by agent type and model.
Top sessions by NTU cost (so tenant admins can identify runaway workflows).
7-day rolling trend with a projected end-of-cycle consumption estimate.

Putting It All Together: The Reference Architecture

Here is how all four stages connect in a complete reference architecture for a multi-tenant SaaS platform:

Agent Framework (LangGraph, AutoGen 3.x, or custom) constructs the context window and prepares an inference request.
The request is intercepted by the Inference Proxy, which runs pre-flight token counting using the tokenizer registry.
The proxy queries the Budget Store (Redis) for the tenant's current NTU balance.
If the budget check passes, the proxy applies any configured context compression (tool output compression, proactive summarization) and forwards the request to the model endpoint.
If the budget check fails, the proxy applies the tenant's configured enforcement strategy (truncation, rejection, model downgrade) before forwarding or blocking the request.
The model response is received. The proxy counts output tokens, converts to NTUs, and posts an async ledger update to Redis.
A structured TokenUsageEvent is emitted to the telemetry stream.
The Cost Intelligence Layer consumes the telemetry stream, powers dashboards, and fires alerts.
A nightly reconciliation job compares proxy-counted NTUs against model API billing records and flags discrepancies above a 2 percent threshold.

Common Pitfalls and How to Avoid Them

Pitfall 1: Counting Tokens Only on Output

Many teams only count output tokens because that is what the model API reports most prominently. Input tokens (your entire context window) often cost as much or more than output tokens, especially with large context models. Always count and budget both directions.

Pitfall 2: Ignoring Cached Prompt Tokens

Several model APIs in 2026 offer prompt caching, where repeated context prefixes are charged at a reduced rate (typically 10 to 25 percent of the full price). Your NTU conversion logic must account for cached vs. uncached token pricing. Failing to do so will cause your cost estimates to be consistently higher than actual billing, which erodes trust in your budget system.

Pitfall 3: Synchronous Budget Writes Under Load

Writing to the budget ledger synchronously on every inference call is a latency and availability risk. Use an async write pattern with Redis INCR for real-time approximate balances, and reconcile with your authoritative store (PostgreSQL or equivalent) on a short interval (every 30 to 60 seconds). Accept that your budget enforcement is eventually consistent, and build in a small overage buffer (5 to 10 percent) to absorb the window of inconsistency.

Pitfall 4: One-Size-Fits-All Session Limits

A session limit that works for a customer support agent will be far too tight for a long-running data analysis agent and far too loose for a simple Q&A agent. Invest in per-agent-type budget profiles from the start. Retrofitting this later is painful.

Pitfall 5: Not Versioning the NTU Conversion Table

Model prices change. If you update your NTU conversion table without versioning it, historical usage records become meaningless and your billing reconciliation breaks. Every NTU conversion event should record the conversion table version it used.

Conclusion: Budget Enforcement Is a Product Feature, Not an Ops Afterthought

Token sprawl is not a bug in your AI agents. It is the natural consequence of giving capable agents access to large context windows without economic guardrails. In 2026, as agentic AI becomes the default interaction paradigm in SaaS products, the teams that treat per-tenant context window budget enforcement as a first-class architectural concern will have a durable competitive advantage: predictable unit economics, scalable multi-tenancy, and the kind of cost transparency that builds enterprise trust.

The pipeline described here is not simple to build. It requires an inference proxy, a tokenizer registry, a budget store, a telemetry pipeline, and thoughtful enforcement strategies. But every component solves a real problem that will only grow larger as your tenant base and model diversity expand.

Start with the inference proxy and the Redis budget store. Get accurate token counting in place first. Everything else can be layered on incrementally. The alternative, discovering token sprawl after the fact on your cloud invoice, is a significantly more expensive way to learn the same lesson.