AI cost attribution

How to Build a Per-Tenant AI Agent Cost Attribution Pipeline: A Complete Guide for the 2026 Multi-Model Billing Complexity Crisis

Scott Miller

Mar 31, 2026 • 12 min read

If you are running a SaaS platform powered by AI agents in 2026, you already know the pain. Your platform might be routing requests to GPT-5, Claude Opus 4, Gemini Ultra 2, Mistral Large, and a handful of fine-tuned open-source models, all within a single user session. Each model charges differently. Each tool call has a cost. And somewhere in your finance team's spreadsheet, a very stressed analyst is trying to figure out why your AI inference bill tripled last quarter and which subscription tier is actually profitable.

Welcome to the 2026 Multi-Model Billing Complexity Crisis: the moment when AI-powered SaaS platforms discovered that their cost infrastructure was never designed for agentic, multi-provider, multi-tool workloads. The good news is that this problem is entirely solvable with the right attribution pipeline. This tutorial will walk you through building one from scratch.

Why Traditional Usage Metering Completely Breaks Down for AI Agents

Legacy SaaS billing systems were built around discrete, predictable units: API calls, seats, storage gigabytes. AI agents break every assumption in that model. A single user action can trigger a chain of tool calls, sub-agent invocations, retrieval-augmented generation (RAG) lookups, and model-switching decisions, all happening autonomously and asynchronously. By the time a response reaches your tenant, costs have already been incurred across three providers and a dozen function calls.

Here is what makes per-tenant attribution so difficult in 2026:

Model routing is dynamic. Orchestration frameworks like LangGraph, AutoGen, and CrewAI route to different models based on task complexity, latency requirements, or cost thresholds. There is no static "this tenant uses Model X" mapping.
Tool calls have variable costs. A web search tool, a code execution sandbox, and a vector database retrieval each carry different cost structures that are separate from token costs.
Prompt caching is non-deterministic. Cache hits from providers like Anthropic and OpenAI reduce costs, but only sometimes, and the savings are not automatically reflected in your internal accounting.
Context windows are enormous. With 1M+ token context windows now common, a single agentic loop can consume more tokens than an entire day of legacy chatbot traffic.
Multi-tenancy is implicit. Many platforms share infrastructure across tenants, meaning cost signals are aggregated at the infrastructure level, not the tenant level.

The solution is a cost attribution pipeline: a dedicated data path that intercepts every inference call, enriches it with tenant and context metadata, prices it in real time, and streams the result to both your billing system and your finance team's dashboards.

The Architecture: A Four-Layer Attribution Pipeline

Before writing any code, it helps to understand the four layers your pipeline needs to implement.

Layer 1: The Instrumentation Layer (Emit Cost Events)

Every AI call your platform makes must emit a structured cost event. This is the foundation. Without it, you are flying blind.

Layer 2: The Enrichment Layer (Tag with Context)

Raw cost events need to be enriched with tenant ID, subscription tier, agent ID, session ID, tool call type, and model provider before they are useful.

Layer 3: The Pricing Engine (Translate Tokens to Dollars)

Token counts and tool call counts must be converted to dollar amounts using a real-time or near-real-time pricing table that accounts for model, tier (input vs. output vs. cached), and provider-specific pricing quirks.

Layer 4: The Reconciliation Layer (Sync with Billing)

Priced, enriched events must be aggregated and pushed to your billing system so finance teams can reconcile AI spend against subscription tier commitments in real time.

Step 1: Instrument Every AI Call with a Structured Cost Event

The first step is to wrap every outbound call to any model provider in a thin instrumentation layer. The goal is to capture a CostEvent object at the moment of each inference call, before the response is returned to the caller.

Here is a Python example using a universal wrapper pattern:


import time
import uuid
from dataclasses import dataclass, field
from typing import Optional

@dataclass
class CostEvent:
    event_id: str = field(default_factory=lambda: str(uuid.uuid4()))
    timestamp: float = field(default_factory=time.time)
    tenant_id: str = ""
    session_id: str = ""
    agent_id: str = ""
    model_provider: str = ""   # e.g., "openai", "anthropic", "google"
    model_name: str = ""       # e.g., "gpt-5", "claude-opus-4"
    input_tokens: int = 0
    output_tokens: int = 0
    cached_tokens: int = 0
    tool_call_type: Optional[str] = None  # e.g., "web_search", "code_exec"
    tool_call_count: int = 0
    latency_ms: float = 0.0
    raw_cost_usd: float = 0.0  # filled by pricing engine
    subscription_tier: str = ""

Now wrap your LLM client calls. The key is to use a context propagation pattern so that tenant metadata flows from the HTTP request context all the way down into the AI call, even across async boundaries:


import contextvars
from your_llm_clients import openai_client, anthropic_client

# Context variable that carries tenant metadata through async call chains
tenant_context: contextvars.ContextVar[dict] = contextvars.ContextVar(
    "tenant_context", default={}
)

async def instrumented_chat_completion(
    provider: str,
    model: str,
    messages: list,
    tools: list = None,
    **kwargs
) -> dict:
    ctx = tenant_context.get()
    start = time.time()

    # Route to the correct provider client
    if provider == "openai":
        response = await openai_client.chat.completions.create(
            model=model, messages=messages, tools=tools, **kwargs
        )
        usage = response.usage
        input_tokens = usage.prompt_tokens
        output_tokens = usage.completion_tokens
        cached_tokens = getattr(usage, "prompt_tokens_details", {}).get(
            "cached_tokens", 0
        )
    elif provider == "anthropic":
        response = await anthropic_client.messages.create(
            model=model, messages=messages, tools=tools or [], **kwargs
        )
        usage = response.usage
        input_tokens = usage.input_tokens
        output_tokens = usage.output_tokens
        cached_tokens = getattr(usage, "cache_read_input_tokens", 0)

    latency = (time.time() - start) * 1000

    event = CostEvent(
        tenant_id=ctx.get("tenant_id", "unknown"),
        session_id=ctx.get("session_id", ""),
        agent_id=ctx.get("agent_id", ""),
        subscription_tier=ctx.get("subscription_tier", "free"),
        model_provider=provider,
        model_name=model,
        input_tokens=input_tokens,
        output_tokens=output_tokens,
        cached_tokens=cached_tokens,
        tool_call_count=len(tools) if tools else 0,
        latency_ms=latency,
    )

    # Emit to the pipeline asynchronously (fire-and-forget)
    await cost_event_queue.put(event)
    return response

The critical design decision here is using Python's contextvars.ContextVar for tenant propagation. This ensures that even when your agent framework spawns sub-agents or parallel tool calls, the tenant identity is never lost. Set the context variable at your API gateway or request handler layer, and it propagates automatically.

Step 2: Build the Enrichment and Tool Call Tagging Layer

Raw cost events need enrichment before they can be priced. The enrichment layer has two jobs: resolving tool call types to cost categories, and validating that the tenant ID maps to a real subscription record.

Define a tool cost taxonomy that maps tool call names to billing categories. This is important because different tools have very different cost profiles:


TOOL_COST_TAXONOMY = {
    # Web & Search
    "web_search": {"category": "retrieval", "base_cost_usd": 0.003},
    "news_search": {"category": "retrieval", "base_cost_usd": 0.003},
    "url_fetch": {"category": "retrieval", "base_cost_usd": 0.001},

    # Code & Compute
    "code_interpreter": {"category": "compute", "base_cost_usd": 0.015},
    "bash_exec": {"category": "compute", "base_cost_usd": 0.010},

    # Data & Retrieval
    "vector_search": {"category": "retrieval", "base_cost_usd": 0.0005},
    "sql_query": {"category": "data", "base_cost_usd": 0.002},

    # External APIs (pass-through pricing)
    "maps_lookup": {"category": "external_api", "base_cost_usd": 0.005},
    "email_send": {"category": "external_api", "base_cost_usd": 0.001},
}

async def enrich_cost_event(event: CostEvent, tool_name: str = None) -> CostEvent:
    # Tag tool call type and category
    if tool_name and tool_name in TOOL_COST_TAXONOMY:
        taxonomy = TOOL_COST_TAXONOMY[tool_name]
        event.tool_call_type = taxonomy["category"]
        # Add tool cost to the event for the pricing engine
        event._tool_base_cost = taxonomy["base_cost_usd"]

    # Validate and enrich tenant metadata from your tenant store
    tenant_record = await tenant_store.get(event.tenant_id)
    if tenant_record:
        event.subscription_tier = tenant_record.subscription_tier
    else:
        event.subscription_tier = "unknown"
        # Flag for alerting: unattributed spend is a finance risk
        await alert_queue.put(f"Unresolved tenant_id: {event.tenant_id}")

    return event

Step 3: Build the Real-Time Pricing Engine

This is the heart of the pipeline. The pricing engine converts token counts and tool call counts into dollar amounts using a live pricing table that you maintain and version-control. Hard-coding prices is a trap: model providers changed pricing multiple times in the past 18 months, and a stale price table will silently corrupt your financial data.

Store your pricing table in a fast key-value store (Redis works well) with a fallback to a config file. Here is the structure:


# pricing_table.yaml (version-controlled, loaded into Redis on deploy)
pricing:
  openai:
    gpt-5:
      input_per_million_tokens: 15.00
      output_per_million_tokens: 60.00
      cached_input_per_million_tokens: 3.75
    gpt-4o-mini:
      input_per_million_tokens: 0.15
      output_per_million_tokens: 0.60
      cached_input_per_million_tokens: 0.075
  anthropic:
    claude-opus-4:
      input_per_million_tokens: 18.00
      output_per_million_tokens: 90.00
      cached_input_per_million_tokens: 1.80
    claude-sonnet-4:
      input_per_million_tokens: 3.00
      output_per_million_tokens: 15.00
      cached_input_per_million_tokens: 0.30
  google:
    gemini-ultra-2:
      input_per_million_tokens: 10.00
      output_per_million_tokens: 30.00
      cached_input_per_million_tokens: 2.50

Now build the pricing engine itself:


class PricingEngine:
    def __init__(self, redis_client, pricing_table: dict):
        self.redis = redis_client
        self.pricing = pricing_table

    def calculate_cost(self, event: CostEvent) -> float:
        provider_pricing = self.pricing.get(event.model_provider, {})
        model_pricing = provider_pricing.get(event.model_name, None)

        if not model_pricing:
            # Fallback: use a conservative default to avoid undercharging
            return self._fallback_cost(event)

        # Separate cached from non-cached input tokens
        non_cached_input = event.input_tokens - event.cached_tokens

        input_cost = (
            non_cached_input / 1_000_000
        ) * model_pricing["input_per_million_tokens"]

        cached_cost = (
            event.cached_tokens / 1_000_000
        ) * model_pricing["cached_input_per_million_tokens"]

        output_cost = (
            event.output_tokens / 1_000_000
        ) * model_pricing["output_per_million_tokens"]

        # Add tool call costs if present
        tool_cost = getattr(event, "_tool_base_cost", 0.0) * event.tool_call_count

        total = input_cost + cached_cost + output_cost + tool_cost
        return round(total, 8)  # Keep 8 decimal places for micro-transactions

    def _fallback_cost(self, event: CostEvent) -> float:
        # Conservative fallback: $0.01 per 1K tokens total
        total_tokens = event.input_tokens + event.output_tokens
        return round((total_tokens / 1000) * 0.01, 8)

One important nuance: always price with 8 decimal places internally. When tenants are on high-volume tiers, sub-cent differences compound into meaningful reconciliation errors at the end of the month.

Step 4: Stream Priced Events to Your Data Store and Billing System

With enriched, priced events flowing through your pipeline, you now need to persist them in two places simultaneously: a time-series analytics store for real-time dashboards, and your billing aggregation layer for subscription reconciliation.

Use an event streaming backbone (Apache Kafka or Redpanda work well at scale; a simple Redis Streams setup works for smaller platforms) to fan out each event to multiple consumers:


import asyncio
import json

async def pipeline_worker(event_queue: asyncio.Queue, pricing_engine: PricingEngine):
    """
    Core pipeline worker. Runs continuously, processing cost events.
    """
    while True:
        event: CostEvent = await event_queue.get()

        # Step 1: Enrich
        event = await enrich_cost_event(event)

        # Step 2: Price
        event.raw_cost_usd = pricing_engine.calculate_cost(event)

        # Step 3: Serialize
        payload = {
            "event_id": event.event_id,
            "timestamp": event.timestamp,
            "tenant_id": event.tenant_id,
            "session_id": event.session_id,
            "agent_id": event.agent_id,
            "model_provider": event.model_provider,
            "model_name": event.model_name,
            "subscription_tier": event.subscription_tier,
            "input_tokens": event.input_tokens,
            "output_tokens": event.output_tokens,
            "cached_tokens": event.cached_tokens,
            "tool_call_type": event.tool_call_type,
            "tool_call_count": event.tool_call_count,
            "latency_ms": event.latency_ms,
            "raw_cost_usd": event.raw_cost_usd,
        }

        # Step 4: Fan out to consumers
        await asyncio.gather(
            write_to_timeseries(payload),     # ClickHouse / TimescaleDB
            write_to_billing_aggregator(payload),  # Stripe / internal billing
            write_to_audit_log(payload),      # Immutable audit trail
        )

        event_queue.task_done()

For the time-series store, ClickHouse is the standout choice in 2026 for this workload. Its columnar storage and materialized views let you answer queries like "total AI spend by tenant, broken down by model provider and tool type, for the last 15 minutes" in milliseconds, even at tens of millions of events per day.

Here is the ClickHouse schema you should use:


CREATE TABLE ai_cost_events (
    event_id        UUID,
    timestamp       DateTime64(3),
    tenant_id       LowCardinality(String),
    session_id      String,
    agent_id        LowCardinality(String),
    model_provider  LowCardinality(String),
    model_name      LowCardinality(String),
    subscription_tier LowCardinality(String),
    input_tokens    UInt32,
    output_tokens   UInt32,
    cached_tokens   UInt32,
    tool_call_type  LowCardinality(String),
    tool_call_count UInt16,
    latency_ms      Float32,
    raw_cost_usd    Decimal(18, 8)
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(timestamp)
ORDER BY (tenant_id, timestamp, model_provider)
TTL timestamp + INTERVAL 24 MONTH;

The LowCardinality type on repeated string fields like tenant_id, model_provider, and subscription_tier is critical for query performance. It tells ClickHouse to use dictionary encoding, which can reduce storage and query time by 3x to 10x on these columns.

Step 5: Build the Billing Aggregation Layer for Subscription Tier Reconciliation

This is where the pipeline pays off for your finance team. The billing aggregation layer maintains a running spend ledger per tenant and compares it against their subscription tier's included AI credits in real time.

Define your tier structure first:


SUBSCRIPTION_TIERS = {
    "starter": {
        "monthly_ai_credit_usd": 5.00,
        "overage_rate_multiplier": 1.5,  # 1.5x list price for overage
        "included_tool_calls": 500,
    },
    "growth": {
        "monthly_ai_credit_usd": 50.00,
        "overage_rate_multiplier": 1.2,
        "included_tool_calls": 10_000,
    },
    "scale": {
        "monthly_ai_credit_usd": 500.00,
        "overage_rate_multiplier": 1.0,  # No markup on overage at scale tier
        "included_tool_calls": 200_000,
    },
    "enterprise": {
        "monthly_ai_credit_usd": None,   # Custom contract
        "overage_rate_multiplier": 1.0,
        "included_tool_calls": None,
    },
}

Now build the aggregator that maintains real-time spend counters in Redis and triggers billing events when thresholds are crossed:


class BillingAggregator:
    def __init__(self, redis_client, billing_client):
        self.redis = redis_client
        self.billing = billing_client  # e.g., Stripe Billing or your internal system

    async def record_spend(self, event: dict):
        tenant_id = event["tenant_id"]
        tier = event["subscription_tier"]
        cost = float(event["raw_cost_usd"])
        tool_calls = event["tool_call_count"]
        billing_period = self._current_billing_period()

        # Atomic increment of spend and tool call counters in Redis
        spend_key = f"spend:{tenant_id}:{billing_period}"
        tools_key = f"tools:{tenant_id}:{billing_period}"

        pipe = self.redis.pipeline()
        pipe.incrbyfloat(spend_key, cost)
        pipe.incrby(tools_key, tool_calls)
        pipe.expire(spend_key, 60 * 60 * 24 * 35)  # 35-day TTL
        pipe.expire(tools_key, 60 * 60 * 24 * 35)
        results = await pipe.execute()

        current_spend = float(results[0])
        current_tools = int(results[1])

        # Check against tier limits and emit billing events
        tier_config = SUBSCRIPTION_TIERS.get(tier, {})
        credit_limit = tier_config.get("monthly_ai_credit_usd")

        if credit_limit is not None:
            await self._check_and_emit_overage(
                tenant_id, tier, current_spend, credit_limit, cost
            )

    async def _check_and_emit_overage(
        self, tenant_id, tier, current_spend, credit_limit, incremental_cost
    ):
        previous_spend = current_spend - incremental_cost
        tier_config = SUBSCRIPTION_TIERS[tier]
        multiplier = tier_config["overage_rate_multiplier"]

        # Detect the moment a tenant crosses into overage
        if previous_spend < credit_limit <= current_spend:
            await self.billing.create_overage_event(
                tenant_id=tenant_id,
                event_type="tier_limit_crossed",
                overage_start_usd=current_spend - credit_limit,
                multiplier=multiplier,
            )

        # If already in overage, bill the incremental amount at overage rate
        elif current_spend > credit_limit:
            overage_cost = incremental_cost * multiplier
            await self.billing.record_usage(
                tenant_id=tenant_id,
                amount_usd=overage_cost,
                description=f"AI overage ({tier} tier)",
            )

    def _current_billing_period(self) -> str:
        from datetime import datetime
        now = datetime.utcnow()
        return f"{now.year}-{now.month:02d}"

Step 6: Build the Finance Dashboard Query Layer

Your finance team does not want to write SQL. They want a dashboard that answers specific questions instantly. Build a thin query API on top of your ClickHouse store that exposes the four views finance teams actually need:

View 1: Spend by tenant, current billing period. Total AI spend per tenant, broken down by model provider.
View 2: Spend vs. tier credit utilization. What percentage of each tenant's included AI credit has been consumed, with a projected end-of-month overage.
View 3: Cost by tool call type. Which tool categories (compute, retrieval, external API) are driving costs for each tenant.
View 4: Anomaly detection. Tenants whose spend velocity in the last hour is more than 3 standard deviations above their 30-day baseline.

Here is the ClickHouse query for View 2, which is the most valuable for subscription reconciliation:


SELECT
    tenant_id,
    subscription_tier,
    sum(raw_cost_usd)                          AS total_spend_usd,
    count()                                    AS total_events,
    sum(tool_call_count)                       AS total_tool_calls,
    groupArray(model_provider)                 AS providers_used, Project end-of-month spend based on current daily burn rate
    sum(raw_cost_usd) / toDayOfMonth(now())
        * daysInMonth(now())                   AS projected_monthly_spend_usd
FROM ai_cost_events
WHERE
    toYYYYMM(timestamp) = toYYYYMM(now())
GROUP BY
    tenant_id,
    subscription_tier
ORDER BY
    total_spend_usd DESC;

Step 7: Add Alerting and Anomaly Detection

A cost attribution pipeline without alerting is just an expensive audit log. Add two critical alert types to complete the system.

Alert Type 1: Spend Velocity Spike

Run this query every 5 minutes and alert if any tenant's spend rate in the last 15 minutes is more than 3x their hourly average for the past 7 days. This catches runaway agent loops, prompt injection attacks that trigger excessive tool calls, and misconfigured agents before they generate four-figure surprise bills.

Alert Type 2: Unattributed Spend

Any cost event where tenant_id = 'unknown' represents money you are spending that you cannot bill back. Set a hard alert threshold: if unattributed spend exceeds 0.5% of total spend in any 1-hour window, page your on-call engineer. In practice, this usually means a new code path was deployed without the tenant context propagation middleware attached.

Common Pitfalls and How to Avoid Them

Pitfall: Pricing table drift. Model providers update prices with little notice. Build a simple CI/CD job that fetches pricing from provider APIs (or their pricing pages via a scraper) weekly and opens a pull request to update your pricing table. Treat a stale pricing table as a P1 incident.
Pitfall: Losing context in async agent frameworks. If you use frameworks that spawn threads or processes (not just async coroutines), Python's contextvars will not propagate across process boundaries. Use explicit header injection instead, passing X-Tenant-ID as a header on every internal service call.
Pitfall: Double-counting retries. If your LLM client retries a failed request, you may emit two cost events for one billable interaction. Add a request_id field to your CostEvent and deduplicate at the ClickHouse insert layer using its ReplacingMergeTree engine variant.
Pitfall: Ignoring streaming responses. Streaming completions (SSE) report token usage only at the end of the stream. Make sure your instrumentation wrapper waits for the usage object in the final stream chunk rather than estimating from partial data.
Pitfall: Not accounting for batch inference discounts. Some providers offer 50% discounts on asynchronous batch inference jobs. If your platform uses batch APIs for background tasks, your pricing table needs a separate entry for batch vs. real-time pricing per model.

Putting It All Together: The Complete Data Flow

Here is the end-to-end data flow your pipeline implements:

A tenant's HTTP request arrives at your API gateway. The gateway sets the tenant_context ContextVar with tenant ID, session ID, and subscription tier.
Your agent framework processes the request, calling instrumented_chat_completion() for every model call. Each call emits a raw CostEvent to an in-memory async queue.
The pipeline worker picks up each event, calls enrich_cost_event() to tag tool types and validate tenant metadata, then calls pricing_engine.calculate_cost() to compute the dollar amount.
The priced event is fanned out to ClickHouse (for analytics), the BillingAggregator (for real-time ledger updates and overage detection), and an immutable audit log (for finance compliance).
The BillingAggregator updates Redis counters atomically and emits billing events to your billing provider when tier thresholds are crossed.
Finance dashboards query ClickHouse directly via the query API, seeing real-time spend breakdowns by tenant, model provider, and tool type, with projected end-of-month figures.

Conclusion: Cost Attribution Is a Product Feature, Not an Ops Afterthought

In 2026, the platforms that win the multi-model AI race will not necessarily be the ones with the best models. They will be the ones that can confidently tell a CFO exactly how much each tenant costs to serve, in real time, broken down by every dimension that matters. That confidence is what enables you to price your subscription tiers correctly, catch runaway costs before they become crises, and build the kind of transparent billing that enterprise customers require.

The pipeline described in this tutorial is not a massive engineering project. The core of it, from instrumentation through billing aggregation, can be built and deployed by a small team in two to three weeks. The investment pays back almost immediately: most teams that implement per-tenant cost attribution discover at least one subscription tier that is structurally unprofitable, usually the one where power users are running deep agentic workflows that the original pricing model never anticipated.

Start with Step 1. Wrap your LLM calls. Get the events flowing. Everything else builds on that foundation. The 2026 billing complexity crisis is real, but it is a solvable engineering problem, and now you have the blueprint.