AI cost attribution

How to Build a Backend Cost Attribution System for Multi-Agent AI Workflows (So Engineering Teams Can Accurately Chargeback Compute, Token, and Tool-Call Expenses to Individual Product Lines in 2026)

Scott Miller

Mar 7, 2026 • 11 min read

Searches returned limited results, so I'll draw on my deep expertise to write this comprehensive tutorial now.

If your organization runs multi-agent AI workflows at any meaningful scale in 2026, you already know the uncomfortable truth: the billing dashboard is a black box. You see a massive monthly invoice from your LLM provider, your cloud compute bill has ballooned, and when the CFO asks which product line is responsible, engineering shrugs. That silence is expensive, in more ways than one.

Cost attribution, or the practice of accurately assigning infrastructure and AI spend to the teams and product lines that generated it, has become one of the most critical (and most neglected) disciplines in AI engineering. Traditional FinOps tooling was built for monolithic services and microservices. Multi-agent AI systems, with their dynamic task spawning, nested tool calls, parallel sub-agent execution, and variable token consumption, break every assumption those tools were built on.

This guide walks you through building a backend cost attribution system from scratch, one that handles compute, token, and tool-call expenses and maps them cleanly to individual product lines. By the end, you will have a practical architecture you can implement this quarter.

Why Existing FinOps Tools Fall Short for Multi-Agent AI

Before diving into the solution, it is worth understanding exactly where the gap is. Standard FinOps approaches rely on a few core primitives: resource tags, cost allocation by service, and static budget envelopes. These work reasonably well when a microservice has a predictable, bounded footprint. Multi-agent systems violate this model in three specific ways:

Dynamic agent spawning: An orchestrator agent may spawn between 2 and 40 sub-agents depending on task complexity. Each sub-agent has its own token budget, model selection, and tool-call profile. There is no static resource to tag ahead of time.
Nested and recursive tool calls: A single user request can trigger a chain of tool calls (web search, code execution, database lookup, external API) across multiple agent hops. Each hop has its own cost signature, and the chain is rarely linear.
Mixed model routing: Modern agentic pipelines in 2026 typically route tasks across multiple models. A GPT-class frontier model handles reasoning, a smaller fine-tuned model handles classification, and a specialized embedding model handles retrieval. Each has a different per-token price, and a single workflow touches all three.

The result is that your cost surface is a directed acyclic graph (DAG), not a flat list of services. Your attribution system needs to understand that graph.

The Core Architecture: Trace-First Cost Attribution

The most reliable approach to multi-agent cost attribution is to build it on top of distributed tracing. If you have used OpenTelemetry before, this will feel familiar. The key insight is this: every cost event in an AI workflow is a child span of a root trace that carries a product-line identifier. If you instrument your system correctly, cost rolls up naturally through the trace hierarchy.

Here is the high-level architecture:

Trace Context Propagation Layer (injects and forwards attribution metadata)
Cost Event Collector (captures token counts, compute duration, and tool-call events)
Cost Enrichment Pipeline (attaches real-time pricing and normalizes units)
Attribution Store (persists enriched cost events with full trace lineage)
Chargeback Reporting Engine (aggregates and exposes data per product line)

Let's build each layer in detail.

Step 1: Define Your Attribution Taxonomy

Before writing a single line of code, you need a clean taxonomy. Attribution systems fail not because of bad instrumentation but because of ambiguous ownership. Define the following dimensions before you start:

Required Dimensions

product_line_id: The top-level business unit responsible for the cost (e.g., search_assistant, customer_support_bot, data_pipeline_agent).
workflow_id: A UUID identifying the specific workflow template or agent graph definition.
run_id: A UUID for the specific execution instance of that workflow.
agent_id: The identifier of the specific agent node within the workflow (e.g., planner_agent, retrieval_agent, code_executor_agent).
environment: production, staging, or development. This is critical; you do not want dev experiments inflating production chargebacks.

Optional But Recommended Dimensions

user_segment: Free tier vs. paid tier, useful for downstream margin analysis.
feature_flag: Which A/B variant or model configuration was active during the run.
priority_tier: Whether the run was a real-time interactive request or a background batch job.

Store this taxonomy in a central configuration registry so every service in your stack can pull it. A simple JSON schema served from a config service works well. Avoid hardcoding these values in individual agents.

Step 2: Instrument Trace Context Propagation

Every agent execution must carry the attribution context through its entire call chain. The cleanest way to do this in 2026 is with OpenTelemetry baggage, which propagates key-value pairs across process boundaries automatically via HTTP headers or gRPC metadata.

Here is a Python example using the OpenTelemetry SDK to inject attribution context at the workflow entry point:


from opentelemetry import baggage, context
from opentelemetry.baggage.propagation import W3CBaggagePropagator
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
import uuid

def start_attributed_workflow(product_line_id: str, workflow_id: str) -> dict:
    run_id = str(uuid.uuid4())

    # Inject attribution metadata into OpenTelemetry baggage
    ctx = baggage.set_baggage("product_line_id", product_line_id)
    ctx = baggage.set_baggage("workflow_id", workflow_id, context=ctx)
    ctx = baggage.set_baggage("run_id", run_id, context=ctx)
    ctx = baggage.set_baggage("environment", "production", context=ctx)

    # Attach context to current execution
    token = context.attach(ctx)

    return {"run_id": run_id, "context_token": token}

Any downstream agent or tool call that uses the OpenTelemetry SDK will automatically inherit this baggage. When a sub-agent spawns, it propagates the same product_line_id and run_id without any manual threading. This is the key to avoiding the "who called whom" attribution problem.

For agent frameworks like LangGraph, AutoGen, or CrewAI (all of which support custom callbacks and middleware hooks in their 2026 versions), you inject this context in the framework's pre-execution hook rather than manually in each agent.

Step 3: Build the Cost Event Collector

The Cost Event Collector is a lightweight instrumentation layer that intercepts every cost-generating operation and emits a structured cost event. There are three categories of cost events you need to capture:

A. Token Cost Events (LLM API Calls)

Wrap your LLM client to capture token usage on every call. Here is a generic wrapper pattern:


import time
from opentelemetry import trace, baggage, context as otel_context
from dataclasses import dataclass, asdict
from typing import Optional
import json

@dataclass
class TokenCostEvent:
    event_type: str = "token_cost"
    timestamp: float = 0.0
    run_id: str = ""
    product_line_id: str = ""
    agent_id: str = ""
    model_id: str = ""
    prompt_tokens: int = 0
    completion_tokens: int = 0
    total_tokens: int = 0
    estimated_cost_usd: float = 0.0

# Pricing table (update monthly from provider APIs)
MODEL_PRICING = {
    "gpt-5": {"input": 0.000015, "output": 0.000060},       # per token
    "gpt-4o-mini": {"input": 0.0000002, "output": 0.0000008},
    "claude-4-sonnet": {"input": 0.000012, "output": 0.000048},
    "gemini-2-flash": {"input": 0.0000001, "output": 0.0000004},
}

def emit_token_cost_event(
    agent_id: str,
    model_id: str,
    prompt_tokens: int,
    completion_tokens: int,
    event_sink  # your Kafka producer, Kinesis client, etc.
):
    ctx = otel_context.get_current()
    pricing = MODEL_PRICING.get(model_id, {"input": 0.0, "output": 0.0})
    cost = (prompt_tokens * pricing["input"]) + (completion_tokens * pricing["output"])

    event = TokenCostEvent(
        timestamp=time.time(),
        run_id=baggage.get_baggage("run_id", ctx) or "unknown",
        product_line_id=baggage.get_baggage("product_line_id", ctx) or "unknown",
        agent_id=agent_id,
        model_id=model_id,
        prompt_tokens=prompt_tokens,
        completion_tokens=completion_tokens,
        total_tokens=prompt_tokens + completion_tokens,
        estimated_cost_usd=cost
    )

    event_sink.send("cost-events", json.dumps(asdict(event)).encode())

B. Compute Cost Events (Container and GPU Runtime)

For agents running on containerized infrastructure (Kubernetes, ECS, or similar), compute cost attribution requires a sidecar or DaemonSet that samples CPU and memory usage per pod and correlates it with the run_id label on the pod spec. The formula is straightforward:


compute_cost_usd = (cpu_cores_used * cpu_price_per_core_hour * duration_hours) +
                   (memory_gb_used * memory_price_per_gb_hour * duration_hours) +
                   (gpu_hours_used * gpu_price_per_hour)

The critical step is ensuring your Kubernetes pod spec includes attribution labels that match your taxonomy. Add these to your agent deployment manifests:


metadata:
  labels:
    ai.attribution/product-line-id: "search_assistant"
    ai.attribution/workflow-id: "semantic-search-v3"
    ai.attribution/environment: "production"

A Prometheus exporter with custom label selectors can then aggregate compute cost per product line using these pod labels.

C. Tool-Call Cost Events

Tool calls are the most frequently overlooked cost category. In 2026, a typical production agent workflow might invoke external APIs (web search, code sandboxes, vector databases, data enrichment services), each with its own per-call pricing. Wrap each tool executor with a cost emitter:


TOOL_PRICING = {
    "web_search": 0.003,          # per call
    "code_execution_sandbox": 0.008,
    "vector_db_query": 0.0001,
    "image_generation": 0.04,
    "structured_data_extract": 0.005,
}

def tool_call_wrapper(tool_name: str, tool_fn, *args, event_sink, **kwargs):
    start = time.time()
    result = tool_fn(*args, **kwargs)
    duration_ms = (time.time() - start) * 1000

    ctx = otel_context.get_current()
    event = {
        "event_type": "tool_cost",
        "timestamp": time.time(),
        "run_id": baggage.get_baggage("run_id", ctx) or "unknown",
        "product_line_id": baggage.get_baggage("product_line_id", ctx) or "unknown",
        "agent_id": baggage.get_baggage("agent_id", ctx) or "unknown",
        "tool_name": tool_name,
        "duration_ms": duration_ms,
        "estimated_cost_usd": TOOL_PRICING.get(tool_name, 0.0),
    }
    event_sink.send("cost-events", json.dumps(event).encode())
    return result

Step 4: Build the Cost Enrichment Pipeline

Raw cost events from your collector are good but not great. The enrichment pipeline adds the context needed for accurate reporting. This is typically a stream processing job (Apache Flink, Kafka Streams, or AWS Kinesis Data Analytics) that consumes from your cost-events topic and produces to an enriched-cost-events topic.

The enrichment pipeline performs three operations:

Price normalization: Apply the correct pricing tier based on your negotiated contract rates (which may differ from public list prices). If your organization has volume discounts with an LLM provider, this is where you apply them.
Taxonomy validation: Reject or quarantine events with missing or invalid product_line_id values. Unattributed costs should be flagged, not silently dropped into an "other" bucket.
Lineage enrichment: Join each event with the workflow run metadata store to attach parent workflow context, enabling hierarchical rollups. A tool-call event from a sub-agent should be traceable back to the root orchestrator and the originating product line.

Step 5: Design the Attribution Store

The Attribution Store is a time-series-friendly database that holds all enriched cost events with their full attribution context. Your choice of storage depends on your query patterns:

ClickHouse is the most popular choice in 2026 for this use case. Its columnar storage handles high-cardinality GROUP BY queries (aggregating by product line, model, agent, time window) at exceptional speed. Cost events are append-only and time-ordered, which plays to ClickHouse's strengths.
Apache Iceberg on S3/GCS works well if you need long-term retention and integration with a data lake for finance and analytics teams.
TimescaleDB is a solid choice if your team is already on PostgreSQL and wants to avoid introducing a new database technology.

Here is the recommended schema for your core cost events table in ClickHouse:


CREATE TABLE cost_events (
    event_id         UUID,
    event_type       LowCardinality(String), 'token_cost', 'compute_cost', 'tool_cost'
    timestamp        DateTime64(3, 'UTC'),
    product_line_id  LowCardinality(String),
    workflow_id      LowCardinality(String),
    run_id           UUID,
    agent_id         LowCardinality(String),
    environment      LowCardinality(String),
    model_id         LowCardinality(String),
    tool_name        LowCardinality(String),
    prompt_tokens    UInt32,
    completion_tokens UInt32,
    duration_ms      Float32,
    cost_usd         Decimal(18, 8),
    metadata         String                   , JSON blob for extensibility
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(timestamp)
ORDER BY (product_line_id, timestamp, run_id)
TTL timestamp + INTERVAL 24 MONTH;

The ORDER BY key is intentional: most chargeback queries filter first by product_line_id and then by time range, so this ordering dramatically improves query performance.

Step 6: Build the Chargeback Reporting Engine

With clean, enriched data in your Attribution Store, the reporting layer is relatively straightforward. You need three tiers of reporting:

Tier 1: Real-Time Cost Dashboard (Engineering Operational View)

A Grafana dashboard connected directly to ClickHouse. Key panels include: cost per product line per hour, top 10 most expensive workflows by run, token consumption rate by model, and anomaly alerts when a product line exceeds its hourly cost budget.

Tier 2: Weekly Chargeback Report (Engineering Manager View)

A scheduled query that produces a structured report breaking down the full cost by product line, subdivided by cost category (tokens, compute, tool calls). This is what gets sent to engineering managers and product owners for budget review. A simple materialized view in ClickHouse can power this:


SELECT
    product_line_id,
    environment,
    toStartOfWeek(timestamp) AS week,
    sumIf(cost_usd, event_type = 'token_cost')   AS token_cost_usd,
    sumIf(cost_usd, event_type = 'compute_cost') AS compute_cost_usd,
    sumIf(cost_usd, event_type = 'tool_cost')    AS tool_cost_usd,
    sum(cost_usd)                                AS total_cost_usd,
    sum(prompt_tokens + completion_tokens)       AS total_tokens
FROM cost_events
WHERE environment = 'production'
GROUP BY product_line_id, environment, week
ORDER BY week DESC, total_cost_usd DESC;

Tier 3: Monthly Finance Chargeback Export (Finance View)

A CSV or Parquet export that maps directly to your internal cost center codes. This is what the finance team uses to allocate AI infrastructure spend to the correct P&L lines. The key addition here is a mapping table that translates product_line_id to your organization's internal cost center identifiers and budget owners.

Step 7: Handle the Hard Cases

A few scenarios will break a naive attribution system. Here is how to handle them:

Shared Infrastructure Costs

Some costs are genuinely shared: a vector database cluster used by five product lines, a shared embedding model endpoint, a common retrieval cache. For these, use a proportional allocation model. Track each product line's share of total requests to the shared resource over a billing period and allocate the shared cost proportionally. Store the allocation weights in a configuration table so they can be audited and adjusted.

Cascading Agent Failures with Partial Costs

When a multi-agent workflow fails mid-execution, you still incurred real costs up to the failure point. Your cost events should be emitted at the time of the operation, not at workflow completion. Never batch cost events to the end of a run; emit them as each operation completes. This way, a failed run still generates accurate partial cost attribution.

Async and Background Workflows

Background batch workflows that run overnight present a context propagation challenge: the original HTTP request context is long gone. The solution is to persist the attribution context (product_line_id, workflow_id, etc.) in your workflow job queue payload at enqueue time. When the worker picks up the job, it reconstructs the OTel baggage context from the job payload before executing.

Step 8: Set Up Budget Guardrails and Alerts

A cost attribution system without automated guardrails is a reporting tool, not a control plane. Add the following guardrails to make your system actionable:

Per-product-line hourly budget caps: When a product line's hourly spend exceeds a configurable threshold, emit a PagerDuty or Slack alert to the owning team. Consider an automatic circuit breaker that throttles new workflow starts for that product line until the owning team acknowledges.
Per-run cost caps: Set a maximum cost per workflow run. If a single run exceeds the cap (usually a sign of a runaway agent loop), terminate it and emit a cost anomaly event.
Weekly budget burn rate alerts: If a product line is on pace to exceed its monthly budget by day 10 of the month, alert early rather than at month end.

Putting It All Together: A Reference Architecture Summary

Here is the complete data flow in one view:

A user request arrives at your API gateway. The gateway injects product_line_id and creates a run_id, attaching both to OTel baggage.
The orchestrator agent spawns sub-agents. Each sub-agent inherits the baggage context automatically.
Every LLM call, tool call, and compute event emits a structured cost event to a Kafka topic.
A stream processing job enriches events with pricing data and validates attribution metadata.
Enriched events land in ClickHouse, partitioned by month and ordered by product line and time.
Grafana reads from ClickHouse for real-time operational dashboards.
A weekly scheduled job materializes chargeback reports and delivers them to product owners.
A monthly export feeds into the finance system for P&L allocation.

Common Pitfalls to Avoid

Attributing by service name instead of workflow context: Tagging costs to "the inference service" tells you nothing. Always attribute to the workflow and product line that initiated the work.
Ignoring development and staging costs: These can be 20 to 40 percent of total AI spend in active development cycles. Track them separately, not as noise in production data.
Using list prices instead of contracted rates: If you have negotiated rates with your LLM provider (increasingly common for enterprise customers in 2026), your estimated costs will be materially wrong if you use public pricing.
Skipping taxonomy governance: Without a formal process for adding new product_line_id values, engineers will invent their own, and your reports will fragment into dozens of unrecognized identifiers. Treat the taxonomy as a governed artifact, versioned and reviewed like a database schema.

Conclusion

Building a backend cost attribution system for multi-agent AI workflows is not a glamorous project, but it is one of the highest-leverage infrastructure investments your engineering team can make in 2026. As AI spend continues to grow as a percentage of total infrastructure cost, the organizations that can see exactly where that money is going will make dramatically better decisions: which workflows to optimize, which product lines to scale, and which experiments to kill before they drain the budget.

The architecture described here, trace-first context propagation, event-driven cost collection, stream enrichment, and columnar storage, is production-tested and scales from a few thousand workflow runs per day to tens of millions. Start with Step 1 (taxonomy definition) and Step 2 (trace context propagation). Those two steps alone will give you more visibility than most engineering teams have today. The rest of the pipeline can be built incrementally.

The goal is not perfect cost accounting on day one. The goal is to make cost a first-class observable in your AI system, right alongside latency and error rate. When your engineering team treats AI cost with the same rigor as performance, the entire product development culture shifts. And that shift pays for itself very quickly.