AI Agents

How One Fintech SaaS Team Discovered Their Per-Tenant AI Agent Dependency Graph Was Silently Duplicating Tool Execution Costs Across Shared Infrastructure , And the Deduplication Pipeline Architecture That Cut Their March 2026 Inference Bills by 40%

Scott Miller

Mar 31, 2026 • 8 min read

When Meridian Financial's platform engineering team sat down to review their March 2026 inference billing dashboard, the number staring back at them was not just alarming , it was confusing. Their AI-powered compliance assistant, deployed across roughly 340 enterprise tenants, had generated an invoice nearly double what their cost models had projected for that quarter. Nobody had shipped a major feature. Traffic was flat. Token usage per session looked reasonable. So where was the money going?

The answer, buried deep inside their per-tenant agent dependency graph, would take three weeks to fully untangle. But once they found it, the fix was architectural, elegant, and ultimately shaved 40% off their monthly inference bill starting in the same billing cycle. This is the story of what they found, why it happens more often than teams realize, and how to build the deduplication pipeline that stops it cold.

Background: The Architecture That Seemed Fine

Meridian's core product is a compliance and risk analysis platform built for mid-market financial institutions. Their AI layer, introduced in late 2024 and significantly expanded through 2025, uses a multi-agent architecture where each tenant gets a logically isolated agent runtime. Think of it as a dedicated "compliance brain" per customer: each agent can call a suite of shared tools including regulatory document retrieval, transaction pattern analysis, entity resolution, and external data enrichment via third-party APIs.

On paper, the architecture was sound. Tenants were isolated at the data layer. Shared tools were stateless and horizontally scalable. The dependency graph that wired agents to tools was generated dynamically at tenant provisioning time, based on each customer's subscribed feature set. A basic-tier tenant might have access to four tools; an enterprise tenant might have access to fourteen.

What nobody had audited carefully was how the dependency graph was being resolved at runtime, especially when multiple tenants triggered similar or identical tool calls within the same inference window.

The Silent Duplication: What Was Actually Happening

The root cause came down to a subtle but expensive interaction between three systems: the agent orchestration layer (built on a customized version of an open-source agentic framework), the tool execution registry, and the shared inference router that sat between agents and the underlying LLM API.

Here is the sequence that was silently bleeding money:

Tenant A's agent receives a user query about a specific regulatory filing requirement. The agent's dependency graph resolves this as requiring Tool X (regulatory document retrieval) followed by Tool Y (entity resolution).
Tenant B's agent, milliseconds later, receives a nearly identical query from a different user. Its dependency graph also resolves to Tool X followed by Tool Y, with the same underlying parameters.
Because each tenant's agent runtime was logically isolated, both tool calls were dispatched independently to the execution registry, which treated them as unrelated requests.
Tool X, in this case, involved a call to an external regulatory database AND a summarization pass through the LLM. Tool Y did the same. Both tenants paid full inference cost for identical operations.

Now multiply that pattern across 340 tenants, across dozens of tools, across thousands of daily queries. The Meridian team ran a retroactive trace on two weeks of logs and found that approximately 31% of all tool executions were semantically identical to another execution that had occurred within a 90-second window. Every single one of those was billed as a fresh inference call.

The dependency graph was not broken. It was doing exactly what it was designed to do. The problem was that the design had never accounted for cross-tenant tool call convergence at scale.

Why This Problem Is Invisible Until It Isn't

This class of bug is particularly nasty because it hides behind metrics that all look healthy. Per-session token counts are normal. Per-tenant cost trends are flat or slightly growing in line with usage. No errors, no latency spikes, no customer complaints. The only signal is the aggregate bill, and most teams do not have the tooling to decompose that bill at the tool-execution level with cross-tenant visibility.

The Meridian team only caught it because a senior infrastructure engineer, running a routine cost attribution report, noticed that their tool execution event volume was growing at 2.3x the rate of their user session volume. That ratio should have been closer to 1.4x based on their feature expansion roadmap. That single anomaly opened the investigation.

This is a critical lesson: instrument your tool execution layer independently from your session layer. If you are only measuring inference cost at the session or tenant level, you are flying blind on the most granular and controllable cost driver in your agentic stack.

The Deduplication Pipeline Architecture

Once the team understood the problem, they designed a cross-tenant tool execution deduplication pipeline. Here is how it works, broken into its four core components.

1. The Canonical Tool Call Fingerprint

Every tool call, before it hits the execution registry, is now passed through a fingerprinting function. The fingerprint is a deterministic hash composed of:

The tool identifier and version
A normalized, tenant-agnostic representation of the input parameters (stripping tenant IDs, session tokens, and any PII fields that are irrelevant to the actual computation)
A semantic hash of the user intent vector, bucketed to a configurable resolution (they use cosine similarity with a 0.96 threshold to group near-identical intents)
A temporal bucket (the current 90-second rolling window)

This fingerprint is not stored per-tenant. It lives in a shared, ephemeral deduplication cache (Redis with a 120-second TTL) that is intentionally cross-tenant at the tool execution layer, while remaining fully isolated at the data response layer.

2. The Execution Lock and Promise Registry

When a tool call fingerprint arrives, the pipeline checks the deduplication cache. Three outcomes are possible:

Cache miss: The call is new. It acquires a distributed lock on that fingerprint, executes the tool, stores the result, and releases the lock. The originating tenant is tagged as the "primary executor" for billing attribution.
Cache hit with result: The tool has already been executed within the current window. The result is returned immediately from cache. The requesting tenant is tagged as a "deduplication beneficiary" and billed at a significantly reduced cache-retrieval rate rather than full inference cost.
Cache hit without result (in-flight): Another tenant's agent is currently executing the same tool call. The new request subscribes to a promise (implemented as a pub/sub channel) and waits for the result. When it arrives, both tenants receive the same output. Billing is split between primary executor and subscriber.

The lock-and-promise pattern is critical for preventing the "thundering herd" scenario where 50 tenant agents all simultaneously miss the cache for the same popular tool call and all fire off independent executions before any result is returned.

3. Data Isolation at the Response Layer

The most important compliance and security constraint: the shared result cache must never leak tenant-specific data between tenants. Meridian addressed this with a two-layer response architecture.

Tool results are split into a "shared computation artifact" and a "tenant-contextualized response." The shared artifact contains the raw output of the tool (for example, the retrieved regulatory document text, or the resolved entity graph). The tenant-contextualized response is assembled by a lightweight, per-tenant post-processor that injects tenant-specific context, applies data access control filters, and formats the output for that tenant's agent runtime.

Only the shared computation artifact is cached and deduplicated. The post-processing step always runs per-tenant and is computationally cheap (no LLM calls involved). This architecture means that two tenants can share the cost of retrieving and summarizing a regulatory document, while still receiving responses that are correctly scoped to their own data access permissions.

4. The Cost Attribution and Billing Reconciliation Layer

Deduplication creates a billing fairness problem: if Tenant A's query triggers an execution that Tenant B also benefits from, how do you allocate costs? Meridian solved this with a weighted attribution model:

The primary executor pays 60% of the full execution cost.
Each deduplication beneficiary pays an equal share of the remaining 40%, divided by the number of beneficiaries within the execution window.
If a tool call is served entirely from a warm cache (no in-flight execution), the beneficiary pays only the cache retrieval cost, which is a flat fee set at roughly 8% of the average full execution cost for that tool tier.

This model is exposed to customers in their usage dashboards as a line item called "Shared Compute Credits," which has actually become a minor selling point in enterprise sales conversations, demonstrating that Meridian's platform is cost-efficient by design.

Implementation Timeline and Rollout

The team did not rebuild everything from scratch. Here is the condensed timeline of how they shipped this in production:

Week 1: Instrumentation pass. Added tool-level execution telemetry to generate the data needed to validate the duplication hypothesis and size the opportunity.
Week 2: Fingerprinting and cache layer built and tested in a staging environment with synthetic multi-tenant load. Edge cases around parameter normalization (especially for date-range and threshold parameters) consumed most of this week.
Week 3: Shadow mode deployment. The deduplication pipeline ran in parallel with the existing execution registry, logging what it would have deduplicated without actually changing behavior. This validated the 31% duplication rate estimate and confirmed the response isolation model was correct.
Week 4: Phased production rollout, starting with the four highest-volume tools (regulatory retrieval, entity resolution, market data enrichment, and document classification). These four tools alone accounted for 74% of total tool execution volume.
End of March 2026 billing cycle: First full-month bill under the new architecture. Total inference cost reduction: 40.2% versus the February 2026 baseline.

Key Metrics Before and After

To make the impact concrete, here is a summary of the before-and-after numbers from Meridian's internal engineering post-mortem:

Tool execution volume: Down 38% (same user session volume, fewer redundant executions)
Average tool call latency: Down 22% (cache hits are dramatically faster than live execution)
Inference API spend: Down 40.2% month-over-month
Cache hit rate: 34% of all tool calls served from deduplication cache in the first full month
Data isolation incidents: Zero (confirmed by a third-party security audit conducted in parallel)
Customer-facing response quality degradation: None reported; CSAT scores for the AI assistant feature were unchanged

Lessons for Other Multi-Tenant AI Teams

Meridian's story is not unique. Any team running a multi-tenant agentic platform with shared tool infrastructure is likely experiencing some version of this problem right now. Here are the takeaways that apply broadly:

Audit your tool execution layer independently

Do not wait for your aggregate bill to spike. Build telemetry that lets you measure tool call volume, cost, and deduplication potential as first-class metrics. If your tool execution growth rate is outpacing your session growth rate, start investigating immediately.

Dependency graph isolation does not mean execution isolation should be total

Logical tenant isolation is a data and security concern. It does not require that every computation be physically isolated. The distinction between "what data a tenant can access" and "what computations are performed on their behalf" is the key architectural insight that unlocks cross-tenant deduplication safely.

Fingerprinting is harder than it looks

The most time-consuming part of Meridian's implementation was getting the parameter normalization right. Tool calls that look different on the surface (different date formats, different casing, slightly different phrasing) are often semantically identical. Invest in robust normalization before you invest in the cache infrastructure.

Shadow mode is non-negotiable before production

Running the deduplication pipeline in shadow mode for a full week before activating it in production was the decision that prevented a potential incident. Several edge cases in the promise-and-lock system only surfaced under real multi-tenant load patterns that synthetic testing had not covered.

Make the savings visible to customers

Turning deduplication credits into a visible line item on customer dashboards was a small UX decision with outsized commercial impact. It reframes shared infrastructure as a customer benefit rather than a platform implementation detail.

Conclusion: Your Dependency Graph Is Probably Leaking Money Too

The most sobering part of Meridian's story is how long the problem went undetected. For months, a well-engineered, well-monitored platform was silently burning roughly 40% more inference budget than it needed to, not because of bad code, but because of an architectural assumption (that tenant isolation required execution isolation) that had never been explicitly challenged.

As AI agent platforms mature through 2026, the teams that win on unit economics will not necessarily be the ones with the best models or the most features. They will be the ones who treat their tool execution layer as a first-class cost surface, instrument it with the same rigor they apply to their database query layer, and build the cross-cutting infrastructure to eliminate waste without compromising the security and isolation guarantees their customers depend on.

If you are running a multi-tenant agentic platform and you have not yet audited your per-tenant dependency graph for cross-tenant execution convergence, that audit is probably the highest-ROI engineering task on your backlog right now. The bill is already running. The question is just whether you know it yet.