A Beginner's Guide to Per-Tenant AI Agent Rate Limiting: Token Buckets, Quota Pipelines, and Stopping Noisy Neighbors Before They Starve Your Smallest Tenants

Scott Miller

Mar 25, 2026 • 11 min read

You launched your multi-tenant LLM platform. Onboarding is going great. Then one Tuesday morning, your support queue fills up with tickets from small customers saying the product feels "slow" or "broken." Meanwhile, one of your enterprise tenants is happily running a batch AI agent job that is consuming 94% of your shared token budget. Welcome to the noisy-neighbor problem, and it is one of the most underestimated challenges in building production-grade AI platforms today.

In 2026, as agentic AI workloads have become the norm rather than the exception, this problem has gotten dramatically worse. Agents don't just make one LLM call. They loop, they retry, they fan out into sub-agents, and they can generate thousands of tokens per second across dozens of concurrent tool calls. If your platform treats all tenants the same way at the infrastructure layer, you are essentially running a buffet where one customer has a forklift.

This guide is written for developers who are new to multi-tenant AI infrastructure. You don't need a distributed systems PhD to follow along. By the end, you will understand the core theory behind token bucket rate limiting, how to map it to per-tenant quota enforcement, and how to wire up a basic but production-worthy pipeline that protects your smallest tenants from being starved out.

Why Multi-Tenant LLM Platforms Are Uniquely Dangerous Without Rate Limiting

Traditional SaaS multi-tenancy is hard enough. But LLM platforms introduce a dimension that most rate limiting literature doesn't cover: variable-cost, bursty, stateful consumption. A single API call to a REST endpoint might cost 1 millisecond of CPU. A single agentic LLM call might cost 4,000 input tokens, spawn 3 tool calls, generate 1,200 output tokens, and take 8 seconds of wall-clock time. The cost variance between your cheapest and most expensive operations can be four orders of magnitude.

This creates three specific failure modes you need to design against:

Token starvation: Upstream LLM providers (like OpenAI, Anthropic, or Google Gemini) enforce their own rate limits at the API key or organization level. If one tenant burns through your shared TPM (tokens per minute) budget, every other tenant gets throttled or rejected, even if they are making perfectly reasonable requests.
Latency inflation: Even before hard limits are hit, a noisy tenant can saturate your internal queuing layer, adding hundreds of milliseconds of latency to requests from unrelated tenants.
Cost explosions: Without quota enforcement, a misconfigured agent loop (an infinite retry cycle, a runaway tool-calling chain) can generate a five-figure LLM invoice overnight. Without per-tenant attribution and hard caps, you may not even know which tenant caused it until the bill arrives.

The Core Concept: What Is a Token Bucket?

Before writing a single line of code, you need to understand the token bucket algorithm. It is the most widely used rate limiting primitive in production systems, and it maps beautifully onto LLM quota enforcement.

Imagine a physical bucket that holds tokens (not LLM tokens, just abstract units of capacity). The bucket has two properties:

Capacity (burst size): The maximum number of tokens the bucket can hold at any time.
Refill rate: The rate at which tokens are added back to the bucket over time (e.g., 1,000 tokens per minute).

When a tenant makes a request, you check their bucket. If there are enough tokens available, you deduct the cost of the request and allow it through. If there are not enough tokens, you either reject the request immediately (hard limiting) or queue it until the bucket refills (soft limiting / backpressure).

Here is the key insight that makes token buckets great for LLM platforms: you can use LLM tokens themselves as the unit of consumption. Instead of counting requests, you count estimated or actual tokens consumed. A request that uses 500 tokens costs 500 units from the bucket. A request that uses 5,000 tokens costs 5,000 units. This gives you a much more accurate model of real resource consumption than simple request counting.

Token Bucket vs. Other Algorithms: Why Token Bucket Wins Here

You may have heard of other rate limiting algorithms. Here is a quick comparison so you understand why token bucket is the right default choice for this use case:

Fixed window counter: Counts requests in a fixed time window (e.g., 100 requests per minute). Simple but suffers from boundary bursts. A tenant can make 100 requests at 11:59 and 100 more at 12:00, effectively doubling their rate at the window edge. Bad for LLM workloads.
Sliding window log: More accurate than fixed window, but requires storing a timestamp for every request. At LLM agent scale, this becomes a memory problem quickly.
Leaky bucket: Enforces a strictly constant output rate. Great for smoothing traffic, but it does not allow bursts, which is a problem for legitimate interactive use cases where a user might ask three questions in quick succession.
Token bucket: Allows controlled bursting up to the bucket capacity, then enforces a sustained rate via refill. This matches real user and agent behavior: bursty in the short term, bounded in the long term. It is the right fit.

Designing Your Per-Tenant Quota Model

Before writing code, you need to decide what you are actually measuring and enforcing. A well-designed quota model for a multi-tenant LLM platform typically has three layers:

Layer 1: Tokens Per Minute (TPM) Bucket

This is your real-time rate limiter. It controls how fast a tenant can consume tokens in any given minute. This is the primary defense against noisy-neighbor problems because it prevents any single tenant from monopolizing your upstream API capacity in a short burst. A reasonable starting point might be:

Free tier: 10,000 TPM
Starter tier: 60,000 TPM
Pro tier: 200,000 TPM
Enterprise tier: Custom, negotiated

Layer 2: Tokens Per Day (TPD) Hard Cap

This is your cost protection layer. Even if a tenant stays within their per-minute rate, they should not be able to run continuously for 24 hours and generate an unbounded bill. A daily cap gives you a hard ceiling on per-tenant cost exposure. Think of it as the "budget" and the TPM bucket as the "flow rate."

Layer 3: Concurrent Request Limit

Agentic workloads are highly parallel. An agent might fan out into 20 simultaneous sub-agent calls. Even if each call is small, 20 concurrent connections to your LLM proxy can cause latency problems for other tenants. A concurrency limit (e.g., max 5 simultaneous in-flight requests per tenant on the starter tier) is a simple but powerful guard.

Building Your First Rate Limiting Pipeline: A Practical Walkthrough

Let's build a minimal but real per-tenant rate limiting pipeline. We will use Python for the logic and Redis as our state store, since Redis is the industry standard for distributed rate limiting. This example is intentionally simplified to focus on the core concepts.

Step 1: Store Per-Tenant Bucket State in Redis

Each tenant gets their own bucket state in Redis. We store two values: the current token count and the timestamp of the last refill check. We use a Redis hash for this:


# Key pattern: ratelimit:{tenant_id}:tpm
# Fields: tokens (float), last_refill_ts (float unix timestamp)

Here is a Python class that encapsulates the token bucket logic:


import time
import redis

class TenantTokenBucket:
    def __init__(self, redis_client: redis.Redis, tenant_id: str,
                 capacity: int, refill_rate: float):
        """
        capacity: max tokens the bucket can hold (burst size)
        refill_rate: tokens added per second
        """
        self.redis = redis_client
        self.key = f"ratelimit:{tenant_id}:tpm"
        self.capacity = capacity
        self.refill_rate = refill_rate

    def _refill(self, current_tokens: float, last_ts: float, now: float) -> float:
        elapsed = now - last_ts
        refilled = elapsed * self.refill_rate
        return min(self.capacity, current_tokens + refilled)

    def consume(self, token_cost: int) -> dict:
        now = time.time()
        pipe = self.redis.pipeline(True)  # Use WATCH for optimistic locking

        try:
            pipe.watch(self.key)
            raw = pipe.hgetall(self.key)
            current_tokens = float(raw.get(b"tokens", self.capacity))
            last_ts = float(raw.get(b"last_refill_ts", now))

            # Refill based on elapsed time
            available = self._refill(current_tokens, last_ts, now)

            if available >= token_cost:
                new_tokens = available - token_cost
                allowed = True
            else:
                new_tokens = available
                allowed = False

            pipe.multi()
            pipe.hset(self.key, mapping={
                "tokens": new_tokens,
                "last_refill_ts": now
            })
            pipe.expire(self.key, 3600)  # TTL: clean up inactive tenants
            pipe.execute()

            return {
                "allowed": allowed,
                "tokens_remaining": new_tokens,
                "tokens_requested": token_cost,
                "retry_after_seconds": (
                    (token_cost - available) / self.refill_rate
                    if not allowed else 0
                )
            }
        except redis.WatchError:
            # Concurrent modification: retry once
            return self.consume(token_cost)

Step 2: Estimate Token Cost Before the LLM Call

One of the practical challenges of LLM rate limiting is that you don't always know the exact token cost until after the call completes (because output tokens are variable). You have two strategies:

Pre-deduct with estimation: Use a tokenizer (like tiktoken for OpenAI models) to count input tokens before the call, deduct an estimated total (input + a conservative output estimate), then reconcile the actual cost afterward. This is the preferred approach for hard real-time enforcement.
Post-deduct: Allow the call, then deduct the actual tokens used. Simpler, but allows a tenant to exceed their limit on any given call. Acceptable for soft enforcement on trusted tenants.

A simple pre-estimation wrapper looks like this:


import tiktoken

def estimate_tokens(messages: list, model: str = "gpt-4o",
                    output_buffer: int = 512) -> int:
    """
    Estimate total token cost for a chat completion request.
    Adds an output_buffer for expected response tokens.
    """
    enc = tiktoken.encoding_for_model(model)
    total = 0
    for msg in messages:
        total += 4  # per-message overhead
        total += len(enc.encode(msg.get("content", "")))
    total += output_buffer
    return total

Step 3: Wire It Into Your LLM Proxy Layer

The rate limiter should sit as middleware in your LLM proxy, the service that sits between your application and the upstream LLM provider. Here is a simplified FastAPI middleware example:


from fastapi import FastAPI, Request, HTTPException
import redis

app = FastAPI()
redis_client = redis.Redis(host="localhost", port=6379, db=0)

TENANT_CONFIGS = {
    "free":    {"capacity": 10_000,  "refill_rate": 10_000 / 60},
    "starter": {"capacity": 60_000,  "refill_rate": 60_000 / 60},
    "pro":     {"capacity": 200_000, "refill_rate": 200_000 / 60},
}

@app.middleware("http")
async def rate_limit_middleware(request: Request, call_next):
    tenant_id = request.headers.get("X-Tenant-ID")
    tenant_tier = request.headers.get("X-Tenant-Tier", "free")

    if not tenant_id:
        raise HTTPException(status_code=400, detail="Missing X-Tenant-ID header")

    # Parse request body to estimate tokens
    body = await request.json()
    messages = body.get("messages", [])
    estimated_cost = estimate_tokens(messages)

    config = TENANT_CONFIGS.get(tenant_tier, TENANT_CONFIGS["free"])
    bucket = TenantTokenBucket(
        redis_client, tenant_id,
        capacity=config["capacity"],
        refill_rate=config["refill_rate"]
    )

    result = bucket.consume(estimated_cost)

    if not result["allowed"]:
        raise HTTPException(
            status_code=429,
            detail={
                "error": "rate_limit_exceeded",
                "retry_after_seconds": result["retry_after_seconds"],
                "tokens_remaining": result["tokens_remaining"]
            }
        )

    response = await call_next(request)
    return response

Handling the Reconciliation Problem: Actual vs. Estimated Tokens

Your pre-deduction estimate will rarely be exactly right. Output tokens are inherently unpredictable. You need a reconciliation step that runs after each successful LLM call to adjust the bucket based on actual usage. Most LLM provider APIs return token counts in the response object (e.g., usage.total_tokens in the OpenAI response format).

The reconciliation flow looks like this:

Pre-deduct the estimated token cost before the LLM call.
Make the LLM call.
Read the actual token count from the response's usage field.
Calculate the delta: actual_cost - estimated_cost.
If actual was higher, deduct the additional tokens from the bucket. If actual was lower, refund the unused tokens back to the bucket (up to the capacity ceiling).

This keeps your accounting accurate without blocking the hot path with a second round-trip. You can also use this data to improve your estimation models over time by logging the estimation error per model and prompt type.

Adding the Daily Hard Cap Layer

The TPM bucket handles real-time flow control, but you also need a daily spending cap. This is simpler than the token bucket: it is just a counter with a daily TTL in Redis.


import datetime

class TenantDailyQuota:
    def __init__(self, redis_client: redis.Redis, tenant_id: str,
                 daily_limit: int):
        today = datetime.date.today().isoformat()
        self.key = f"quota:{tenant_id}:daily:{today}"
        self.redis = redis_client
        self.daily_limit = daily_limit

    def check_and_increment(self, token_cost: int) -> dict:
        pipe = self.redis.pipeline()
        pipe.incrby(self.key, token_cost)
        pipe.expire(self.key, 86400)  # 24-hour TTL
        results = pipe.execute()
        new_total = results[0]

        return {
            "allowed": new_total <= self.daily_limit,
            "daily_used": new_total,
            "daily_limit": self.daily_limit,
            "daily_remaining": max(0, self.daily_limit - new_total)
        }

In your middleware, run the daily quota check first (it is cheaper) and the token bucket check second. If either rejects the request, return a 429 with an informative error body that tells the tenant whether they hit a rate limit (temporary) or a quota cap (resets tomorrow).

Observability: You Can't Protect What You Can't See

Rate limiting without observability is like installing a circuit breaker in a dark room. You need to be able to see, in real time, which tenants are consuming how much capacity. At minimum, instrument the following metrics:

tokens_consumed_total (counter, labeled by tenant_id and tier): The total tokens consumed since platform launch, per tenant.
rate_limit_rejections_total (counter, labeled by tenant_id and rejection_type): How many requests were rejected, and whether it was a TPM limit or a daily quota cap.
bucket_fill_ratio (gauge, labeled by tenant_id): The current fill level of each tenant's token bucket as a percentage of capacity. A tenant consistently at 0-5% fill is being rate-limited heavily and may need a tier upgrade.
estimated_vs_actual_token_delta (histogram): Tracks your estimation accuracy. A large consistent delta means your estimation model needs tuning.

Emit these metrics to your observability stack (Prometheus, Datadog, or whatever you use) and build a dashboard that shows you the top 10 token consumers in real time. This is the single most valuable operational tool you can build for a multi-tenant LLM platform.

Common Beginner Mistakes to Avoid

Here are the pitfalls that trip up most teams building this for the first time:

Using a single global rate limiter: If you have one Redis key for your entire platform's token budget and subtract from it per-request, you have a global rate limiter, not a per-tenant one. Make sure every bucket key is namespaced by tenant ID.
Forgetting streaming responses: If you use streaming (Server-Sent Events or chunked responses), the token count is not available until the stream ends. You need to either buffer the stream to count tokens before forwarding, or use a post-deduct model for streaming calls specifically.
Setting burst capacity too low: If your bucket capacity equals your per-minute refill rate, you have effectively eliminated burst allowance. Agents legitimately need to burst. A good rule of thumb is to set burst capacity at 2 to 3 times the per-minute refill rate.
Not communicating limits to tenants: Your 429 responses should always include Retry-After, X-RateLimit-Remaining, and X-RateLimit-Reset headers. Tenants and their developers need this information to implement proper backoff. An opaque 429 is a support ticket waiting to happen.
Ignoring agent-specific patterns: Agentic loops often implement their own retry logic. If your rate limiter rejects a request, the agent may immediately retry, creating a thundering herd against your limiter. Always return a retry_after value and encourage tenants to implement exponential backoff with jitter.

What Comes Next: Beyond the Basics

Once you have the basic per-tenant token bucket and daily quota pipeline working, there are several natural next steps to explore as your platform matures:

Priority queuing: Instead of rejecting requests when a bucket is empty, queue them with a priority based on tenant tier. Enterprise tenants get their queued requests processed before free-tier tenants.
Adaptive limits: Use historical consumption data to dynamically adjust bucket sizes. A tenant who consistently uses only 20% of their quota might get a temporary boost during off-peak hours.
Per-model limits: Different models have vastly different costs. GPT-4o costs significantly more per token than a smaller model. Consider separate buckets per model family, not just per tenant.
Tenant-level circuit breakers: If a tenant's agent is stuck in an infinite retry loop, a circuit breaker can open after a threshold of consecutive failures and block that tenant's requests entirely for a cooldown period, protecting both the tenant from runaway costs and the platform from the load.

Conclusion: Rate Limiting Is an Act of Fairness

It is tempting to frame per-tenant rate limiting as a defensive, infrastructure-level concern. But there is a more human way to think about it: rate limiting is how you make a promise to your smallest customers that they will always get a fair share of the platform they are paying for.

In 2026, as agentic AI workloads become more autonomous, more parallel, and more capable of generating enormous token volumes without human oversight, the gap between a well-protected multi-tenant platform and an unprotected one will only grow wider. The noisy-neighbor problem is not going away. If anything, it is getting noisier.

The good news is that the core primitives, token buckets, Redis counters, estimation pipelines, and observability metrics, are well understood, battle-tested, and absolutely within reach for a developer who is just getting started. You do not need to build a perfect system on day one. You need to build a system that is better than nothing, instrument it well so you can see what is happening, and iterate from there.

Start with a single token bucket per tenant. Add a daily cap. Emit two or three metrics. Deploy it. You will be surprised how much protection even this minimal setup provides, and you will have a solid foundation to build everything else on top of.