AI Agents

How to Build a Per-Tenant AI Agent Secret and API Credential Rotation Pipeline That Automatically Reissues Foundation Model Provider Keys Across Active Agentic Workflows Without Dropping In-Flight Tasks

Scott Miller

Apr 2, 2026 • 11 min read

In 2026, agentic AI systems are no longer a novelty. They are the operational backbone of SaaS platforms, enterprise automation suites, and developer tooling. Thousands of concurrent AI agents, each acting on behalf of a specific tenant, are calling foundation model providers like OpenAI, Anthropic, Google Gemini, and Mistral around the clock. With that scale comes a brutal operational reality: API credentials expire, get compromised, or need rotation, and your agents cannot afford to stop working when that happens.

The challenge is deceptively complex. A naive "swap the key and restart the service" approach might work for a monolithic app, but in a multi-tenant agentic system, you are dealing with dozens of long-running task graphs, stateful tool-call chains, and tenant-isolated credential scopes all executing simultaneously. Kill those in-flight workflows and you break trust with customers, corrupt task state, and potentially trigger cascading failures.

This guide walks you through building a production-grade, per-tenant credential rotation pipeline that gracefully reissues foundation model provider keys across active agentic workflows, with zero task drops and full auditability. We will cover the architecture, the state machine design, the secret store integration, and the graceful handoff protocol step by step.

Why Standard Secret Rotation Patterns Break for Agentic Workloads

Most secret rotation guides assume a stateless consumer: a web server that picks up a new database password from Vault and reconnects on the next request. Agentic workloads violate every one of those assumptions:

Long-lived execution contexts: A single agent task may run for minutes or hours, holding an in-memory reference to a credential object established at task start.
Nested tool calls: An agent orchestrating sub-agents or tool chains may have credentials injected several layers deep in a call stack that cannot be trivially refreshed mid-execution.
Tenant isolation requirements: In a multi-tenant system, rotating one tenant's key must never touch another tenant's runtime context.
Provider-side key invalidation windows: Providers like Anthropic and OpenAI have varying grace periods between issuing a new key and invalidating the old one. Your pipeline must respect these windows precisely.
Audit and compliance trails: Regulated industries require a full log of who held which credential, when it was rotated, and which tasks executed under which key version.

The solution is not to patch a generic rotation script. It requires a purpose-built pipeline with four distinct layers: a secret lifecycle manager, a credential context broker, a task checkpoint and resume system, and a rotation event bus.

The Architecture Overview

Before diving into implementation, here is the high-level architecture you are building:

Secret Store: HashiCorp Vault (or AWS Secrets Manager / Azure Key Vault) with per-tenant secret paths and versioning enabled.
Rotation Scheduler: A time-based or policy-driven service that initiates rotation events per tenant.
Credential Context Broker (CCB): A lightweight sidecar or service that injects credentials into agent runtimes and handles live swap notifications.
Agent Task State Machine: Each agent task tracks its own credential version and knows how to checkpoint and resume across a key swap.
Rotation Event Bus: An internal message bus (Kafka, Redis Streams, or NATS) that broadcasts rotation lifecycle events to all active agent workers.
Provider Key Reissue Adapter: A provider-specific module that calls the foundation model provider's management API to issue a new key and schedule old key revocation.

Step 1: Design Your Per-Tenant Secret Path Schema

Every tenant must have an isolated secret namespace. In Vault, this looks like a KV v2 mount per environment, with paths structured as follows:

secret/tenants/{tenant_id}/providers/{provider_name}/api_key

Enable versioning on all paths. This is non-negotiable. Versioning is what allows your agents to hold a reference to version: 4 of a credential while the rotation pipeline writes version: 5 and waits for the grace period before destroying version 4.

Store the following metadata alongside each credential version:

issued_at: ISO 8601 timestamp of when this key was issued.
expires_at: Hard expiry enforced by your rotation policy (not necessarily by the provider).
provider_key_id: The provider-side identifier for this key (used when calling the revocation API).
rotation_status: One of active, draining, or revoked.
tenant_id: Redundant but useful for audit queries.

Here is a Vault CLI example to write an initial credential with metadata:

vault kv put secret/tenants/tenant-abc/providers/openai/api_key \
  value="sk-live-xxxx" \
  issued_at="2026-03-01T00:00:00Z" \
  expires_at="2026-03-31T00:00:00Z" \
  provider_key_id="key_abc123" \
  rotation_status="active" \
  tenant_id="tenant-abc"

Step 2: Build the Provider Key Reissue Adapter

Each foundation model provider has a different management API for creating and revoking API keys. You need a thin adapter layer that normalizes this into a consistent interface. Define a base interface (shown here in Python, but the pattern applies to any language):

from abc import ABC, abstractmethod
from dataclasses import dataclass

@dataclass
class IssuedCredential:
    key_value: str
    provider_key_id: str
    issued_at: str
    expires_at: str

class ProviderKeyAdapter(ABC):

    @abstractmethod
    def issue_new_key(self, tenant_id: str, label: str) -> IssuedCredential:
        """Call the provider API to create a new API key."""
        pass

    @abstractmethod
    def revoke_key(self, provider_key_id: str) -> bool:
        """Revoke a specific key by its provider-side ID."""
        pass

Implement a concrete adapter for each provider. For example, an OpenAI adapter would call the POST /organization/api_keys endpoint to issue a key and DELETE /organization/api_keys/{key_id} to revoke it. For providers that do not expose a programmatic key management API (some still do not in 2026), your adapter falls back to a human-in-the-loop approval step, which emits a Slack or webhook notification and pauses the rotation pipeline until a human confirms the new key has been entered.

PROVIDER_ADAPTERS = {
    "openai": OpenAIKeyAdapter(),
    "anthropic": AnthropicKeyAdapter(),
    "gemini": GeminiKeyAdapter(),
    "mistral": MistralKeyAdapter(),
}

Step 3: Implement the Rotation Scheduler

The rotation scheduler is responsible for deciding when to rotate a credential. It should support three trigger modes:

Time-based rotation: Rotate every N days regardless of usage (e.g., every 30 days per your security policy).
Event-triggered rotation: Rotate immediately on a security incident signal, a compromised key alert from a provider, or a SIEM alert.
Usage-threshold rotation: Rotate after a credential has been used for a certain number of API calls or a certain spend threshold, useful for budget isolation per tenant.

A minimal scheduler loop in pseudocode looks like this:

async def rotation_scheduler_loop():
    while True:
        tenants = await fetch_all_active_tenants()
        for tenant in tenants:
            for provider in tenant.configured_providers:
                credential = await secret_store.get_current_credential(
                    tenant.id, provider
                )
                if should_rotate(credential):
                    await rotation_pipeline.initiate(tenant.id, provider)
        await asyncio.sleep(SCHEDULER_INTERVAL_SECONDS)

The should_rotate function checks the expires_at field against the current time (with a configurable lead time, e.g., rotate 48 hours before expiry), and also checks for any pending security events in your event bus.

Step 4: Design the Rotation State Machine

This is the most critical piece of the entire pipeline. Each rotation event for a given (tenant_id, provider) pair moves through a well-defined state machine. The states are:

PENDING: Rotation has been triggered; no action taken yet.
ISSUING: The pipeline is calling the provider adapter to create a new key.
DRAINING: The new key has been written to the secret store. Existing in-flight tasks are still using the old key. New tasks will receive the new key. A drain timer is running.
CHECKPOINTING: The pipeline is signaling active tasks to checkpoint their state before the old key is revoked.
REVOKING: The old key is being revoked at the provider level.
COMPLETE: Rotation is done. Old key version is marked revoked in the secret store.
FAILED: An error occurred. The pipeline halts and alerts are sent. The old key remains active.

Persist this state machine in a durable store (Postgres, DynamoDB, or Redis with AOF persistence) so that a pipeline crash does not lose rotation progress. Each state transition is an atomic write with an idempotency key of {tenant_id}:{provider}:{rotation_id}.

Step 5: Build the Credential Context Broker

The Credential Context Broker (CCB) is the component that sits between your secret store and your agent runtime. It serves two functions:

Credential injection at task start: When an agent task is initialized for a given tenant, the CCB fetches the current active credential version and injects it into the task's execution context. It also registers the task as a "holder" of that credential version in a live registry.
Live swap notification: When the rotation pipeline enters the DRAINING state, the CCB receives a rotation event from the event bus and knows which tasks are holding the old credential version. It can then signal those tasks to checkpoint and resume with the new credential.

The live holder registry is a simple in-memory or Redis hash map:

# Key: (tenant_id, provider, credential_version)
# Value: set of task_ids currently holding this credential version

CREDENTIAL_HOLDERS = {
  ("tenant-abc", "openai", 4): {"task-001", "task-007", "task-023"},
  ("tenant-abc", "openai", 5): {"task-031"},  # new tasks already on v5
}

When a task completes or errors out, it deregisters itself from the registry. The rotation pipeline monitors this registry during the DRAINING state and only proceeds to REVOKING when the old version's holder set is empty (or the drain timeout has elapsed, forcing a checkpoint).

Step 6: Implement Task Checkpointing and Resume

This is where in-flight task safety is actually guaranteed. Every agent task in your system must implement a checkpoint protocol. The checkpoint captures the full task state: the current step in the task graph, all tool call results accumulated so far, the conversation history, any intermediate outputs, and the credential version in use.

A checkpoint payload looks like this:

{
  "task_id": "task-007",
  "tenant_id": "tenant-abc",
  "provider": "openai",
  "credential_version": 4,
  "checkpoint_reason": "credential_rotation",
  "task_graph_step": 3,
  "accumulated_tool_results": [...],
  "conversation_history": [...],
  "intermediate_outputs": {...},
  "checkpointed_at": "2026-03-15T14:32:00Z"
}

Store checkpoints in a durable, tenant-isolated store (S3 with per-tenant prefixes, or a partitioned Postgres table). When the CCB signals a task to checkpoint, the task:

Finishes its current atomic operation (a single LLM call or tool call) without interruption.
Writes the checkpoint payload to the durable store.
Deregisters itself from the credential holder registry.
Signals the CCB that it has checkpointed successfully.
Fetches the new credential version from the CCB.
Resumes execution from the checkpointed step using the new credential.

The key design rule here is: never interrupt a task mid-LLM-call. The checkpoint signal is queued and only acted upon at the next safe boundary, which is between tool calls or between reasoning steps. This is why your agent task loop must explicitly check for a pending checkpoint signal at each step boundary:

async def agent_task_loop(task_context):
    while not task_context.is_complete():
        # Check for pending rotation signal before each step
        if await ccb.has_pending_rotation_signal(task_context.task_id):
            await task_context.checkpoint()
            new_cred = await ccb.get_current_credential(
                task_context.tenant_id,
                task_context.provider
            )
            task_context.update_credential(new_cred)

        result = await execute_next_step(task_context)
        task_context.record_step_result(result)

Step 7: Wire Up the Rotation Event Bus

The event bus is the nervous system of the entire pipeline. Use a topic-per-tenant pattern to ensure tenant isolation at the messaging layer. With Kafka, this means topics named rotation.events.{tenant_id}. With NATS JetStream, use subject hierarchies like rotation.{tenant_id}.{provider}.

Publish the following event types during a rotation lifecycle:

rotation.initiated: Sent when the scheduler triggers a rotation. Payload includes tenant_id, provider, rotation_id.
rotation.new_key_ready: Sent when the new key has been written to the secret store. This transitions the state machine to DRAINING and signals the CCB to stop issuing the old key to new tasks.
rotation.checkpoint_requested: Broadcast to all agent workers holding the old key version, asking them to checkpoint at the next safe boundary.
rotation.drain_complete: Sent by the CCB when the old credential version's holder set is empty. Triggers the REVOKING transition.
rotation.complete: Final event. Consumed by audit logging, monitoring, and alerting systems.
rotation.failed: Sent on any unrecoverable error. Triggers paging and halts revocation.

Step 8: Handle the Grace Period and Revocation Window

Different providers have different behaviors when you issue a new key. Some invalidate the old key immediately upon issuing the new one. Others support a configurable overlap window. Your pipeline must encode this per-provider:

PROVIDER_REVOCATION_POLICIES = {
    "openai":    {"overlap_seconds": 300,  "supports_delayed_revoke": True},
    "anthropic": {"overlap_seconds": 600,  "supports_delayed_revoke": True},
    "gemini":    {"overlap_seconds": 0,    "supports_delayed_revoke": False},
    "mistral":   {"overlap_seconds": 120,  "supports_delayed_revoke": True},
}

For providers with supports_delayed_revoke: True, your adapter schedules the revocation call after the overlap window has passed and after the drain is confirmed complete. For providers with supports_delayed_revoke: False (where issuing a new key immediately voids the old one), your pipeline must be more aggressive: it enters CHECKPOINTING immediately upon issuing the new key, forces all tasks to checkpoint within a tight window (e.g., 30 seconds), and only then issues the new key. This is the most disruptive scenario, and you should flag these providers in your tenant onboarding flow.

Step 9: Enforce Tenant Isolation at Every Layer

In a multi-tenant system, a rotation pipeline bug that leaks a credential across tenant boundaries is a critical security incident. Enforce isolation with these controls:

Secret store policies: Each service account used by the rotation pipeline has a Vault policy that restricts read/write access to only secret/tenants/{its_own_tenant_id}/*.
Event bus partitioning: Use tenant-scoped topics and authenticate producers and consumers with per-tenant service credentials.
CCB request validation: The CCB validates that every credential fetch request includes a signed tenant JWT, and the tenant_id in the JWT matches the requested secret path.
Audit log tagging: Every secret read, write, and rotation event is tagged with tenant_id and written to an immutable audit log (CloudTrail, Vault Audit Device, or a WORM-compliant log sink).
Checkpoint store isolation: Checkpoint payloads are stored under per-tenant S3 prefixes with bucket policies that deny cross-tenant access at the IAM level.

Step 10: Observability, Alerting, and Runbooks

A rotation pipeline that fails silently is worse than no pipeline at all. Instrument every state transition with metrics and traces:

Metrics to emit: rotation_duration_seconds (histogram per provider), tasks_checkpointed_per_rotation (gauge), rotation_failures_total (counter per tenant and provider), credential_holder_count (gauge, useful for detecting stuck tasks that never drain).
Alerts to configure: Page on any rotation stuck in DRAINING for more than 10 minutes (indicates a task is not responding to checkpoint signals). Alert on rotation.failed events immediately. Warn if a credential is within 24 hours of expiry and no rotation has been initiated.
Distributed tracing: Attach a rotation_id as a trace context attribute so you can correlate a rotation event with every task checkpoint, secret write, and provider API call in a single trace in your observability platform (Honeycomb, Datadog, or Grafana Tempo).

Maintain a runbook for each failure mode: what to do when a provider's key issuance API is down, how to manually force-complete a stuck drain, and how to roll back a rotation if the new key turns out to be invalid.

Putting It All Together: The Rotation Flow End to End

Here is the complete happy-path flow for a single tenant credential rotation, from trigger to completion:

The rotation scheduler detects that tenant-abc's OpenAI key expires in 48 hours and emits rotation.initiated.
The rotation pipeline transitions to ISSUING and calls the OpenAI adapter to create a new key. The new key (version: 5) is written to Vault with rotation_status: active. The old key (version: 4) is updated to rotation_status: draining.
The pipeline emits rotation.new_key_ready. The CCB stops issuing version 4 to new tasks. All new tasks for tenant-abc now receive version 5.
The pipeline emits rotation.checkpoint_requested. The CCB notifies the three active tasks holding version 4.
Each task finishes its current LLM call, checkpoints its state, deregisters from the holder registry, fetches version 5 from the CCB, and resumes. From the task's perspective, there is a brief pause of milliseconds between steps, nothing more.
The CCB detects that the version 4 holder set is empty and emits rotation.drain_complete.
After the 300-second overlap window (per OpenAI's policy), the pipeline calls the OpenAI adapter to revoke the version 4 key. It then marks version 4 as rotation_status: revoked in Vault.
The pipeline emits rotation.complete. The audit log records the full rotation lifecycle. Metrics are updated.

Total elapsed time for a typical rotation with three active tasks: under 10 minutes. Zero tasks dropped. Zero LLM calls interrupted.

Conclusion

Building a per-tenant AI agent credential rotation pipeline is one of those infrastructure investments that feels over-engineered until the day a key gets compromised and your platform handles it silently in the background while every tenant's agents keep running without a hiccup. That is the moment it pays for itself entirely.

The architecture described here, combining a versioned secret store, a stateful rotation state machine, a credential context broker, task-level checkpointing, and a rotation event bus, gives you the building blocks for a system that is both operationally safe and security-compliant. It respects the realities of agentic workloads: long-running tasks, nested tool calls, and strict tenant isolation requirements that simpler rotation patterns simply cannot accommodate.

As foundation model providers continue to mature their key management APIs throughout 2026 and beyond, the adapter layer will become easier to implement. But the core architecture, the state machine, the drain protocol, and the checkpoint system, will remain relevant as long as AI agents are doing real work in production. Build it once, build it right, and your platform will be ready for whatever comes next.