How a Legal Tech Startup's Multi-Tenant Agentic Platform Collapsed Under Runaway Cross-Agent Loops , And the Cycle Detection and Circuit-Breaking Architecture That Fixed It

How a Legal Tech Startup's Multi-Tenant Agentic Platform Collapsed Under Runaway Cross-Agent Loops ,  And the Cycle Detection and Circuit-Breaking Architecture That Fixed It

In early 2026, a promising legal technology startup we'll call LexOrbit was riding the wave of agentic AI adoption. Their flagship product: a multi-tenant SaaS platform that deployed autonomous AI agent pipelines for law firms , automating contract review, case research, document summarization, compliance checks, and client intake processing. Each tenant (a law firm) got its own isolated logical workspace, but all pipelines ran on a shared orchestration layer powered by a proprietary agent graph engine built on top of a popular open-source multi-agent framework.

For eight months, the platform worked beautifully. Then, on a Tuesday morning in February 2026, everything fell apart , simultaneously, for every tenant on the system.

This is the story of what went wrong, why it was architecturally inevitable, and how LexOrbit's engineering team rebuilt their orchestration layer with cycle detection and per-tenant circuit-breaking to prevent it from ever happening again.

The Platform: What LexOrbit Built

LexOrbit's architecture was genuinely impressive for its time. Each law firm tenant onboarded with a configurable set of agent pipelines, where individual AI agents were nodes in a directed graph. A typical pipeline for a mid-size litigation firm might look like this:

  • IntakeAgent: Parses incoming client documents and classifies case type.
  • ResearchAgent: Queries legal databases and synthesizes relevant precedents.
  • DraftAgent: Generates initial motion drafts or contract redlines.
  • ReviewAgent: Performs a compliance and risk review of DraftAgent output.
  • SummaryAgent: Produces a final client-facing summary.

Tenants could customize agent behavior through a no-code configuration UI. They could add agents, reorder them, set conditional branching rules (e.g., "if ReviewAgent flags a risk score above 0.7, re-route to ResearchAgent for additional precedent lookup"), and define escalation paths to human reviewers.

The orchestration engine maintained a shared task queue across tenants, with logical tenant isolation enforced through namespace tagging. Agent-to-agent communication happened via an internal message bus, and each agent could both consume tasks and emit new task events , which other agents could then pick up.

This event-driven, graph-based design was flexible and powerful. It was also, as the team would discover, a ticking clock.

The Incident: A Tuesday Morning in February 2026

The failure began not with a catastrophic bug but with a seemingly harmless tenant configuration change. A large corporate law firm , Tenant ID: TEN-0047 , updated their pipeline the evening before. Their IT administrator added a new feedback rule: if the SummaryAgent's output confidence score fell below a threshold, the pipeline should loop back to the ResearchAgent to gather more supporting material before re-drafting.

On paper, this was a reasonable quality-control loop. In practice, it created a directed cycle in the agent graph:

IntakeAgent → ResearchAgent → DraftAgent → ReviewAgent → SummaryAgent
                    ↑                                          |
                    └──────────── (confidence < 0.65) ────────┘

The problem? The SummaryAgent, operating on a particularly ambiguous multi-jurisdictional contract, consistently produced confidence scores of 0.61 to 0.63. Never quite reaching the threshold. The loop triggered. ResearchAgent ran again. DraftAgent ran again. ReviewAgent ran again. SummaryAgent ran again. Score: 0.62. Loop triggered again.

Within four minutes, TEN-0047's pipeline had spawned 847 concurrent agent task instances. The shared task queue began saturating. Because the orchestration engine used a shared thread pool and a global message bus with no per-tenant rate limiting, the cascading task flood began consuming resources allocated to every other tenant on the platform. By 9:14 AM, 23 other law firm tenants reported frozen pipelines, stalled document reviews, and unresponsive dashboards. By 9:31 AM, the orchestration engine itself had crashed under memory exhaustion.

Total downtime: 4 hours and 22 minutes. Estimated impact: over 300 stalled legal workflows across 24 tenant organizations, several of which had time-sensitive court filing deadlines that day.

The Root Cause Analysis: Three Compounding Failures

LexOrbit's post-mortem identified not one but three distinct architectural failures that combined to produce the incident. Understanding each one is critical, because fixing only one or two would have left the system vulnerable.

1. No Graph Cycle Detection at Configuration Time

The most fundamental failure was that the pipeline configuration system had no static analysis layer. When Tenant TEN-0047's administrator created the feedback loop, the UI accepted the configuration without any validation. The orchestration engine ingested it as a valid directed graph without checking whether it contained cycles.

In computer science terms, this is a solved problem. A simple Depth-First Search (DFS) traversal with a "visited" and "recursion stack" set can detect cycles in a directed graph in O(V + E) time, where V is the number of agent nodes and E is the number of directed edges (routing rules). For a typical LexOrbit pipeline with 5 to 15 agents, this check would complete in microseconds.

The team had simply never implemented it. The assumption was that tenants would configure "reasonable" pipelines. That assumption was wrong.

Importantly, not all cycles are inherently bad. Intentional retry loops are legitimate. But the system had no way to distinguish between a bounded cycle (one with a clear exit condition and a maximum iteration count) and an unbounded cycle (one that could spin indefinitely). Both looked identical to the orchestration engine.

2. No Per-Tenant Resource Isolation in the Orchestration Layer

The second failure was the shared task queue and thread pool. LexOrbit's multi-tenancy was logical, not physical. Tenant namespaces were tags in a database, not separate execution environments. This meant one tenant's runaway pipeline could consume resources earmarked for others.

In a well-designed multi-tenant agentic system, each tenant's pipeline should operate within a resource envelope: a maximum number of concurrent agent executions, a maximum task queue depth, and a maximum execution time budget per pipeline run. LexOrbit had none of these controls in place.

3. No Circuit Breaker on the Agent Invocation Path

The third failure was the absence of any circuit-breaking mechanism on the agent invocation path. Even after the loop began spinning, there was nothing in the system to detect the abnormal invocation rate, halt execution, and alert operators. The orchestration engine simply kept accepting and dispatching tasks until it ran out of memory.

A circuit breaker, borrowed from electrical engineering and popularized in distributed systems by Michael Nygard's "Release It!" patterns, would have detected the surge in failure-adjacent behavior (repeated identical task patterns, queue depth spikes, execution time anomalies) and opened the circuit for the offending tenant's pipeline, stopping the cascade before it affected others.

The Fix: A Three-Layer Resilience Architecture

Over the following six weeks, LexOrbit's engineering team designed and shipped a comprehensive resilience layer. They called it internally the GUARD stack: Graph validation, Usage envelopes, Agent rate limiting, Runtime cycle detection, and Dead-man circuit breakers. Here is how each component works.

Layer 1: Static Cycle Detection at Configuration Save Time

The first fix was the simplest and the most impactful. The pipeline configuration service was updated to run a directed graph cycle detection algorithm every time a tenant saved a pipeline configuration. The implementation used an iterative DFS with an explicit recursion stack to avoid call stack overflow on deeply nested pipelines:

def detect_cycle(graph: dict[str, list[str]]) -> tuple[bool, list[str]]:
    visited = set()
    rec_stack = set()
    cycle_path = []

    def dfs(node: str, path: list[str]) -> bool:
        visited.add(node)
        rec_stack.add(node)
        path.append(node)

        for neighbor in graph.get(node, []):
            if neighbor not in visited:
                if dfs(neighbor, path):
                    return True
            elif neighbor in rec_stack:
                cycle_path.extend(path)
                cycle_path.append(neighbor)  # close the cycle
                return True

        rec_stack.discard(node)
        path.pop()
        return False

    for node in graph:
        if node not in visited:
            if dfs(node, []):
                return True, cycle_path

    return False, []

Critically, the system was updated to distinguish between bounded and unbounded cycles. A cycle is permitted if and only if every back-edge in the graph is annotated with both a termination condition and a maximum iteration count (capped at a platform-enforced ceiling of 5 re-entries per pipeline run). Cycles without these annotations are rejected at save time with a clear error message to the tenant administrator, explaining the exact cycle path detected.

This single change would have prevented the February incident entirely. TEN-0047's administrator would have received an error: "Cycle detected: SummaryAgent → ResearchAgent → DraftAgent → ReviewAgent → SummaryAgent. Please add a maximum iteration limit to this feedback loop."

Layer 2: Per-Tenant Resource Envelopes

The second layer introduced hard per-tenant resource constraints enforced at the orchestration engine level. Each tenant is now assigned a resource envelope based on their subscription tier:

  • Max concurrent agent executions: 10 (standard) to 50 (enterprise) per pipeline run.
  • Max task queue depth: 200 pending tasks per tenant at any time.
  • Max pipeline wall-clock time: 15 minutes per end-to-end pipeline execution before forced termination.
  • Max agent re-invocation count: Any single agent node may not be invoked more than N times within a single pipeline run (N configurable per tenant, default 3).

These envelopes are enforced by a new TenantResourceGovernor service that sits between the message bus and the agent executor pool. When a tenant's envelope is exhausted, new task emissions are queued in a tenant-specific overflow buffer rather than the shared global queue, preventing cross-tenant contamination.

The resource governor also emits structured telemetry events to the observability stack (LexOrbit uses an OpenTelemetry-based pipeline feeding into a Grafana dashboard), so on-call engineers receive real-time alerts when any tenant approaches 80% of their envelope limits.

Layer 3: Runtime Cycle Detection with Agent Invocation Fingerprinting

Static analysis at save time handles known cycles in the configured graph. But what about emergent cycles? These are loops that arise not from explicit back-edges in the configuration but from agents dynamically emitting task events that happen to be picked up by upstream agents through indirect routing rules. These are far harder to detect statically because they depend on runtime data and agent output content.

To address this, LexOrbit implemented runtime cycle detection via invocation fingerprinting. Every agent invocation within a pipeline run is assigned a fingerprint composed of:

  • The agent's node ID.
  • A hash of the input payload (normalized to remove timestamps and run-specific IDs).
  • The current pipeline run's execution depth counter.

These fingerprints are stored in a per-pipeline-run bloom filter backed by Redis, with a TTL matching the maximum pipeline wall-clock time. Before executing any agent task, the orchestration engine checks whether an identical fingerprint has been seen in the current run. If it has, the engine refuses to execute the task, logs a runtime cycle detection event, and routes the pipeline to a human review escalation path instead.

The bloom filter approach was chosen deliberately: it is probabilistic (with a tunable false positive rate, set to 0.1% in production) and extremely fast, adding less than 2 milliseconds of overhead per agent invocation. The small false positive rate is acceptable because a false positive results in an unnecessary escalation to human review, not data corruption or system failure.

Layer 4: Per-Tenant Circuit Breakers

The final layer is the circuit breaker, implemented as a per-tenant state machine with three states: Closed (normal operation), Open (pipeline execution halted, all new tasks rejected), and Half-Open (limited probe executions allowed to test recovery).

The circuit breaker trips to Open state when any of the following conditions are met within a configurable rolling time window (default: 60 seconds):

  • A tenant's concurrent agent execution count exceeds 90% of their envelope for more than 10 consecutive seconds.
  • The runtime cycle detector fires more than 3 times for the same pipeline run.
  • The average agent execution latency for the tenant exceeds 5x the rolling baseline for more than 30 seconds.
  • The tenant's task queue depth grows monotonically for more than 20 seconds without any task completions.

When the circuit opens, the affected tenant's pipeline is suspended, their users see a clear status page message ("Your pipeline has been temporarily paused due to an anomalous execution pattern. Our team has been notified."), and an automated incident ticket is created in LexOrbit's internal system. Crucially, no other tenant is affected. The circuit breaker is scoped entirely to the offending tenant's execution namespace.

After a configurable cool-down period (default: 5 minutes), the circuit transitions to Half-Open and allows a single probe pipeline execution. If it completes successfully within normal parameters, the circuit closes. If it triggers any anomaly condition again, the circuit reopens and the cool-down timer resets.

Results: Six Weeks Post-Deployment

LexOrbit deployed the GUARD stack in phases across March and April 2026. The results were measurable and significant:

  • Zero cross-tenant cascade incidents in the six weeks following deployment, compared to three in the six weeks prior (the February incident was the largest but not the only one).
  • 14 misconfigured pipeline saves blocked by static cycle detection, with tenant administrators guided to correct their configurations before any runtime impact occurred.
  • 7 runtime cycle detection events triggered by emergent loops in dynamic routing scenarios, all successfully contained to the originating tenant without any cross-tenant impact.
  • Circuit breaker activations: 3 total, all in the Half-Open probe phase, all resolved within one cool-down cycle without manual intervention.
  • P99 pipeline execution latency across all tenants improved by 18%, attributed primarily to the elimination of noisy-neighbor resource contention from the shared queue.

Perhaps most importantly, the static cycle detection UI change received positive feedback from tenant administrators. Several reported that the error messages helped them understand their pipeline configurations more deeply and prompted them to redesign feedback loops more intentionally, adding explicit human checkpoints rather than fully automated re-entry loops.

Broader Lessons for Agentic Platform Engineers

LexOrbit's incident is not unique. As agentic AI platforms proliferate across industries in 2026, the pattern of flexible, event-driven, multi-agent pipelines running on shared infrastructure is becoming the norm. The architectural lessons from this case study apply broadly.

Treat Agent Graphs as First-Class Data Structures

If your platform allows users to configure agent pipelines, you are in the business of managing directed graphs. Apply the full toolkit of graph theory: cycle detection, topological sorting, reachability analysis, and strongly connected component identification. These are not academic exercises; they are production safety tools.

Multi-Tenancy Requires Execution Isolation, Not Just Data Isolation

Most teams building multi-tenant AI platforms invest heavily in data isolation (separate databases, encrypted namespaces, row-level security). Far fewer invest in execution isolation. In an agentic system where one tenant's pipeline can generate unbounded compute load, execution isolation is equally critical. Resource envelopes, per-tenant queues, and fair-share schedulers are not optional features; they are table stakes for production reliability.

Static Analysis and Runtime Detection Are Complements, Not Substitutes

Static cycle detection at configuration time is fast and cheap, but it cannot catch emergent runtime loops. Runtime detection is comprehensive but adds overhead. You need both layers. The cost of the runtime bloom filter check (2ms per invocation) is trivially small compared to the cost of a 4-hour platform outage.

Circuit Breakers Must Be Scoped to the Blast Radius You Want to Contain

A global circuit breaker that shuts down the entire platform when any tenant misbehaves is not a circuit breaker; it is a controlled outage. Circuit breakers in multi-tenant systems must be scoped to the tenant, the pipeline, or the agent type, depending on the granularity of isolation your architecture provides. Design your circuit breakers to match your desired blast radius.

Design for Intentional Loops, Not Looplessness

The instinct after an incident like this is to ban all cycles in agent graphs. Resist that instinct. Feedback loops, quality-control re-entry patterns, and iterative refinement cycles are genuinely valuable in agentic workflows. The goal is not to eliminate cycles but to make them explicit, bounded, and observable. A cycle with a maximum iteration count, a clear exit condition, and telemetry instrumentation is a feature. An unbounded cycle discovered at runtime is a bug.

Conclusion

LexOrbit's February 2026 incident was painful, costly, and reputationally damaging. It was also entirely preventable with architecture that the industry already knows how to build. The GUARD stack they deployed afterward was not novel research; it was the disciplined application of well-understood patterns from distributed systems engineering, graph theory, and resilience design to the specific challenges of multi-tenant agentic AI platforms.

As agentic AI moves from prototype to production infrastructure across legal, financial, healthcare, and enterprise software sectors, the engineering community needs to treat orchestration reliability with the same rigor historically applied to database transactions, API rate limiting, and microservice fault tolerance. The agents are getting smarter. The platforms that run them need to get more resilient at the same pace.

The loop will always find a way to form. Your job as a platform engineer is to make sure it cannot spin forever.