AI Agents

How a Mid-Size SaaS DevOps Team's AI Agent Deployment Collapsed When Unvalidated Tool-Call Outputs Silently Corrupted Their Driver Packaging Pipeline (And the Architecture That Fixed It)

Scott Miller

Mar 15, 2026 • 8 min read

When teams talk about AI agent failures, they usually picture a chatbot giving a wrong answer or an autonomous task runner getting stuck in a loop. What they rarely picture is a silent, months-long corruption of a production software packaging pipeline that ships signed drivers to enterprise customers. That is exactly what happened to Velorik Systems, a mid-size SaaS company providing device management and telemetry software to manufacturing clients. And the root cause was not a rogue model, a hallucinated API call, or a misconfigured cloud permission. It was something far more mundane and far more dangerous: unvalidated tool-call outputs from an AI agent that nobody thought to treat as untrusted data.

This is the full case study of how that failure unfolded, what it cost, and the output schema enforcement and immutable artifact signing architecture that rebuilt their secure software distribution workflow from the ground up.

The Setup: A Promising AI-Augmented DevOps Pipeline

By early 2025, Velorik's DevOps team of eleven engineers had integrated a multi-step AI agent into their internal tooling. The agent, built on a popular LLM orchestration framework, was responsible for automating a surprisingly complex workflow: parsing hardware compatibility matrices, cross-referencing firmware changelogs, generating driver package manifests, and triggering downstream packaging and signing jobs in their CI/CD system.

On paper, the setup was elegant. The agent used a series of registered tools to:

Query an internal hardware compatibility database via a REST API tool
Pull structured firmware metadata from a versioned S3-backed store
Generate a JSON driver manifest describing version strings, target architectures, dependency hashes, and distribution flags
Pass that manifest to a packaging service that bundled, compressed, and staged artifacts for code signing

The team had spent real effort on the agent's prompting, tool definitions, and retry logic. What they had not spent effort on was what happened after a tool returned a value and before that value was consumed by the next step. That gap became the fault line everything fell into.

The Failure: Silent Corruption at the Seam Between Agent and Pipeline

The corruption began in October 2025 and was not discovered until January 2026, a window of roughly fourteen weeks. During that period, Velorik shipped forty-three driver package releases to enterprise customers. Of those, nineteen contained subtly incorrect metadata: wrong minimum OS version strings, mismatched architecture flags, and in six cases, dependency hash fields that had been silently overwritten with empty strings.

None of these packages caused immediate crashes. The driver installers still ran. But the metadata corruption meant that Velorik's customer-side device management platform was making incorrect compatibility decisions, silently skipping updates on machines that needed them and, in two documented cases, pushing driver updates to hardware profiles they were not validated for.

What Actually Went Wrong at the Technical Level

The failure chain was traced to a single tool in the agent's toolkit: the firmware metadata fetcher. This tool queried an internal API and was expected to return a structured JSON object conforming to a well-known internal schema. However, over time, the internal API had been quietly updated by a separate infrastructure team. Two fields were renamed, one field changed its type from a string to an array, and a new optional field was added.

The tool call still succeeded. It returned a 200 OK. The agent received a JSON blob. And here is the critical failure: the agent treated a successful tool call as implicitly valid output. There was no schema validation layer between the tool response and the agent's reasoning context. The LLM parsed the renamed and restructured fields as best it could, filling gaps with plausible-looking values drawn from its context window, prior tool outputs, and in some cases, what amounted to confident hallucination of metadata strings.

The packaging service downstream had no validation layer either. It accepted whatever JSON the agent passed it, trusting that the orchestration layer had already handled correctness. The signing service, the final step, signed whatever the packaging service produced. Cryptographic signatures were valid. The artifacts were not.

"We had a fully signed, cryptographically verified artifact that was semantically wrong. The signature told customers the package hadn't been tampered with. It said nothing about whether the package was correct to begin with."
Lead DevOps Engineer, Velorik Systems

The Discovery: A Customer Anomaly Report That Pulled the Thread

The failure surfaced not through internal monitoring but through a support ticket. A manufacturing client noticed that a fleet of CNC controller units had stopped receiving driver updates for two consecutive release cycles. Their device management dashboard showed the machines as "up to date" even though a newer validated driver was available. After escalating to Velorik's support team, engineers began tracing the update logic and found that the compatibility metadata in the distributed packages was filtering out the affected hardware profile entirely.

A deeper audit followed. Engineers wrote a validation script against the last sixty driver packages and ran it against the canonical schema. The results were alarming. Nineteen packages had at least one malformed or missing field. The team cross-referenced git history, CI logs, and agent run traces and reconstructed the full failure chain within seventy-two hours.

The remediation was immediate: all affected packages were pulled from the distribution CDN, customer accounts were notified, and corrected packages were manually prepared and re-signed. But the reputational and operational cost was real. Three enterprise clients opened formal incident reviews. One paused their contract renewal pending a security audit. The internal post-mortem ran for three weeks.

The Architecture Rebuild: Three Layers of Defense

Velorik's post-mortem produced a clear architectural mandate: no data produced by or passed through an AI agent should be consumed by a downstream system without explicit, machine-enforced schema validation and tamper-evident artifact provenance. The rebuild was organized around three layers.

Layer 1: Tool-Call Output Schema Enforcement

The first and most immediate change was the introduction of a Tool Output Validation Gateway (TOVG), a thin middleware service sitting between every registered agent tool and the agent's reasoning context. Every tool in the agent's toolkit was assigned a strict JSON Schema definition. When a tool returned a response, the TOVG intercepted it before the agent could consume it and ran full schema validation including:

Field presence checks: required fields must be present with non-null values
Type enforcement: no implicit type coercion; a field declared as a string must be a string
Enum validation: architecture flags, OS version identifiers, and distribution channel codes were constrained to known-good enum sets
Hash format validation: dependency hash fields were validated against a SHA-256 regex before passing through
Schema version pinning: every tool response was required to include a schema_version field; mismatches triggered an immediate halt and alert rather than a silent fallback

Critically, validation failures did not cause the agent to retry with a different prompt or attempt to "repair" the output. They caused a hard stop, a structured error event written to the audit log, and a human-review queue entry. The team made a deliberate decision: a paused pipeline is recoverable; a corrupted artifact in production is not.

Layer 2: Agent-Boundary Output Attestation

The second layer addressed the handoff between the agent's final output and the packaging service. Previously, the agent simply wrote a JSON manifest to an internal message queue and the packaging service consumed it. After the rebuild, every manifest leaving the agent boundary was wrapped in a signed attestation envelope.

The attestation envelope included:

The manifest payload (validated by the TOVG)
A SHA-256 hash of the payload
A timestamp and agent run ID
A digital signature produced by a short-lived signing key issued to the agent runtime via their internal secrets management system
A reference to the specific schema version the payload was validated against

The packaging service was updated to verify this envelope before processing any manifest. If the signature was invalid, the schema version was unrecognized, or the payload hash did not match, the job was rejected at intake. The packaging service became, in effect, a zero-trust consumer: it trusted nothing that could not prove its provenance.

Layer 3: Immutable Artifact Signing With Build Provenance Metadata

The third layer addressed the deepest flaw the incident had exposed: the fact that a cryptographic signature on an artifact said nothing about the correctness of the artifact's contents or the integrity of the process that produced it. Velorik rebuilt their artifact signing workflow around the SLSA (Supply-chain Levels for Software Artifacts) framework, targeting SLSA Level 3 compliance for all driver packages.

Key changes included:

Hermetic builds: packaging jobs were moved to isolated, ephemeral build environments with no network access during the build step, eliminating the possibility of runtime dependency substitution
Build provenance attestations: every artifact was accompanied by a signed SLSA provenance document describing the exact inputs, the build environment, the agent run ID that produced the manifest, and the schema version used for validation
Immutable artifact storage: packaged artifacts were written to a content-addressed, append-only artifact store; no artifact could be overwritten or deleted, only superseded by a new versioned entry
Transparency log integration: artifact hashes and provenance attestations were submitted to an internal transparency log (modeled on the Sigstore Rekor pattern) before any artifact was promoted to the distribution CDN
Customer-side verification tooling: Velorik shipped a lightweight verification CLI to enterprise customers that could check any downloaded package against the transparency log, confirming both cryptographic integrity and build provenance

The Results: Six Months After the Rebuild

By the time the new architecture was fully deployed in March 2026, the results were measurable and significant. Here is what the team reported at their internal six-month review:

Zero undetected schema violations in agent-produced manifests since TOVG deployment. Four validation failures were caught and halted before reaching packaging, all traced to legitimate upstream API changes that were then properly versioned and communicated.
Mean time to detect pipeline anomalies dropped from 14 weeks to under 4 hours, driven by structured error events and real-time alerting on validation failures.
All 43 driver packages shipped since the rebuild passed independent provenance verification by customer security teams, including the three clients who had opened formal incident reviews.
Contract renewals were recovered from all three at-risk enterprise accounts, with two explicitly citing the new transparency log and customer-side verification tooling as a differentiating security feature.
The engineering team reported that the TOVG had a secondary benefit nobody anticipated: it forced every tool owner to write and maintain an explicit schema contract, which dramatically improved internal API documentation and reduced integration bugs across the entire platform.

The Broader Lesson: AI Agents Are Not Trusted Internal Services

The most important insight from the Velorik incident is not technical. It is conceptual. The team had unconsciously extended the same trust to their AI agent that they would extend to a well-tested internal microservice. When a microservice returns a 200 OK, you generally trust the response. You built the service. You tested it. You own it.

An AI agent is not that. An AI agent is a probabilistic reasoning system operating over tool outputs that may change, drift, or fail in ways that do not surface as errors. It is, in a security framing, an untrusted intermediary. Every output it produces, whether from its own reasoning or from a tool it called, should be treated with the same skepticism you would apply to data arriving from a third-party API you do not control.

This reframing has practical consequences for any team building AI-augmented pipelines:

Define schemas before you define prompts. The schema is the contract. The prompt is the implementation detail.
Validate at every boundary, not just at the edge. Tool output to agent context is a boundary. Agent context to downstream service is a boundary. Treat them both as untrusted handoffs.
Make failures loud, not graceful. An agent that silently recovers from a malformed tool response is an agent that will silently corrupt your data. Fail hard, fail early, fail visibly.
Sign for correctness, not just integrity. Cryptographic signing proves an artifact has not been tampered with. Build provenance attestation proves it was built correctly. You need both.

Conclusion: The Pipeline You Trust Is the Pipeline You Have Not Audited Yet

Velorik's story is not unique. As AI agents move deeper into engineering workflows in 2026, the attack surface for this class of failure is growing rapidly. The tools get more capable, the pipelines get more automated, and the assumption that "the AI handled it" quietly replaces the discipline of explicit validation.

The good news is that the remediation is not exotic. Output schema enforcement, attestation envelopes, immutable artifact stores, and transparency logs are all mature, well-documented patterns. The challenge is not technical availability. It is the organizational habit of treating AI agent outputs as first-class untrusted data from day one, before the fourteen-week silent corruption window, before the customer incident report, before the contract review meeting.

Build the validation layer before you need it. Because by the time you know you need it, your signed artifacts are already in the wild.