CI/CD

How One Fintech Team's 6-Hour Outage Exposed the Hidden Cost of Non-Deterministic Builds (And What They Did About It)

Scott Miller

Mar 7, 2026 • 8 min read

At 2:47 AM on a Tuesday in January 2026, the on-call engineer at a mid-size payments fintech we'll call Vaultline got the alert no one wants: transaction processing was down across three regions. Not degraded. Not slow. Down. By the time the incident was resolved, six hours and eleven minutes had passed, roughly $340,000 in delayed settlements had accumulated, and two senior engineers had filed formal post-mortems blaming not a bug in the application code, not a database failure, and not a cloud provider outage. They blamed a build.

More specifically, they blamed a non-reproducible, non-deterministic build that had silently shipped a subtly different binary than the one that passed staging. The kind of failure that most engineering organizations are still completely unprepared for in 2026, despite the tooling to prevent it having existed for years.

This is the story of what happened, how Vaultline's team diagnosed it, and how they rebuilt their entire CI/CD pipeline around deterministic builds, hermetic toolchains, and build provenance. It is also a quiet indictment of the operational habits that backend engineers across the industry are still ignoring.

The Incident: When "It Passed Staging" Stopped Meaning Anything

Vaultline runs a Go-based microservices backend with roughly 28 services, deployed via Kubernetes on a major cloud provider. Their CI/CD pipeline at the time of the incident was a fairly standard setup: GitHub Actions for orchestration, Docker for containerization, a private container registry, and ArgoCD for GitOps-style deployments. Nothing exotic. In fact, their setup looked like a textbook modern fintech stack.

The deployment that caused the outage was a routine release of their payment-router service. The PR had been reviewed, tests had passed in CI, and the staging deployment had been running cleanly for 18 hours before the team promoted it to production. What they did not know was that the container image running in staging and the container image deployed to production were not the same binary, even though they shared the same Git SHA tag.

How? The build had been triggered twice: once when the PR was merged (which produced the image tested in staging), and once when the deployment pipeline re-triggered a build before the production rollout (a common pattern to "ensure freshness"). Between those two builds, a transitive dependency in their Go module graph had received a patch release. The go.sum file was pinned, but a C library pulled in via CGO during the Docker build was resolved from the base image's package manager at build time, not locked. The two builds pulled different versions of that library. The difference was enough to change the behavior of a low-level TLS session resumption path under high connection concurrency, which only manifested at production traffic volumes.

The result: a subtle memory corruption pattern that caused connection pool exhaustion after approximately 90 minutes of production load. Staging never saw it because staging traffic is a fraction of production.

The Diagnosis: Why It Took Four Hours Just to Find the Cause

The first two hours of the incident were consumed by the wrong hypothesis. Observability tooling showed connection pool exhaustion, which pointed the team toward the connection pool configuration, the database layer, and the Kubernetes resource limits. All red herrings. It was not until an engineer on the team, almost as an afterthought, ran a docker inspect comparison between the staging and production image digests that the discrepancy surfaced.

The images had different SHA256 digests despite being tagged identically. That single discovery reframed the entire incident. But it also revealed a deeper problem: the team had no build provenance records. They could not answer basic questions about either build:

What exact versions of every dependency, including system libraries, were included?
What was the state of the build environment at the time of each build?
Which build runner executed the job, and what was installed on it?
Had any external network calls been made during the build that could have resolved to different content?

Without provenance, the team was essentially doing archaeology. They eventually reconstructed enough to identify the CGO library version mismatch, but it took four hours of manual diff work across build logs that were partially expired from their CI provider's retention window. The rollback itself took only 22 minutes. The diagnosis took the rest of the six hours.

The Root Cause: Three Compounding Non-Determinism Vectors

When the post-mortem was complete, the team identified not one but three independent sources of non-determinism in their build pipeline, any one of which could cause a similar incident in the future:

1. Mutable Base Images

Their Dockerfile referenced base images by tag (e.g., golang:1.23-bookworm) rather than by digest. Tags are mutable. A tag can point to a different image layer set on Tuesday than it did on Monday, and Docker will happily pull the new one without any warning. Every build that referenced this tag was potentially building on a different foundation.

2. Network-Resolved System Dependencies

The CGO build step ran apt-get install inside the Docker build context to pull in a C library. This call reached out to Debian's package mirrors at build time and resolved to whatever was current. There was no pinning of the package version, no offline mirror, and no verification of the package hash against a known-good value.

3. Re-Triggered Builds Without Artifact Promotion

The pipeline architecture allowed, and in some cases required, rebuilding from source rather than promoting a previously validated artifact. This meant that the artifact that passed testing was not necessarily the artifact that shipped to production. This is one of the most common and most dangerous anti-patterns in CI/CD, and it is still widespread in 2026.

The Rebuild: Designing for Determinism From the Ground Up

Over the following eight weeks, Vaultline's platform engineering team, a group of four engineers, rebuilt their pipeline around three core principles: hermetic builds, artifact promotion, and build provenance. Here is what that looked like in practice.

Hermetic Toolchains with Bazel

The team migrated their most critical services, starting with payment-router, to Bazel as their build system. Bazel's hermetic build model is designed specifically to eliminate environmental non-determinism. Dependencies, including toolchains, compilers, and system libraries, are declared explicitly in build files and fetched from content-addressed storage using cryptographic hashes. A Bazel build on a developer's laptop and a Bazel build in CI will produce byte-for-byte identical outputs given the same inputs, because Bazel controls and isolates the entire build environment.

For services where a full Bazel migration was not immediately practical, the team implemented a lighter-weight approach: all Docker base images are now pinned by digest rather than by tag, and all apt-get calls during builds were eliminated in favor of pre-built, versioned base images maintained internally. The internal base images are rebuilt on a controlled schedule, tested, and promoted through their own pipeline before being made available to service builds.

Artifact Promotion, Never Rebuild

This was the most operationally impactful change and, frankly, the simplest. The team established a hard rule: the artifact that passes testing is the artifact that ships. No exceptions. The pipeline now works as follows:

A build is triggered once, on merge to the main branch.
The resulting container image is tagged with its immutable content-addressed digest and stored in the registry.
All subsequent pipeline stages (integration tests, staging deployment, production deployment) reference that specific digest, not a tag.
Re-triggering a build produces a new artifact with a new digest, which must pass all validation stages before it can be promoted. There is no mechanism to skip stages for a new build.

Build Provenance with SLSA Level 3

The team implemented build provenance generation conforming to SLSA (Supply chain Levels for Software Artifacts) Level 3. For every build, the pipeline now automatically generates and signs a provenance attestation that records:

The exact Git commit SHA that triggered the build
The identity and configuration of the build runner
A cryptographic hash of every input to the build, including dependency manifests
The identity of the signing key used to attest the provenance
A timestamp and build environment snapshot

These attestations are stored alongside the container images in their registry using the Sigstore/Cosign toolchain and are verified as a gate in the deployment pipeline. If an image arrives at the deployment stage without a valid, verifiable provenance attestation, the deployment is automatically rejected. This is not just a security measure; it is an operational one. During any future incident, the team can now answer every provenance question in under 60 seconds.

The Operational Results: Six Months Later

By the time this case study was written in early 2026, Vaultline's platform team had been running the rebuilt pipeline for approximately five months. The numbers are worth sharing:

Zero environment-related build discrepancies detected since the pipeline rebuild. The provenance verification gate has caught two cases where a developer attempted to manually push an image that bypassed the build system, both flagged immediately.
Mean time to diagnose (MTTD) for deployment-related incidents dropped from an estimated 3.5 hours (based on historical incidents) to under 20 minutes, because provenance data answers the "what changed" question instantly.
Build times increased by approximately 14% due to Bazel's more rigorous dependency resolution. The team considers this an acceptable trade-off.
Developer friction was minimal for the services that stayed on Docker-based builds with pinned digests. The Bazel migration required a meaningful upfront investment but has been well-received by the engineers who work on those services daily.

What the Rest of the Industry Is Still Getting Wrong

Vaultline's story is not unusual. It is, in fact, representative of a systemic gap in how most engineering organizations think about their build pipelines. Here are the patterns that are still dangerously common in 2026:

Treating the Build as a Formality, Not a Security and Reliability Boundary

Most teams think of the build step as a box to check on the way to deployment. The code is the artifact of interest; the build is just the process that packages it. This mental model is wrong and increasingly dangerous. The build environment is an attack surface and a reliability variable. What goes into the build, including compilers, system libraries, package manager resolutions, and environment variables, is as important as the source code itself.

Tag-Based Image References in Production Pipelines

Despite years of community guidance on this, a significant portion of production Kubernetes deployments in 2026 still use mutable image tags in their deployment manifests. image: myapp:latest or even image: myapp:v2.3.1 are not guarantees of a specific artifact. Only a digest reference (image: myapp@sha256:abc123...) is.

No Separation Between Build Artifacts and Build Processes

The pattern of rebuilding from source at promotion time, rather than promoting a validated artifact, is extremely common. It feels safe because it seems like "starting fresh," but it introduces exactly the kind of environmental non-determinism that caused Vaultline's outage. The artifact that passed tests should be the artifact that ships. Always.

Provenance as an Afterthought

SLSA has been a well-defined framework for several years. Sigstore and Cosign are mature, well-supported tools. Yet adoption outside of large technology companies and security-focused organizations remains low. Most mid-size engineering teams have never generated a build provenance attestation, and many engineers have never heard of SLSA. In a world where software supply chain attacks are a documented, recurring threat vector, this is an increasingly difficult position to defend.

Practical Starting Points for Teams Who Are Not Vaultline

You do not need to rebuild your entire pipeline in eight weeks to meaningfully reduce your exposure. Here is a pragmatic, prioritized list of changes any backend team can start making today:

Pin all base images by digest immediately. This is a one-line change per Dockerfile and eliminates one of the most common sources of build non-determinism. Do this today.
Implement artifact promotion in your pipeline. Stop rebuilding at promotion time. Build once, promote the digest. Your CI platform almost certainly supports this pattern.
Add build log retention. You cannot diagnose what you cannot see. Ensure your CI build logs are retained for at least 90 days. Many providers default to 30 days or less.
Evaluate Sigstore/Cosign for image signing. Even basic image signing without full SLSA provenance is a meaningful improvement over no attestation. The tooling is mature and integrates with most CI platforms.
Audit your external network calls during builds. Any curl, apt-get, pip install, or npm install that runs during a build and resolves to "whatever is current" is a non-determinism vector. Either pin it, cache it, or move it to a pre-built base image.

Conclusion: The Build Is the Product

The most important reframe that came out of Vaultline's incident is deceptively simple: the build is not a step in the process; the build is part of the product. The reproducibility, verifiability, and provenance of your build artifacts are engineering properties that deserve the same rigor as your application code, your database schema, and your API contracts.

Six hours of downtime and $340,000 in delayed settlements is a steep tuition bill. But the lesson it bought, that non-deterministic builds are not an abstract concern but a concrete operational risk, is one that every engineering team can learn from without paying it themselves.

The tooling exists. The frameworks exist. The patterns are documented and battle-tested. The only thing missing, at most organizations, is the decision to treat build determinism as a first-class engineering concern rather than a future nice-to-have. In 2026, that decision is overdue.