Deterministic Replay for Code Debugging AI: Making Bugs Reproducible, Fixes Verifiable
Software teams want an AI that can take a failing CI job, reproduce it locally, generate a patch, and send back a verifiably correct pull request. The brutal reality is that most failures are not cleanly reproducible: they depend on the exact environment, external I/O, nondeterministic concurrency scheduling, the clock, or long-tail system state. Without reproducibility, debugging (human or AI) devolves into guesswork.
Deterministic replay makes reproducibility a feature, not a hope. If you can capture a complete execution trace, a debugging agent can re-run the same code path deterministically, explore hypotheses, propose diffs, and validate fixes with strong guarantees. The path from flaky, environment-sensitive tests to deterministic AI-driven repair goes through hermetic builds, syscall/I/O recording, sandboxing, and verifiable CI orchestration.
This article provides a practical, opinionated blueprint for building such a pipeline on Linux today, with nods to Windows/macOS. We’ll cover:
- Why reproducibility is the gating factor for autonomous debugging
- What to record to get deterministic replays (and what not to)
- Hermetic builds using Nix/Bazel and reproducible builds practices
- Syscall/I/O recording techniques: user-space interception, ptrace, eBPF, and VM-level record/replay
- Sandboxing via user namespaces, gVisor, Firecracker microVMs
- A CI workflow that turns failures into replayable artifacts, and patches into verifiable PRs
- Reference tools, code snippets, and a minimal implementation you can ship this quarter
The audience is engineers who ship code at scale: CI owners, infra/platform teams, developer productivity orgs, and anyone trying to make AI debugging real.
Executive Summary
- Deterministic replay is the only reliable way to give a debugging AI a ground-truth reproduction of a failure.
- Record the minimum sufficient set: process tree, syscalls, file reads, environment, time/randomness, concurrency schedule, network I/O. Use content-addressed snapshots for files.
- Make builds hermetic: pin toolchains and inputs; normalize locale/timezone; lock the graph. Nix/Guix or Bazel/Pants are your allies.
- Sandbox everything: containers are table stakes; gVisor or Firecracker yields strong isolation and consistent kernel behavior; use seccomp and user namespaces.
- In CI, treat traces as first-class artifacts. On failure, automatically publish a replay bundle. Require that fixes pass both in replay and in a fresh, hermetic live run.
- Start small: add rr or strace/eBPF capture to flakiest test jobs; roll out hermetic runners; add trace-redaction for privacy; grow toward full record/replay.
Why Reproducibility is the Gating Factor for AI Debugging
Traditional debugging relies on a human reproducing the bug. If the failure depends on ephemeral state (clock rollovers, DNS flaps, shared caches, kernel version quirks), you will waste hours chasing ghosts. An AI agent suffers the same fate unless you arm it with a faithful reproduction of the failing execution.
Deterministic replay offers:
- Reproducible failures: press play and get the same behavior every time, including races.
- Controlled perturbation: change one variable (a patch) and hold everything else constant to measure causal impact.
- Forensics: step backward and forward in time; inspect memory and I/O as they happened.
- Verifiable PRs: attach a proof-of-reproduction and proof-of-fix artifact to every AI-generated patch.
It also slays flakiness. A flake is definitionally nondeterministic; record/replay pins it down, enabling root-cause analysis rather than suppression.
Sources of Nondeterminism You Must Control
To replay deterministically, you need to neutralize or record the following:
- Time: wall clock, monotonic clock, timezones, locales, leap seconds
- Randomness: PRNG seeds, /dev/urandom, getrandom(), language RNGs
- Concurrency: thread interleavings, futex wake-ups, timer signals, CPU counts
- Environment: env vars, process IDs, working directory, umask, PATH, locale
- Filesystem: file contents, permissions, symlink resolution, directory listings
- Network: DNS, TCP stream boundaries, packet timing, TLS randomness
- Kernel version and behavior: syscalls semantics, scheduler, cgroup limits
- Hardware: CPU features, vectorization flags, instruction-set differences
- External services: APIs, databases, caches, S3 buckets, message queues
You don’t have to record everything at cycle-level fidelity. The guiding principle: record inputs at boundaries where your application consumes them, and ensure all nondeterministic sources are either fixed or captured.
What to Record: The Minimum Sufficient Set
Aim for a replay bundle that is both small and sufficient. A practical trace usually contains:
- Build manifest: commit hash, dependency lockfiles, container image digests
- Runtime config: CLI args, environment variables, feature flags
- Process tree: fork/exec graph, command lines, exit codes, working directories
- Syscall log: a timestamped stream of syscalls with arguments and return values
- File snapshots: content-addressed blobs for files read during execution
- Random/time feeds: captured values returned to the process for time/randomness
- Network I/O: DNS responses and inbound/outbound byte streams (or a mockable boundary)
- Scheduling decisions: enough ordering info to replay concurrency deterministically (e.g., rr’s approach)
For many services, capturing syscalls, file reads, and network I/O plus normalized time/randomness is sufficient. For highly concurrent, timing-sensitive C/C++ code, user-space record/replay (e.g., rr) that controls scheduling is ideal.
Approaches to Deterministic Record/Replay
You have a spectrum of options with trade-offs in overhead, fidelity, portability, and implementation complexity.
- Language-level recording: intercept standard libraries (e.g., Python’s time/random, Java’s Thread scheduling via determinizers). Low overhead, but misses native dependencies and syscalls.
- User-space syscall interposition: LD_PRELOAD wrappers, libsyscall interposers. Easy to deploy; imperfect coverage; good for time/random/files.
- ptrace-based tracing (strace/rr): capture all syscalls and results. rr adds deterministic thread scheduling. Overhead is moderate; limited to Linux/x86_64.
- eBPF-based tracing: kprobes/uprobes to gather syscall and I/O data with lower overhead. Requires careful engineering and kernel features.
- Virtualization-level record/replay: QEMU/PANDA, VMware/Hyper-V, or Intel Processor Trace (PT) with kernel integration. Highest fidelity; higher overhead; best for kernel-sensitive bugs.
- Black-box system snapshotting: snapshot filesystem, container image, and test inputs; rely on hermeticity to reproduce. Lowest overhead; least precise for concurrency bugs.
Rule of thumb:
- For CLI tools, tests, and most services: strace/eBPF + hermetic builds + file/network capture.
- For nasty data races: rr or VM-level replay.
- For language-heavy stacks: add language shims (time/random/dns) to reduce trace size.
Hermetic Builds: Pin the World
If your build is not hermetic, recording runtime I/O won’t save you. The debugging AI must be able to build the same bits.
Key practices:
- Content-addressable inputs: pin base images and toolchains by digest; mirror artifacts.
- Reproducible builds: avoid timestamps in artifacts, fix locale and timezone, set SOURCE_DATE_EPOCH, strip nondeterministic sections.
- Dependency locking: use lockfiles (npm, pip-tools, Cargo, Go modules) and vendor if possible.
- Build systems: Nix/Guix for full-system hermeticity or Bazel/Pants/Buck with remote execution and sandboxing.
- Environment normalization: fixed PATH, HOME, umask, TZ, LANG, CPU count; forbid network during builds.
A minimal, practical choice is Bazel with sandboxed actions and a pinned docker toolchain image, or Nix flakes with a locked flake.nix/flake.lock. This lets your AI agent rehydrate the exact toolchain inside a container or VM and build byte-identical outputs.
Example: Reproducible Build Setup
- For Bazel: enable sandboxfs or Linux sandboxing; set stamp = False unless you’re explicitly stamping; disable embedded timestamps; pin toolchains.
- For Nix: use flakes; niv or builtins.fetchTarball with sha256; set sandbox = true in nix.conf; cache in a binary cache (Cachix).
Recording Syscalls and I/O
You need syscall and I/O capture to provide ground truth about what your process saw and did.
Approaches:
- strace: simplest; portable across Linux distros. Use -ff to follow forks, -yy to print file descriptors paths, -s to print long strings, -ttT for timing. Good starter.
- rr: records user-space execution with deterministic scheduling and supports reverse debugging. Ideal for C/C++ races and heisenbugs.
- eBPF: custom tracers using bpftrace or libbpf to capture open/read/write/connect/accept/send/recv with buffers. Lower overhead than ptrace; higher engineering cost.
- LD_PRELOAD wrappers: intercept fopen/read/write/getaddrinfo/clock_gettime/getrandom. Good to normalize or mock time/randomness, but does not see every path.
Minimal Linux Prototype using strace + bubblewrap + LD_PRELOAD
This prototype captures enough to replay many failures:
- Containerized sandbox (bubblewrap) to isolate FS/network
- strace to record syscalls per process
- LD_PRELOAD to control time and randomness
- OverlayFS or a snapshot to freeze the filesystem view
Bash harness:
bash#!/usr/bin/env bash set -euo pipefail WORKDIR=$(mktemp -d) ROOTFS=/var/lib/my-ci/basefs # a read-only base image UPPER=$WORKDIR/upper WORK=$WORKDIR/work MERGED=$WORKDIR/merged mkdir -p "$UPPER" "$WORK" "$MERGED" mount -t overlay overlay -o lowerdir=$ROOTFS,upperdir=$UPPER,workdir=$WORK "$MERGED" # Fake lib to normalize time/randomness cat > $WORKDIR/shim.c <<'EOF' #define _GNU_SOURCE #include <dlfcn.h> #include <time.h> #include <sys/random.h> #include <stdint.h> #include <string.h> static const time_t FIXED_EPOCH = 1700000000; // 2023-11-14 int clock_gettime(clockid_t clk_id, struct timespec *tp) { static int (*real)(clockid_t, struct timespec*) = NULL; if (!real) real = dlsym(RTLD_NEXT, "clock_gettime"); int r = real(clk_id, tp); if (r == 0) { if (clk_id == CLOCK_REALTIME) { tp->tv_sec = FIXED_EPOCH; tp->tv_nsec = 0; } } return r; } ssize_t getrandom(void *buf, size_t buflen, unsigned int flags) { // deterministic pseudo-random stream static uint64_t s = 0xdeadbeefcafebabeULL; unsigned char *p = buf; for (size_t i = 0; i < buflen; i++) { s ^= s << 13; s ^= s >> 7; s ^= s << 17; // xorshift64* p[i] = (unsigned char)(s & 0xff); } return (ssize_t)buflen; } EOF gcc -shared -fPIC -ldl -o $WORKDIR/libshim.so $WORKDIR/shim.c # Create a minimal sandbox with bubblewrap bwrap \ --unshare-all \ --bind "$MERGED" / \ --proc /proc \ --dev /dev \ --tmpfs /tmp \ --die-with-parent \ --unshare-net \ --setenv TZ UTC \ --setenv LANG C.UTF-8 \ --setenv PATH /usr/bin:/bin \ --chdir /work \ --bind "$PWD" /work \ sh -lc 'LD_PRELOAD=$WORKDIR/libshim.so strace -ff -yy -ttT -s 8192 -o /work/trace ./your-test-binary --flag' # Unmount overlay dumount "$MERGED" || true
This yields:
- A stable filesystem view via OverlayFS
- Deterministic time/randomness via LD_PRELOAD
- strace logs per PID in ./trace.* files capturing syscalls and I/O
Replay can be implemented by feeding recorded results to a runner that intercepts the same syscalls and returns recorded values. For concurrency-sensitive C/C++ bugs, switch to rr:
bashrr record ./your-test-binary --flag # later rr replay
Sandboxing: Make Replay Portable, Safe, and Consistent
Sandboxing is non-negotiable. You need safety (don’t exfiltrate secrets), consistency (kernel behavior), and portability (replay on a developer laptop or an AI worker).
Options:
- Containers with user namespaces: rootless Podman or bubblewrap. Good isolation; shares the host kernel.
- gVisor: user-space kernel implementing a large subset of Linux syscalls. Strong isolation; consistent semantics; small overhead for most workloads.
- Firecracker microVMs: lightweight VMs with KVM; strong isolation and consistent CPU features; a sweet spot for multi-tenant AI agents.
- Kata Containers: hybrid VM/container isolation.
For an AI debugging fleet, Firecracker-backed sandboxes provide the best security and determinism. For CI jobs within a trusted network, gVisor or rootless containers are often sufficient.
Normalizing Kernel and CPU Features
Even with containerization, hosts differ. Set:
- CPU feature masks (disable AVX-512 if some hosts lack it)
- Cgroup limits (CPU count, memory) to stabilize scheduling
- Kernel version compatibility (pin base AMI or use gVisor/VM)
Handling Network I/O: Cassette-Style Recording
External APIs and networks are the noisiest source of nondeterminism. Two patterns:
- Record-and-replay (“VCR cassettes”): capture DNS + TCP streams during failure; replay byte-for-byte. Best for tests and stateless calls.
- Contracted mocks: define a boundary (e.g., HTTP client) and intercept at that layer, capturing requests/responses. Lower fidelity but easier to reason about.
Security considerations:
- Redact PII and secrets in-flight; never store tokens in traces. Use a redaction policy applied to HTTP headers and JSON bodies.
- Split payloads into content-addressed blobs; encrypt at rest; restrict access via signed URLs.
Example: eBPF-based network capture (concept)
- Attach kprobes to sys_enter_sendto and sys_exit_recvfrom (and TCP variants)
- For each task, capture bytes and metadata into a ring buffer
- Emit content hashes and store payloads separately
Tools like sysdig, Cilium Tetragon, or custom bpftrace scripts can get you >80% there without ptrace overhead.
Concurrency: Replaying Races
If your bug depends on thread interleaving, only rr/TTD-style record/replay or full VM replay is truly reliable. rr records sources of nondeterminism (syscalls, signals) and enforces the same ordering on replay, allowing reverse execution and precise inspection.
- rr works on Linux/x86_64, shines for native code, and typically runs at 1.2–2x overhead for single-threaded; higher for heavily multithreaded code.
- For managed runtimes (JVM, .NET), language-level determinizers exist but are incomplete. Consider restricting concurrency (e.g., GOMAXPROCS=1) during failing test capture to increase determinism, then fix the race with static/dynamic analysis once captured.
The CI Workflow: From Failure to Verifiable PR
Here’s a reference workflow that integrates deterministic replay with an AI agent.
- Build hermetically.
- Build using Nix/Bazel in a pinned container or microVM.
- Produce build manifest: digests, lockfiles, compiler versions.
- Run tests in a sandbox with capture enabled.
- Inject LD_PRELOAD shim for time/randomness.
- Start strace/eBPF capture; enable network cassette mode.
- On failure, stop and package a “replay bundle.”
- Publish the replay bundle to object storage.
- Contents: build manifest, container/microVM image ID, process tree, syscall logs, file snapshots (CAS), network cassettes, env/args, and minimal reproduction command.
- Attach a signed manifest referencing content hashes for deduplication.
- AI agent picks up the bundle and replays locally.
- Hydrate the same sandbox (Firecracker/gVisor) with the base image.
- Mount file snapshots read-only; inject shim; feed syscall and network results.
- Run the failing command; confirm reproduction deterministically.
- Agent proposes a patch and validates it.
- Apply patch; rebuild hermetically within sandbox.
- Run the recorded replay to assert the failure disappears; optionally run a divergent replay that allows network semantics to vary within protocol bounds.
- Run a fresh live test in the same hermetic sandbox with network off or directed at a staging mock.
- Agent opens a PR with proofs.
- PR includes: patch diff, reproduction manifest, replay logs, before/after trace summaries, and a short validation report.
- CI gate requires: pass replayed test; pass fresh hermetic run; pass normal CI suite; no new golden trace diffs unless approved.
- Human review aided by artifacts.
- Reviewers can fetch the bundle and run a single command to reproduce locally.
- Artifacts expire per policy, stored encrypted.
Example: Bundle Manifest (YAML)
yamlversion: 1 project: my-service commit: 3a9f0d1 build: system: bazel image: ghcr.io/org/toolchain@sha256:... deps_lock: bazel_maven_install.json runtime: sandbox: firecracker cpu_mask: "x86_64-v2" env: TZ: UTC LANG: C.UTF-8 args: - ./bin/test - --seed=123 trace: syscalls: s3://traces/abc/syscalls.tar.zst files_cas: s3://traces/abc/files.cas network: s3://traces/abc/net.cassette process_tree: s3://traces/abc/proc.json security: redactions: - http.header: Authorization - json.path: $.user.ssn
Diff-Based Repair in CI: Making Fixes Verifiable
A debugging AI must be judged on diffs and evidence, not vibes. Define acceptance rules:
- Reproduction invariant: The failing test deterministically fails under replay before the patch, and deterministically passes after the patch.
- No regression invariant: A representative subset of the suite passes under replay post-patch, and the full suite passes in a fresh hermetic live run.
- Trace delta policy: Allowed differences are only those implied by the code change (e.g., log line content, number of reads from a modified file). Unexpected syscall/network diffs are flagged.
- Test flakiness budget: If the same replay occasionally diverges, the run is invalid. Fix the capture setup or use rr.
The gate should block on unexpected trace diffs. This is stricter than conventional CI, but it’s what makes AI-generated patches reliably safe to merge—especially in large monorepos with brittle tests.
Case Study: A Realistic Failure and Repair
Scenario: A Go service test intermittently fails with “token expired” despite a 5-minute TTL. Real-world cause: time skew between the test container and the signing library’s monotonic/realtime mix; sometimes garbage-collect cycles delay token issuance.
Capture setup:
- Sandbox: gVisor-backed container with TZ=UTC
- LD_PRELOAD shim won’t affect Go’s time.Now() directly; instead, run within rr or intercept gettimeofday/clock_gettime at the syscall layer (Go calls VDSO/clock_gettime).
- strace shows nanosleep, clock_gettime, futex interactions; captured file reads include a JSON config with ttl_seconds: 300.
Replay:
- Replaying the syscall stream reproduces the failure exactly within 2 seconds.
- Inspection shows token issuance uses wall clock while expiration compares monotonic time; under CPU contention, issuance + skew crosses a threshold.
Fix:
- Change code to use time.Since(start) or only monotonic time via time.Now().Sub(origin) which is monotonic in Go 1.17+.
- Add a test that freezes time using a clock interface and asserts expiration logic.
Verification:
- Replay pre-patch: fails deterministically.
- Apply patch; rebuild; replay: passes.
- Trace delta: clock_gettime calls unchanged; arithmetic in user-space differs. No additional syscalls observed.
- Fresh live run under sandbox: passes; network is disabled, so isolated.
- PR includes replay bundle and a new deterministic unit test using an injected clock.
Result: Merge with confidence; flake eliminated.
Data Management: Storing Traces at Scale
Traces can be large. Practical strategies:
- Content-addressed storage (CAS): store file snapshots and network payloads by hash. Many runs share inputs; dedupe wins are massive.
- Compression: zstd at level 6 is a good default; long strings compress well in syscall logs.
- Sharding: split bundles by process; store metadata in an index (SQLite/Parquet) for search.
- Retention: short-term (7–30 days) for most; keep exemplars of flaky tests longer.
- PII minimization: redact at capture time; allow opt-in verbose capture with encryption and stricter ACLs.
Performance Overheads and Trade-offs
Recording has a cost:
- strace: 20–200% overhead depending on I/O patterns; fork-heavy workloads hurt.
- eBPF: often <10–30% overhead for targeted syscall capture; engineering complexity is higher.
- rr: 1.2–2x for CPU-bound single-threaded; more for high I/O or many threads.
- VM-level replay: can be 2–5x, acceptable for post-failure capture, not for always-on.
Pragmatic approach:
- Capture only on failure paths (retry tests under capture if they fail without capture).
- Sample: always capture on flaky tests; sample stable tests.
- Targeted probes: capture only open/read/write/connect/accept/clock/random; elevate to full capture for suspect modules.
Security and Compliance
- Secrets hygiene: never persist tokens; redact/replace with placeholders; re-inject synthetic tokens for replay if needed.
- Multi-tenant isolation: use microVMs (Firecracker/Kata) for AI workers; assume untrusted code.
- Licensing: be mindful when snapshotting proprietary SDKs; store only what’s necessary for replay under your license terms.
- Data protection: encrypt traces at rest; maintain an auditable access log; set TTLs.
Cross-Platform Notes
- Linux: best-in-class tooling (rr, eBPF, gVisor, Firecracker). Prefer Linux for capture even if target platform is portable.
- Windows: WinDbg Time Travel Debugging (TTD) supports record/replay for native apps; Visual Studio has historical debugging features. Container-based hermeticity is improving with Windows containers and WSL2.
- macOS: System call tracing via dtrace is curtailed; consider VM-based capture or build/test on Linux.
Tooling Menu
Open-source tools worth evaluating:
- rr (Mozilla): user-space record/replay, reverse debugging
- Pernosco: cloud time-travel debugging built on rr
- ReproZip: packs experiments with their dependencies for reproducibility
- bpftrace/libbpf: eBPF tooling for tracing syscalls and user functions
- strace: syscall tracing workhorse
- gVisor: user-space kernel sandbox
- Firecracker: microVMs for secure, minimal VMs
- Bazel/Pants/Buck: hermetic build systems with sandboxing
- Nix/Guix: fully declarative, reproducible system builds
- reprotest and reproducible-builds.org tools: detect nondeterminism in builds
A Minimal Reference Architecture
- Source control: monorepo with hermetic build (Bazel or Nix)
- CI runners: Linux hosts with Firecracker and gVisor available
- Test execution: default in container; on failure, re-run under capture
- Capture stack: LD_PRELOAD shim (time/random), strace or eBPF for syscalls, cassette network layer; rr for selected tests
- Artifact store: S3-compatible CAS for files and payloads; index in Postgres/Parquet
- AI worker: microVM per task; pulls bundle; replays; patches; validates
- Policy engine: enforces replay pass + trace-delta policy on PRs
Implementation Checklist
- Hermeticity
- Pin base images by digest
- Adopt Nix or Bazel with sandboxing
- Remove timestamps and locale drift from builds
- Capture
- Add failure-driven capture reruns
- Implement LD_PRELOAD for time/random
- Choose strace or eBPF capture for syscalls/I/O
- Add network cassette with redaction
- Sandbox
- Standardize on gVisor or Firecracker for replays
- Normalize CPU features and cgroup limits
- Storage
- CAS for file snapshots and payloads
- zstd compression; retention policy; encryption
- CI Integration
- Publish replay bundles on failures
- AI agent integration with replay + build
- PR gate: replay pass + live hermetic pass + trace-delta policy
- Observability
- Index traces; enable local developer reproduction command
- Flake dashboard driven by trace frequency
Opinionated Guidance
- Make traces first-class artifacts. If you don’t store it, you didn’t capture it.
- Prefer deterministic replay to mocks. Mocks are useful but lie by default; use them at explicit boundaries.
- Don’t chase 100% always-on capture. Start with capture-on-failure and expand to hot spots.
- Invest in redaction early. Privacy debt becomes security incidents.
- Treat reproducible builds as non-negotiable. Without hermeticity, debugging AI is handicapped.
Frequently Asked Practical Questions
- Isn’t this too heavy for all tests? No. You only need full capture on failures and flaky tests. For the rest, hermeticity plus targeted logging suffices.
- What about databases? Snapshot local DB state or run DBs in the same sandbox with deterministic persistence; capture network I/O at the boundary or run replay against a snapshot.
- How do we keep bundle sizes reasonable? Use CAS, dedupe, compression, and scope capture to inputs actually read. 50–500 MB per interesting failure is common and acceptable.
- Can we deploy rr in CI at scale? Yes, but selectively. rr is best for elusive native code bugs; strace/eBPF is cheaper and sufficient for most failures.
References and Further Reading
- rr: Lightweight record and replay for debugging (Mozilla). https://rr-project.org
- OOPSLA paper on rr: The Design and Implementation of rr. https://dl.acm.org/doi/10.1145/3133910
- Pernosco time-travel debugging. https://pernos.co
- ReVirt: Enabling Intrusion Analysis Through Virtual-Machine Logging and Replay (King et al., 2003). https://www.usenix.org/legacy/event/osdi03/tech/full_papers/king/king.pdf
- PANDA: Platform for Architecture-Neutral Dynamic Analysis. https://panda.re
- Reproducible Builds project. https://reproducible-builds.org
- Nix Pills and NixOS manual. https://nixos.org
- Bazel sandboxing and remote execution docs. https://bazel.build
- gVisor: https://gvisor.dev
- Firecracker: https://firecracker-microvm.github.io
- WinDbg Time Travel Debugging (TTD). https://learn.microsoft.com/windows-hardware/drivers/debugger/time-travel-debugging-overview
Conclusion
Deterministic replay is the enabling technology that turns “AI for debugging” from a demo into a production capability. By capturing the right execution data and running it inside a hermetic, sandboxed environment, you give an AI agent the same power elite debuggers rely on: faithful reproduction, precise forensics, and verifiable fixes.
The investment pays back quickly. Flaky tests become stable, postmortems become concrete, and AI-generated patches come with their own proofs. Start by making your builds hermetic, add capture-on-failure, store replay bundles, and wire an agent to produce diffs that your CI can verify. Within a few sprints, you’ll wonder how you ever merged red builds based on hope rather than evidence.
