Time-Travel CI: Deterministic Replay Meets Debug AI to Kill Heisenbugs

Heisenbugs are the bugs that vanish when you add logging, pass locally but fail in CI, or only appear under full moon load on ephemeral hardware. They are the consequence of nondeterminism: time, scheduling, IO, address space layout, hash seeds, flaky networks, and undefined behavior lurking in the seams. Conventional log-heavy forensics often makes them worse. The antidote is time travel.

This article proposes an end-to-end, production-grade pipeline that pairs deterministic record and replay (rr and optionally Pernosco) with a debug-focused AI to systematically isolate the root causes of flakes, rank races, auto-minimize repros, and suggest fixes. The twist: we do it without breaking hermetic builds, and we integrate with real-world CI thanks to careful sandboxing, containerization, and governance.

The target audience is technical: staff engineers, productivity teams, SREs, and toolsmiths who want to turn Heisenbugs from multi-week hunts into same-day triage and actionable pull requests.

Why deterministic replay belongs in CI

Deterministic replay gives you exactly-one run of the faulting execution. No more chase-the-seed. rr records a trace of syscalls, context switches, and non-deterministic events, then drives the CPU through the same path during replay so that stepping, watchpoints, and reverse-continue actually transport you to the cause instead of the symptom.
Pernosco layers a searchable, time-travel UI and rich analysis on top of rr traces. It is the closest thing to a black box flight recorder for debugging.
Pairing replay with a specialized debug AI yields a smarter triage loop: extracting happens-before graphs, ranking races, detecting transient memory ownership anomalies, automatically minimizing a repro capsule, and proposing fix strategies.

Empirically, large orgs report 1 to 5 percent of test cases showing some degree of flakiness at scale, with outsized cost on developer attention and CI capacity. Replay changes the economics: capture once, replay many times offline, and let tools reason about the entire system time-line, not just the last stack trace.

Sources of nondeterminism we must tame

Scheduling and interleavings: thread preemption, IO completion, signals, timer interrupts
Time: clock reads, timers, backoff loops, JIT warmup
Randomness: getrandom, urandom, ad hoc PRNG seeds
Environment: PIDs, temp filenames, ports, filesystem ordering
Network and remote services
UB and data races: especially in C and C plus plus, but also via FFI and native addons in higher-level languages
JIT and GC: shape-dependent inlining, safepoints, allocation timing

A credible pipeline needs to capture or neutralize these without destabilizing hermetic builds or CI resource budgets.

Architecture at a glance

Flake detector and trigger
- Watches CI results, computes flakiness scores per test and per target, and decides when to enter record mode. Optional chaos scheduling to increase race manifestation probability.
Deterministic recorder
- Re-runs the suspect test under rr when triggers fire, captures a trace for the failing run, and packages a repro capsule.
Trace storage and indexing
- Content-addressed storage with retention, compression, and PII guardrails; optional upload to Pernosco or to an on-prem deployment.
Debug AI analysis
- Consumes rr and symbol info to build an event timeline, lock graph, and memory access summaries; ranks race candidates and crash roots; proposes fixes with confidence scores.
Repro minimizer
- Delta-minimizes inputs, flags, and environment, and produces a minimal container plus script that deterministically reproduces the failure via rr replay.
Review loop
- Files actionable issues, attaches traces, and posts PR comments; optionally synthesizes small patches for review.
Governance, hermeticity, and policy
- Sandbox allowances for rr without breaking hermetic guarantees; resource budgets; secrets policy; opt-in pathways.

Deterministic recording with rr

rr is a user-space record and replay debugger that relies on Linux perf events and ptrace. In record mode, it captures exactly the nondeterministic effects a process could observe: syscalls, signals, context switches, and selected hardware counters. In replay, rr controls the program and replays those events with instruction-level fidelity, enabling reverse stepping.

Key properties for CI integration:

Performance overhead typically 1.2x to 5x, spiking higher with IO-heavy tests. For flake capture, we do not need to record every test all the time; we can record only on suspected flakes and only for a bounded number of attempts.
Works best on Linux on bare metal or nested virtualization with the right perf_event settings. Containers and Kubernetes are workable with privileges noted below.
Plays well with native code across languages: C or C plus plus, Rust, Python extensions, Node native modules. It can record JVM or Go processes too, though JIT code and runtime signals will be visible and may require tuning.

Pernosco for human-in-the-loop debugging

Pernosco is a time-travel debugger built on rr. It provides a web UI for searching state over time, visualizing threads, stacks, data, and symbolized source. For CI, the value is:

Quickly triage the faulting run without pulling a 20 GB trace locally.
Share a link with the exact timeline and breakpoints
Use auto-slicing features to focus on the data flow to a corrupted value

For hermetic environments or air-gapped networks, you can deploy on-prem alternatives or keep using rr locally with gdb.

The end-to-end pipeline in detail

1. Flake detection and triggering

We need a cheap signal that a test is flaky:

Empirical flake score across recent runs: probability of failure not explained by code changes (e.g., pass-fail-pass pattern on same commit across reruns)
Heuristics: failure with retry pass, error messages containing timeouts or broken pipe, signals like SIGSEGV that appear only under high parallelism
Optional chaos mode: add scheduler noise to increase the chance of reproducing races

When the score crosses a threshold, re-run the failing test under rr with a capped number of attempts and a time budget.

Pseudo policy:

If test failed and its 30-day flake score is above 0.01, record with rr up to 3 times or 15 minutes total
If a race detector (when enabled) reports a data race, always record one run for later analysis

A simple GitHub Actions job can re-run flaking tests in a self-hosted runner that supports the required kernel features.

2. Recording under rr with chaos scheduling

rr can be invoked as a wrapper around any test command:

bash
#!/usr/bin/env bash
set -euo pipefail

# rr_wrapper.sh
# Usage: rr_wrapper.sh <test-command> [args...]

RR_TRACE_DIR=${RR_TRACE_DIR:-$PWD/rr-traces}
mkdir -p "$RR_TRACE_DIR"

# Use rr chaos to perturb scheduling, and avoid saving traces for passing runs by using an exit hook
TRACE_DIR=$(mktemp -d "$RR_TRACE_DIR/trace.XXXXXX")
RR_LOG="${TRACE_DIR}/rr.log"

set +e
rr record --chaos --output-trace-dir "$TRACE_DIR" -- "$@" 2>"$RR_LOG"
EC=$?
set -e

if [ $EC -ne 0 ]; then
  echo "Recorded failing run in $TRACE_DIR" >&2
  echo "$TRACE_DIR" > "$TRACE_DIR/TRACE_PATH"
  exit $EC
else
  # Passing run; delete trace to save space
  rm -rf "$TRACE_DIR"
  exit 0
fi

Notes:

chaos option increases the chance of a race to manifest by randomizing context switches.
output-trace-dir avoids writing to default locations in home directories, which helps hermetic builds and sandboxing.
If the test passes, we drop the trace to conserve space.

For Bazel, tag flaky tests to run under a local strategy and a custom test wrapper so ptrace is allowed:

python
# BUILD snippet
sh_test(
    name = 'my_test_rr',
    srcs = ['run_under_rr.sh'],
    data = [':my_test_binary'],
    tags = ['no-sandbox', 'local', 'flaky-instrumented'],
)

And the wrapper simply execs the binary via rr_wrapper.sh from above.

3. Making rr work in hermetic-ish CI

Hermetic builds aim to guarantee reproducibility and no undeclared dependencies. rr requires some host capabilities. The trick is to scope exceptions precisely for the record step, not the build step.

Use separate stages: build artifacts hermetically inside a container; then run rr in a privileged test runner that mounts the built artifacts read-only. The build remains hermetic; the record step is allowed the minimal kernel features.
On Kubernetes, enable ptrace and perf events for the rr job:

yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: rr-recorder
spec:
  template:
    spec:
      securityContext:
        runAsUser: 1000
      containers:
      - name: recorder
        image: your-registry/ci-runner@sha256:...
        command: ['bash', '-lc', 'bash rr_wrapper.sh bazel test //pkg:flaky_test']
        securityContext:
          privileged: true
          capabilities:
            add: ['SYS_PTRACE']
        resources:
          limits:
            cpu: '4'
            memory: 8Gi
        volumeMounts:
        - name: workspace
          mountPath: /workspace
      restartPolicy: Never
      volumes:
      - name: workspace
        persistentVolumeClaim:
          claimName: ci-workspace

If you cannot grant privileged, you can still try adding SYS_PTRACE and ensuring perf_event_paranoid is set low enough on the node. rr’s documentation lists the current kernel requirements.

Use a content-addressed artifact layout so the rr trace is not considered part of the hermetic build graph. It is a debugging artifact attached to a test job.

4. Storing traces and building a repro capsule

A typical trace can be hundreds of MB to a few GB. The pipeline should:

Use rr pack to compress and thin traces for long-term storage if needed
Store in object storage with content addressing by hash of commit plus test plus a hash of the trace directory contents
Attach associated metadata: commit sha, build id, test target, command line, env diff, container image digest, CPU model, rr version, and optional chaos seed
Generate a minimal repro shell script that replays the trace to the crash and drops the developer into gdb on the exact faulting instruction

Example generated repro script:

bash
#!/usr/bin/env bash
set -euo pipefail

trace_dir="$1"
if [ -z "$trace_dir" ]; then
  echo 'Usage: repro.sh <trace_dir>' >&2
  exit 2
fi

# Replay to fault and launch gdb at the crash point
rr replay -k  --onfork stop --read-only "$trace_dir"

The -k option keeps the trace after replay. You can add rr commands via a .gdbinit that sets breakpoints or watchpoints automatically.

5. Debug AI: extracting causality from the trace

An rr trace gives you the ground truth: every context switch, every syscall, every signal, and memory snapshots at watchpoints. The debug AI is an agent that automates the first several hours of a seasoned engineer’s spelunking:

Build a timeline of key events: exceptions, assertion failures, interesting syscalls, lock acquisitions, thread starts and joins
Compute a happens-before graph using synchronization primitives visible in the trace: mutexes, futexes, condition variables, atomics via libc calls, and syscalls like futex or pselect
Extract memory access patterns around the fault: last writers to the crashing address, cross-thread reads and writes, lockset intersections
Identify nondeterminism sources involved: time reads shortly before the failure, epoll timeouts, racy temp file usage, hash iterations over unordered maps
Rank likely race pairs and root causes

A concrete way to implement the agent is via gdb’s Python API during rr replay. The agent attaches to rr replay, registers callbacks for thread events, and walks backtraces at points of interest.

Sketch of a Python entrypoint using gdb scripting:

python
# rr_ai.py
import gdb

class StopHandler(gdb.Command):
    def __init__(self):
        super(StopHandler, self).__init__('rr_ai', gdb.COMMAND_USER)

    def invoke(self, arg, from_tty):
        frame = gdb.newest_frame()
        bt = []
        while frame is not None:
            sal = frame.find_sal()
            bt.append({'func': frame.name(), 'file': sal.symtab.filename if sal.symtab else None, 'line': sal.line})
            frame = frame.older()
        gdb.write('Captured backtrace with {} frames\n'.format(len(bt)))
        # Persist to a JSON file or stdout; mark thread id, signal, and event time.

StopHandler()

You would drive rr replay with target remote and step the execution to the exception point, then run rr_ai to dump context. For production, extend this to watch for memory reads of uninitialized data patterns, futex handoffs, and lock ordering inversions. For languages with rich runtimes, integrate symbol demangling and source maps.

6. Ranking races and deduping signals

We define a race candidate as a pair of conflicting memory accesses not ordered by happens-before: at least one write, and either no shared lockset intersection or a lock order inversion suspicious pattern. The agent assigns a priority score:

Repro stability: fraction of recorded attempts in which the candidate manifests near the failure
Proximity: number of instructions between the second access and the crash site
Criticality: whether the access touches product code vs only test code
Impact: whether the race affects heap or user-visible state, or trips memory safety
Novelty: dedupe cluster distance from previously known issues

Simple formula:

score = w1 * repro_stability + w2 * proximity_inv + w3 * criticality + w4 * impact - w5 * duplication

where proximity_inv is a sigmoid of inverse distance to crash point. Start with equal weights and tune using historical labels.

Deduping: cluster by top stack frames of both conflicting accesses and by canonicalized source locations. This helps avoid spamming devs with many traces of the same root cause.

7. Auto-minimize repros

It is not enough to have a large rr trace. Dev velocity hinges on a tiny repro. We apply delta debugging principles:

Input deltas: shrink command-line flags, test data files, and relevant environment variables via ddmin. Only keep deltas that still reproduce the failure when executed under rr record.
Schedule deltas: reduce thread count, disable unrelated features, minimize concurrency by pinning to 2 cores, then reintroduce chaos to preserve manifestation while cutting noise
Source deltas: optional, use git-bisect to narrow to the minimal commit region if the failure is clearly a regression over a recent history

Your minimizer runs a loop:

bash
#!/usr/bin/env bash
set -euo pipefail

attempt() {
  # Run under rr record once with a timeout; return 0 only if it fails and writes a new trace
  rr record --chaos --output-trace-dir "$1" --timeout=120 -- $CMD &>/dev/null
  test $? -ne 0
}

# Pseudocode of ddmin over ENV and ARGS

Output: a tiny shell script and data directory that deterministically reproduces the crash when recorded and then replayed, no external network, pinned container image digest, explicit environment. This becomes the artifact attached to the bug.

8. Suggesting fixes with a concurrency-savvy AI

We do not auto-commit fixes. We propose them with justification and guardrails.

The agent can detect common patterns:

Missing memory barriers around publish-subscribe of a pointer or flag
Double-checked locking in C or C plus plus without proper atomics
Reading and writing shared containers without a lock or with coarsened locking only in some paths
Releasing a mutex before last use of a protected object due to early return
Lifetime issues: use-after-free caused by refcount underflow or early pool return

For each class, the agent drafts one or two patches:

Introduce or widen a lock guard in the critical function
Replace a non-atomic flag with std::atomic and correct memory order (e.g., release on writer, acquire on reader)
Convert ad hoc PRNG to a seeded deterministic one in tests
Swap out unordered_map iteration in tests for a sorted view to avoid nondeterministic order comparisons

Validation loop:

Synthesize the patch in a branch using an internal code-aware model that has access only to repository code and symbols.
Build and run the minimal repro and the target test under rr record to verify the failure vanishes without introducing new faults (smoke tests only).
Present the patch via a PR with the analysis report, repro capsule, and replay commands.

If the patch fails validation, keep the analysis report and proposed next steps but do not open a PR.

Integrations

GitHub Actions

Use a self-hosted runner with proper kernel settings. The job logic:

Run the test matrix normally
On flake detection, dispatch a reusable workflow to a runner pool that supports rr
Upload traces to object storage and comment on the PR with links and repro script

Skeleton:

yaml
name: rr-capture
on:
  workflow_call:
    inputs:
      test_target:
        required: true
        type: string
jobs:
  record:
    runs-on: [self-hosted, linux, rr]
    steps:
      - uses: actions/checkout@v4
      - name: Build hermetically
        run: |
          bazel build //... --config=ci
      - name: Run under rr
        run: |
          ./ci/rr_wrapper.sh bazel test ${{ inputs.test_target }}
      - name: Upload trace
        if: failure()
        run: |
          tar -czf trace.tgz rr-traces/*
          aws s3 cp trace.tgz s3://traces/${{ github.sha }}/${{ inputs.test_target }}/

Bazel

Keep builds hermetic. Only rr record steps escape sandboxing.
Tag rr-instrumented tests with no-sandbox and local; use toolchains pinned by digest; keep RR_TRACE_DIR inside test temp outputs to preserve caching semantics.

Kubernetes

Dedicated node pool for rr with perf_event_paranoid configured
SecurityContext: privileged or SYS_PTRACE cap
Use runtimeClass with tuned kernel flags, if supported

Practical constraints and budgets

Overhead: Recording with rr can be 2x to 10x slower on IO-heavy tests; set a cap like 15 minutes per trace capture and roll up N traces per day across the org.
Storage: Plan for 1 to 3 GB per captured trace before pack. Retain for 14 days unless linked to an open issue.
Compute: AI analysis runs offline after capture; budget a fixed number of parallel analyses and queue the rest.
Success metrics: Mean time to root cause for flake issues; number of auto-minimized repros; percentage of proposed patches accepted after review.

Security and privacy

rr traces can contain memory images with secrets. Set a policy:

Sanitizers: scan traces for patterns matching tokens and redact or quarantine
Encryption: server-side encryption at rest; per-project KMS keys
Access control: only project owners and security reviewers can fetch traces
On-prem only: if data policy forbids external upload, use on-prem Pernosco or stick to rr and gdb locally

Additionally, scrub environment variables and command history from artifacts unless explicitly whitelisted.

Language considerations

C and C plus plus and Rust: rr excels; combine with ASan and UBSan for standard builds. If you enable TSan, consider recording with TSan too only when chasing races, since the overhead is high. You cannot run TSan on rr replay alone; it must be present in the recorded run.
Go: rr can record but goroutine scheduling is in the runtime and may differ; however, rr still captures syscalls and signals. Go race detector is a strong complement; use it in non-rr runs to flag race sites, then record without race detector to capture the crash.
JVM: rr observes native syscalls and signals, JIT code is visible; consider tuning safepoint timers to reduce noise during record.
Python and Node: fine if the failure crosses into native extensions or if your issue is around syscalls and IO; pair with high-level logging and deterministic seeds for tests.

Case study: a double-checked locking bug

Symptom: Intermittent crashes in production test harness only when running with high parallelism. The flake repro rate is 1 percent of runs.

Pipeline:

Flake detector flags the test after 3 failures in 200 runs
rr record with chaos captures a failing run on attempt 2
Debug AI identifies a pattern: a shared singleton pointer published without an acquire-release pair. Reader thread sees a non-null pointer, but fields are not initialized yet.
Ranked as top candidate with high proximity to SIGSEGV on dereference.
Proposed fix: convert shared pointer to std::atomic with release on writer after full initialization, acquire on reader before dereference. Verify with minimal repro; crash disappears.
PR opened with patch and report including rr replay commands and a Pernosco link.

Outcome: From flake detection to approved fix in two days, mainly human review of the patch and validating downstream effects.

Failure modes and mitigations

rr unavailable on hosted runners: deploy self-hosted runners or a separate capture cluster
Kernel perf settings too strict: coordinate with infra; provide a fallback queue instead of silently skipping capture
Heavy traces: enforce timeouts and pass filters to rr (e.g., record only target PID subtree)
JIT heavy apps: accept larger traces and add filters for irrelevant processes
False patch suggestions: keep human-in-the-loop and a high bar for confidence scores; never auto-merge

Rollout plan

Pilot on the top 50 flaky tests in one repository. Measure capture rate, trace sizes, and analysis quality.
Integrate the repro capsule publishing into your bug tracker workflow.
Expand to cross-repo shared libraries where flakiness clusters often arise.
Introduce the fix-synthesis step only after the analysis reports are consistently helpful.
Gradually enable chaos mode by default for suspected flakes to boost capture rate.

References and further reading

rr project: rr-project dot org
Pernosco: pernos dot co
Delta debugging: Zeller, Simplifying and Isolating Failure-Inducing Input, 1999
ThreadSanitizer and AddressSanitizer documentation
Empirical studies on flaky tests in large codebases (industry conference talks and academic literature)

Opinionated guidance

Always separate the hermetic build from the rr record step. Let hermeticity guard your artifact reproducibility; let rr guard your debugging reproducibility.
Do not try to record everything. Trigger on flakes and known trouble spots. You want signal, not a data lake.
If you can run Pernosco, do it. Developer time is the scarcest resource, and searchable time travel is worth the operational effort.
Invest in a small, focused debug AI. You do not need a general chatbot; you need a concurrency analyst that speaks your codebase dialect and outputs precise, verifiable artifacts: repro scripts, ranked races, and small diffs.
Treat rr traces like crash dumps: sensitive, controlled, and ephemeral.

Conclusion

Heisenbugs thrive in nondeterminism. The combination of deterministic replay and a purpose-built debug AI reverses the asymmetry: you capture the failure once and reason about it offline with full fidelity. By deliberately scoping rr into the testing stage and preserving hermetic builds for compilation, you get the best of both worlds: reproducibility in artifacts and reproducibility in debugging.

The payoff is measurable: fewer developer-days lost to flake hunts, higher confidence in CI signals, and a faster path from flaky symptom to minimal repro to reviewed patch. Time-travel CI is not magic; it is good engineering applied at the right choke points. Build it once, and watch the worst bugs run out of places to hide.