Deterministic Replay for Code Debugging AI: Time‑Traveling Production Bugs Without Live Access

Modern systems are too complex to debug by staring at logs and trying to guess what really happened. Services run behind layers of proxies, deploy in containers, and depend on external APIs, queues, caches, and databases. When a production bug emerges, you often face a paradox: you need production context to reproduce it, but you cannot (and should not) debug directly in production. Even with feature flags and canaries, the hardest bugs are often timing sensitive, data dependent, or triggered by rare interleavings that are nearly impossible to simulate in staging.

This is exactly where deterministic replay earns its keep. By recording what mattered at the boundary of your process when the bug occurred and then deterministically replaying it in a sandbox, you can hand a faithful reproduction to a code debugging AI. No production shells. No live database access. No guesswork. Instead, the AI inspects a trace‑anchored re‑execution and proposes a patch that is validated by repeating the same replay. If it passes, you get a verifiable fix and a dramatically shortened path to RCA (root cause analysis).

This article proposes a complete, opinionated pipeline for record‑replay built on eBPF for low‑overhead tracing and on sandboxed re‑execution for safety. It is intentionally detailed: the target audience is developers, SREs, and platform engineers who want to operationalize such a system, and tool builders who aim to integrate debugging agents with privacy and compliance requirements in mind.

TL;DR

Capture what matters with eBPF at system call and network boundaries, with cgroup and container filters and minimal overhead.
Redact or tokenize PII on the way in, retaining determinism through stable hashing or format‑preserving transformations.
Package an immutable replay bundle: runtime binaries, environment deltas, trace of non‑deterministic inputs, and a manifest.
Re‑execute in a hardened sandbox that virtualizes time, randomness, network, and filesystem reads from the recorded inputs.
Drive an AI debugging agent against the deterministic run to generate and validate fixes automatically.
Produce a cryptographic witness of the replay result so human reviewers can trust that the fix actually addresses the production failure.

Why production bugs are hard

Non‑determinism: Thread scheduling, network timing, and time‑dependent logic create heisenbugs that disappear on re‑run.
Environmental drift: Container images, feature flags, deploy scripts, and secrets differ between staging and prod.
Observability gaps: Logs and metrics compress reality; they rarely provide full request and syscall‑level history.
Restricted access: Rightly, most orgs forbid interactive shells and live data access for debugging.

Traditional debugging workflows force you to choose two out of three: safety, fidelity, and speed. Deterministic replay gives you all three if you architect it carefully.

What deterministic replay must capture

A deterministic replay harness only needs to recreate the nondeterministic inputs at the process boundary and enough environment to make the code run the same paths:

Syscall stream: openat, read, write, recvmsg, sendmsg, connect, accept, mmap, futex, getrandom, clock_gettime, stat, ioctl, and friends.
Network payloads and metadata: socket addresses, TLS keys or decrypted payloads, TCP stream ordering, packet boundaries where relevant.
File reads and metadata: content digests for files read, inode attrs when they influence behavior.
Time sources: coarse and monotonic clocks, timers, sleeps.
Randomness: getrandom, /dev/urandom reads, UUID generation.
Thread and process lifecycle: fork, exec, clone, thread rendezvous (futex waits/wakes) to enforce recorded happens‑before.

What you do not need to capture:

Pure CPU instructions or register state between syscalls.
Full network traffic unrelated to the traced process or request.
Writes back to the world (unless you want to verify external side effects), because in replay you will typically stub or discard them.

Architecture at a glance

A practical record‑replay pipeline comprises four stages:

Record minimally invasive runtime traces at the process boundary with eBPF.
Transform, redact, and bundle artifacts into a portable replay package.
Replay deterministically inside a hardened sandbox with syscall interception.
Orchestrate verification and patch generation by a debugging agent.

Conceptual flow:

[Prod Process] --eBPF--> [Trace Stream] --PII Guard--> [Replay Bundle]
     |                                                   |
     \-----------------------------------------------> [Sandbox Runner]
                                                      |
                                               [AI Debugging Agent]

Stage 1: Recording with eBPF

eBPF lets you attach lightweight programs to kernel hooks without patching the kernel or the application. With CO‑RE (Compile Once, Run Everywhere), you can ship a single BPF object across kernels.

Key ideas:

Use kprobes/tracepoints on sys_enter/sys_exit for relevant syscalls, plus sock layer hooks for payloads.
Filter aggressively by cgroup id and container/pid namespace to capture only the target service.
Correlate events per request using a propagated trace id (W3C traceparent or similar).
Emit to a ring buffer; aggregate and stream to a processor over a local Unix socket.

Example event structure (conceptual):

struct event {
  u64 tsc;              // timestamp
  u32 tgid;             // process id
  u32 tid;              // thread id
  u16 evt;              // event type enum
  u16 cpu;
  u64 req_key;          // stable hash of request id
  u64 arg0, arg1, arg2; // syscall args or metadata
  u32 len;              // payload length if any
  u8  data[...];        // payload snapshot
}

You can build this with libbpf. If you prefer a higher‑level prototype, bpftrace can capture essentials with near‑zero code:

# Capture openat and record path and flags for the target cgroup
bpftrace -e '
tracepoint:syscalls:sys_enter_openat /cgroup == 12345/ {
  printf("%lld %d openat %s %d\n", nsecs, tid, str(args->filename), args->flags);
}
'

In production, prefer libbpf and ring buffers to avoid printf overhead and to perform PII filtering in kernel or immediately in user space.

Minimal set of hooks

Syscalls: sys_enter/sys_exit for read, write, openat, close, stat, fstat, mmap, munmap, mprotect, clock_gettime, nanosleep, getrandom, clone, execve, futex, connect, accept, sendmsg, recvmsg, getsockopt, setsockopt, ioctl.
Network: kprobes on tcp_sendmsg/tcp_recvmsg or tracepoints at net/socket layers; for TLS, capture at kTLS or instrument user‑space crypto via uprobes when feasible.
User‑space events: uprobes on known libraries (e.g., OpenSSL, BoringSSL, Go net/http) to tag request boundaries and trace ids.

Overhead budgeting

Event rate: typical service handling thousands of requests/sec may emit tens of thousands of events/sec.
Target overhead: under 5% p99 CPU for the traced service under normal load by filtering early and sampling payloads.
Tricks: store payloads in per‑cpu ring buffers, avoid copying large buffers unless the process has subscribed to a trace context, and compress batches with zstd at low levels.

Stage 2: PII‑aware transformation and bundling

A record‑replay pipeline must be privacy‑first. The rule is simple: never ship raw production data off the cluster unless policy allows it with clear audit.

Practical PII strategies:

Structural redaction: drop or mask fields matching patterns (email, phone, SSN, credit card PAN) in captured payloads.
Tokenization: replace sensitive tokens with stable hashes or format‑preserving substitutions so logic that depends on structure still behaves deterministically.
Scope minimization: capture only the slice needed to replay a specific request or transaction chain; discard unrelated flows.
Policy engine: express redaction as code plus centrally managed rules. Validate with unit tests and canary lifecycles.

A replay bundle should include:

Binary and library identifiers: container image digest, build id, symbol files.
Environment deltas: selected environment variables and flags used by the process.
Filesystem snapshot delta: content of any files read that are not part of the container image, plus content digests to verify.
Syscall log: the ordered list of boundary events, with timing metadata.
Payload blobs: network and file read data needed to satisfy syscalls during replay.
Manifest: versioned metadata, hashes of all pieces, and a policy descriptor describing any redactions.

Example manifest stub (YAML‑like for readability):

version: 1
image: ghcr.io/acme/payments@sha256:abc...
build_id: 0x7f31...
policy:
  pii: tokenized
trace:
  events: events.zst
  payloads: payloads.zst
fs:
  base: image
  delta:
    - path: /etc/service/config.yaml
      sha256: 2c26b...
    - path: /var/data/feature_flags.json
      sha256: 9b1d4...
keys:
  tls: omitted
  rng: included

Note that the manifest avoids including secrets by default. If the service uses client‑side TLS, you can either record decrypted payloads at the socket boundary or capture session keys under strict policy. Many orgs choose the former.

Stage 3: Sandboxed deterministic replay

Replay requires forcibly replacing nondeterminism with recorded inputs. You can do this by intercepting syscalls and library calls and feeding the recorded data.

Core elements:

Sandbox: run in a container with user, pid, and mount namespaces, no network access by default, seccomp to restrict syscalls, and read‑only rootfs plus an ephemeral scratch mount.
Syscall interception: one of ptrace, seccomp user‑notifier, or a preload shim. Ptrace is flexible but slower; seccomp user‑notifier offers lower overhead and control; preload is easy for libc calls but misses direct syscalls and Go/Rust statically linked cases.
Virtualization:
- time: override clock_gettime, gettimeofday, timers to advance according to recorded deltas.
- randomness: satisfy getrandom and /dev/urandom reads using recorded bytes.
- network: for connect/send/recv, deliver recorded payloads and statuses; simulate kernel responses where needed.
- filesystem: when the process reads a file, serve the recorded bytes, verified by content hash.
Scheduling discipline: enforce ordering observed in futex waits/wakes and join points to avoid divergent interleavings; for hardest cases, serialize threads across syscalls.

An outline using seccomp user‑notifier:

Start a supervisor process that installs a seccomp filter sending relevant syscalls from the target to a user‑space handler.
Supervisor loads the trace manifest and events, then launches the target binary in the sandbox.
For each trapped syscall, the supervisor:
- Matches it against the next event in the log for the calling thread.
- If the syscall expects to read bytes (read/recv): copy recorded bytes into the target memory and set return value accordingly.
- If the syscall mutates metadata (stat): fill user buffers with recorded structures.
- If the syscall writes: either allow it to proceed (redirected to scratch FS) or short‑circuit it if side effects are unnecessary for correctness.
- Controls time and RNG reads by returning recorded values.
The run completes when the target exits; mismatches trigger a divergence report.

A toy pseudo‑handler loop:

while true:
  req = seccomp_user_notif_receive()
  ev = trace.next_for_thread(req.tid)
  assert ev.syscall == req.syscall
  if req.syscall in [READ, RECVMSG]:
    write_into_target(req.mem, ev.bytes)
    respond_return(len(ev.bytes))
  elif req.syscall in [OPENAT, STAT]:
    fill_struct(req.mem, ev.struct)
    respond_return(ev.rc)
  elif req.syscall == CLOCK_GETTIME:
    write_clock(req.mem, ev.ts)
    respond_return(0)
  elif req.syscall == GETRANDOM:
    write_into_target(req.mem, ev.bytes)
    respond_return(len(ev.bytes))
  else:
    respond_return(ev.rc)

For some languages (notably Go), stealing time via vDSO hooks requires extra care. If your code calls vDSO directly, use a preload or patch the GOT to redirect to your time shim, or simply trap at sys_exit when the kernel serves the fallback path.

Harden the sandbox

Run as an unprivileged user namespace; map a non‑root host uid.
seccomp default‑deny; only allow the syscalls needed by the replay supervisor.
No outbound network unless strictly required to fetch artifacts; even then, do so before the target begins execution.
Mount a tmpfs for writable paths; remap paths to avoid touching host files.
Audit logging and deterministic seeds for the sandbox itself so replays are reproducible.

Stage 4: Orchestrating an AI debugging agent

Once you can reliably replay a production failure from a bundle, you can let an AI agent do the heavy lifting:

Observability inside the sandbox: run perf sampling, capture flamegraphs, collect stack traces, and export structured events the agent can analyze.
Hypothesis generation: the agent inspects failing call stacks, diffs against recently changed code, and analyzes concurrency patterns.
Patch proposal: the agent edits code in a feature branch, runs unit tests and the replay to verify the fix preserves the observed behavior but eliminates the failure.
Witness: record hashes of the input bundle and the new run outputs. Ensure that the same bundle deterministically passes after the patch and deterministically fails before it. Store both proofs for review.

This workflow allows human reviewers to focus on code quality and broader reasoning while trusting that the fix is grounded in a faithful reproduction of the production event.

A concrete example: a racy Go handler

Consider a Go microservice that handles payments. Under certain concurrency, a shared map used for request‑level cache is mutated without a lock, leading to intermittent panics with concurrent map writes. You never see it locally because your load is too low and timing is different.

Recording:

Attach eBPF uprobes to Go net/http request entry and exit to tag a request key computed from the incoming traceparent header.
Use kprobes for futex waits/wakes to capture thread rendezvous; attach tracepoints to syscalls and tcp recv/send to capture payloads only for the tagged request.
Overhead remains under 3% because you filter by cgroup and request key.

Bundling:

Redact PAN in JSON payloads via streaming tokenization using format‑preserving mapping (preserves Luhn checksum behavior so validation logic still passes in replay).
Snapshot the feature flag file and a small env delta; include binary build id and container image digest.

Replay:

Launch the service inside a sandbox with seccomp user‑notifier.
Feed the recorded tcp recv payload; control time so a timer wheel tick occurs at the same logical moment.
Enforce futex order so two goroutines interleave the same way; the panic reproduces deterministically.

Agent fix:

The AI recognizes a shared map written by multiple goroutines without synchronization.
It proposes guarding with a RWMutex or switching to sync.Map, updates a couple of call sites, and adds a unit test to simulate concurrent access.
It reruns the bundle: the panic no longer occurs; CPU cost rises marginally. It produces a witness with hashes of the bundle and passing replay artifacts.

Human review approves the fix, merges, and ships with confidence.

Implementation guide: from zero to usable

The following steps outline a pragmatic path to a working system that balances fidelity, complexity, and safety.

Identify MVE (minimum viable events):
- Start with read/write/openat/connect/recvmsg/sendmsg/clock_gettime/getrandom/futex.
- Build a per‑cgroup filter to restrict to your service container.
- Correlate per request via header‑based key (e.g., traceparent) using a uprobe at the HTTP framework boundary.
Build the recorder:
- Use libbpf CO‑RE; create a ring buffer emitting fixed‑size headers plus optional payload slices.
- Implement a user‑space collector that batches, redacts, and compresses with zstd.
- Measure overhead at p50/p95/p99 CPU and latency under load tests.
Design the bundle format:
- Choose a content‑addressed layout in an object store (S3, GCS) with per‑org encryption keys.
- Define a manifest with hashes and a redaction policy descriptor.
- Create a CLI to fetch bundles by incident id or trace id.
Stand up the sandbox and interceptor:
- Start with ptrace if you need rapid prototyping, but move to seccomp user‑notifier for scale.
- Implement handlers for read/recv/open/stat/clock/rng. Initially, serialize threads at syscall boundaries.
- Add file delta loader and path remapping into a tmpfs overlay to satisfy file reads.
Make it deterministic:
- Advance virtual time only when the original run advanced.
- Serve RNG from recorded buffer; guard against underflow with clear divergence errors.
- Enforce futex order; in simplest form, do not let a thread resume from a futex until the recorded wake event fires.
Integrate the debugging agent:
- Expose a simple gRPC or CLI interface: run(bundle), collect_artifacts(), apply_patch(diff), rerun().
- Instrument coverage to guide the agent toward code paths executed during the replay.
Governance and safety:
- Enforce PII policies with continuous scanning and simulated audits.
- Log who accessed which bundle, when, and for what incident.
- Provide deletion and retention controls; default short retention for payloads, longer for manifests without PII.

Performance and sizing expectations

Numbers vary by workload, but reasonable starting points:

Recorder CPU overhead: 1‑3% for lightly instrumented syscalls, 3‑7% when also capturing payload slices and futex rendezvous; aim for p99 under 5% on critical paths.
Event throughput: 100k‑500k events/sec per host with per‑cpu ring buffers and minimal copying.
Bundle sizes: 1‑20 MB per request‑scoped replay for HTTP services; 50‑200 MB for complex, chatty RPC flows or when including large file reads.
Replay speed: 0.5x to 2x of wall‑clock time depending on how faithfully you emulate sleeps and timers; you can accelerate time if the bug does not depend on real delays.

PII and compliance details

Data minimization: do not capture outbound payloads unless necessary; focus on inbound request and file reads.
Streaming redaction: apply tokenization before writing to disk; keep original content only in kernel memory, drop after transformation.
Token stability: use keyed hashing or deterministic format‑preserving encryption so that multiple replays of the same incident remain consistent, but the tokens are useless outside the replay environment.
Role‑based access: gate bundle fetch behind incident tickets and least‑privilege roles; integrate with audit logging.
Cryptographic attestations: sign manifests and bundles; include a signature chain so the replay witness is tamper evident.

What is hard (and how to work around it)

TLS decryption: if you cannot access kTLS or session keys, capture payloads at user‑space library boundaries via uprobes. For Go, hook crypto/tls read/write; for OpenSSL, hook SSL_read/SSL_write.
GPU and JIT nondeterminism: replay for GPU kernels and JITs is out of scope for most teams; rely on higher‑level boundaries and pin versions so codegen is stable.
External dependencies: calls to cloud services or databases should be served from recorded responses. If the service depends on time‑sensitive remote state, you need to capture enough to make idempotent replays.
Highly concurrent code: a fully faithful schedule replay can be complex. As a pragmatic solution, serialize syscalls and enforce recorded futex sequences. If your service uses lock‑free structures reliant on subtle timing, increase the fidelity: record per‑thread wall time offsets and enforce preemption points.
vDSO time: intercept via preload or force fallbacks by manipulating auxv at exec to disable certain vDSO paths in replay.

Comparison to other approaches

rr (Mozilla): records user‑space execution at instruction granularity for single‑threaded or controlled multi‑threaded programs; great for local C/C++ debugging, heavy for production microservices.
Pernosco/Replay.io: high‑fidelity record‑replay with deep time‑travel debugging for certain stacks; powerful but often requires tight integration and specific environments.
Deterministic simulation (e.g., FoundationDB): fantastic if you can design for determinism from day one; not feasible to retrofit across a heterogeneous fleet.

Our approach is deliberately less intrusive: trace only the boundary of your process, then re‑inject inputs. You get most of the benefit at a fraction of the cost and with better compatibility across languages.

Security posture

Principle of least privilege: the tracer runs with CAP_BPF and related caps only on tracing hosts; no shells, no db creds.
eBPF verifier: restrict complexity and ensure your programs are safe; monitor verifier logs and maintain kernel compatibility via CO‑RE.
Host hardening: isolate the recorder and sandbox runners; use SELinux/AppArmor profiles; do not allow outgoing network from the sandbox.
Supply chain: pin container digests in manifests; verify build ids and symbol files; sign bundles and replays.

Measuring success

Time to reproduce drops from days to minutes.
Fraction of incidents with replay bundles attached increases steadily.
p99 overhead remains under agreed SLOs during recording windows.
Patch verification without hitting prod is the new default.
Engineers trust the witness artifacts and accept AI‑generated patches more readily.

Frequently asked questions

Does this require kernel changes? No. Modern kernels support eBPF with the necessary tracepoints and kprobes. Use CO‑RE for compatibility.
What about languages with static linking like Go or Rust? You still intercept syscalls at the kernel boundary. For user‑space hooks (e.g., crypto libraries), rely on uprobes or accept capturing at the kernel socket layer.
Can we accelerate time? Yes. If behavior does not depend on real delays, you can collapse sleeps in replay while preserving ordering.
What happens on divergence? The sandbox halts and emits a diff: which syscall mismatched, what the process expected, and what the trace recorded. This often reveals non‑captured nondeterminism.
How do we handle multi‑service flows? Record per service with the same request id; produce a multi‑bundle replay orchestrator that replays each hop independently while maintaining consistent data. In many cases, replaying only the failing hop is enough for the RCA.

Practical checklist

Do: start small, trace just enough syscalls to reproduce your top category of failures.
Do: correlate per request so you can avoid capturing everything.
Do: tokenize PII in stream.
Do: sandbox tightly and default to no network.
Don’t: rely on prod shells.
Don’t: ship raw payloads to third‑party AI tools; run the agent inside your controlled environment.

References and further reading

eBPF and libbpf CO‑RE: docs and examples from the Linux community.
Seccomp user‑notifier: kernel documentation and example implementations from container runtimes.
OpenTelemetry: for propagating trace ids used to correlate record events per request.
Deterministic replay ideas: rr and academic literature on record‑replay for concurrent programs.
gVisor and nsjail: sandboxing systems for safe re‑execution.

Conclusion

Deterministic replay built on eBPF and a sandboxed syscall interceptor creates a safe, private, and fast path from production failure to verified fix. By elevating the unit of debugging from logs to a faithful replay bundle and integrating an AI debugging agent that proposes and validates patches, teams can time‑travel stubborn bugs without touching production. The system emphasizes minimal capture, privacy by design, and verifiable outcomes. The result is not just fewer pager nights and faster RCAs, but a new default workflow: when prod fails, you capture, replay, and fix with confidence.