Deterministic Replays for Debug AI: Building Record and Replay Pipelines that Make Bugs Reproducible

Debugging with AI is intoxicating when it works and infuriating when it doesn’t. The difference is almost always reproducibility. If the AI sees the same thing you saw — the same kernel events, syscalls, network packets, clock ticks, and memory layout — it can reason from the data instead of hallucinating a plausible-sounding fix.

That’s the purpose of deterministic replay: capture enough about execution and inputs to drive a faithful re-run, then evaluate hypotheses mechanically. In this article, I’ll lay out a practical, opinionated design for a record-and-replay pipeline using eBPF for collection, OpenTelemetry for correlation and transport, and containerized snapshots (CRIU/containerd) for frozen-in-time environments. The goal is simple: create privacy-safe, minimal, and consistent reproductions that your debug AI can run, inspect, and verify.

This is not a theoretical exercise. The techniques here borrow from time-tested systems—rr, Undo, Pernosco, FoundationDB’s deterministic simulation—and from modern cloud-native primitives like eBPF, OpenTelemetry, and container runtimes. We’ll discuss design trade-offs, provide code snippets, and wrap with a concrete end-to-end example.

Why Deterministic Replay Is the Missing Ingredient for Debug AI

Classic bug reports are a soup of non-determinism: one-off kernel schedules, racy I/O, JIT behavior, and time-dependent cache effects. Humans compensate with intuition and trial-and-error. LLMs do not; they require grounded, consistent context. When a debug AI has to operate on noisy, non-reproducible input, it correlates patterns loosely (best case) or fabricates a narrative (worst case). Deterministic replay fixes the substrate:

Consistent inputs: Network packets, file reads, syscalls, random numbers, and clocks return the same values in the same order on replay.
Frozen environment: The container image, library versions, env vars, feature flags, kernel ABI, and cgroups match the original.
Thread order defined: A scheduler (recorded or enforced) determines when threads progress; races become repeatable sequences.
Privacy-preserving: Sensitive payloads are tokenized or synthesized, but structural properties and timing are preserved.
Testable hypotheses: Proposed fixes can be evaluated automatically against the same scenario; AI gets immediate feedback.

With that, a debug AI can act like a methodical engineer: find the minimal reproducer, point to the root cause, propose a patch, and show a passing replay.

Sources of Non-Determinism You Must Tame

You won’t get deterministic replay unless you control these:

Time: CLOCK_REALTIME, CLOCK_MONOTONIC, TSC drift, timeouts, timers.
Randomness: /dev/urandom, getrandom(), language-level RNGs.
Concurrency: Thread scheduling, futex wake order, interrupt timing, async I/O.
External I/O: Network responses, DNS, outgoing API jitter, kernel TCP behavior.
Filesystem state: Changing files, temp directories, inode allocation.
JIT and GC: Heuristics, tiered compilation, background sweeps.
Kernel and hardware: Different kernel versions, CPU features, ASLR.

You don’t need to control all of them all the time. A pragmatic approach records those that matter (I/O, syscalls, time, random) and snapshots the rest (process state, file system) at the moment of interest.

Architecture Overview

The pipeline has four major layers:

Record (in production or staging):
- eBPF programs attached to syscalls and kernel tracepoints.
- Userspace hooks (LD_PRELOAD) for time/random as needed.
- OpenTelemetry correlation between application spans and kernel events.
- Triggering on anomalies (errors, SLO breaches, crashes), then capturing the preceding N seconds and the next M seconds.
Snapshot:
- Container runtime checkpoint via CRIU (Docker/Podman/containerd) to capture process state, memory, file descriptors, namespaces, and TCP state when feasible.
- Persistent volumes snapshot or synthetic replay of DB interactions.
- Image digests, kernel version, sysctls, cgroup settings.
Privacy Transform:
- Tokenize sensitive payloads while preserving structure.
- Hash or encrypt identifiers with format-preserving techniques.
- Data minimization: retain only fields necessary for replay.
Replay Sandbox:
- Isolated environment (Firecracker/gVisor/Kata, or a hardened VM).
- Clocks and randomness stubbed to recorded sequence.
- Network replay from pcap or application-level I/O log; no egress.
- Deterministic scheduler (best effort) or rr where applicable.
- OpenTelemetry export for the replay run to compare with record.

Finally, the bundle — a Repro Manifest plus artifacts — is handed to the debug AI. The AI has everything it needs to reproduce and verify.

The Record Layer with eBPF

eBPF is the right default for low-overhead, production-safe collection of kernel-level facts. It lets you attach to syscalls, tracepoints, and kprobes without instrumenting application code. You can scope to a container via cgroup IDs, apply filters, and stream data via ring buffers.

Recommended events to record (sampled or filtered):

Syscalls: openat, read, write, sendto, recvfrom, connect, accept, close, rename, unlink, mmap, munmap, futex, clock_gettime, getrandom.
Network: TCP events, ingress/egress metadata, pcap if needed (beware privacy and volume).
Scheduling and synchronization: futex wait/wake, context switches (sched:sched_switch).
Signals and exits: process exit, signals delivered.

Scope captures by cgroup to avoid cross-container noise. Use BPF CO-RE to build once and run across kernel versions. A ringbuf consumer in userspace maps these kernel events to a compact protobuf or JSON schema and forwards to your OpenTelemetry Collector.

Here’s a trimmed eBPF program capturing openat and read/write lengths tied to a cgroup (for illustration; production code needs error checks and perf optimization):

c
// file: io_trace.bpf.c
#include <vmlinux.h>
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_core_read.h>
#include <linux/sched.h>

struct event_t {
    u64 ts;           // monotonic timestamp
    u32 pid, tid;
    u64 cgroup_id;
    u32 type;         // 1=openat, 2=read, 3=write
    s64 fd;           // fd for read/write
    s64 ret;          // return value
    char path[128];   // truncated path for openat
    u64 count;        // nbytes for read/write
};

struct {
    __uint(type, BPF_MAP_TYPE_RINGBUF);
    __uint(max_entries, 1 << 24);
} events SEC(".maps");

static __always_inline bool in_target_cgroup() {
    u64 id = bpf_get_current_cgroup_id();
    // TODO: optionally filter by known cgroup id stored in map
    return true;
}

SEC("tracepoint/syscalls/sys_enter_openat")
int on_enter_openat(struct trace_event_raw_sys_enter* ctx) {
    if (!in_target_cgroup()) return 0;
    struct event_t *e = bpf_ringbuf_reserve(&events, sizeof(*e), 0);
    if (!e) return 0;
    e->ts = bpf_ktime_get_ns();
    e->pid = bpf_get_current_pid_tgid() >> 32;
    e->tid = (u32)bpf_get_current_pid_tgid();
    e->cgroup_id = bpf_get_current_cgroup_id();
    e->type = 1;  // openat
    e->fd = -1;
    e->ret = 0;
    e->count = 0;
    const char* pathname = (const char*)ctx->args[1];
    bpf_core_read_user_str(&e->path, sizeof(e->path), pathname);
    bpf_ringbuf_submit(e, 0);
    return 0;
}

SEC("tracepoint/syscalls/sys_exit_openat")
int on_exit_openat(struct trace_event_raw_sys_exit* ctx) {
    if (!in_target_cgroup()) return 0;
    struct event_t *e = bpf_ringbuf_reserve(&events, sizeof(*e), 0);
    if (!e) return 0;
    e->ts = bpf_ktime_get_ns();
    e->pid = bpf_get_current_pid_tgid() >> 32;
    e->tid = (u32)bpf_get_current_pid_tgid();
    e->cgroup_id = bpf_get_current_cgroup_id();
    e->type = 1;  // openat exit
    e->fd = 0;
    e->ret = ctx->ret;
    e->count = 0;
    e->path[0] = '\0';
    bpf_ringbuf_submit(e, 0);
    return 0;
}

SEC("tracepoint/syscalls/sys_enter_read")
int on_enter_read(struct trace_event_raw_sys_enter* ctx) {
    if (!in_target_cgroup()) return 0;
    struct event_t *e = bpf_ringbuf_reserve(&events, sizeof(*e), 0);
    if (!e) return 0;
    e->ts = bpf_ktime_get_ns();
    e->pid = bpf_get_current_pid_tgid() >> 32;
    e->tid = (u32)bpf_get_current_pid_tgid();
    e->cgroup_id = bpf_get_current_cgroup_id();
    e->type = 2; // read enter
    e->fd = (s64)ctx->args[0];
    e->ret = 0;
    e->count = (u64)ctx->args[2];
    e->path[0] = '\0';
    bpf_ringbuf_submit(e, 0);
    return 0;
}

SEC("tracepoint/syscalls/sys_exit_read")
int on_exit_read(struct trace_event_raw_sys_exit* ctx) {
    if (!in_target_cgroup()) return 0;
    struct event_t *e = bpf_ringbuf_reserve(&events, sizeof(*e), 0);
    if (!e) return 0;
    e->ts = bpf_ktime_get_ns();
    e->pid = bpf_get_current_pid_tgid() >> 32;
    e->tid = (u32)bpf_get_current_pid_tgid();
    e->cgroup_id = bpf_get_current_cgroup_id();
    e->type = 2; // read exit
    e->fd = -1;
    e->ret = ctx->ret;
    e->count = 0;
    e->path[0] = '\0';
    bpf_ringbuf_submit(e, 0);
    return 0;
}

SEC("tracepoint/syscalls/sys_enter_write")
int on_enter_write(struct trace_event_raw_sys_enter* ctx) {
    if (!in_target_cgroup()) return 0;
    struct event_t *e = bpf_ringbuf_reserve(&events, sizeof(*e), 0);
    if (!e) return 0;
    e->ts = bpf_ktime_get_ns();
    e->pid = bpf_get_current_pid_tgid() >> 32;
    e->tid = (u32)bpf_get_current_pid_tgid();
    e->cgroup_id = bpf_get_current_cgroup_id();
    e->type = 3; // write enter
    e->fd = (s64)ctx->args[0];
    e->ret = 0;
    e->count = (u64)ctx->args[2];
    e->path[0] = '\0';
    bpf_ringbuf_submit(e, 0);
    return 0;
}

char LICENSE[] SEC("license") = "GPL";

And a minimal Go userspace that consumes the ring buffer and emits OpenTelemetry spans with attributes linked to application traces via cgroup and PID:

go
// file: main.go
package main

import (
    "context"
    "encoding/binary"
    "log"
    "os"
    "time"

    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/attribute"
    "go.opentelemetry.io/otel/trace"

    // Assume we have a ringbuf consumer and OTLP exporter setup
)

type Event struct {
    Ts       uint64
    Pid      uint32
    Tid      uint32
    CgroupID uint64
    Type     uint32
    Fd       int64
    Ret      int64
    Path     [128]byte
    Count    uint64
}

func main() {
    // Initialize OTLP exporter and tracer
    tp := initOTLPTracer()
    defer func() { _ = tp.Shutdown(context.Background()) }()
    tracer := otel.Tracer("replay/collector")

    rb := openRingbuf("/sys/fs/bpf/events")
    for {
        rec := rb.Read()
        if rec == nil { time.Sleep(10 * time.Millisecond); continue }
        var e Event
        binary.Read(rec, binary.LittleEndian, &e)

        ctx, span := tracer.Start(context.Background(), "syscall",
            trace.WithTimestamp(time.Unix(0, int64(e.Ts))))
        span.SetAttributes(
            attribute.Int("pid", int(e.Pid)),
            attribute.Int("tid", int(e.Tid)),
            attribute.Int64("cgroup.id", int64(e.CgroupID)),
            attribute.Int("type", int(e.Type)),
            attribute.Int64("fd", e.Fd),
            attribute.Int64("ret", e.Ret),
            attribute.Int64("count", int64(e.Count)),
            attribute.String("path", cString(e.Path[:])))
        span.End(trace.WithTimestamp(time.Unix(0, int64(e.Ts))))
        _ = ctx
    }
}

func initOTLPTracer() trace.TracerProvider { /* ... */ return nil }
func openRingbuf(path string) *Ringbuf { /* ... */ return nil }
func cString(b []byte) string { i := 0; for i < len(b) && b[i] != 0 { i++ }; return string(b[:i]) }

In practice, you’ll want richer schemas (e.g., byte-level hashes for buffers, correlation IDs from socket cookies, syscall args for futex), and you’ll need careful sampling/filters to keep overhead <2–3% CPU and <1% tail latency. Use per-cgroup ring buffers to avoid head-of-line blocking across noisy neighbors.

Correlating with OpenTelemetry

OpenTelemetry (OTel) gives you a vendor-neutral way to move structured data and correlate it with application traces.

Span context propagation: You can attach kernel events to spans via attributes like cgroup ID, PID/TID, and socket tuple (src/dst addr, port). In languages with async tracing, capture the active span ID in userspace and associate kernel events post hoc via PID/TID mapping.
OTel Collector processors: Implement a processor that joins kernel events to app spans based on time windows and PID/TID maps, then emits enriched spans/logs.
Storage: Send to Tempo, Jaeger, or any OTel-compatible backend; keep trace retention for the capture window.

The win: the AI (and humans) can see a narrative linking an inbound HTTP span → DB span → kernel writes → futex contention → syscall error → crash, all within one timeline.

Snapshotting the Environment with CRIU and Container Runtimes

When a trigger fires, you need to freeze the world. On Linux, CRIU (Checkpoint/Restore In Userspace) is the workhorse used by Docker, Podman, and containerd for container checkpointing.

What it captures: process tree, memory pages, registers, open file descriptors, sockets (with conditions), namespaces, and cgroup state. It serializes to image files that can be restored later on a compatible kernel.
Runtime integration: Docker and Podman expose checkpoint/restore commands; containerd has a checkpoint API and snapshotters. Kubernetes can orchestrate via CRIU-enabled runtimes (alpha features exist; production readiness varies).
Limitations: Kernel feature compatibility, TCP connection states (e.g., in-flight connections may be problematic), and GPU/accelerator contexts are tricky. For network-heavy apps, it’s often better to tear down sockets and replay I/O at the application layer during restore.

Example: checkpoint a container with Podman

bash
# Enable CRIU and checkpoint
podman container checkpoint myapp \
  --export=/var/repro/myapp-checkpoint.tar.gz \
  --leave-running=false

# Later, restore into a sandbox host
podman container restore \
  --import=/var/repro/myapp-checkpoint.tar.gz \
  --name myapp-replay \
  --publish 127.0.0.1:18080:8080 \
  --env REPLAY_MODE=1

If checkpointing is not viable, snapshot the image + volume state instead and rely on a deterministic wrapper to control time, randomness, and I/O during replay.

For Kubernetes, you can pre-provision snapshot-capable storage (CSI snapshots) for PVs and use a job to create a “repro bundle” from:

Container image digest(s)
CRIU checkpoint or equivalent process state dump
Volume snapshots or a filtered export (e.g., only the rows referenced by the failing request)
A pcap or higher-level I/O log of external requests

Privacy-Safe Reproductions by Design

Sending raw production data into an AI sandbox is a non-starter for many teams. You can preserve replay fidelity without revealing sensitive content.

Principles:

Data minimization: Capture only the bytes required for the failing path. If you identify the triggering request, don’t include the entire traffic window.
Tokenization: Replace PII in payloads with reversible tokens using an HSM-backed key. During replay, detokenize inside the sandbox; outside, only tokens are visible.
Format-preserving encryption (FPE): For fields with strict formats (credit cards, SSNs), use FPE so downstream validators still pass.
Structural masking: Keep JSON/XML/protobuf structure and lengths; mask sensitive values. Maintain the same length where the protocol depends on it.
Hashing for joins: Hash identifiers when only equality is needed (e.g., matching user_id across services).
Governance and audit: RBAC guardrails, purpose-limited processing, and immutable audit logs of who accessed what.

OpenTelemetry helps here too: Apply attribute processors in the Collector that redact or tokenize before storage and before packaging into the repro bundle.

The Replay Rig: Make It Run the Same Way Every Time

The replay environment has two goals: safety (isolate from prod) and determinism (control non-deterministic sources).

Recommended setup:

Isolation: Run replays inside Firecracker microVMs or gVisor to create a tight boundary. No outbound network, only a controlled ingress for replayed traffic. Mount filesystems read-only except for ephemeral scratch space.
Clock control: LD_PRELOAD a shim that intercepts clock_gettime, gettimeofday, timerfd, and nanosleep. During replay, return recorded timestamps and inject timer expirations on schedule.
Randomness control: Intercept getrandom and /dev/urandom to feed a recorded byte stream. For language-level RNGs, seed them deterministically at process start (e.g., setting GODEBUG or JAVA_TOOL_OPTIONS where applicable).
Network replay: Feed recorded request/response pairs via a local proxy instead of the real internet. For TLS, terminate and re-encrypt locally; keep SNI/ALPN consistent.
Scheduling: Ideally use a deterministic scheduler (rr is best-in-class on x86_64 with perf counters). Otherwise, serialize critical points or pin threads to cores with controlled priorities. A hybrid approach records futex events and enforces the same wake order.
Kernel alignment: Restore under a compatible kernel (same major/minor) and ABI. Keep a fleet of replay hosts matching prod kernels.

A simple LD_PRELOAD shim for time and randomness:

c
// file: libdet.so
#define _GNU_SOURCE
#include <dlfcn.h>
#include <time.h>
#include <sys/random.h>
#include <pthread.h>
#include <stdio.h>
#include <stdint.h>

static int (*real_clock_gettime)(clockid_t, struct timespec*);
static ssize_t (*real_getrandom)(void*, size_t, unsigned int);

static FILE* ts_file;
static FILE* rnd_file;
static pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;

__attribute__((constructor))
static void init() {
    real_clock_gettime = dlsym(RTLD_NEXT, "clock_gettime");
    real_getrandom = dlsym(RTLD_NEXT, "getrandom");
    const char* ts_path = getenv("REPLAY_TS");
    const char* rnd_path = getenv("REPLAY_RND");
    if (ts_path) ts_file = fopen(ts_path, "rb");
    if (rnd_path) rnd_file = fopen(rnd_path, "rb");
}

int clock_gettime(clockid_t clk_id, struct timespec* tp) {
    if (ts_file) {
        pthread_mutex_lock(&lock);
        struct timespec rec;
        size_t n = fread(&rec, sizeof(rec), 1, ts_file);
        pthread_mutex_unlock(&lock);
        if (n == 1) { *tp = rec; return 0; }
    }
    return real_clock_gettime(clk_id, tp);
}

ssize_t getrandom(void* buf, size_t buflen, unsigned int flags) {
    if (rnd_file) {
        pthread_mutex_lock(&lock);
        size_t n = fread(buf, 1, buflen, rnd_file);
        pthread_mutex_unlock(&lock);
        if (n == buflen) return buflen;
    }
    return real_getrandom(buf, buflen, flags);
}

Use it at replay time:

bash
LD_PRELOAD=/opt/lib/libdet.so \
REPLAY_TS=/repro/recorded_times.bin \
REPLAY_RND=/repro/recorded_random.bin \
./myapp --config /repro/config.yaml

For languages with their own time APIs, add shims or configure environment flags to ensure consistent time (e.g., Go’s time functions ultimately call clock_gettime on Linux; Java uses system calls under the hood but JIT can fold constants).

If your service is CPU-bound with complex concurrency, consider rr for Linux/x86_64. rr records user-space execution with hardware performance counters to deterministically replay schedules. It’s brilliant but has constraints (x86-only, perf counters required, system call heavy; doesn’t cover all kernel behavior). When it fits, it’s the gold standard for determinism.

Putting It Together: The Repro Manifest

A reproducible bundle needs a manifest that tells humans and machines what to do. Here’s a suggested schema:

json
{
  "version": "1.0",
  "id": "repro-2025-10-31-abc123",
  "created_at": "2025-10-31T22:04:05Z",
  "trigger": {
    "type": "http_500_rate_spike",
    "span_id": "7a1b...",
    "trace_id": "5f3c..."
  },
  "environment": {
    "kernel": "5.15.0-1023-aws",
    "container_image": "registry.example.com/team/myapp@sha256:...",
    "cgroup_id": "213987654321",
    "sysctls": { "net.core.somaxconn": "1024" },
    "env": { "FEATURE_X": "true", "GOMEMLIMIT": "1GiB" }
  },
  "artifacts": {
    "criu_checkpoint": "artifacts/checkpoint.tar.gz",
    "pcap": "artifacts/inbound.pcap",
    "syscalls": "artifacts/syscalls.binpb",
    "times": "artifacts/timestamps.bin",
    "random": "artifacts/random.bin",
    "otel_trace": "artifacts/trace.jsonl",
    "volume_snapshot": "artifacts/volumes/db.delta.tar",
    "symbols": "artifacts/symbols.tar"
  },
  "privacy": {
    "tokenization": "fpe:v1:kms-key-arn",
    "pii_ruleset": "v3.2",
    "logs_redacted": true
  },
  "replay": {
    "runtime": "firecracker",
    "network_mode": "offline",
    "clock": "recorded",
    "random": "recorded",
    "entrypoint": ["/usr/local/bin/myapp", "--config=/repro/config.yaml"],
    "preload": "/opt/lib/libdet.so",
    "env_overrides": { "REPLAY_MODE": "1" }
  }
}

Your debug AI should know how to parse this manifest, spin up the sandbox according to the instructions, run the replay, and compare the OTel traces from the replay to the original.

Measuring Success: Determinism and Overhead SLOs

You can’t improve what you don’t measure. Track:

Reproduction Success Rate (RSR): fraction of triggered captures that replay the failure.
Determinism Score: similarity metric between record and replay (event timing deltas, syscall sequences, span structure). Aim >0.95 for bug-of-interest windows.
Overhead Budget: CPU/memory and tail latency impact of the record layer. Keep under 3% CPU and 1% p99 latency where possible; sample more aggressively for hot paths.
Storage Cost: average size per repro bundle; target <50–200 MB by minimizing payloads and keeping only relevant windows.
Privacy Incidents: number of rejected bundles for PII policy violations; should trend to zero.

Workflow Integration: From Trigger to Verified Fix

A practical end-to-end flow:

Trigger: Alerts from OTel metrics/logs (e.g., 500 rate, p99 spikes, panic/crash, regression tests failing in CI) start a capture.
Pre/Post Window: Buffer N seconds before trigger and M seconds after; flush eBPF ring buffers and write OTel traces.
Snapshot: CRIU checkpoint or volume snapshot is taken; ephemeral volumes captured.
Privacy Transform: Run redaction/tokenization; validate with policy engine.
Bundle: Create the repro manifest + artifacts; store in an encrypted object store with TTL.
Replay: Debug AI (or a human) downloads the bundle, starts a sandbox, injects libdet, replays inputs, and verifies failure reproduction.
Analyze: The AI compares traces, syscalls, and logs; proposes a candidate patch.
Validate: Build a patched image, rerun the replay. If root cause fixed, replay passes; AI provides a minimal test reproducer and a diff.
Land and Guard: Merge the fix; add a replay-based regression test in CI using the bundle (sanitized) to prevent reintroduction.

Example: Hunting a Go net/http Race with Deterministic Replay

Scenario: A Go microservice occasionally returns 502 under load with a cryptic "context canceled" error and spikes in CPU. It’s intermittent in staging and almost unobservable in prod.

Record:

eBPF captures read/write, futex events, and sched_switch for the myapp cgroup.
OTel traces show spikes in the HTTP server span duration and canceled downstream gRPC calls.
Trigger on 502 rate >1% for 60s; capture a 30s pre-window and 10s post-window.
Podman checkpoints the container at t=trigger+2s; volumes snapshot includes the on-disk cache.
The recorder stores a pcap of inbound requests (for just the last 10s), random bytes consumed, and timestamps for timers.

Privacy:

Replace Authorization headers and cookies with tokens; maintain identical lengths.
Hash user IDs; keep only equality semantics for joins.
Redact gRPC payload bodies, but keep method names and message sizes.

Replay:

Firecracker VM with the same kernel version; image pulled by digest.
Start myapp with LD_PRELOAD=libdet.so; provide recorded time and random streams.
Inject inbound HTTP requests from pcap via a local proxy, preserving arrival times.
Outbound calls are loopbacked to a stub server that replies using recorded responses.

Outcome:

The bug reproduces with a determinism score of 0.97: same sequence of futex waits/wakes and the same goroutine stacks. The AI inspects goroutine profiles (continuous profiler via eBPF) and traces and identifies a deadlock-prone cache invalidation path: a write lock held while doing a synchronous gRPC call under backpressure.
Suggested fix: narrow the lock scope to only update the cache map, do the gRPC call without holding the write lock, then re-check cache state with an atomic compare-and-swap. The AI provides a patch and a deterministic test using the repro bundle (sans PII) that proves the 502s disappear.

This is the ideal loop: reproduce, explain, propose, verify.

Opinionated Guidance and Pitfalls

What to do:

Start with eBPF for syscalls and key tracepoints, OTel for correlation, and CRIU snapshots where feasible. This covers 80% of cases with manageable overhead.
Instrument time and randomness deterministically first. Many failures depend on timeouts and retry backoff behavior; fixing those alone boosts RSR dramatically.
Prefer application-level I/O replay to raw pcap unless protocols are opaque. It’s easier to keep privacy and determinism at the message level.
Store debug symbols with the bundle. Symbolization is essential for the AI to make sense of stacks.
Use microVMs for replay isolation. It reduces risk and keeps kernel/CPU drift under control.

What to avoid:

Capturing everything indiscriminately. You’ll drown in data and leak privacy. Be surgical: a 30–60s window, filtered by cgroup and PID, suffices for most bugs.
Relying solely on application logs. They’re valuable but insufficient for determinism.
Assuming the same fix applies across kernels. Kernel timing and epoll behavior can differ; validate on a matched kernel.

Common pitfalls:

ASLR and JIT variance can perturb instruction addresses across runs. Use symbolization and avoid comparing raw addresses unless normalized.
CRIU and sockets: Restoring live TCP connections between hosts often fails. Prefer closing sockets at checkpoint time and replaying I/O locally.
eBPF kernel drift: Use CO-RE and keep a matrix of kernel versions; run CI tests against each.
Storage leakage: Forgetting to scrub one field (e.g., a nested JWT) can violate policy. Use a linter and automated scanners.

References and Tools

rr: Lightweight record and replay for debugging (Mozilla) — deterministic user-space debugging on Linux/x86_64.
Undo (commercial) and Pernosco — time-travel debugging; excellent case studies on replay.
CRIU — Checkpoint/Restore In Userspace docs; Docker/Podman integration.
OpenTelemetry — Spec and Collector; processors for redaction.
Pixie, Parca/Pyroscope, Cilium Tetragon, Inspektor Gadget — eBPF-based observability and security tooling.
FoundationDB simulation — exemplary deterministic testing culture.
Papers: Dthreads (deterministic multithreading), Kendo, and various record/replay systems.

A Minimal Implementation Plan

Phase 1 (2–4 weeks):

Deploy eBPF collectors for syscalls read/write/openat, connect/accept, futex, sched_switch. Filter by target services’ cgroups.
Add OTel Collector pipeline with a processor to attach kernel events to app spans.
Capture minimal windows on triggers; persist artifacts and traces.
Add LD_PRELOAD shim for time and randomness; test locally.

Phase 2 (4–8 weeks):

Integrate CRIU checkpoint for containerized services; build compatibility matrix per kernel.
Introduce privacy transform pipeline with tokenization and FPE.
Build replay sandbox (Firecracker) and automation to run bundles.
Define manifest schema; export symbol bundles.

Phase 3 (8–12 weeks):

Add application-level request/response replay for critical protocols (HTTP/gRPC, Kafka, Postgres).
Implement determinism scoring and CI jobs to run replays on PRs.
Pilot debug AI agent that consumes bundles, runs replays, proposes patches, and validates.

Closing Thoughts

Deterministic replay is not about capturing everything; it’s about capturing the right things and making them repeatable. Combined with a debug AI that refuses to guess without running the evidence, you get a workflow that feels like cheating: bugs that used to take days resolve in hours, with confidence.

Use eBPF to observe, OpenTelemetry to narrate, and containerized snapshots to freeze. Add privacy controls that make InfoSec smile. Then let your AI do what it’s best at: searching vast hypothesis spaces and writing patches — with a replay that proves they work.