Make Your Code Debug AI Deterministic with Record‑Replay Traces

You can’t ask an LLM to reason effectively about a bug that itself isn’t reproducible. The hardest incidents are the ones that happen once in production, never under a debugger, and defy log-driven forensics: data races, clock skews, request interleavings, network heisenbugs, GC pauses with unlucky object graphs. Developers know the trend line: more concurrency, more distributed state, more ephemeral environments, less determinism.

The fix is not more clever prompting; it’s determinism. Make the world repeatable and the analysis becomes a bounded problem. Record–replay systems do exactly that: capture the inputs and scheduling of a failing execution, then replay it bit-for-bit so you can time‑travel and ask precise questions. Combine that with a code‑focused LLM and you get a Debug AI with superpowers: grounded, cite‑able, and fast.

In this article, we build an opinionated pipeline for time‑travel debugging with LLMs using rr (low‑level user‑space record/replay), JFR (Java Flight Recorder), and eBPF (kernel‑level event capture). We cover:

What determinism means in practice and which tools deliver it
How to capture traces with acceptable overhead in prod and CI
Privacy‑first redaction and data governance for traces
Prompt grounding patterns that keep the model from hallucinating
CI gating with replay bundles and golden trace diffs
Why replay slashes the cycle time on flaky and prod‑only bug hunts

The target audience is engineers who ship production services and want the shortest path from "we saw it once" to a minimal patch and a regression test.

Determinism 101: Replaying Reality

Nondeterminism has a few major sources:

Time: clocks, timers, scheduling quanta
Randomness: PRNG seeds, entropy sources
Concurrency: thread interleavings, races, lock timings
I/O: network reordering, kernel buffering, filesystem latency
Environment: CPU features, ASLR, JIT compilation, container topology

Record‑replay techniques make execution repeat by observing and later re‑feeding a minimal set of inputs:

System call streams and timing hints
Thread scheduling and signals
Network packets and file contents
Runtime events (allocation, GC, safepoints)

Three tiers of tooling matter in practice:

Instruction‑accurate: rr, Undo LiveRecorder, Pernosco (service on top of rr). You get step‑backwards debugging and watchpoints that work.
Runtime/event‑level: JFR (JVM), .NET IntelliTrace/dotnet-trace, Python profilers, Go execution traces, JavaScript V8 traces. These don’t reproduce the exact CPU states but give precise event timelines with stack traces.
System‑level: eBPF, LTTng, ftrace, perf, Intel PT. These cover syscalls, scheduling, and kernel events, and can be joined to application traces.

No single tool covers everything. So the pragmatic approach is to capture enough to:

Reproduce failure locally or in CI (rr or process‑level replayers)
Explain high‑level behavior to a model with event sequences (JFR/eBPF/OTel)
Provide ground truth for the model’s citations (trace IDs and stack frames)

Tool Landscape and Tradeoffs

rr (Linux x86‑64): Records user‑space execution, scheduler decisions, and syscalls using performance counters. Replay is deterministic with single‑core scheduling and the same kernel version family. Overhead is typically 1.2–3×, depending on workload. Pros: time travel, reverse watchpoints, works with gdb and IDEs via Pernosco. Cons: needs Linux/x86‑64, JIT code can be tricky, single‑core replay semantics.
Undo LiveRecorder: Commercial alternative for C/C++ with enterprise integrations and low overhead. Especially useful for long‑running services and core dumps.
JFR (Java Flight Recorder): Built into the JVM. Event‑level capture with low overhead (<1–2% in continuous mode for common configurations). Good for method samples, allocations, GC, lock contention. Does not deterministically replay instruction execution, but when paired with input cassettes (recorded network/database responses), it gets close to reproduction.
.NET: dotnet-trace/dotnet-counters, IntelliTrace (historical). Similar to JFR conceptually.
eBPF: Kernel‑resident programs (CO‑RE) that hook syscalls, tracepoints, uprobes, and kprobes. Very low overhead when used with care; excellent for system‑level forensics and multi‑language shops. Great for stitching together process and network behavior without modifying app code.
OpenTelemetry: Not replay, but critical for correlating spans to traces. We’ll join OTel span IDs to replay bundles so the LLM can jump from a production incident to the exact execution timeline.
Bonus signals: perf (CPU samples), Intel PT (branch tracing), LTTng/ftrace (kernel trace streams). These can be attached for deep performance incidents.

Opinion: if you can use rr for the hot path where the failure manifests, you should. It’s the smallest hammer that nails heisenbugs. In polyglot microservices, supplement with JFR/eBPF to cover everything else and to run continuously at low overhead.

Architecture: A Deterministic Debugging Pipeline

We’ll assemble a pipeline with explicit, auditable boundaries. The guiding principles:

Capture only when needed, and only what you need
Redact at source, not downstream
Build a self‑contained replay bundle with environment and inputs
Let the LLM ask for more context via tools; don’t pre‑stuff it with entire traces

Components:

Capture Agent: rr/JFR/eBPF collectors running in prod, staging, and CI
Scrubber: In‑kernel and user‑space redaction that enforces PII policy
Bundle Builder: Packages trace + environment descriptors + input cassettes
Replay Runner: Hermetic runtime (container/VM) to replay with rr or re‑feed cassettes
Trace Indexer: Builds searchable summaries, event windows, and joins to code maps
LLM Orchestrator: Tool‑enabled agent that queries the index, proposes fixes, and cites events
Gatekeeper: CI rules that require replayable artifacts and trace diffs before merge

A minimal target is a "replay bundle" that can run under rr or re‑feed recorded inputs for higher‑level runtimes, plus a JSON index the LLM can query.

Replay Bundle Manifest

Use a simple manifest that a human can inspect and a machine can enforce:

yaml
# trace_bundle.yaml
version: 1
bundle_id: 2cbe4f7c-b52f-4b60-9b19-fb2c1a4b8e4b
created_at: 2025-01-03T12:34:56Z
origin:
  env: production
  region: us-east-1
  service: payments
  revision: git:3c1ab73
  oci_image: ghcr.io/acme/payments@sha256:8a5...
  otel_trace_id: 8f2d67e7a6...
recorders:
  - kind: rr
    kernel: 6.6.7-200.fc39.x86_64
    rr_version: 5.7.0
    command: ["/app/bin/payments", "--serve"]
    rr_trace_dir: traces/rr/trace-0001
  - kind: jfr
    jfr_file: traces/jfr/incident.jfr
    jvm: openjdk-21
  - kind: ebpf
    bpf_profile: traces/ebpf/profile.parquet
inputs:
  time_stream: inputs/time.ndjson
  random_seed: 12903812098
  network_cassette: inputs/http_vcr/  # HAR/WARC files
  db_cassette: inputs/sql_vcr.sqlite
  env:
    TZ: UTC
    FEATURE_FLAGS: checkout_v2=false
privacy:
  schema: v1
  pii_hash_salt_kdf: argon2id:saltref:secrets/kms_salt
  redaction:
    - path: inputs/http_vcr/**
      ruleset: gdpr_default
attestations:
  - type: slsa_provenance
    file: attestations/slsa.json
  - type: sbom
    file: attestations/sbom.spdx.json

The manifest ties together the trace and the environment. The rr section points to a trace directory; the JFR file captures JVM events; eBPF entries cover syscalls and network flow. Input cassettes record time, random seed, network responses, and database results needed to make replay consistent.

Storage and Cost

Raw traces can be large. Expect representative ballparks:

rr: 50–500 MB per minute under moderate I/O; zstd -19 helps a lot
JFR: tens of MB/min under low‑overhead profiles; more in diagnostic mode
eBPF: depends entirely on event selection and sampling; aim for <5 MB/min per host when idling

You don’t capture continuously at full fidelity. Trigger on anomalies:

On 5xx spikes or SLO burn rate
On panic/signal crashes
On anomaly detectors (outlier latencies, memory growth without release)
During CI when tests fail or when coverage is below a threshold

Use ring buffers with spill‑to‑disk on trigger. Retain minimal windows (e.g., last 60 seconds around incident) and keep only the sessions with a confirmed bug ID.

Capturing with rr: Determinism for Native and Many Polyglot Runtimes

rr is the most reliable way to get step‑backwards debugging on Linux/x86‑64. It serializes execution by controlling scheduling, recording syscalls, and using performance counters for progress. For Node, Python, Rust, C/C++, Go (non‑cgo heavy) it often just works.

Basic workflow:

bash
# Install rr on Debian/Ubuntu
sudo apt-get install rr

# Record your service until it crashes or a condition is met
rr record --output-trace-dir ./traces/rr/trace-0001 \
  env TZ=UTC \
  ./bin/service --serve --port 8080

# Alternatively, attach and start recording when an anomaly occurs
rr record -p $(pidof service)

Replay and inspect:

bash
rr replay ./traces/rr/trace-0001
# Inside the rr gdb session:
# - reverse-continue (rc), reverse-step (rs) work as expected
# - watch -l *ptr sets watchpoints that also work backwards

Python glue to gate rr collection on errors:

python
import subprocess, os, signal

proc = subprocess.Popen(["rr", "record", "./bin/service"], env={"TZ": "UTC", **os.environ})
try:
    proc.wait()
finally:
    if proc.returncode not in (0,):
        # Promote trace to bundle
        subprocess.check_call(["rr", "pack"])  # shrinks trace; optional

Notes and gotchas:

rr prefers to run the target on a single hardware thread during replay. That’s fine for correctness triage; for perf, use perf/Intel PT separately.
JIT heavy code (e.g., JVM, V8 with aggressive JIT) can require flags to disable or pin JIT behaviours. Node often works; JVM is better served by JFR + input cassettes.
Kernel and microarchitecture matter. Use the same kernel family for replay as record. Package a replay VM image to eliminate drift.

JVM: JFR for Event‑Level Ground Truth

JFR captures detailed runtime events: method samples, allocations, GC pauses, monitor enter/exit, thread parks, I/O, and even custom user events. While it’s not instruction‑level replay, in practice you can reconstruct the timeline well enough to reproduce with the right input cassettes.

Continuous low‑overhead config:

bash
JAVA_TOOL_OPTIONS="-XX:StartFlightRecording=name=prod,settings=profile,dumponexit=true,filename=/var/log/incident.jfr"

You can also control it at runtime:

bash
jcmd <pid> JFR.configure settings=profile
jcmd <pid> JFR.start name=incident delay=0s filename=/tmp/incident.jfr
# ...wait for anomaly...
jcmd <pid> JFR.dump name=incident filename=/tmp/incident.jfr
jcmd <pid> JFR.stop name=incident

JFR pairs well with:

HTTP VCR: record raw HTTP requests/responses as HAR/WARC; re‑feed them in tests
SQL VCR: intercept JDBC and log parameterized queries with deterministic responses
Clock control: abstract time via an injectable clock so you can rewind or freeze it under test

With these, you can write a reproducible unit/integration test that mimics the production failure while the JFR file gives the model/verifier the event citations.

eBPF: System‑Level Context without App Changes

eBPF lets you attach to syscalls and tracepoints with minimal overhead. Use libbpf CO‑RE to compile once and run everywhere (with BTF metadata). Capture only what you need and redact at the source.

Example: capture connect attempts and DNS answers to build the network cassette.

Kernel program (C, CO‑RE):

c
// connect.bpf.c
#include <vmlinux.h>
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_tracing.h>
#include <bpf/bpf_core_read.h>

struct conn_event {
  __u64 ts_ns;
  __u32 pid;
  __u32 saddr_v4;
  __u32 daddr_v4;
  __u16 dport;
  char  comm[16];
};

struct { __uint(type, BPF_MAP_TYPE_RINGBUF); __uint(max_entries, 1 << 24); } rb SEC(".maps");

SEC("kprobe/tcp_v4_connect")
int BPF_KPROBE(kp_tcp_v4_connect, struct sock *sk) {
  struct conn_event *e = bpf_ringbuf_reserve(&rb, sizeof(*e), 0);
  if (!e) return 0;
  e->ts_ns = bpf_ktime_get_ns();
  e->pid = bpf_get_current_pid_tgid() >> 32;
  BPF_CORE_READ_INTO(&e->saddr_v4, sk, __sk_common.skc_rcv_saddr);
  BPF_CORE_READ_INTO(&e->daddr_v4, sk, __sk_common.skc_daddr);
  __u16 dport_be; BPF_CORE_READ_INTO(&dport_be, sk, __sk_common.skc_dport);
  e->dport = __bpf_ntohs(dport_be);
  bpf_get_current_comm(&e->comm, sizeof(e->comm));
  bpf_ringbuf_submit(e, 0);
  return 0;
}

char LICENSE[] SEC("license") = "GPL";

User‑space loader (Go + cilium/ebpf or Rust + libbpf‑rs) subscribes to the ring buffer and writes Parquet/NDJSON with PII redaction.

Why eBPF here?

You get system context (who connected to what, when) without changing every service
You can correlate PIDs to containers, pods, and OTel trace IDs via process‑to‑span mapping
You can build minimal network/FS cassettes for replay

Be disciplined: limit probes, prefer tracepoints over kprobes when available, and sample aggressively under load.

Privacy and Governance: Redact at Source, Attest Everywhere

Traces can contain PII/PHI. Don’t ship raw payloads to your LLM index. Adopt these guardrails:

Classify fields at the source. Maintain a schema that lists which fields can contain PII (headers, query params, JSON keys). Pair with a rule engine.
Redact in kernel/user space before disk. For eBPF, don’t capture payload by default. For app‑level cassettes, tokenize with a reversible vault only when absolutely necessary.
Hash with salted, keyed hashing (e.g., HMAC‑SHA256 or argon2id for high‑value identifiers). Store salts in KMS and record references in the bundle, not the key.
Encrypt at rest with tenant keys and envelope encryption; limit retention (e.g., 14–30 days).
Add SLSA provenance and SBOMs to bundles to trace supply chain and tooling build provenance.
RBAC for replay: not all engineers need all bundles; gate by service ownership.

Privacy isn’t optional; you won’t get sign‑off from security otherwise.

Prompt Grounding: How to Make an LLM a Good Debugger

Most “AI debugging” failures come from handing a model a pile of logs and hoping for insight. Instead, give it tools, context windows, and rules.

Represent traces as structured events with stable IDs. Store code maps (file:line to symbol) and build an embedded index.
Let the model call tools to fetch windows around events, stack traces, heap stats, or diff two traces. Avoid dumping everything into the prompt.
Require citations: every claim must reference event IDs and code locations. The orchestrator validates references.
Constrain outputs: patches must compile, tests must pass locally under replay.

Event schema example:

json
{
  "event_id": "jfr:645821",
  "ts": "2025-01-03T12:34:56.540Z",
  "type": "JavaMonitorEnter",
  "attrs": {
    "class": "com.acme.payment.Checkout",
    "monitor": "java.lang.Object@0x7f483...",
    "thread": "http-nio-8443-exec-17",
    "stack": [
      {"fn": "Checkout.reserveInventory", "file": "Checkout.java", "line": 214},
      {"fn": "Inventory.lock", "file": "Inventory.java", "line": 78}
    ]
  },
  "span_id": "3c41..."
}

Tool interfaces (function calls) the LLM can use:

json
[
  {
    "name": "get_events",
    "description": "Fetch events by time or ID range with filters",
    "parameters": {
      "type": "object",
      "properties": {
        "filter": {"type": "string"},
        "limit": {"type": "integer"}
      },
      "required": ["filter"]
    }
  },
  {
    "name": "get_stack",
    "description": "Get resolved stack for an event",
    "parameters": {"type": "object", "properties": {"event_id": {"type": "string"}}, "required": ["event_id"]}
  },
  {
    "name": "diff_traces",
    "description": "Compare two bundles and list deltas in event frequency, latency, and resource usage",
    "parameters": {"type": "object", "properties": {"left": {"type": "string"}, "right": {"type": "string"}}, "required": ["left", "right"]}
  },
  {
    "name": "open_file",
    "description": "Read file content at a specific revision",
    "parameters": {"type": "object", "properties": {"repo": {"type": "string"}, "rev": {"type": "string"}, "path": {"type": "string"}}, "required": ["rev", "path"]}
  },
  {
    "name": "run_replay_test",
    "description": "Run rr or cassette‑backed test and return exit code, logs, and key metrics",
    "parameters": {"type": "object", "properties": {"bundle_id": {"type": "string"}, "test": {"type": "string"}}, "required": ["bundle_id"]}
  }
]

Prompt scaffold for the orchestrator (conceptual):

Task: Diagnose incident X using bundle Y. Don’t guess; fetch events.
Rules: Cite event_id and file:line for claims. Don’t invent paths.
Output: 1) Root cause hypothesis with citations; 2) Minimal patch; 3) Replay test name; 4) Confidence; 5) Open questions

This keeps the model’s reasoning grounded and auditable without requiring verbose chain‑of‑thought output.

CI Gating with Replay

Stop merging changes that fail replay. Add two gates:

Failing test → Mandatory replay bundle
- If a unit/integration test fails in CI, save the rr trace (for native) or the input cassette + JFR (for JVM) and attach to the job artifacts.
- A PR cannot merge unless the replay test passes or the bundle is attached with a linked issue.
Golden trace diffs
- Maintain baseline traces for critical flows (checkout, auth) using synthetic traffic.
- On PR, run the flows, capture minimal traces, and diff against baseline. Flag regressions in:
  - Syscall counts and types
  - Locks held and contention
  - Allocation volume per request
  - Network endpoints contacted
  - Latency distributions

Example GitHub Actions step:

yaml
- name: Run flaky test under rr
  run: |
    set -e
    rr record --output-trace-dir traces/rr/ci-$$ ./bazel-bin/tests/order_test --gtest_filter=FlakyCase
    rr pack traces/rr/ci-$$
- name: Build replay bundle
  run: |
    python tools/build_bundle.py --rr traces/rr/ci-$$ --out dist/bundle.tar.zst
- name: Verify replay
  run: |
    python tools/replay.py dist/bundle.tar.zst --test tests/order_test:FlakyCase

Trace diffs can be enforced with thresholds. For example, more than +20% allocations or new outbound domains require manual approval.

How Replay Slashes Flaky, Prod‑Only Bug Hunts

Replay shortens the feedback loop from hours/days to minutes:

Flaky tests: run them under rr with a loop until failure, then keep the trace. Replay deterministically to bisect code changes and add a minimal sleep/reschedule to make the failure inevitable.
Data races: set reverse watchpoints on the corrupted memory. rr shows the last write that changed a value, even if it happened seconds earlier and on another thread.
Prod‑only ordering bugs: replicate the exact order of messages with a network cassette. If needed, use a scheduler shim that serializes event delivery to match the trace.
GC/lock pathologies: JFR shows the contentious monitors, safepoints, and object churn. Add a test that constructs the same object graph and configuration; your cassette ensures the inputs match.

A composite example:

An order occasionally double‑charges in production. OTel shows the trace ID; SLO burn triggers a capture.
eBPF records outbound calls to payment gateway and DB queries; JFR records lock contentions in Checkout.reserveInventory.
Bundle builder creates a cassette for the exact HTTP calls and SQL queries. Secrets and PII are hashed.
Locally, you run the replay test. The LLM requests event windows and notices two threads entering the same monitor in alternating order, with a non‑atomic increment on a shared counter.
The LLM proposes a minimal patch: replace AtomicInteger.get()+set with getAndIncrement, add a lock order guard, and adds a regression test that reproduces the interleaving.
Gatekeeper runs replay; passes. Merge.

No heroics, no “can you repro?” ping‑pong.

Implementation: Minimal but End‑to‑End

Bundle builder (Python) sketch:

python
# tools/build_bundle.py
import argparse, json, os, shutil, tarfile, time
from pathlib import Path

parser = argparse.ArgumentParser()
parser.add_argument('--rr', help='Path to rr trace dir')
parser.add_argument('--jfr', help='Path to JFR file')
parser.add_argument('--ebpf', help='Path to eBPF parquet')
parser.add_argument('--network', help='Path to HTTP cassette dir')
parser.add_argument('--db', help='Path to SQL cassette')
parser.add_argument('--out', required=True)
args = parser.parse_args()

bundle_dir = Path('dist/bundle_' + str(int(time.time())))
bundle_dir.mkdir(parents=True, exist_ok=True)

manifest = {
  "version": 1,
  "bundle_id": os.popen('uuidgen').read().strip(),
  "created_at": time.strftime('%Y-%m-%dT%H:%M:%SZ', time.gmtime()),
  "recorders": [],
  "inputs": {},
  "privacy": {"schema": "v1"},
}

if args.rr:
    shutil.copytree(args.rr, bundle_dir / 'traces/rr', dirs_exist_ok=True)
    manifest["recorders"].append({"kind": "rr", "rr_trace_dir": "traces/rr"})
if args.jfr:
    (bundle_dir / 'traces/jfr').mkdir(parents=True, exist_ok=True)
    shutil.copy2(args.jfr, bundle_dir / 'traces/jfr/incident.jfr')
    manifest["recorders"].append({"kind": "jfr", "jfr_file": "traces/jfr/incident.jfr"})
if args.ebpf:
    (bundle_dir / 'traces/ebpf').mkdir(parents=True, exist_ok=True)
    shutil.copy2(args.ebpf, bundle_dir / 'traces/ebpf/profile.parquet')
    manifest["recorders"].append({"kind": "ebpf", "bpf_profile": "traces/ebpf/profile.parquet"})
if args.network:
    shutil.copytree(args.network, bundle_dir / 'inputs/http_vcr', dirs_exist_ok=True)
    manifest["inputs"]["network_cassette"] = "inputs/http_vcr/"
if args.db:
    (bundle_dir / 'inputs').mkdir(parents=True, exist_ok=True)
    shutil.copy2(args.db, bundle_dir / 'inputs/sql_vcr.sqlite')
    manifest["inputs"]["db_cassette"] = "inputs/sql_vcr.sqlite"

(manifest_path := bundle_dir / 'trace_bundle.yaml').write_text(json.dumps(manifest, indent=2))

with tarfile.open(args.out, 'w:gz') as tar:
    tar.add(bundle_dir, arcname='bundle')
print('Wrote', args.out)

Replay runner sketch:

python
# tools/replay.py
import argparse, tarfile, os, subprocess, tempfile, json
from pathlib import Path

parser = argparse.ArgumentParser()
parser.add_argument('bundle')
parser.add_argument('--test', help='Optional test target name')
args = parser.parse_args()

work = Path(tempfile.mkdtemp())
with tarfile.open(args.bundle, 'r:gz') as tar:
    tar.extractall(work)
root = work / 'bundle'
manifest = json.loads((root / 'trace_bundle.yaml').read_text())

rr = next((r for r in manifest['recorders'] if r['kind'] == 'rr'), None)
if rr:
    trace_dir = root / rr['rr_trace_dir']
    print('Replaying rr trace at', trace_dir)
    # This will launch gdb UI; for automation, use "rr replay -a" with a scripted gdb
    subprocess.check_call(['rr', 'replay', str(trace_dir)])
else:
    print('No rr trace; running cassette-backed test')
    # Load cassette env vars and run test command if provided
    if args.test:
        env = os.environ.copy()
        env['HTTP_CASSETTE_DIR'] = str(root / 'inputs/http_vcr')
        env['SQL_CASSETTE'] = str(root / 'inputs/sql_vcr.sqlite')
        subprocess.check_call([args.test], env=env)

Trace indexing and summarization: build a compact NDJSON of key events with token‑friendly summaries. For JFR, use jfr2flame or Mission Control APIs; for eBPF, store Parquet and build rollups (counts per syscall, latency histograms, endpoints touched). Expose a simple query language inspired by Tempo’s TraceQL or Parca’s PQL.

Performance: Keeping Overhead and Cost in Check

rr: Don’t run 24/7 in prod unless you accept the overhead; instead, attach on demand or run in high‑fidelity staging. In CI and during canary rollouts, rr is completely viable.
JFR: Continuous low‑overhead profile is fine; switch to diagnostic when an incident triggers.
eBPF: Design for low overhead—prefer tracepoints, use BPF ring buffers, and apply filters early in BPF. On busy hosts, sample or restrict to target processes.
Storage: Use zstd compression and chunking. Clean up aggressively by retaining only the last N minutes around incidents.

Empirical expectations (always validate for your workload):

rr: 1.2–3× overhead in CPU‑bound scenarios; I/O heavy apps can see more due to syscall interposition.
JFR: sub‑percent overhead in profile mode; 2–5% in diagnostic mode with many events.
eBPF: sub‑percent if you only capture lightweight events, 1–2% if you add more probes or do heavy map operations.

Gotchas and Practical Advice

Kernel pinning for rr: Replaying on a very different kernel/microarchitecture can fail; ship a replay VM image (e.g., Firecracker with pinned kernel) with the bundle.
JIT and rr: Consider lowering JIT tiers or disabling tiered compilation during capture; document how to reproduce flags in the bundle.
Time abstraction: Introduce an injectable clock interface in your codebase. It’s the single most valuable change for deterministic tests.
Network determinization: Put a proxy (e.g., Toxiproxy) between your service and external endpoints. It simplifies building cassettes by centralizing capture.
DB determinization: Use read‑only snapshots or a statement‑level VCR; avoid depending on wall‑clock timestamps in queries.
Token budgets: Don’t feed huge traces to the model. Use retrieval: the model asks for the window around the suspicious event, not the entire session.
Security posture: Don’t let LLMs call out to internet over replay; the whole point is to keep inputs fixed.

Example: End‑to‑End from Incident to Patch

Alert: P99 latency spike. Gatekeeper triggers capture: JFR switches to diagnostic; eBPF starts syscall capture; rr attaches to the suspect service in canary pool.
Incident hits: latency correlates with inventory lock contention. JFR shows JavaMonitorEnter spikes; eBPF shows increased fsync and a change in retry policy contacting Redis.
Bundle built: includes JFR file, eBPF Parquet, and network/DB cassettes. Secrets redacted.
LLM session: the agent requests event windows where lock hold time > 100 ms, correlates to a code path added in the PR introducing a non‑fair ReentrantLock around a cache warmup path.
Patch: guard warmup behind tryLock with timeout and move it off the request path; add test that replays the cassette and asserts no lock holds > 10 ms. CI gate passes.

Total engineer time: an hour of reading citations rather than days of fishing in logs.

Why This Matters: The Debug AI You Actually Want

If you want a model to be helpful, you can’t hand it nondeterminism. A good Debug AI is:

Deterministic: every suggestion can be replayed and verified
Grounded: every claim cites trace event IDs and code lines
Actionable: outputs patches that compile and tests that reproduce

Record‑replay traces—rr for deep time travel, JFR for runtime truth, eBPF for system context—are the backbone. They make your model stop guessing and start proving.

Quick Checklist

Add an injectable clock to your code
Set up JFR continuous profile and an incident preset
Build a small eBPF collector for syscalls and network connects with at‑source redaction
Add rr to CI for flaky tests and to staging for high‑risk rollouts
Define a replay bundle manifest and a builder script
Build a minimal LLM orchestration layer with tools: get_events, get_stack, diff_traces, run_replay_test
Gate merges on replay and golden trace diffs

Ship it, and stop chasing ghosts. With determinism and grounded prompts, your Debug AI won’t hallucinate—it will explain, reproduce, and fix.