Make Your Code Debug AI Deterministic with Record‑Replay Traces
You can’t ask an LLM to reason effectively about a bug that itself isn’t reproducible. The hardest incidents are the ones that happen once in production, never under a debugger, and defy log-driven forensics: data races, clock skews, request interleavings, network heisenbugs, GC pauses with unlucky object graphs. Developers know the trend line: more concurrency, more distributed state, more ephemeral environments, less determinism.
The fix is not more clever prompting; it’s determinism. Make the world repeatable and the analysis becomes a bounded problem. Record–replay systems do exactly that: capture the inputs and scheduling of a failing execution, then replay it bit-for-bit so you can time‑travel and ask precise questions. Combine that with a code‑focused LLM and you get a Debug AI with superpowers: grounded, cite‑able, and fast.
In this article, we build an opinionated pipeline for time‑travel debugging with LLMs using rr (low‑level user‑space record/replay), JFR (Java Flight Recorder), and eBPF (kernel‑level event capture). We cover:
- What determinism means in practice and which tools deliver it
- How to capture traces with acceptable overhead in prod and CI
- Privacy‑first redaction and data governance for traces
- Prompt grounding patterns that keep the model from hallucinating
- CI gating with replay bundles and golden trace diffs
- Why replay slashes the cycle time on flaky and prod‑only bug hunts
The target audience is engineers who ship production services and want the shortest path from "we saw it once" to a minimal patch and a regression test.
Determinism 101: Replaying Reality
Nondeterminism has a few major sources:
- Time: clocks, timers, scheduling quanta
- Randomness: PRNG seeds, entropy sources
- Concurrency: thread interleavings, races, lock timings
- I/O: network reordering, kernel buffering, filesystem latency
- Environment: CPU features, ASLR, JIT compilation, container topology
Record‑replay techniques make execution repeat by observing and later re‑feeding a minimal set of inputs:
- System call streams and timing hints
- Thread scheduling and signals
- Network packets and file contents
- Runtime events (allocation, GC, safepoints)
Three tiers of tooling matter in practice:
-
Instruction‑accurate: rr, Undo LiveRecorder, Pernosco (service on top of rr). You get step‑backwards debugging and watchpoints that work.
-
Runtime/event‑level: JFR (JVM), .NET IntelliTrace/dotnet-trace, Python profilers, Go execution traces, JavaScript V8 traces. These don’t reproduce the exact CPU states but give precise event timelines with stack traces.
-
System‑level: eBPF, LTTng, ftrace, perf, Intel PT. These cover syscalls, scheduling, and kernel events, and can be joined to application traces.
No single tool covers everything. So the pragmatic approach is to capture enough to:
- Reproduce failure locally or in CI (rr or process‑level replayers)
- Explain high‑level behavior to a model with event sequences (JFR/eBPF/OTel)
- Provide ground truth for the model’s citations (trace IDs and stack frames)
Tool Landscape and Tradeoffs
-
rr (Linux x86‑64): Records user‑space execution, scheduler decisions, and syscalls using performance counters. Replay is deterministic with single‑core scheduling and the same kernel version family. Overhead is typically 1.2–3×, depending on workload. Pros: time travel, reverse watchpoints, works with gdb and IDEs via Pernosco. Cons: needs Linux/x86‑64, JIT code can be tricky, single‑core replay semantics.
-
Undo LiveRecorder: Commercial alternative for C/C++ with enterprise integrations and low overhead. Especially useful for long‑running services and core dumps.
-
JFR (Java Flight Recorder): Built into the JVM. Event‑level capture with low overhead (<1–2% in continuous mode for common configurations). Good for method samples, allocations, GC, lock contention. Does not deterministically replay instruction execution, but when paired with input cassettes (recorded network/database responses), it gets close to reproduction.
-
.NET: dotnet-trace/dotnet-counters, IntelliTrace (historical). Similar to JFR conceptually.
-
eBPF: Kernel‑resident programs (CO‑RE) that hook syscalls, tracepoints, uprobes, and kprobes. Very low overhead when used with care; excellent for system‑level forensics and multi‑language shops. Great for stitching together process and network behavior without modifying app code.
-
OpenTelemetry: Not replay, but critical for correlating spans to traces. We’ll join OTel span IDs to replay bundles so the LLM can jump from a production incident to the exact execution timeline.
-
Bonus signals: perf (CPU samples), Intel PT (branch tracing), LTTng/ftrace (kernel trace streams). These can be attached for deep performance incidents.
Opinion: if you can use rr for the hot path where the failure manifests, you should. It’s the smallest hammer that nails heisenbugs. In polyglot microservices, supplement with JFR/eBPF to cover everything else and to run continuously at low overhead.
Architecture: A Deterministic Debugging Pipeline
We’ll assemble a pipeline with explicit, auditable boundaries. The guiding principles:
- Capture only when needed, and only what you need
- Redact at source, not downstream
- Build a self‑contained replay bundle with environment and inputs
- Let the LLM ask for more context via tools; don’t pre‑stuff it with entire traces
Components:
- Capture Agent: rr/JFR/eBPF collectors running in prod, staging, and CI
- Scrubber: In‑kernel and user‑space redaction that enforces PII policy
- Bundle Builder: Packages trace + environment descriptors + input cassettes
- Replay Runner: Hermetic runtime (container/VM) to replay with rr or re‑feed cassettes
- Trace Indexer: Builds searchable summaries, event windows, and joins to code maps
- LLM Orchestrator: Tool‑enabled agent that queries the index, proposes fixes, and cites events
- Gatekeeper: CI rules that require replayable artifacts and trace diffs before merge
A minimal target is a "replay bundle" that can run under rr or re‑feed recorded inputs for higher‑level runtimes, plus a JSON index the LLM can query.
Replay Bundle Manifest
Use a simple manifest that a human can inspect and a machine can enforce:
yaml# trace_bundle.yaml version: 1 bundle_id: 2cbe4f7c-b52f-4b60-9b19-fb2c1a4b8e4b created_at: 2025-01-03T12:34:56Z origin: env: production region: us-east-1 service: payments revision: git:3c1ab73 oci_image: ghcr.io/acme/payments@sha256:8a5... otel_trace_id: 8f2d67e7a6... recorders: - kind: rr kernel: 6.6.7-200.fc39.x86_64 rr_version: 5.7.0 command: ["/app/bin/payments", "--serve"] rr_trace_dir: traces/rr/trace-0001 - kind: jfr jfr_file: traces/jfr/incident.jfr jvm: openjdk-21 - kind: ebpf bpf_profile: traces/ebpf/profile.parquet inputs: time_stream: inputs/time.ndjson random_seed: 12903812098 network_cassette: inputs/http_vcr/ # HAR/WARC files db_cassette: inputs/sql_vcr.sqlite env: TZ: UTC FEATURE_FLAGS: checkout_v2=false privacy: schema: v1 pii_hash_salt_kdf: argon2id:saltref:secrets/kms_salt redaction: - path: inputs/http_vcr/** ruleset: gdpr_default attestations: - type: slsa_provenance file: attestations/slsa.json - type: sbom file: attestations/sbom.spdx.json
The manifest ties together the trace and the environment. The rr section points to a trace directory; the JFR file captures JVM events; eBPF entries cover syscalls and network flow. Input cassettes record time, random seed, network responses, and database results needed to make replay consistent.
Storage and Cost
Raw traces can be large. Expect representative ballparks:
- rr: 50–500 MB per minute under moderate I/O; zstd -19 helps a lot
- JFR: tens of MB/min under low‑overhead profiles; more in diagnostic mode
- eBPF: depends entirely on event selection and sampling; aim for <5 MB/min per host when idling
You don’t capture continuously at full fidelity. Trigger on anomalies:
- On 5xx spikes or SLO burn rate
- On panic/signal crashes
- On anomaly detectors (outlier latencies, memory growth without release)
- During CI when tests fail or when coverage is below a threshold
Use ring buffers with spill‑to‑disk on trigger. Retain minimal windows (e.g., last 60 seconds around incident) and keep only the sessions with a confirmed bug ID.
Capturing with rr: Determinism for Native and Many Polyglot Runtimes
rr is the most reliable way to get step‑backwards debugging on Linux/x86‑64. It serializes execution by controlling scheduling, recording syscalls, and using performance counters for progress. For Node, Python, Rust, C/C++, Go (non‑cgo heavy) it often just works.
Basic workflow:
bash# Install rr on Debian/Ubuntu sudo apt-get install rr # Record your service until it crashes or a condition is met rr record --output-trace-dir ./traces/rr/trace-0001 \ env TZ=UTC \ ./bin/service --serve --port 8080 # Alternatively, attach and start recording when an anomaly occurs rr record -p $(pidof service)
Replay and inspect:
bashrr replay ./traces/rr/trace-0001 # Inside the rr gdb session: # - reverse-continue (rc), reverse-step (rs) work as expected # - watch -l *ptr sets watchpoints that also work backwards
Python glue to gate rr collection on errors:
pythonimport subprocess, os, signal proc = subprocess.Popen(["rr", "record", "./bin/service"], env={"TZ": "UTC", **os.environ}) try: proc.wait() finally: if proc.returncode not in (0,): # Promote trace to bundle subprocess.check_call(["rr", "pack"]) # shrinks trace; optional
Notes and gotchas:
- rr prefers to run the target on a single hardware thread during replay. That’s fine for correctness triage; for perf, use perf/Intel PT separately.
- JIT heavy code (e.g., JVM, V8 with aggressive JIT) can require flags to disable or pin JIT behaviours. Node often works; JVM is better served by JFR + input cassettes.
- Kernel and microarchitecture matter. Use the same kernel family for replay as record. Package a replay VM image to eliminate drift.
JVM: JFR for Event‑Level Ground Truth
JFR captures detailed runtime events: method samples, allocations, GC pauses, monitor enter/exit, thread parks, I/O, and even custom user events. While it’s not instruction‑level replay, in practice you can reconstruct the timeline well enough to reproduce with the right input cassettes.
Continuous low‑overhead config:
bashJAVA_TOOL_OPTIONS="-XX:StartFlightRecording=name=prod,settings=profile,dumponexit=true,filename=/var/log/incident.jfr"
You can also control it at runtime:
bashjcmd <pid> JFR.configure settings=profile jcmd <pid> JFR.start name=incident delay=0s filename=/tmp/incident.jfr # ...wait for anomaly... jcmd <pid> JFR.dump name=incident filename=/tmp/incident.jfr jcmd <pid> JFR.stop name=incident
JFR pairs well with:
- HTTP VCR: record raw HTTP requests/responses as HAR/WARC; re‑feed them in tests
- SQL VCR: intercept JDBC and log parameterized queries with deterministic responses
- Clock control: abstract time via an injectable clock so you can rewind or freeze it under test
With these, you can write a reproducible unit/integration test that mimics the production failure while the JFR file gives the model/verifier the event citations.
eBPF: System‑Level Context without App Changes
eBPF lets you attach to syscalls and tracepoints with minimal overhead. Use libbpf CO‑RE to compile once and run everywhere (with BTF metadata). Capture only what you need and redact at the source.
Example: capture connect attempts and DNS answers to build the network cassette.
Kernel program (C, CO‑RE):
c// connect.bpf.c #include <vmlinux.h> #include <bpf/bpf_helpers.h> #include <bpf/bpf_tracing.h> #include <bpf/bpf_core_read.h> struct conn_event { __u64 ts_ns; __u32 pid; __u32 saddr_v4; __u32 daddr_v4; __u16 dport; char comm[16]; }; struct { __uint(type, BPF_MAP_TYPE_RINGBUF); __uint(max_entries, 1 << 24); } rb SEC(".maps"); SEC("kprobe/tcp_v4_connect") int BPF_KPROBE(kp_tcp_v4_connect, struct sock *sk) { struct conn_event *e = bpf_ringbuf_reserve(&rb, sizeof(*e), 0); if (!e) return 0; e->ts_ns = bpf_ktime_get_ns(); e->pid = bpf_get_current_pid_tgid() >> 32; BPF_CORE_READ_INTO(&e->saddr_v4, sk, __sk_common.skc_rcv_saddr); BPF_CORE_READ_INTO(&e->daddr_v4, sk, __sk_common.skc_daddr); __u16 dport_be; BPF_CORE_READ_INTO(&dport_be, sk, __sk_common.skc_dport); e->dport = __bpf_ntohs(dport_be); bpf_get_current_comm(&e->comm, sizeof(e->comm)); bpf_ringbuf_submit(e, 0); return 0; } char LICENSE[] SEC("license") = "GPL";
User‑space loader (Go + cilium/ebpf or Rust + libbpf‑rs) subscribes to the ring buffer and writes Parquet/NDJSON with PII redaction.
Why eBPF here?
- You get system context (who connected to what, when) without changing every service
- You can correlate PIDs to containers, pods, and OTel trace IDs via process‑to‑span mapping
- You can build minimal network/FS cassettes for replay
Be disciplined: limit probes, prefer tracepoints over kprobes when available, and sample aggressively under load.
Privacy and Governance: Redact at Source, Attest Everywhere
Traces can contain PII/PHI. Don’t ship raw payloads to your LLM index. Adopt these guardrails:
- Classify fields at the source. Maintain a schema that lists which fields can contain PII (headers, query params, JSON keys). Pair with a rule engine.
- Redact in kernel/user space before disk. For eBPF, don’t capture payload by default. For app‑level cassettes, tokenize with a reversible vault only when absolutely necessary.
- Hash with salted, keyed hashing (e.g., HMAC‑SHA256 or argon2id for high‑value identifiers). Store salts in KMS and record references in the bundle, not the key.
- Encrypt at rest with tenant keys and envelope encryption; limit retention (e.g., 14–30 days).
- Add SLSA provenance and SBOMs to bundles to trace supply chain and tooling build provenance.
- RBAC for replay: not all engineers need all bundles; gate by service ownership.
Privacy isn’t optional; you won’t get sign‑off from security otherwise.
Prompt Grounding: How to Make an LLM a Good Debugger
Most “AI debugging” failures come from handing a model a pile of logs and hoping for insight. Instead, give it tools, context windows, and rules.
- Represent traces as structured events with stable IDs. Store code maps (file:line to symbol) and build an embedded index.
- Let the model call tools to fetch windows around events, stack traces, heap stats, or diff two traces. Avoid dumping everything into the prompt.
- Require citations: every claim must reference event IDs and code locations. The orchestrator validates references.
- Constrain outputs: patches must compile, tests must pass locally under replay.
Event schema example:
json{ "event_id": "jfr:645821", "ts": "2025-01-03T12:34:56.540Z", "type": "JavaMonitorEnter", "attrs": { "class": "com.acme.payment.Checkout", "monitor": "java.lang.Object@0x7f483...", "thread": "http-nio-8443-exec-17", "stack": [ {"fn": "Checkout.reserveInventory", "file": "Checkout.java", "line": 214}, {"fn": "Inventory.lock", "file": "Inventory.java", "line": 78} ] }, "span_id": "3c41..." }
Tool interfaces (function calls) the LLM can use:
json[ { "name": "get_events", "description": "Fetch events by time or ID range with filters", "parameters": { "type": "object", "properties": { "filter": {"type": "string"}, "limit": {"type": "integer"} }, "required": ["filter"] } }, { "name": "get_stack", "description": "Get resolved stack for an event", "parameters": {"type": "object", "properties": {"event_id": {"type": "string"}}, "required": ["event_id"]} }, { "name": "diff_traces", "description": "Compare two bundles and list deltas in event frequency, latency, and resource usage", "parameters": {"type": "object", "properties": {"left": {"type": "string"}, "right": {"type": "string"}}, "required": ["left", "right"]} }, { "name": "open_file", "description": "Read file content at a specific revision", "parameters": {"type": "object", "properties": {"repo": {"type": "string"}, "rev": {"type": "string"}, "path": {"type": "string"}}, "required": ["rev", "path"]} }, { "name": "run_replay_test", "description": "Run rr or cassette‑backed test and return exit code, logs, and key metrics", "parameters": {"type": "object", "properties": {"bundle_id": {"type": "string"}, "test": {"type": "string"}}, "required": ["bundle_id"]} } ]
Prompt scaffold for the orchestrator (conceptual):
- Task: Diagnose incident X using bundle Y. Don’t guess; fetch events.
- Rules: Cite event_id and file:line for claims. Don’t invent paths.
- Output: 1) Root cause hypothesis with citations; 2) Minimal patch; 3) Replay test name; 4) Confidence; 5) Open questions
This keeps the model’s reasoning grounded and auditable without requiring verbose chain‑of‑thought output.
CI Gating with Replay
Stop merging changes that fail replay. Add two gates:
-
Failing test → Mandatory replay bundle
- If a unit/integration test fails in CI, save the rr trace (for native) or the input cassette + JFR (for JVM) and attach to the job artifacts.
- A PR cannot merge unless the replay test passes or the bundle is attached with a linked issue.
-
Golden trace diffs
- Maintain baseline traces for critical flows (checkout, auth) using synthetic traffic.
- On PR, run the flows, capture minimal traces, and diff against baseline. Flag regressions in:
- Syscall counts and types
- Locks held and contention
- Allocation volume per request
- Network endpoints contacted
- Latency distributions
Example GitHub Actions step:
yaml- name: Run flaky test under rr run: | set -e rr record --output-trace-dir traces/rr/ci-$$ ./bazel-bin/tests/order_test --gtest_filter=FlakyCase rr pack traces/rr/ci-$$ - name: Build replay bundle run: | python tools/build_bundle.py --rr traces/rr/ci-$$ --out dist/bundle.tar.zst - name: Verify replay run: | python tools/replay.py dist/bundle.tar.zst --test tests/order_test:FlakyCase
Trace diffs can be enforced with thresholds. For example, more than +20% allocations or new outbound domains require manual approval.
How Replay Slashes Flaky, Prod‑Only Bug Hunts
Replay shortens the feedback loop from hours/days to minutes:
- Flaky tests: run them under rr with a loop until failure, then keep the trace. Replay deterministically to bisect code changes and add a minimal sleep/reschedule to make the failure inevitable.
- Data races: set reverse watchpoints on the corrupted memory. rr shows the last write that changed a value, even if it happened seconds earlier and on another thread.
- Prod‑only ordering bugs: replicate the exact order of messages with a network cassette. If needed, use a scheduler shim that serializes event delivery to match the trace.
- GC/lock pathologies: JFR shows the contentious monitors, safepoints, and object churn. Add a test that constructs the same object graph and configuration; your cassette ensures the inputs match.
A composite example:
- An order occasionally double‑charges in production. OTel shows the trace ID; SLO burn triggers a capture.
- eBPF records outbound calls to payment gateway and DB queries; JFR records lock contentions in Checkout.reserveInventory.
- Bundle builder creates a cassette for the exact HTTP calls and SQL queries. Secrets and PII are hashed.
- Locally, you run the replay test. The LLM requests event windows and notices two threads entering the same monitor in alternating order, with a non‑atomic increment on a shared counter.
- The LLM proposes a minimal patch: replace AtomicInteger.get()+set with getAndIncrement, add a lock order guard, and adds a regression test that reproduces the interleaving.
- Gatekeeper runs replay; passes. Merge.
No heroics, no “can you repro?” ping‑pong.
Implementation: Minimal but End‑to‑End
Bundle builder (Python) sketch:
python# tools/build_bundle.py import argparse, json, os, shutil, tarfile, time from pathlib import Path parser = argparse.ArgumentParser() parser.add_argument('--rr', help='Path to rr trace dir') parser.add_argument('--jfr', help='Path to JFR file') parser.add_argument('--ebpf', help='Path to eBPF parquet') parser.add_argument('--network', help='Path to HTTP cassette dir') parser.add_argument('--db', help='Path to SQL cassette') parser.add_argument('--out', required=True) args = parser.parse_args() bundle_dir = Path('dist/bundle_' + str(int(time.time()))) bundle_dir.mkdir(parents=True, exist_ok=True) manifest = { "version": 1, "bundle_id": os.popen('uuidgen').read().strip(), "created_at": time.strftime('%Y-%m-%dT%H:%M:%SZ', time.gmtime()), "recorders": [], "inputs": {}, "privacy": {"schema": "v1"}, } if args.rr: shutil.copytree(args.rr, bundle_dir / 'traces/rr', dirs_exist_ok=True) manifest["recorders"].append({"kind": "rr", "rr_trace_dir": "traces/rr"}) if args.jfr: (bundle_dir / 'traces/jfr').mkdir(parents=True, exist_ok=True) shutil.copy2(args.jfr, bundle_dir / 'traces/jfr/incident.jfr') manifest["recorders"].append({"kind": "jfr", "jfr_file": "traces/jfr/incident.jfr"}) if args.ebpf: (bundle_dir / 'traces/ebpf').mkdir(parents=True, exist_ok=True) shutil.copy2(args.ebpf, bundle_dir / 'traces/ebpf/profile.parquet') manifest["recorders"].append({"kind": "ebpf", "bpf_profile": "traces/ebpf/profile.parquet"}) if args.network: shutil.copytree(args.network, bundle_dir / 'inputs/http_vcr', dirs_exist_ok=True) manifest["inputs"]["network_cassette"] = "inputs/http_vcr/" if args.db: (bundle_dir / 'inputs').mkdir(parents=True, exist_ok=True) shutil.copy2(args.db, bundle_dir / 'inputs/sql_vcr.sqlite') manifest["inputs"]["db_cassette"] = "inputs/sql_vcr.sqlite" (manifest_path := bundle_dir / 'trace_bundle.yaml').write_text(json.dumps(manifest, indent=2)) with tarfile.open(args.out, 'w:gz') as tar: tar.add(bundle_dir, arcname='bundle') print('Wrote', args.out)
Replay runner sketch:
python# tools/replay.py import argparse, tarfile, os, subprocess, tempfile, json from pathlib import Path parser = argparse.ArgumentParser() parser.add_argument('bundle') parser.add_argument('--test', help='Optional test target name') args = parser.parse_args() work = Path(tempfile.mkdtemp()) with tarfile.open(args.bundle, 'r:gz') as tar: tar.extractall(work) root = work / 'bundle' manifest = json.loads((root / 'trace_bundle.yaml').read_text()) rr = next((r for r in manifest['recorders'] if r['kind'] == 'rr'), None) if rr: trace_dir = root / rr['rr_trace_dir'] print('Replaying rr trace at', trace_dir) # This will launch gdb UI; for automation, use "rr replay -a" with a scripted gdb subprocess.check_call(['rr', 'replay', str(trace_dir)]) else: print('No rr trace; running cassette-backed test') # Load cassette env vars and run test command if provided if args.test: env = os.environ.copy() env['HTTP_CASSETTE_DIR'] = str(root / 'inputs/http_vcr') env['SQL_CASSETTE'] = str(root / 'inputs/sql_vcr.sqlite') subprocess.check_call([args.test], env=env)
Trace indexing and summarization: build a compact NDJSON of key events with token‑friendly summaries. For JFR, use jfr2flame or Mission Control APIs; for eBPF, store Parquet and build rollups (counts per syscall, latency histograms, endpoints touched). Expose a simple query language inspired by Tempo’s TraceQL or Parca’s PQL.
Performance: Keeping Overhead and Cost in Check
- rr: Don’t run 24/7 in prod unless you accept the overhead; instead, attach on demand or run in high‑fidelity staging. In CI and during canary rollouts, rr is completely viable.
- JFR: Continuous low‑overhead profile is fine; switch to diagnostic when an incident triggers.
- eBPF: Design for low overhead—prefer tracepoints, use BPF ring buffers, and apply filters early in BPF. On busy hosts, sample or restrict to target processes.
- Storage: Use zstd compression and chunking. Clean up aggressively by retaining only the last N minutes around incidents.
Empirical expectations (always validate for your workload):
- rr: 1.2–3× overhead in CPU‑bound scenarios; I/O heavy apps can see more due to syscall interposition.
- JFR: sub‑percent overhead in profile mode; 2–5% in diagnostic mode with many events.
- eBPF: sub‑percent if you only capture lightweight events, 1–2% if you add more probes or do heavy map operations.
Gotchas and Practical Advice
- Kernel pinning for rr: Replaying on a very different kernel/microarchitecture can fail; ship a replay VM image (e.g., Firecracker with pinned kernel) with the bundle.
- JIT and rr: Consider lowering JIT tiers or disabling tiered compilation during capture; document how to reproduce flags in the bundle.
- Time abstraction: Introduce an injectable clock interface in your codebase. It’s the single most valuable change for deterministic tests.
- Network determinization: Put a proxy (e.g., Toxiproxy) between your service and external endpoints. It simplifies building cassettes by centralizing capture.
- DB determinization: Use read‑only snapshots or a statement‑level VCR; avoid depending on wall‑clock timestamps in queries.
- Token budgets: Don’t feed huge traces to the model. Use retrieval: the model asks for the window around the suspicious event, not the entire session.
- Security posture: Don’t let LLMs call out to internet over replay; the whole point is to keep inputs fixed.
Example: End‑to‑End from Incident to Patch
- Alert: P99 latency spike. Gatekeeper triggers capture: JFR switches to diagnostic; eBPF starts syscall capture; rr attaches to the suspect service in canary pool.
- Incident hits: latency correlates with inventory lock contention. JFR shows JavaMonitorEnter spikes; eBPF shows increased fsync and a change in retry policy contacting Redis.
- Bundle built: includes JFR file, eBPF Parquet, and network/DB cassettes. Secrets redacted.
- LLM session: the agent requests event windows where lock hold time > 100 ms, correlates to a code path added in the PR introducing a non‑fair ReentrantLock around a cache warmup path.
- Patch: guard warmup behind tryLock with timeout and move it off the request path; add test that replays the cassette and asserts no lock holds > 10 ms. CI gate passes.
Total engineer time: an hour of reading citations rather than days of fishing in logs.
Why This Matters: The Debug AI You Actually Want
If you want a model to be helpful, you can’t hand it nondeterminism. A good Debug AI is:
- Deterministic: every suggestion can be replayed and verified
- Grounded: every claim cites trace event IDs and code lines
- Actionable: outputs patches that compile and tests that reproduce
Record‑replay traces—rr for deep time travel, JFR for runtime truth, eBPF for system context—are the backbone. They make your model stop guessing and start proving.
Quick Checklist
- Add an injectable clock to your code
- Set up JFR continuous profile and an incident preset
- Build a small eBPF collector for syscalls and network connects with at‑source redaction
- Add rr to CI for flaky tests and to staging for high‑risk rollouts
- Define a replay bundle manifest and a builder script
- Build a minimal LLM orchestration layer with tools: get_events, get_stack, diff_traces, run_replay_test
- Gate merges on replay and golden trace diffs
Ship it, and stop chasing ghosts. With determinism and grounded prompts, your Debug AI won’t hallucinate—it will explain, reproduce, and fix.
