Stop Prompting, Start Instrumenting: Building a Code Debugging AI That Actually Runs Your Program

Software breaks in ways that static linting, static prompts, and polite suggestions cannot fully anticipate. A credible debugging AI must run your program, gather evidence, and operate the tools that engineers reach for when uptime matters—probes, record/replay, fuzzers, and deterministic reproducers. This article proposes a concrete, opinionated design for a code debugging AI that orchestrates eBPF probes, time-travel debugging, coverage-guided tests, and MCP tools to reproduce, isolate, and fix bugs at scale. It’s a blueprint for moving beyond chat and into instrumentation.

TL;DR

Prompting alone won’t fix production bugs. Instrument your software. Run it. Gather traces.
Combine eBPF (for low-overhead runtime signals), time-travel debugging (for precise causality), and coverage-guided testing (for repro and regression). Glue them with an MCP tool layer.
The debugging AI should:
- Reproduce: bootstrap a minimal, deterministic failure.
- Instrument: attach eBPF probes and record execution with rr or equivalent.
- Isolate: reduce inputs and code paths (ddmin + coverage deltas).
- Fix: propose patches informed by execution evidence (not just code text), then validate.
- Verify: run regression and property-based tests; track coverage changes and flake risk.
Expect measured overhead: eBPF negligible to low, rr 1.2–5x typical, fuzzers variable. Use budgets and policies.

Why Prompt-Only Debugging Falls Short

LLMs are excellent at explaining code and suggesting plausible fixes, but production debugging demands:

Ground truth: real inputs, real environment, real race conditions, real I/O delays.
Determinism: a failure you can reliably reproduce and rerun.
Forensics: low-level signals (syscalls, page faults, lock contention) that don’t show up in static text.
Causality: the ability to move backward in time to the moment a bad value appeared.

A purely chat-based agent cannot invent this evidence. It must run the code.

Core Principle: Observability-First Debugging

The design premise is simple: an AI that debugs like a senior engineer will instrument the system before theorizing. It will:

Capture runtime data cheaply and safely (eBPF).
Construct deterministic reproducers (containers, rr recordings, minimized inputs).
Observe coverage and drive input generation to traverse untested paths.
Maintain provenance: exact commit, config, kernel, container image, and tool versions.

The AI becomes less of a chatbot and more of a conductor of tools.

High-Level Architecture

The proposed system comprises these components:

Orchestrator/Planner (LLM Agent)
- Decides which tools to run based on signals, logs, and developer prompts.
- Explains what it did and why, but decisions are grounded in data.
Sandbox Runner
- Creates ephemeral, isolated environments (e.g., Firecracker microVMs, Kata Containers, or Docker with seccomp and cgroup policies).
- Replays production-like workloads safely.
Probe Manager (eBPF)
- Attaches kprobes/uprobes/tracepoints/USDT to capture syscalls, TCP events, latency, allocator churn, page faults, CPU migration, and user-level function timings.
- Streams samples to a ring buffer for the AI to summarize and query.
Recorder (Time-Travel Debugging)
- Records execution with rr (Linux x86-64) or alternatives (UndoDB, Pernosco SaaS, or QEMU/KVM snapshotting) to support reverse execution.
Coverage & Fuzzing Driver
- Leverages LLVM sancov, libFuzzer/AFL++/Honggfuzz/cargo-fuzz/Go fuzz, or Hypothesis for property-based exploration.
- Maintains corpus, deduplicates crashes by stack/PC/cov signature, and auto-minimizes inputs.
MCP Tool Layer
- Model Context Protocol (MCP) enables the LLM to call specific tools with typed arguments and to receive structured results.
- Supports capability discovery, sandboxed execution, and output schemas for determinism.
Knowledge Store
- Stores traces, recordings, coverage reports, crash signatures, minimized reproducers, and patch evaluations.
- Indexable by commit SHA, input hash, environment fingerprint, and test ID.
Patch Generator & Verifier
- Suggests minimal diffs with references to observed failing lines and variables.
- Validates via repro, regression, mutation tests, and additional fuzzing passes.

The system is useful because each piece exists today. The work is in orchestrating them reliably and safely.

Lifecycle: Reproduce → Instrument → Isolate → Fix → Verify

Intake/Triage
- Signals: incident ticket, SLO breach, panic log, user bug report with input(s).
- The AI asks: do we have a deterministic reproducer? If not, it synthesizes one: capture input from prod logs or API gateway mirrors, or derive from crash breadcrumbs.
Environment & Policy Setup
- Build the target commit and dependencies in a sandbox image.
- Apply resource limits (CPU, memory), network shaping, and secrets redaction policies.
Instrumentation & Recording
- Attach eBPF to OS and user space to capture relevant metrics.
- Execute the program under rr for record/replay when possible.
- Collect core dumps and symbolicate.
Failure Confirmation
- Verify that the failure reproduces with the given input and environment.
- If flaky, capture multiple runs and cluster outcomes (ddmin on environment factors).
Isolation
- Reduce the input (delta debugging) and measure coverage deltas.
- Use binary search in commit history (git bisect) with the reproducer to find the introducing change.
Root Cause Hypothesis with Time-Travel
- Step backward from the point of crash or assertion.
- Trace variable origins and last-writer wins. Validate invariants.
Patch Proposal
- Propose a minimal diff referencing evidence (stack, variables, coverage edges affected).
- Include tests (unit/property/fuzz) that replicate the issue and guard the fix.
Verification
- Re-run repro, full test suite, and fuzzing with a time budget.
- Compare performance and resource overhead. Watch for flakiness and new crash classes.
Reporting
- Emit a structured bug dossier: repro steps, environment hash, probes activated, rr trace ID, coverage diff, patch, and risk notes.

eBPF: The AI’s Stethoscope

eBPF enables low-overhead, safe, dynamic instrumentation of the kernel and user space. You can answer concrete questions:

Which syscalls are hot before the crash?
Are we hitting TCP retransmits, SYN backlog limits, or accept queue saturation?
Did the process OOM? What cgroup memory limit was enforced?
Which allocator call paths correlate with anomalies?
Which user-space function returned error codes preceding failure?

Key elements:

kprobes/tracepoints: probe kernel functions and events (e.g., sched_switch, tcp_retransmit_skb).
uprobes/USDT: user-level function entry/exit and static tracepoints.
CO-RE (Compile Once – Run Everywhere): portable BPF bytecode via BTF.
Perf ring buffers/maps: efficient data channels.

Example 1: bpftrace to catch segmentation faults and print faulting instruction pointers and symbolized stack traces for a given PID:

bash
sudo bpftrace -e '
  tracepoint:exceptions:page_fault_user /pid == $target/ {
    @[ustack] = count();
  }
  interval:s:10 { print(@); clear(@); }
' -c "./your_binary --args" -p $PID

Example 2: A minimal CO-RE eBPF program (C) to trace TCP retransmits and aggregate by PID:

c
// tcp_retx.bpf.c
#include "vmlinux.h"
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_tracing.h>

struct {
  __uint(type, BPF_MAP_TYPE_HASH);
  __type(key, u32);
  __type(value, u64);
  __uint(max_entries, 10240);
} retx_per_pid SEC(".maps");

SEC("tracepoint/tcp/tcp_retransmit_skb")
int BPF_PROG(on_tcp_retx) {
  u32 pid = bpf_get_current_pid_tgid() >> 32;
  u64 init = 0, *cnt = bpf_map_lookup_elem(&retx_per_pid, &pid);
  if (!cnt) {
    bpf_map_update_elem(&retx_per_pid, &pid, &init, BPF_ANY);
    cnt = bpf_map_lookup_elem(&retx_per_pid, &pid);
    if (!cnt) return 0;
  }
  __sync_fetch_and_add(cnt, 1);
  return 0;
}

char LICENSE[] SEC("license") = "GPL";

A user-space loader (C/Go/Rust) reads the map periodically and correlates PIDs to service names using cgroup metadata.

The AI’s orchestration layer should decide which probes to attach based on symptom heuristics:

Crash or SIGSEGV: tracepage faults, uprobe hot-path functions leading to the fault, collect ustack/kstack samples.
Latency spikes: sched_switch, block IO latency, TCP retrans, userspace USDT timers.
Memory bloat: kmalloc/kfree, userspace malloc/free uprobes with stack collapsing.

Crucially, eBPF is safe by design: programs are verified before load, memory-safe, and rate-limited by maps and perf buffers. Still, the AI must enforce policies: CPU quotas for BPF programs, rate-sampling, and bounded event volumes.

References:

eBPF (https://ebpf.io)
bpftrace (https://github.com/iovisor/bpftrace)
BPF CO-RE and libbpf (https://github.com/libbpf/libbpf)

Time-Travel Debugging: rr and Friends

When a crash occurs, engineers ask: where did this bad value come from? rr (https://rr-project.org/) provides deterministic record and replay on Linux x86-64. It records syscalls, signals, and sources of non-determinism, enabling reverse execution under gdb.

Typical overhead: 1.2×–5× depending on workload; I/O heavy and high-syscall apps may incur more.
Works best in a single-process or small-process tree; can trace forks/execs.
Requires ptrace, perf events, and a compatible kernel/cpuid.

Example workflow:

bash
# Record a failing run
rr record ./service --input repro.bin

# Replay with gdb
rr replay
(gdb) b function_where_it_crashes
(gdb) c
(gdb) reverse-continue
(gdb) reverse-step
(gdb) p problematic_var
(gdb) watch *ptr

Alternatives:

UndoDB/LiveRecorder (commercial) with reverse debugging.
Pernosco (SaaS for rr traces) with timeline and dataflow UI.
QEMU/KVM snapshots combined with VM checkpoints for coarse-grained time travel.

The AI should choose rr when:

The failure reproduces on Linux x86-64 with ptrace allowed.
Overhead is acceptable (batch job, offline reproducer).

When rr is infeasible (e.g., kernel modules, GPU-heavy, or restricted ptrace), fallback to:

High-fidelity logging + eBPF + core dumps.
Application-level snapshots (e.g., CRIU for process snapshots where applicable).

Coverage-Guided Testing and Input Synthesis

Static reproducers are fragile. Coverage-guided fuzzing systematically explores code paths:

LLVM sancov + libFuzzer/AFL++: for C/C++.
cargo-fuzz (libFuzzer) for Rust.
go test -fuzz for Go.
Hypothesis/Property testing for Python.

The AI should:

Auto-detect existing fuzz targets or generate harnesses.
Build with coverage flags.
Seed with the failing input; drive mutations while tracking new edges.
Cluster crashes by coverage + top of stack to deduplicate.
Auto-minimize crashing inputs.

Example: Rust harness with cargo-fuzz

toml
# fuzz/Cargo.toml
[package]
name = "myproj-fuzz"
version = "0.0.1"
edition = "2021"

[dependencies]
libfuzzer-sys = "0.4"
myproj = { path = ".." }

[[bin]]
name = "fuzz_decode"
path = "fuzz_targets/fuzz_decode.rs"

rust
// fuzz_targets/fuzz_decode.rs
#![no_main]
use libfuzzer_sys::fuzz_target;

fuzz_target!(|data: &[u8]| {
    let _ = myproj::decode_message(data); // Should never panic or UB
});

Run:

bash
cargo +nightly fuzz run fuzz_decode -- -runs=0 -max_total_time=120

Example: Python Hypothesis property test derived from a user crash

python
from hypothesis import given, strategies as st
from mypkg import parse

@given(st.binary())
def test_parse_never_crashes(data):
    try:
        parse(data)
    except Exception as e:
        # Hypothesis will shrink to a minimal counterexample
        assert False, f"Crash with {e}"

Coverage-based techniques help the AI generate tests that not only reproduce the bug but guard against regressions by covering the once-missed path.

References:

libFuzzer (https://llvm.org/docs/LibFuzzer.html)
AFL++ (https://github.com/AFLplusplus/AFLplusplus)
cargo-fuzz (https://github.com/rust-fuzz/cargo-fuzz)
Hypothesis (https://hypothesis.works/)

The MCP Tool Layer: How the AI Operates Tools Safely

Model Context Protocol (MCP) provides a standardized way for an LLM to discover and call tools with typed inputs/outputs, with clear boundaries. In this system, each capability is a tool:

build_image
run_in_sandbox
deploy_bpf_probe
start_rr_record
reproduce_failure
run_fuzzer
collect_core_dump
symbolize_stack
run_coverage_report
propose_patch
run_regression_suite
open_pull_request

Example MCP tool definitions (conceptual JSON schema):

json
{
  "tools": [
    {
      "name": "deploy_bpf_probe",
      "description": "Attach an eBPF program or bpftrace script to a sandboxed process",
      "input_schema": {
        "type": "object",
        "properties": {
          "sandbox_id": { "type": "string" },
          "script": { "type": "string" },
          "pid": { "type": "integer" },
          "duration_sec": { "type": "integer" }
        },
        "required": ["sandbox_id", "script"]
      },
      "output_schema": {
        "type": "object",
        "properties": {
          "events_captured": { "type": "integer" },
          "artifact_uri": { "type": "string" }
        }
      }
    },
    {
      "name": "start_rr_record",
      "description": "Record a reproducible run with rr",
      "input_schema": {
        "type": "object",
        "properties": {
          "sandbox_id": { "type": "string" },
          "command": { "type": "string" },
          "env": { "type": "object", "additionalProperties": { "type": "string" } },
          "input_artifact": { "type": "string" }
        },
        "required": ["sandbox_id", "command"]
      },
      "output_schema": {
        "type": "object",
        "properties": {
          "trace_id": { "type": "string" },
          "exit_code": { "type": "integer" }
        }
      }
    }
  ]
}

A thin Python orchestrator can host MCP servers that implement these tools, abstracting away the messy details of Docker, rr, and libbpf.

Example: skeleton of a Python MCP tool implementation for rr recording

python
import subprocess, os, json
from pathlib import Path

class RRTool:
    def start_rr_record(self, sandbox_id: str, command: str, env: dict = None, input_artifact: str = None):
        sandroot = Path(f"/sandboxes/{sandbox_id}")
        rr_dir = sandroot / "rr_traces"
        rr_dir.mkdir(parents=True, exist_ok=True)
        record_cmd = ["rr", "record"] + command.split()
        run_env = os.environ.copy()
        if env:
            run_env.update(env)
        if input_artifact:
            run_env["REPRO_INPUT"] = str(input_artifact)
        p = subprocess.run(record_cmd, cwd=sandroot, env=run_env)
        # Discover last trace ID via rr
        trace_id = subprocess.check_output(["rr", "ls"], cwd=sandroot).decode().strip().split("\n")[-1]
        return {"trace_id": trace_id, "exit_code": p.returncode}

Security policies:

Tools run inside confined sandboxes with least privilege.
eBPF restricted to the target namespace/cgroup where supported (cgroup-bpf).
No outbound network without explicit allowance.
All artifacts are content-addressed and access-controlled.

Data Artifacts and Provenance

A debugging AI earns trust by keeping a durable, queryable record:

Sandbox fingerprint: base image digest, kernel version, CPU flags, libraries.
Probe pack: list of eBPF programs, schemas, sampling rates, and event counts.
rr trace ID and pointer to storage.
Coverage report: per-file/function deltas.
Crash signature: top-of-stack function, PC address, and hashed coverage bitmap.
Reproducer: command line, environment, minimized inputs.
Patch: diff, linked issue, tests added, and risk scoring.

Example bug dossier (YAML):

yaml
bug_id: BUG-2025-02-0142
commit: 9a1b7c2
sandbox_image: ghcr.io/org/service@sha256:abc123
kernel: 6.8.7-zen1-x86_64
reproducer:
  cmd: ./service --load repro.bin
  env:
    FEATURE_X: "on"
  input_sha256: 72b0...ae
rr_trace: rr://traces/BUG-2025-02-0142/trace-7
coverage:
  before_edges: 17234
  after_edges: 17302
  new_edges:
    - src/decoder.rs:412
    - src/decoder.rs:433
crash_signature:
  top_frame: decoder::read_varint
  pc: 0x5559b2fe
  hash: 7cde2a90
probes:
  - name: tcp_retx
    events: 0
  - name: page_fault_user
    events: 2
patch_artifact: s3://patches/BUG-2025-02-0142.diff
regression:
  tests_passed: 812
  fuzz_time_sec: 600
  unique_crashes: 0

Input and Scenario Minimization (ddmin)

Delta debugging reduces a failing input or environment until it’s minimal. Incorporate Zeller’s ddmin algorithm to shave hours from triage:

Pseudo-code:

text
ddmin(S):
  n = 2
  while len(S) >= 2:
    subsets = partition(S, n)
    some_reduced = False
    for subset in subsets:
      if fails(subset):
        S = subset
        n = max(n-1, 2)
        some_reduced = True
        break
    if not some_reduced:
      if n == len(S):
        break
      n = min(len(S), 2*n)
  return S

Apply ddmin to:

Command-line flags
Environment variables
Input files (split by chunk/line/JSON elements)
Network interactions (sequence minimization)

The AI can automatically propose the next ddmin partition, then call reproduce_failure via MCP to test each hypothesis.

Reference: “Simplifying and Isolating Failure-Inducing Input” (Zeller & Hildebrandt, 2002).

End-to-End Example: Fixing a Go Service Panic

Scenario: A Go microservice intermittently panics with “runtime error: invalid memory address or nil pointer dereference” when decoding user-uploaded binary payloads.

Intake
- AI ingests a pager event and log snippet containing a request ID and payload size. It fetches the raw payload from the API gateway replay buffer.
Sandbox
- Builds commit 9a1b7c2, creates an ephemeral sandbox with Go 1.22, glibc 2.38, and the exact feature flags from prod.
Instrument
- Deploys eBPF tracepoints for page_fault_user and uprobes for the decode function via USDT or symbols (using compiled with -gcflags=all=-N -l for easier symbolization in non-prod).
- Starts rr recording and runs the service with the captured request payload.
Reproduce
- The panic happens reliably in rr record. A core dump is produced; AI symbolicates the stack to decoder.ReadVarint.
Time Travel
- rr replay, reverse-steps show that a slice index used to read a length prefix wasn’t validated; the field read from the stream exceeded the buffer’s remaining length.
Coverage-Guided Exploration
- The AI generates a fuzz harness around the decoder using go test -fuzz, seeding with the failing payload.
- Within 60 seconds, it finds a smaller counterexample that triggers the same panic in the decoder.
Patch Proposal
- AI suggests:
  - Bounds checks before advancing the reader.
  - Early return with error instead of panic.
  - A property test asserting decode(encode(x)) roundtrips for random inputs.

Example patch (simplified):

diff
--- a/decoder/decoder.go
+++ b/decoder/decoder.go
@@ func (d *Decoder) ReadVarint() (int64, error) {
-   b := d.buf[d.pos]
-   // ... increments pos
+   if d.pos >= len(d.buf) {
+       return 0, io.ErrUnexpectedEOF
+   }
+   b := d.buf[d.pos]
+   // ... before each increment
+   if d.pos+1 > len(d.buf) {
+       return 0, io.ErrUnexpectedEOF
+   }
+   d.pos++
    // ... existing varint logic with additions for cap checks
 }

Verify
- Re-run the rr reproducer: no crash; function returns error.
- Run fuzzing for 10 minutes: no new unique panics; coverage increases on decoder.
- Run full test suite and SLO-impact smoke tests.
Report
- The AI posts a PR with:
  - The patch
  - Minimal reproducer input (as a test fixture)
  - New fuzz target and property tests
  - Artifacts: rr trace, coverage diff, and probe summary

Outcome: Fix based on runtime evidence, not guesswork.

Polyglot Support: Language-Specific Hooks

C/C++: build with -fsanitize=address,undefined and -fsanitize-coverage=trace-pc-guard. Integrate libFuzzer or AFL++. Symbolize with llvm-symbolizer.
Rust: cargo-fuzz; enable -Zsanitizer=address for ASan; use backtrace + panic=abort for sanitizer friendliness.
Go: go test -race for data races; go test -fuzz for fuzzing; use GODEBUG settings to alter GC or memory for repro.
JVM: Flight Recorder (JFR) plus eBPF for syscalls; rr not applicable—fall back to snapshotting or JVM-specific recorders.
Python/Node: property-based testing (Hypothesis/fast-check), native crash dumps, and BPF for system calls.

CI/CD Integration

Pre-merge: run fuzzing and sanitizer builds on critical code paths with time budgets.
Post-incident: automatically spawn a debugging job with the incident’s reproducer.
Gates: block merges that decrease line/function coverage in critical modules unless waived.
Artifact retention: store rr traces for P0 incidents for 30 days with encryption and access audit.

Cost, Performance, and Policy Budgets

eBPF: typically negligible to low overhead. Use sampling and bounded maps.
rr: 1.2×–5× overhead; apply to reproduction-only runs, not steady-state.
Fuzzing: time-bounded; scale horizontally; schedule during low-traffic windows.
Storage: rr traces can be GBs; dedupe by content and compress aggressively.
Policy: organization-level budgets (e.g., 20 CPU-hours/day for fuzzing per service) and prioritization queues.

Security, Privacy, and Compliance

Redaction: strip PII from traces and inputs via deterministic filters before storage.
Secrets: ensure sandbox env has only synthetic or scoped credentials.
Attestation: sign artifacts (SLSA provenance) linking code, tools, and environment digests.
Access control: least-privilege roles for AI operators; immutable logs.

Limitations and Workarounds

rr is Linux x86-64 only. For macOS/Windows targets, rely on language-specific recorders or VM snapshotting.
GPU/drivers: rr cannot record GPU kernels; use domain-specific recorders or deterministic seeds and logs.
eBPF on non-Linux: eBPF for Windows exists but is not feature-parity; tailor probes per OS.
High-frequency trading or ultra-low-latency workloads cannot tolerate instrumentation in prod; stage reproductions in shadow or synthetic environments.

Developer Experience: Make It Feel Native

Editor integrations: “Reproduce and Instrument” button in VS Code/JetBrains that runs the pipeline on the current test or input.
GitHub/GitLab bot: comment with reproducer, artifacts, and patch preview.
Dashboards: timeline of probes and rr frames; click-to-open reverse watchpoints.
Chat as an explanation layer, not the control plane; logs and artifacts remain the source of truth.

Minimal Contracts and APIs

Clients should never guess what the AI did. Define stable APIs:

POST /reproduce: returns repro_id
POST /instrument: attach probes to repro_id
POST /record: start rr; returns trace_id
GET /artifacts/{id}: fetch coverage, traces, cores
POST /patch: propose and validate patch; returns PR link

What “Good” Looks Like: Metrics

MTTF (Mean Time To Fix) delta vs. baseline.
Reproducer rate: % of incidents with deterministic repro within 30 minutes.
Minimality score: median size of minimized inputs vs. originals.
Regression escape rate: bugs reintroduced after patch.
Coverage delta: edges/functions added by new tests over time.
Developer adoption and trust: usage, PR acceptance rate, artifact views.

Opinionated Takeaways

If your debugging AI never attaches a probe, records a run, or generates a test, it’s a chatbot with a nice coat of paint.
eBPF + rr + coverage-guided tests cover the vast majority of real-world debugging scenarios for server-side software.
Determinism is a feature: invest in reproducible builds, explicit configs, and deterministic seeds; your AI gets smarter when your system is predictable.
Build guardrails first: sandboxing, redaction, and provenance are not optional in production shops.

Quickstart Blueprint

Phase 1: MVP
- Sandbox runner with Docker/Firecracker
- MCP tools: build, run, eBPF via bpftrace, rr record, core collection
- Minimal coverage with gcov/llvm-cov; cargo-fuzz for Rust or libFuzzer for C++
- Basic ddmin for inputs
Phase 2: Scale
- CO-RE BPF programs with libbpf; per-symptom probe packs
- Artifact store and dossier schema; search by crash signature
- Time-budgeted fuzzing across services; corpus sharing
- Editor and CI integrations; PR bots
Phase 3: Enterprise
- Policy engine for budgets and PII redaction
- Attestation/provenance (SLSA), signed patches
- Cross-lang harness generators; flaky-test detector
- Optional SaaS rr analysis (Pernosco) for heavy traces

Closing

Stop prompting, start instrumenting. A code debugging AI that runs your program—attaching eBPF probes, recording with rr, guiding coverage to reproduce and isolate failures—will ship fewer regressions, resolve incidents faster, and earn engineers’ trust. The technology exists. The difference is orchestration and a relentless focus on evidence.