Stop Guessing: Turn Your Code Debugging AI into a Real Debugger with GDB, LLDB, and eBPF

Most "AI code debuggers" don’t debug. They speculate. They explain hypothetical root causes based on pattern-matching error messages or stack traces, but they rarely set a breakpoint, inspect a register, or deterministically replay a flaky crash. If you want more than educated guesses, you must wire your LLM to the tools practitioners actually use: GDB/LLDB, DWARF/PDB symbols, system tracers, and record–replay.

This article is a practical blueprint for turning an LLM into a genuine, instrumented debugging agent that:

Sets breakpoints, steps, reads registers, and inspects memory
Resolves symbols from DWARF/PDB/dSYM without shipping your source code
Traces syscalls, user-level probes, and kernel events with eBPF
Reproduces flaky bugs on demand with record–replay (rr, TTD)
Operates under strict data minimization and redaction policies so source and secrets don’t leak

We’ll cover the architecture, the protocols, the build flags, the hard edges (optimized code, ASLR, capabilities), and end with concrete code snippets for GDB/MI, LLDB’s SB API, bpftrace/BCC, and rr.

The mandate: Don’t guess—instrument

Developers do not need another guess engine. They need an autonomous operator that can:

Plan a debugging session: define hypotheses, choose tools, and set exit criteria
Execute structured tool calls and parse machine-readable results
Iterate based on evidence (registers, stack, memory, traces), not vibes
Produce a deterministic replay artifact when the bug is intermittent
Keep sensitive artifacts local (source code, core dumps, secrets) unless explicitly allowed

That requires an architecture that puts the LLM behind a strict "tool gateway". Natural language in; structured actions out; instrumented facts back in.

High-level architecture

Think of five cooperating services:

Orchestrator (LLM Agent): Plans and reasons. Converts user intent into a sequence of tool calls.
Debug Gateway: A stateless microservice exposing a small set of safe, auditable operations (breakpoint, step, memory read, trace, record/replay). It talks to GDB/LLDB, eBPF, rr/TTD, etc.
Symbol Service: Resolves addresses and stack frames to human-meaningful names using DWARF/PDB/dSYM. It never exposes source text by default—only symbol metadata.
Redaction/Policy Engine: Enforces data minimization (no source lines, no raw strings from memory) unless explicitly permitted.
Storage/Trace Catalog: Stores rr/TTD traces, eBPF ring-buffer extracts, debugger transcripts, and anonymized session summaries.

Data flow:

User: "Flaky crash in payment worker. Make it deterministic and show the race."
LLM: Plan → "record run under rr", "replay and break on SIGSEGV", "inspect threads and shared state", "trace syscalls to correlate timing".
Gateway: Executes tool actions; returns structured results (JSON) with symbolized stack traces and register/memory excerpts (redacted where needed).
LLM: Updates hypothesis; performs targeted watchpoints; produces a minimal, deterministic repro script and a report.

Key principle: The LLM never gets raw source code unless the user opts in. Tools operate on binaries + symbols and runtime state.

Prepare your builds: symbols without source

Native debuggers are only as good as the debug info you ship with them. You can empower a source-privacy-first workflow by shipping symbol files separately and keeping source lines private by default.

Recommended settings by ecosystem:

Clang/GCC (C/C++/Rust via LLVM):
- Compile with debug info:
  - -g3 -fno-omit-frame-pointer
  - Consider -gsplit-dwarf to keep debug info separate from binaries
  - For optimized-but-debaggable builds: -g -O2 -fno-omit-frame-pointer -fvar-tracking-assignments
- Link with -Wl,--build-id and .gnu_debuglink so symbol servers can map binaries to separate DWARF
- Rust: cargo build -Z split-debuginfo=unpacked (or packed), or set debug = true in Cargo profiles
Go:
- Build with -gcflags=all='-N -l' for debugging (disables inlining/optimizations)
- -ldflags='-compressdwarf=false' if you need easier DWARF consumption
Swift/Objective-C (Apple):
- Produce dSYM bundles via dsymutil; distribute dSYMs separately from the binary
Windows (MSVC/Clang-cl):
- Compile with /Zi or /Z7 and link with /DEBUG:FULL to produce PDBs
- Configure a symbol server: setx _NT_SYMBOL_PATH "SRV*c:\\symcache*https://msdl.microsoft.com/download/symbols"

The key is a symbol store that maps build-id or PDB GUID to debug info. Your LLM agent only needs function/variable names, addresses, types, and line mappings—no source text. Keep source on-prem and only expose it via a guarded capability.

A common command schema for tools

You need a narrow, stable interface that an LLM can reliably call. Use JSON with a strict schema and deterministic responses. The Debug Adapter Protocol (DAP) is a good basis for stepping/breakpoints, but you’ll likely need extensions for low-level memory/registers, eBPF, and record–replay.

Example minimal schema for a tool call:

json
{
  "tool": "gdb-mi|lldb|ebpf|rr|ttd",
  "action": "set_breakpoint|continue|step|eval|read_memory|backtrace|threads|trace_syscalls|record|replay",
  "args": {
    "target": {"pid": 1234},
    "location": {"function": "foo::bar", "file": "foo.cc", "line": 128},
    "expression": "myVar + 1",
    "address": "0x7fff...",
    "length": 64,
    "filters": {"syscalls": ["openat", "connect"], "duration_us_gt": 1000}
  },
  "policy": {
    "redact_strings": true,
    "max_bytes": 32,
    "allow_source_lines": false
  }
}

Responses should include:

A normalized outcome (ok|error|timeout)
Minimal structured data: stack frames with function names and inlined markers; registers as name/value pairs; memory as hex + entropy; traces as event tuples
Explicit redaction markers and count of omitted bytes/lines

GDB via MI: battle-tested and scriptable

GDB’s Machine Interface (MI2) is stable, documented, and ideal for programmatic control. Run GDB as a subprocess with --interpreter=mi2, or use an existing wrapper like PTVS’s ptvsd-gdb, pygdbmi, or your own minimal parser.

Python example using pygdbmi to set a breakpoint, run, and inspect state:

python
from pygdbmi.gdbcontroller import GdbController

# Launch gdb in MI mode
gdbmi = GdbController(command=['gdb', '--interpreter=mi2'])

def send(cmd):
    return gdbmi.write(cmd, timeout_sec=5)

# Load the target
send('-file-exec-and-symbols ./bin/payment_worker')

# Optional: attach to a running PID
# send('-target-attach 12345')

# Set a breakpoint by function name
send('-break-insert foo::ProcessPayment')

# Run until breakpoint
send('-exec-run')

# When stopped, get backtrace
bt = send('-stack-list-frames')
print('backtrace:', bt)

# Read registers on current frame
regs = send('-data-list-register-values x')
print('registers:', regs)

# Evaluate an expression safely
val = send('-data-evaluate-expression myStruct.field')
print('value:', val)

# Read memory (redacted length)
mem = send('-data-read-memory-bytes 0x7fffffffe000 64')
print('mem bytes:', mem)

# Continue execution
send('-exec-continue')

Notes:

Use -gdb-set pagination off and -gdb-set print elements 200 for predictable outputs.
Map MI async records (*stopped, =thread-created) to structured events in your gateway.
To avoid source leakage, don’t call -symbol-list-lines or fetch source files unless user opts in. Symbolize frames from DWARF without reading file contents.

LLDB via SB API: powerful and Pythonic

LLDB’s SB API exposes finer-grained control than its command interpreter and is pleasant in Python. Use it for macOS and for projects where LLVM/Clang’s debug info shines (Swift/Objective-C/C++).

Example: set breakpoint, continue, inspect a variable’s type and value.

python
import lldb

# Create debugger without distorting stdout
lldb.SBDebugger.Initialize()
dbg = lldb.SBDebugger.Create()
dbg.SetAsync(False)

target = dbg.CreateTargetWithFileAndArch('./bin/payment_worker', None)
if not target:
    raise RuntimeError('Failed to create target')

bp = target.BreakpointCreateByName('foo::ProcessPayment')
process = target.LaunchSimple(None, None, '.')
if process.GetState() != lldb.eStateStopped:
    raise RuntimeError('Did not stop at breakpoint')

thread = process.GetSelectedThread()
frame = thread.GetSelectedFrame()

# Backtrace frames (symbolized, no source needed)
for f in thread:
    print(f.GetFunctionName(), hex(f.GetPC()))

# Registers
regs = frame.GetRegisters()
for regset in regs:
    for reg in regset:
        print(reg.GetName(), reg.GetValue())

# Evaluate expression without running code (DWARF-only eval if possible)
opts = lldb.SBExpressionOptions()
opts.SetLanguage(lldb.eLanguageTypeC_plus_plus)
opts.SetCoerceResultToId(False)
val = frame.EvaluateExpression('myStruct.field', opts)
print(val.GetTypeName(), val.GetValue())

process.Continue()

LLDB nuances:

For symbol-only workflows on macOS, ship dSYM bundles. LLDB can symbolize without source.
SBExpressionOptions.SetTryAllThreads and SetAutoApplyFixIts can change behavior; keep defaults predictable.
Restrict expression evaluation if you need to prevent JIT/execution for security; prefer DWARF-based evaluation.

Windows: PDBs, cdb/WinDbg, and Time Travel Debugging (TTD)

On Windows, you can script cdb/windbg through dbgeng or use the VS Code vsdbg adapter (DAP). For record–replay, Time Travel Debugging (TTD) provides deterministic playback of native apps.

Tips:

Use a symbol server and cache (PDBs can be large): set _NT_SYMBOL_PATH=SRV*c:\\symcache*https://msdl.microsoft.com/download/symbols plus your private SRV path.
The dbgeng API is C-based; wrappers exist in Python/PowerShell. If you need simpler automation, drive cdb with command scripts and parse .printf/.echo outputs.
TTD CLI usage (simplified example):

powershell
# Record
ttd.exe -out trace_run.ttd -attach 12345
# Or launch
ttd.exe -out trace_run.ttd -launch .\bin\payment_worker.exe --args "..."

# Replay inside WinDbg
windbg.exe -ttd -z trace_run.ttd
# Then use !ttdext commands to time-travel and inspect state

If you want a unified flow across platforms, prefer the Debug Adapter Protocol for basic stepping and layer in Windows-specific modules for TTD and PDB symbolization.

eBPF: observe without pausing the world

Breakpoints are great, but the world is asynchronous. eBPF lets you observe syscalls, user-level functions (uprobes), and kernel scheduling without stopping the process. It’s ideal for answering questions like: "Which requests spend >5 ms in SQLite?" or "Which file descriptor is leaking?"

Two practical approaches:

bpftrace for quick one-liners and histograms
BCC/libbpf for production-grade, programmatic tracing

Example: trace slow openat syscalls by a target PID with bpftrace:

bash
sudo bpftrace -e 'tracepoint:syscalls:sys_enter_openat /pid == 12345/ { @files[arg1] = count(); } interval:s:5 { print(@files); clear(@files); }'

Example: probe function entry/exit to measure latency with bpftrace uprobes:

bash
# Attach to user function symbols by name (requires symbols)
sudo bpftrace -e 'uprobe:/usr/local/bin/payment_worker:foo::ProcessPayment { @ts[tid] = nsecs; }
uretprobe:/usr/local/bin/payment_worker:foo::ProcessPayment /@ts[tid]/ { @lat = hist((nsecs - @ts[tid]) / 1000); delete(@ts[tid]); }'

Programmatic BCC in Python to trace all connect syscalls for a PID and sample stack traces:

python
from bcc import BPF

prog = r'''
#include <uapi/linux/ptrace.h>
#include <net/sock.h>

BPF_HASH(start, u32, u64);
BPF_PERF_OUTPUT(events);

int trace_enter(struct pt_regs *ctx) {
  u32 pid = bpf_get_current_pid_tgid();
  u64 ts = bpf_ktime_get_ns();
  start.update(&pid, &ts);
  return 0;
}

int trace_exit(struct pt_regs *ctx) {
  u32 pid = bpf_get_current_pid_tgid();
  u64 *tsp = start.lookup(&pid);
  if (!tsp) return 0;
  u64 delta = bpf_ktime_get_ns() - *tsp;
  start.delete(&pid);
  if (delta > 1000000) { // >1ms
    events.perf_submit(ctx, &delta, sizeof(delta));
  }
  return 0;
}
'''

b = BPF(text=prog)
b.attach_kprobe(event="__sys_connect", fn_name="trace_enter")
b.attach_kretprobe(event="__sys_connect", fn_name="trace_exit")

def print_event(cpu, data, size):
    delta = b["events"].event(data)
    print("slow connect us:", delta//1000)

b["events"].open_perf_buffer(print_event)
while True:
    b.perf_buffer_poll()

Security and portability notes:

Kernel must allow eBPF features; on some distros you need CAP_BPF/CAP_SYS_ADMIN or to run inside a privileged container.
Use CO-RE (libbpf) for portable programs across kernel versions.
Redaction: prefer event metadata (durations, FDs, PIDs) over raw buffers/strings. Never dump arbitrary process memory from eBPF to the LLM.

Deterministic repro with record–replay

Some bugs bite only under weird timing. You need time travel.

Linux: rr records user-space execution with hardware performance counters and replays deterministically. It supports stepping, reverse-continue, reverse-next, and integrates with gdb.
Windows: Time Travel Debugging (TTD) provides similar time-travel inside WinDbg.

rr workflow:

bash
# Record a run (store in ~/.rr)
rr record -- ./bin/payment_worker --seed 123 --job-id 42

# List traces
rr ls

# Replay with gdb
rr replay
# Now in a gdb wrapped by rr; reverse commands are available:
# (gdb) reverse-continue
# (gdb) reverse-next
# (gdb) watch myVar

Automate with a gateway that spawns rr record, captures the trace id, and exposes replay as a virtual target. When the LLM says "reproduce the crash and stop at the faulting instruction," your gateway:

Launches under rr until exit or signal
Extracts exit status and signal
Replays and stops at the signal
Returns a backtrace, registers, and disassembly around the PC

Windows TTD parallels this but requires WinDbg automation. Keep traces local; they may contain secrets.

Glue it together: an agent that plans and executes

Your LLM needs to be more than a chatty interface. Give it a tool-using backbone:

Planner: Converts goals into a sequence of tool calls with pre/postconditions. E.g., "If process crashes → record with rr; if hangs → collect eBPF scheduler traces; else → attach gdb and set breakpoints."
Schema-aware Executor: Validates JSON tool calls, enforces timeouts, and retries idempotently. No free-form shell.
State Summarizer: After each step, distill results into a compact, privacy-preserving summary (e.g., top 3 frames with function names and offsets; histograms, not raw samples).
Memory of Facts: A lightweight vector index of symbols, types, and recent observations. No source. Function signatures and docstrings (optional) are okay.
Policy Gate: Enforces "no source, no raw strings" unless the user enables a capability during the session.

A typical run for "flaky crash in payment worker" might be:

Plan: try rr record up to N attempts to capture the crash.
Once captured, rr replay → stop at signal → collect backtrace and thread list.
Look for data races: if multiple threads in shared code, set watchpoints on suspect addresses; reverse-step to first write.
Use eBPF to correlate syscalls and scheduling pressure, or enable rr’s event logs to show interleaving.
Produce a minimal script plus deterministic trace hash that reproduces the crash for CI triage.

Privacy-first design: debug without leaking source

You can get 80–90% of debugging value without ever sending source to the model. Tactics:

Symbol-only RAG: Index function names, types, address ranges, and linkage (who calls whom) from DWARF/PDB. Don’t ingest file contents or comments.
Redacted memory reads: Return lengths, entropy estimates, and hashes instead of raw strings. If necessary, show constant-size previews with ASCII control-character masking.
Source-on-demand: Offer a capability flag like allow_source_lines: true with a user prompt that shows exactly which file/line will be revealed and why.
Local-only heavy artifacts: Keep core dumps, rr/TTD traces, and raw eBPF buffers on the server. The LLM gets summaries and indexes.
Sensitive symbol filtering: Redact known secret-bearing globals (e.g., kSecret, api_key) via allowlists/denylists.
Logging and attestation: Every tool call logged with purpose and redaction mode. Provide a session transcript to the user.

If you host the model on-prem or inside the developer’s machine, you can loosen some restrictions. Default locked-down is still a good baseline.

Concrete end-to-end scenarios

1) Intermittent crash in a multithreaded service

Agent plan:
- Build with debug symbols; ensure separate DWARF available
- Run under rr record until crash captured (max 3 runs)
- rr replay and stop at SIGSEGV
- Collect backtrace for all threads and inspect registers
- Identify the null deref’s origin via reverse-watch on the pointer
- Export a minimal reproducer script and the rr trace id
Key commands:
- rr record/replay; gdb MI: -exec-continue, -data-read-memory-bytes, -stack-list-frames
Report:
- Deterministic steps, function names, offsets, and the race explanation
- Link to local trace artifact

2) Performance regression in I/O path

Agent plan:
- Attach eBPF uprobes to read, write, and fsync wrappers in the binary
- Attach tracepoints to block:block_rq_issue/complete
- Aggregate latencies and show heatmaps over 60s
- If hot spots are in user code, set LLDB/GDB breakpoints around suspect regions and measure wall time per request
Output:
- Histograms of latencies, top stack traces for slow calls (symbolized, no source)

3) File descriptor leak suspected

Agent plan:
- eBPF trace sys_enter_openat/sys_exit_close with per-PID counters
- If count(open) - count(close) grows, resolve culprits by uprobe on open wrappers; stack-sample to the owning callsite
- Optionally set a conditional breakpoint on ulimit breach in gdb/MI
Output:
- Leak rate over time with culprit function symbols

Handling hard edges

ASLR and PIE: Debuggers handle this; if you use uprobes, resolve symbol addresses per mapping in /proc/<pid>/maps.
Stripped binaries: Provide .gnu_debuglink to external DWARF or a PDB symbol server.
Optimized code: Variable locations may be optimized away. Prefer watchpoints on addresses and reason with disassembly. Build a special debug flavor if needed.
Security/capabilities:
- Linux: ptrace_scope must allow attach; containers may need SYS_PTRACE capability
- eBPF: kernel lockdown modes prohibit kprobes; use dedicated debug hosts or carefully scoped privileges
Expression evaluation hazards: Disable JIT-based evaluation by default. Use DWARF-only expression evaluation where possible.

Choosing your control plane: DAP vs direct MI/SB/DBGENG

DAP (Debug Adapter Protocol):
- Pros: Widely adopted; JSON-based; good for basic stepping and breakpoints
- Cons: Limited low-level memory/register control; no eBPF or record–replay semantics
Direct GDB/MI and LLDB SB:
- Pros: Full control; deterministic; easier to extend
- Cons: You must parse/normalize multiple protocols
Pragmatic hybrid: Use DAP for general stepping and per-language adapters (JVMTI, .NET SOS, Node) and augment with direct MI/SB for native deep-dives and your own eBPF/rr actions.

Implementation blueprint: the Debug Gateway

Design a single process that:

Spawns gdb --interpreter=mi2 or links LLDB’s SB API in-process
Manages a pool of eBPF programs (bpftrace on-demand or precompiled libbpf CO-RE probes)
Wraps rr and TTD as first-class actions
Exposes a JSON RPC over a Unix socket with these endpoints:
- start_session, attach, set_breakpoint, continue, step, eval, read_memory, backtrace, threads, disassemble, watchpoint
- trace_syscalls, trace_uprobes, trace_sched
- record_start, record_stop, replay_start, replay_cmd
Injects a policy layer that strips source lines and redacts memory
Emits structured events and spans (OpenTelemetry is fine) for observability

Example JSON call to set a breakpoint and continue:

json
{
  "tool": "gdb-mi",
  "action": "set_breakpoint",
  "args": {"location": {"function": "foo::ProcessPayment"}},
  "policy": {"allow_source_lines": false}
}

Response (normalized):

json
{
  "outcome": "ok",
  "breakpoint": {
    "id": 1,
    "verified": true,
    "locations": [{"function": "foo::ProcessPayment", "file": null, "line": null, "address": "0x7f..."}]
  }
}

Then:

json
{
  "tool": "gdb-mi",
  "action": "continue",
  "args": {},
  "policy": {"max_runtime_ms": 15000}
}

With a stop event:

json
{
  "outcome": "ok",
  "stopped": {
    "reason": "breakpoint-hit",
    "frame": {
      "function": "foo::ProcessPayment",
      "address": "0x7f...",
      "module": "payment_worker",
      "inlined": false
    },
    "thread": 7
  }
}

CI/CD and developer workflow integration

Pre-build symbol artifacts; publish to a private symbol server (S3/GCS + index)
Bundle rr/TTD traces as CI artifacts when failures occur; gate access
Provide a "Reproduce in Container" button that pulls the exact binary + symbol + rr trace; no source transfer required
VS Code extension: forward gateway endpoints to the editor; show structured results rather than raw terminals

Testing your agent like a systems engineer

Golden transcripts: For each tool action, record a canonical request/response pair and assert stability across versions of gdb/lldb/bpftrace.
Chaos inputs: Run rr record under CPU contention and verify determinism on replay.
Policy fuzzing: Attempt to coerce the agent into returning source lines or raw buffers; ensure the gate holds.
Performance SLOs: Tracing overhead < 1% in normal mode, < 5% in diagnostic mode. eBPF programs must pass verifier quickly and unload cleanly.

References and further reading

rr: https://rr-project.org/ and Pernosco (hosted rr traces): https://pernos.co/
GDB/MI: https://sourceware.org/gdb/onlinedocs/gdb/GDB_002fMI.html
LLDB SB API: https://lldb.llvm.org/python_reference/
Debug Adapter Protocol: https://microsoft.github.io/debug-adapter-protocol/
eBPF and bpftrace: https://bpftrace.org/ and BCC: https://github.com/iovisor/bcc
Windows Time Travel Debugging: https://learn.microsoft.com/windows-hardware/drivers/debugger/time-travel-debugging-overview
DWARF: http://dwarfstd.org/

Roadmap: beyond native

Managed runtimes: hook JVMTI (Java), .NET SOS/CLRMD, Python’s faulthandler and py-spy, and Node’s inspector for a unified view.
GPUs: integrate cuda-gdb and ROCm tools for kernels.
Distributed repro: capture per-service rr traces + network I/O harness to replay a multi-process incident.
Security attestation: run the gateway in a sandbox (gVisor/SECCOMP) with auditable policies.

Conclusion

Stop guessing. Your AI can be a real debugger if you let it do real work. By wiring LLMs to GDB/LLDB, symbol servers, eBPF tracers, and record–replay systems, you can produce explanations grounded in memory, registers, and event timelines. You can reproduce flaky bugs deterministically. You can respect source privacy by default.

You don’t need omniscience—just good instruments and a disciplined agent. The blueprint above is enough to go from a chatty assistant to a dependable debugging copilot that speaks the language of debuggers, not just Stack Overflow.