Stop Guessing: Turn Your Code Debugging AI into a Real Debugger with GDB, LLDB, and eBPF
Most "AI code debuggers" don’t debug. They speculate. They explain hypothetical root causes based on pattern-matching error messages or stack traces, but they rarely set a breakpoint, inspect a register, or deterministically replay a flaky crash. If you want more than educated guesses, you must wire your LLM to the tools practitioners actually use: GDB/LLDB, DWARF/PDB symbols, system tracers, and record–replay.
This article is a practical blueprint for turning an LLM into a genuine, instrumented debugging agent that:
- Sets breakpoints, steps, reads registers, and inspects memory
- Resolves symbols from DWARF/PDB/dSYM without shipping your source code
- Traces syscalls, user-level probes, and kernel events with eBPF
- Reproduces flaky bugs on demand with record–replay (rr, TTD)
- Operates under strict data minimization and redaction policies so source and secrets don’t leak
We’ll cover the architecture, the protocols, the build flags, the hard edges (optimized code, ASLR, capabilities), and end with concrete code snippets for GDB/MI, LLDB’s SB API, bpftrace/BCC, and rr.
The mandate: Don’t guess—instrument
Developers do not need another guess engine. They need an autonomous operator that can:
- Plan a debugging session: define hypotheses, choose tools, and set exit criteria
- Execute structured tool calls and parse machine-readable results
- Iterate based on evidence (registers, stack, memory, traces), not vibes
- Produce a deterministic replay artifact when the bug is intermittent
- Keep sensitive artifacts local (source code, core dumps, secrets) unless explicitly allowed
That requires an architecture that puts the LLM behind a strict "tool gateway". Natural language in; structured actions out; instrumented facts back in.
High-level architecture
Think of five cooperating services:
- Orchestrator (LLM Agent): Plans and reasons. Converts user intent into a sequence of tool calls.
- Debug Gateway: A stateless microservice exposing a small set of safe, auditable operations (breakpoint, step, memory read, trace, record/replay). It talks to GDB/LLDB, eBPF, rr/TTD, etc.
- Symbol Service: Resolves addresses and stack frames to human-meaningful names using DWARF/PDB/dSYM. It never exposes source text by default—only symbol metadata.
- Redaction/Policy Engine: Enforces data minimization (no source lines, no raw strings from memory) unless explicitly permitted.
- Storage/Trace Catalog: Stores rr/TTD traces, eBPF ring-buffer extracts, debugger transcripts, and anonymized session summaries.
Data flow:
- User: "Flaky crash in payment worker. Make it deterministic and show the race."
- LLM: Plan → "record run under rr", "replay and break on SIGSEGV", "inspect threads and shared state", "trace syscalls to correlate timing".
- Gateway: Executes tool actions; returns structured results (JSON) with symbolized stack traces and register/memory excerpts (redacted where needed).
- LLM: Updates hypothesis; performs targeted watchpoints; produces a minimal, deterministic repro script and a report.
Key principle: The LLM never gets raw source code unless the user opts in. Tools operate on binaries + symbols and runtime state.
Prepare your builds: symbols without source
Native debuggers are only as good as the debug info you ship with them. You can empower a source-privacy-first workflow by shipping symbol files separately and keeping source lines private by default.
Recommended settings by ecosystem:
- Clang/GCC (C/C++/Rust via LLVM):
- Compile with debug info:
-g3 -fno-omit-frame-pointer- Consider
-gsplit-dwarfto keep debug info separate from binaries - For optimized-but-debaggable builds:
-g -O2 -fno-omit-frame-pointer -fvar-tracking-assignments
- Link with
-Wl,--build-idand.gnu_debuglinkso symbol servers can map binaries to separate DWARF - Rust:
cargo build -Z split-debuginfo=unpacked(orpacked), or setdebug = truein Cargo profiles
- Compile with debug info:
- Go:
- Build with
-gcflags=all='-N -l'for debugging (disables inlining/optimizations) -ldflags='-compressdwarf=false'if you need easier DWARF consumption
- Build with
- Swift/Objective-C (Apple):
- Produce dSYM bundles via
dsymutil; distribute dSYMs separately from the binary
- Produce dSYM bundles via
- Windows (MSVC/Clang-cl):
- Compile with
/Zior/Z7and link with/DEBUG:FULLto produce PDBs - Configure a symbol server:
setx _NT_SYMBOL_PATH "SRV*c:\\symcache*https://msdl.microsoft.com/download/symbols"
- Compile with
The key is a symbol store that maps build-id or PDB GUID to debug info. Your LLM agent only needs function/variable names, addresses, types, and line mappings—no source text. Keep source on-prem and only expose it via a guarded capability.
A common command schema for tools
You need a narrow, stable interface that an LLM can reliably call. Use JSON with a strict schema and deterministic responses. The Debug Adapter Protocol (DAP) is a good basis for stepping/breakpoints, but you’ll likely need extensions for low-level memory/registers, eBPF, and record–replay.
Example minimal schema for a tool call:
json{ "tool": "gdb-mi|lldb|ebpf|rr|ttd", "action": "set_breakpoint|continue|step|eval|read_memory|backtrace|threads|trace_syscalls|record|replay", "args": { "target": {"pid": 1234}, "location": {"function": "foo::bar", "file": "foo.cc", "line": 128}, "expression": "myVar + 1", "address": "0x7fff...", "length": 64, "filters": {"syscalls": ["openat", "connect"], "duration_us_gt": 1000} }, "policy": { "redact_strings": true, "max_bytes": 32, "allow_source_lines": false } }
Responses should include:
- A normalized outcome (
ok|error|timeout) - Minimal structured data: stack frames with function names and inlined markers; registers as name/value pairs; memory as hex + entropy; traces as event tuples
- Explicit redaction markers and count of omitted bytes/lines
GDB via MI: battle-tested and scriptable
GDB’s Machine Interface (MI2) is stable, documented, and ideal for programmatic control. Run GDB as a subprocess with --interpreter=mi2, or use an existing wrapper like PTVS’s ptvsd-gdb, pygdbmi, or your own minimal parser.
Python example using pygdbmi to set a breakpoint, run, and inspect state:
pythonfrom pygdbmi.gdbcontroller import GdbController # Launch gdb in MI mode gdbmi = GdbController(command=['gdb', '--interpreter=mi2']) def send(cmd): return gdbmi.write(cmd, timeout_sec=5) # Load the target send('-file-exec-and-symbols ./bin/payment_worker') # Optional: attach to a running PID # send('-target-attach 12345') # Set a breakpoint by function name send('-break-insert foo::ProcessPayment') # Run until breakpoint send('-exec-run') # When stopped, get backtrace bt = send('-stack-list-frames') print('backtrace:', bt) # Read registers on current frame regs = send('-data-list-register-values x') print('registers:', regs) # Evaluate an expression safely val = send('-data-evaluate-expression myStruct.field') print('value:', val) # Read memory (redacted length) mem = send('-data-read-memory-bytes 0x7fffffffe000 64') print('mem bytes:', mem) # Continue execution send('-exec-continue')
Notes:
- Use
-gdb-set pagination offand-gdb-set print elements 200for predictable outputs. - Map MI async records (
*stopped,=thread-created) to structured events in your gateway. - To avoid source leakage, don’t call
-symbol-list-linesor fetch source files unless user opts in. Symbolize frames from DWARF without reading file contents.
LLDB via SB API: powerful and Pythonic
LLDB’s SB API exposes finer-grained control than its command interpreter and is pleasant in Python. Use it for macOS and for projects where LLVM/Clang’s debug info shines (Swift/Objective-C/C++).
Example: set breakpoint, continue, inspect a variable’s type and value.
pythonimport lldb # Create debugger without distorting stdout lldb.SBDebugger.Initialize() dbg = lldb.SBDebugger.Create() dbg.SetAsync(False) target = dbg.CreateTargetWithFileAndArch('./bin/payment_worker', None) if not target: raise RuntimeError('Failed to create target') bp = target.BreakpointCreateByName('foo::ProcessPayment') process = target.LaunchSimple(None, None, '.') if process.GetState() != lldb.eStateStopped: raise RuntimeError('Did not stop at breakpoint') thread = process.GetSelectedThread() frame = thread.GetSelectedFrame() # Backtrace frames (symbolized, no source needed) for f in thread: print(f.GetFunctionName(), hex(f.GetPC())) # Registers regs = frame.GetRegisters() for regset in regs: for reg in regset: print(reg.GetName(), reg.GetValue()) # Evaluate expression without running code (DWARF-only eval if possible) opts = lldb.SBExpressionOptions() opts.SetLanguage(lldb.eLanguageTypeC_plus_plus) opts.SetCoerceResultToId(False) val = frame.EvaluateExpression('myStruct.field', opts) print(val.GetTypeName(), val.GetValue()) process.Continue()
LLDB nuances:
- For symbol-only workflows on macOS, ship dSYM bundles. LLDB can symbolize without source.
SBExpressionOptions.SetTryAllThreadsandSetAutoApplyFixItscan change behavior; keep defaults predictable.- Restrict
expressionevaluation if you need to prevent JIT/execution for security; prefer DWARF-based evaluation.
Windows: PDBs, cdb/WinDbg, and Time Travel Debugging (TTD)
On Windows, you can script cdb/windbg through dbgeng or use the VS Code vsdbg adapter (DAP). For record–replay, Time Travel Debugging (TTD) provides deterministic playback of native apps.
Tips:
- Use a symbol server and cache (PDBs can be large):
set _NT_SYMBOL_PATH=SRV*c:\\symcache*https://msdl.microsoft.com/download/symbolsplus your private SRV path. - The dbgeng API is C-based; wrappers exist in Python/PowerShell. If you need simpler automation, drive
cdbwith command scripts and parse.printf/.echooutputs. - TTD CLI usage (simplified example):
powershell# Record ttd.exe -out trace_run.ttd -attach 12345 # Or launch ttd.exe -out trace_run.ttd -launch .\bin\payment_worker.exe --args "..." # Replay inside WinDbg windbg.exe -ttd -z trace_run.ttd # Then use !ttdext commands to time-travel and inspect state
If you want a unified flow across platforms, prefer the Debug Adapter Protocol for basic stepping and layer in Windows-specific modules for TTD and PDB symbolization.
eBPF: observe without pausing the world
Breakpoints are great, but the world is asynchronous. eBPF lets you observe syscalls, user-level functions (uprobes), and kernel scheduling without stopping the process. It’s ideal for answering questions like: "Which requests spend >5 ms in SQLite?" or "Which file descriptor is leaking?"
Two practical approaches:
- bpftrace for quick one-liners and histograms
- BCC/libbpf for production-grade, programmatic tracing
Example: trace slow openat syscalls by a target PID with bpftrace:
bashsudo bpftrace -e 'tracepoint:syscalls:sys_enter_openat /pid == 12345/ { @files[arg1] = count(); } interval:s:5 { print(@files); clear(@files); }'
Example: probe function entry/exit to measure latency with bpftrace uprobes:
bash# Attach to user function symbols by name (requires symbols) sudo bpftrace -e 'uprobe:/usr/local/bin/payment_worker:foo::ProcessPayment { @ts[tid] = nsecs; } uretprobe:/usr/local/bin/payment_worker:foo::ProcessPayment /@ts[tid]/ { @lat = hist((nsecs - @ts[tid]) / 1000); delete(@ts[tid]); }'
Programmatic BCC in Python to trace all connect syscalls for a PID and sample stack traces:
pythonfrom bcc import BPF prog = r''' #include <uapi/linux/ptrace.h> #include <net/sock.h> BPF_HASH(start, u32, u64); BPF_PERF_OUTPUT(events); int trace_enter(struct pt_regs *ctx) { u32 pid = bpf_get_current_pid_tgid(); u64 ts = bpf_ktime_get_ns(); start.update(&pid, &ts); return 0; } int trace_exit(struct pt_regs *ctx) { u32 pid = bpf_get_current_pid_tgid(); u64 *tsp = start.lookup(&pid); if (!tsp) return 0; u64 delta = bpf_ktime_get_ns() - *tsp; start.delete(&pid); if (delta > 1000000) { // >1ms events.perf_submit(ctx, &delta, sizeof(delta)); } return 0; } ''' b = BPF(text=prog) b.attach_kprobe(event="__sys_connect", fn_name="trace_enter") b.attach_kretprobe(event="__sys_connect", fn_name="trace_exit") def print_event(cpu, data, size): delta = b["events"].event(data) print("slow connect us:", delta//1000) b["events"].open_perf_buffer(print_event) while True: b.perf_buffer_poll()
Security and portability notes:
- Kernel must allow eBPF features; on some distros you need CAP_BPF/CAP_SYS_ADMIN or to run inside a privileged container.
- Use CO-RE (libbpf) for portable programs across kernel versions.
- Redaction: prefer event metadata (durations, FDs, PIDs) over raw buffers/strings. Never dump arbitrary process memory from eBPF to the LLM.
Deterministic repro with record–replay
Some bugs bite only under weird timing. You need time travel.
- Linux: rr records user-space execution with hardware performance counters and replays deterministically. It supports stepping, reverse-continue, reverse-next, and integrates with gdb.
- Windows: Time Travel Debugging (TTD) provides similar time-travel inside WinDbg.
rr workflow:
bash# Record a run (store in ~/.rr) rr record -- ./bin/payment_worker --seed 123 --job-id 42 # List traces rr ls # Replay with gdb rr replay # Now in a gdb wrapped by rr; reverse commands are available: # (gdb) reverse-continue # (gdb) reverse-next # (gdb) watch myVar
Automate with a gateway that spawns rr record, captures the trace id, and exposes replay as a virtual target. When the LLM says "reproduce the crash and stop at the faulting instruction," your gateway:
- Launches under rr until exit or signal
- Extracts exit status and signal
- Replays and stops at the signal
- Returns a backtrace, registers, and disassembly around the PC
Windows TTD parallels this but requires WinDbg automation. Keep traces local; they may contain secrets.
Glue it together: an agent that plans and executes
Your LLM needs to be more than a chatty interface. Give it a tool-using backbone:
- Planner: Converts goals into a sequence of tool calls with pre/postconditions. E.g., "If process crashes → record with rr; if hangs → collect eBPF scheduler traces; else → attach gdb and set breakpoints."
- Schema-aware Executor: Validates JSON tool calls, enforces timeouts, and retries idempotently. No free-form shell.
- State Summarizer: After each step, distill results into a compact, privacy-preserving summary (e.g., top 3 frames with function names and offsets; histograms, not raw samples).
- Memory of Facts: A lightweight vector index of symbols, types, and recent observations. No source. Function signatures and docstrings (optional) are okay.
- Policy Gate: Enforces "no source, no raw strings" unless the user enables a capability during the session.
A typical run for "flaky crash in payment worker" might be:
- Plan: try rr record up to N attempts to capture the crash.
- Once captured, rr replay → stop at signal → collect backtrace and thread list.
- Look for data races: if multiple threads in shared code, set watchpoints on suspect addresses; reverse-step to first write.
- Use eBPF to correlate syscalls and scheduling pressure, or enable rr’s event logs to show interleaving.
- Produce a minimal script plus deterministic trace hash that reproduces the crash for CI triage.
Privacy-first design: debug without leaking source
You can get 80–90% of debugging value without ever sending source to the model. Tactics:
- Symbol-only RAG: Index function names, types, address ranges, and linkage (who calls whom) from DWARF/PDB. Don’t ingest file contents or comments.
- Redacted memory reads: Return lengths, entropy estimates, and hashes instead of raw strings. If necessary, show constant-size previews with ASCII control-character masking.
- Source-on-demand: Offer a capability flag like
allow_source_lines: truewith a user prompt that shows exactly which file/line will be revealed and why. - Local-only heavy artifacts: Keep core dumps, rr/TTD traces, and raw eBPF buffers on the server. The LLM gets summaries and indexes.
- Sensitive symbol filtering: Redact known secret-bearing globals (e.g.,
kSecret,api_key) via allowlists/denylists. - Logging and attestation: Every tool call logged with purpose and redaction mode. Provide a session transcript to the user.
If you host the model on-prem or inside the developer’s machine, you can loosen some restrictions. Default locked-down is still a good baseline.
Concrete end-to-end scenarios
1) Intermittent crash in a multithreaded service
- Agent plan:
- Build with debug symbols; ensure separate DWARF available
- Run under
rr recorduntil crash captured (max 3 runs) rr replayand stop at SIGSEGV- Collect backtrace for all threads and inspect registers
- Identify the null deref’s origin via reverse-watch on the pointer
- Export a minimal reproducer script and the rr trace id
- Key commands:
- rr record/replay; gdb MI:
-exec-continue,-data-read-memory-bytes,-stack-list-frames
- rr record/replay; gdb MI:
- Report:
- Deterministic steps, function names, offsets, and the race explanation
- Link to local trace artifact
2) Performance regression in I/O path
- Agent plan:
- Attach eBPF uprobes to
read,write, andfsyncwrappers in the binary - Attach tracepoints to
block:block_rq_issue/complete - Aggregate latencies and show heatmaps over 60s
- If hot spots are in user code, set LLDB/GDB breakpoints around suspect regions and measure wall time per request
- Attach eBPF uprobes to
- Output:
- Histograms of latencies, top stack traces for slow calls (symbolized, no source)
3) File descriptor leak suspected
- Agent plan:
- eBPF trace
sys_enter_openat/sys_exit_closewith per-PID counters - If count(open) - count(close) grows, resolve culprits by uprobe on
openwrappers; stack-sample to the owning callsite - Optionally set a conditional breakpoint on ulimit breach in gdb/MI
- eBPF trace
- Output:
- Leak rate over time with culprit function symbols
Handling hard edges
- ASLR and PIE: Debuggers handle this; if you use uprobes, resolve symbol addresses per mapping in
/proc/<pid>/maps. - Stripped binaries: Provide
.gnu_debuglinkto external DWARF or a PDB symbol server. - Optimized code: Variable locations may be optimized away. Prefer watchpoints on addresses and reason with disassembly. Build a special debug flavor if needed.
- Security/capabilities:
- Linux:
ptrace_scopemust allow attach; containers may needSYS_PTRACEcapability - eBPF: kernel lockdown modes prohibit kprobes; use dedicated debug hosts or carefully scoped privileges
- Linux:
- Expression evaluation hazards: Disable JIT-based evaluation by default. Use DWARF-only expression evaluation where possible.
Choosing your control plane: DAP vs direct MI/SB/DBGENG
- DAP (Debug Adapter Protocol):
- Pros: Widely adopted; JSON-based; good for basic stepping and breakpoints
- Cons: Limited low-level memory/register control; no eBPF or record–replay semantics
- Direct GDB/MI and LLDB SB:
- Pros: Full control; deterministic; easier to extend
- Cons: You must parse/normalize multiple protocols
- Pragmatic hybrid: Use DAP for general stepping and per-language adapters (JVMTI, .NET SOS, Node) and augment with direct MI/SB for native deep-dives and your own eBPF/rr actions.
Implementation blueprint: the Debug Gateway
Design a single process that:
- Spawns
gdb --interpreter=mi2or links LLDB’s SB API in-process - Manages a pool of eBPF programs (bpftrace on-demand or precompiled libbpf CO-RE probes)
- Wraps rr and TTD as first-class actions
- Exposes a JSON RPC over a Unix socket with these endpoints:
start_session,attach,set_breakpoint,continue,step,eval,read_memory,backtrace,threads,disassemble,watchpointtrace_syscalls,trace_uprobes,trace_schedrecord_start,record_stop,replay_start,replay_cmd
- Injects a policy layer that strips source lines and redacts memory
- Emits structured events and spans (OpenTelemetry is fine) for observability
Example JSON call to set a breakpoint and continue:
json{ "tool": "gdb-mi", "action": "set_breakpoint", "args": {"location": {"function": "foo::ProcessPayment"}}, "policy": {"allow_source_lines": false} }
Response (normalized):
json{ "outcome": "ok", "breakpoint": { "id": 1, "verified": true, "locations": [{"function": "foo::ProcessPayment", "file": null, "line": null, "address": "0x7f..."}] } }
Then:
json{ "tool": "gdb-mi", "action": "continue", "args": {}, "policy": {"max_runtime_ms": 15000} }
With a stop event:
json{ "outcome": "ok", "stopped": { "reason": "breakpoint-hit", "frame": { "function": "foo::ProcessPayment", "address": "0x7f...", "module": "payment_worker", "inlined": false }, "thread": 7 } }
CI/CD and developer workflow integration
- Pre-build symbol artifacts; publish to a private symbol server (S3/GCS + index)
- Bundle rr/TTD traces as CI artifacts when failures occur; gate access
- Provide a "Reproduce in Container" button that pulls the exact binary + symbol + rr trace; no source transfer required
- VS Code extension: forward gateway endpoints to the editor; show structured results rather than raw terminals
Testing your agent like a systems engineer
- Golden transcripts: For each tool action, record a canonical request/response pair and assert stability across versions of gdb/lldb/bpftrace.
- Chaos inputs: Run rr record under CPU contention and verify determinism on replay.
- Policy fuzzing: Attempt to coerce the agent into returning source lines or raw buffers; ensure the gate holds.
- Performance SLOs: Tracing overhead < 1% in normal mode, < 5% in diagnostic mode. eBPF programs must pass verifier quickly and unload cleanly.
References and further reading
- rr: https://rr-project.org/ and Pernosco (hosted rr traces): https://pernos.co/
- GDB/MI: https://sourceware.org/gdb/onlinedocs/gdb/GDB_002fMI.html
- LLDB SB API: https://lldb.llvm.org/python_reference/
- Debug Adapter Protocol: https://microsoft.github.io/debug-adapter-protocol/
- eBPF and bpftrace: https://bpftrace.org/ and BCC: https://github.com/iovisor/bcc
- Windows Time Travel Debugging: https://learn.microsoft.com/windows-hardware/drivers/debugger/time-travel-debugging-overview
- DWARF: http://dwarfstd.org/
Roadmap: beyond native
- Managed runtimes: hook JVMTI (Java), .NET SOS/CLRMD, Python’s
faulthandlerandpy-spy, and Node’s inspector for a unified view. - GPUs: integrate
cuda-gdband ROCm tools for kernels. - Distributed repro: capture per-service rr traces + network I/O harness to replay a multi-process incident.
- Security attestation: run the gateway in a sandbox (gVisor/SECCOMP) with auditable policies.
Conclusion
Stop guessing. Your AI can be a real debugger if you let it do real work. By wiring LLMs to GDB/LLDB, symbol servers, eBPF tracers, and record–replay systems, you can produce explanations grounded in memory, registers, and event timelines. You can reproduce flaky bugs deterministically. You can respect source privacy by default.
You don’t need omniscience—just good instruments and a disciplined agent. The blueprint above is enough to go from a chatty assistant to a dependable debugging copilot that speaks the language of debuggers, not just Stack Overflow.
