Don’t Attach a Debugger—Attach eBPF: How to Feed a Debug AI Without Breaking Prod
Modern production systems are allergic to debuggers. Pausing a process to set breakpoints or stepping through code introduces latency spikes, perturbs scheduling, and risks deadlocks in code paths that were never designed to be stopped. That may be acceptable in staging. In prod, it’s malpractice.
The better pattern is surgical, always-on telemetry that captures just enough context to reconstruct failure and performance anomalies off the hot path. Extended Berkeley Packet Filter (eBPF) gives us exactly that: kernel-verified, JIT-compiled programs that run in-kernel or attach to user-space binaries to observe, filter, and export events with minimal overhead. Combine eBPF with a record/replay snapshot pipeline, least-privilege sandboxes, and robust redaction, and you can feed a debug AI rich runtime context without jeopardizing latency SLOs or exposing PII.
This article lays out a production-safe blueprint for building such a system. It’s opinionated, concrete, and focused on what works at scale.
- The premise: Stop attaching debuggers; attach eBPF probes.
- The architecture: Event capture, redaction, snapshot, replay, and AI analysis in a sandbox.
- The payoff: High-fidelity root cause analysis and actionable fixes with single-digit overhead and zero PII egress.
Why Debuggers Don’t Belong in Prod
Attaching a debugger to a live process changes the world in ways that invalidate the very observation you’re trying to make:
- Stop-the-world pauses. Breakpoints halt threads mid-syscall or in critical sections. This inflates p99s and can cascade retries.
- Probe effect. A debugger alters timing, memory layout, and scheduling. Heisenbugs either disappear or morph.
- Operational risk. Privileged capabilities and ptrace interfere with security hardening; ephemeral containers complicate attachment; ephemeral failures disappear before you can attach.
- Compliance. Pulling live memory or stack traces can exfiltrate secrets and PII.
Production-safe debugging must be asynchronous, non-blocking, selective, and policy-controlled. eBPF was built for this.
A Pragmatic eBPF Primer for Debugging
eBPF allows you to load small, sandboxed programs into the kernel and attach them to events:
- kprobes/kretprobes: Attach to kernel functions entry/exit (e.g., syscalls).
- tracepoints: Stable ABI for kernel events (e.g., sched:sched_switch, net:net_dev_queue, tcp:tcp_retransmit_skb).
- uprobes/uretprobes: Attach to user-space functions in ELF binaries or to USDT probes (DTrace-style static probes) emitted by runtimes like the JVM, Node.js, and Python.
- Maps: Shared kernel/user-space data structures (hash maps, LRU maps, per-CPU arrays, and ring buffers).
- BTF and CO-RE: BPF Type Format and Compile Once–Run Everywhere allow compiling against type info and running across kernels without rebuilds.
Key properties that make eBPF production-friendly:
- The verifier ensures safety: bounded loops, no unbounded memory access, no recursion. If it loads, it’s safe.
- It’s fast: JIT to native, per-CPU maps, and zero-copy ring buffers keep overhead low.
- Attach/detach at runtime: No reboot or kernel module hassle; reversible.
The Production-Safe Debug AI Blueprint
The goal is to deliver high-signal runtime context to an LLM-based debug assistant without adding latency or violating privacy. Here’s the architecture:
- Event Capture with eBPF
- Selectively attach to syscalls, tracepoints, and user-level probes for the target workload.
- Filter early (PID namespace, cgroup ID, container labels) to avoid noise.
- Emit structured events to a ring buffer with bounded event sizes.
- Sidecar Ingestion and Policy Engine
- A per-node agent reads ring buffers, enriches with symbols and container metadata, and applies redaction/minimization.
- It batches events and ships them out-of-band to a durable store (Kafka, NATS JetStream, S3) with backpressure handling.
- Redaction and Minimization
- Enforce allow-lists: don’t ship payloads, only metadata unless explicitly needed.
- Hash or tokenize identifiers with salted HMAC/FPE for correlation without disclosure.
- Strip PII via high-speed DLP filters and policy rules (OPA/Rego).
- Snapshot for Record/Replay
- For high-severity anomalies, atomically capture a minimal snapshot: container image digest, overlayfs diff, environment, cgroup limits, and a bounded window of syscall I/O and network transcript.
- Store in content-addressed storage. Defer heavy lifting off the critical path.
- Sandbox Replay
- Rehydrate into a gVisor/Firecracker microVM or a seccomp/AppArmor-limited container with no egress.
- Reproduce the failing scenario deterministically as much as possible.
- LLM Analysis with RAG
- Build a structured "context pack": event timeline, stack summaries, resource metrics, config deltas, and relevant code slices.
- Run the LLM locally or in a controlled environment. Queries are policy-gated and audited.
- Actionable Output
- Summaries: suspected root cause and confidence.
- Hypotheses: competing explanations with evidence.
- Reproduction: self-contained script/harness and replay instructions.
- Patches or configuration changes: targeted diffs.
All of this runs without stopping the world, without scraping secrets, and without causing a customer-visible blip.
Instrumentation: What to Capture and How
You don’t need everything. You need the right things.
Recommended starting probes:
- Syscalls: openat/openat2, connect, accept, recvmsg/sendmsg, read/write, rename/unlink, fsync, futex, nanosleep, clone, execve, mmap/munmap, madvise.
- Tracepoints: sched:sched_switch (blocking hotspots), sched:sched_wakeup (which thread is waking whom), tcp:tcp_retransmit_skb (network health), net:netif_rx (packet ingress), block:block_rq_issue/complete (disk latency), oom:oom_kill.
- Language/runtime USDT:
- JVM: safepoint, gc, jit, class load.
- Node.js: async hooks, UV loop idle/busy, HTTP server events.
- Python: gc, function call counts via uprobe on C extension boundaries.
Filtering:
- Target only the pod/cgroup or container image of interest; this reduces overhead and risk.
- Keep event payloads bounded: truncate large buffers, sample rare payloads.
Symbolization:
- Kernel symbols come from kallsyms and BTF.
- User-space symbols come from ELF and optionally DWARF for line numbers. When DWARF isn’t present, function-level symbols are often sufficient.
Example: bpftrace Probe To Catch Slow Syscalls
A quick exploratory probe with bpftrace to log read/write over 10 ms for a given process name:
bashsudo bpftrace -e ' tracepoint:syscalls:sys_enter_read, tracepoint:syscalls:sys_enter_write /comm == "myservice"/ { @ts[tid] = nsecs; } tracepoint:syscalls:sys_exit_read, tracepoint:syscalls:sys_exit_write /comm == "myservice" && @ts[tid]/ { $d = (nsecs - @ts[tid]) / 1000000; if ($d > 10) { printf("%s %s %d ms len=%d\n", comm, probe, $d, args->count); } delete(@ts[tid]); } '
This isn’t the production agent, but it shows how to target just the events you care about and apply thresholds.
Example: CO-RE eBPF Program with Ring Buffer
A minimal C eBPF program that attaches to connect exit to capture slow outbound connections. It writes a compact event to a ring buffer.
c// SPDX-License-Identifier: GPL-2.0 #include <vmlinux.h> #include <bpf/bpf_helpers.h> #include <bpf/bpf_tracing.h> struct event_t { u64 ts_ns; u32 tgid; u32 pid; u32 saddr_v4; u32 daddr_v4; u16 dport; s16 ret; // return code u32 dur_ms; }; struct { __uint(type, BPF_MAP_TYPE_RINGBUF); __uint(max_entries, 1 << 24); // 16 MB } events SEC(".maps"); struct { __uint(type, BPF_MAP_TYPE_HASH); __type(key, u32); __type(value, u64); __uint(max_entries, 16384); } start SEC(".maps"); SEC("tp_btf/sys_enter_connect") int BPF_PROG(on_enter_connect, int fd, struct sockaddr *uservaddr, int addrlen) { u32 tid = bpf_get_current_pid_tgid(); u64 ts = bpf_ktime_get_ns(); bpf_map_update_elem(&start, &tid, &ts, BPF_ANY); return 0; } SEC("tp_btf/sys_exit_connect") int BPF_PROG(on_exit_connect, long ret) { u32 tid = bpf_get_current_pid_tgid(); u64 *tsp = bpf_map_lookup_elem(&start, &tid); if (!tsp) return 0; u64 dur_ns = bpf_ktime_get_ns() - *tsp; bpf_map_delete_elem(&start, &tid); u64 dur_ms = dur_ns / 1000000ULL; if (dur_ms < 20) return 0; // filter fast connects struct event_t *e = bpf_ringbuf_reserve(&events, sizeof(*e), 0); if (!e) return 0; e->ts_ns = bpf_ktime_get_ns(); u64 pidtgid = bpf_get_current_pid_tgid(); e->pid = (u32)pidtgid; e->tgid = pidtgid >> 32; e->ret = (s16)ret; e->dur_ms = (u32)dur_ms; // Optional: read uservaddr to extract daddr/dport with bpf_core_read() e->saddr_v4 = 0; e->daddr_v4 = 0; e->dport = 0; bpf_ringbuf_submit(e, 0); return 0; } char LICENSE[] SEC("license") = "GPL";
A user-space agent using libbpf reads the ring buffer and enriches the event with container metadata and symbolized stacks.
Practical Notes on Overhead
- Favor tracepoints over kprobes when a stable tracepoint exists; they are cheaper and ABI-stable.
- Use the ring buffer (BPF_MAP_TYPE_RINGBUF) instead of perf buffers for higher throughput with fewer syscalls.
- Filter early in-kernel. Avoid copying large buffers to user space unless needed.
- Sample on hot paths. For high-frequency events like read/write, sample 1/N and bias for anomalies (e.g., only emit when dur > threshold).
- Beware of stack unwinding. Kernel stack traces are cheap with bpf_get_stack; user-space stacks can be costly if DWARF is needed. Cache symbolization results.
Empirically, carefully designed eBPF instrumentation adds low single-digit CPU overhead on busy nodes. Projects like Cilium, Pixie, and Parca demonstrate sustained production use with acceptable costs. Always measure in your environment.
Redaction and Minimization: Ship Less, Not More
Privacy and compliance aren’t bolt-ons. They must be baked into the pipeline.
Principles:
- Default deny: The event schema should include metadata fields; payload fields are opt-in by policy.
- Tokenize early: Replace user IDs, emails, and tokens with salted HMACs or format-preserving hashes to retain joinability without disclosure.
- Drop body content: HTTP bodies, SQL result sets, and file contents are excluded by default. Collect only headers and lengths unless an incident policy authorizes a targeted capture.
- Pseudonymize strings: When strings are necessary (e.g., SQL statements), redact literals and keep structure.
- Structured policy enforcement: Use a policy engine to define what’s considered sensitive and who can request exceptions.
Fast path implementation:
- High-speed regex with Hyperscan for common PII (email, SSN-like patterns, credit cards with Luhn check).
- Contextual filters: Mask Authorization headers, cookies, JWTs, OAuth tokens.
- Dictionary-based secrets: Hash known API keys and client IDs; maintain Bloom filters to recognize known sensitive IDs at line rate.
- Structured grammar-based redaction for SQL: Strip constant literals from queries.
Example: Replacing sensitive identifiers with salted HMAC while preserving format class.
pythonimport hmac, hashlib SALT = b"rotate-me-regularly" def token(value: bytes) -> str: digest = hmac.new(SALT, value, hashlib.sha256).hexdigest() return digest[:16] # short token for joins # e.g., user@example.com -> email:ab12c3d4e5f6a7b8
Policy example with OPA/Rego to ensure only metadata leaves the node:
regopackage debugai.redaction # By default, block payloads allow_payload := false # Allow payloads only for incident-severity CRITICAL and narrow scope allow_payload { input.incident_severity == "CRITICAL" input.scope.container_label in {"payroll-indexer"} input.approver in data.approvers } # Fields to strip unconditionally strip_fields := {"http.headers.authorization", "http.cookies", "sql.binds", "env.SECRET_*"}
Encrypt data at rest; log redaction decisions for audit; and periodically test redaction with synthetic PII to ensure policies work.
Record/Replay: Reproducing Failures Without Prod Access
A debug AI is most effective when it can actually run the code under conditions close to the failure. That doesn’t mean copy prod. It means snapshot just enough state to deterministically reproduce the behavior.
Two practical approaches:
- Syscall Record/Replay Light
- Capture a bounded window (e.g., 5–30 s) of relevant syscalls with parameters and minimal I/O payloads.
- For file reads/writes, store content-addressed blocks only for paths under a configurable allow-list (e.g., /etc/service, /srv/config, application data).
- For network, capture TCP streams and DNS queries/answers as transcripts; avoid payload capture unless whitelisted.
- Reconstruct an event-driven harness that replays the timing and data deterministically.
- Container Snapshot + Transcript
- For containerized services, store: base image digest, overlayfs diff of writable layer, environment variables (after redaction), cgroup limits, and a pcap/QUIC trace of external traffic for a short window.
- Use CRIU for process checkpoint/restore where possible, or rely on clean cold-start with replayed stimuli.
Rehydration target:
- gVisor or Firecracker microVM for stronger isolation and syscall mediation.
- Alternatively, a locked-down container with seccomp allow-list, AppArmor/SELinux profile, no network egress, read-only root, and a tmpfs working dir.
Determinism tips:
- Seed RNG: capture and restore /dev/urandom seed at process start where feasible; otherwise, stub randomness in harness.
- Freeze time: LD_PRELOAD time() and clock_gettime() stubs in replay to match captured timestamps.
- Normalize nondeterminism: patch concurrency with fewer worker threads to reduce scheduling variance; replay queue inputs in the captured order.
eBPF Assist for Record/Replay
Using kprobes/tracepoints, one can capture syscall metadata efficiently. For example, capture openat paths and sizes:
cSEC("tp_btf/sys_enter_openat") int BPF_PROG(on_enter_openat, int dfd, const char __user *filename, int flags, umode_t mode) { // Read first N bytes of filename safely char path[128]; bpf_core_read_user_str(path, sizeof(path), filename); // Apply allow-list matching in-kernel (e.g., prefix match /srv/app) // If allowed, enqueue an event; else ignore return 0; } SEC("tp_btf/sys_exit_openat") int BPF_PROG(on_exit_openat, long ret) { // Correlate with enter via tid; record ret fd if needed return 0; }
Use content-addressed storage for blocks you capture, deduplicated by SHA-256. Keep a per-incident manifest describing what to restore.
Least-Privilege Sandboxes: Where the AI Does Work
The analysis environment must be safer than prod, not looser.
Minimum hardening:
- Container runtime: gVisor or runsc for strong syscall mediation; or Firecracker/Kata for microVM isolation.
- Seccomp: allow-list profile tailored to the runtime; no ptrace, no perf_event_open.
- AppArmor/SELinux: confine to per-incident paths; no host mounts.
- Network: egress disabled; strict loopback only. If internet access is needed for CVE metadata, proxy through a broker that strips URLs and caches sanitized content.
- Filesystem: read-only base image; mounted replay snapshot under a controlled path; tmpfs scratch.
- Capabilities: drop all; if eBPF is used in the sandbox, grant CAP_BPF only in the sandbox, never CAP_SYS_ADMIN.
Observability:
- Instrument the sandbox itself: resource caps, timeouts, and kill switches.
- Full audit trail: who requested analysis, what data was loaded, hashes of content packs, and exact LLM prompts and outputs.
Feeding the LLM: Structure and Guardrails
Large Language Models produce better results with structured, concise context. Don’t dump raw logs. Build a "context pack" that looks like this:
- Meta: service name, version/commit, container image, env ( sanitized keys only), regional/cluster info.
- Timeline: ordered events (syscalls, tracepoints, network) with millisecond durations and correlation IDs.
- Stack summaries: top-N stacks by time blocked, aggregated with symbols.
- Resource metrics: CPU throttling events, RSS growth, GC pause durations, file descriptor counts.
- Config deltas: diff from baseline known-good config.
- Repro harness: command-line, inputs, seeds.
- Constraints: policies and what the AI is allowed to suggest or do (e.g., no code changes, only config proposals).
Prompt skeleton:
textYou are a production-safe debugging assistant. Context: - Service: {name} - Version: {git_sha} - Incident: {id} - SLO symptom: {metric} exceeded {threshold} Artifacts: 1) Timeline (bounded): {timeline_table} 2) Stack hot spots: {stack_summaries} 3) Resource anomalies: {resource_notes} 4) Config diffs: {config_diff} Task: - Propose the most likely root cause(s) with ranked confidence and evidence citations pointing to event IDs. - Provide a minimal reproduction script using the provided harness. - Suggest the smallest configuration or code change likely to fix the issue. - Enumerate at least one alternative hypothesis and what data would disambiguate it. Constraints: - Do not request or rely on PII or secrets. All tokens are pseudonymized. - Produce outputs under 1,500 words.
Run this LLM inside the sandbox. Cache embeddings for code and docs to support retrieval-augmented generation (RAG) without internet.
Latency-Safe by Design
This pipeline is asynchronous and designed for backpressure:
- In-kernel buffer: ring buffer size is bounded; if readers lag, events are dropped. Maintain a counter of drops to alert.
- Agent batching: flush on size/time, compress with zstd, and push to durable streams.
- Sampling and prioritization: degrade gracefully under load; only emit high-severity signals when queues are long.
- Zero hot-path locks: avoid contended kernel structures. Use per-CPU maps.
Operational thresholds:
- Start conservative: <1% CPU budget per node; <128 MB memory for agent; <1000 events/sec sustained.
- Expand as you prove safety: enable more probes or deeper payloads only when justified.
Security and Compliance: Capabilities, RBAC, and Audit
Capabilities needed depend on kernel version:
- Modern kernels (>= 5.8): CAP_BPF and CAP_PERFMON, CAP_SYS_RESOURCE often suffice.
- Older kernels: CAP_SYS_ADMIN is unfortunately required in many distros; mitigate with hardened nodes and restricted DaemonSets.
Kubernetes deployment model:
- Node-level DaemonSet: one privileged agent per node attaches to cgroups/pids of target pods.
- Namespace scoping: allow only specific namespaces/labels; configuration via ConfigMaps and admission controls.
RBAC and governance:
- Who can request deeper capture or payload exceptions? Encode this in OPA policies.
- Audit every exception with approver identity and time-bound TTL.
- Keep a manifest of all data that leaves the node with hashes and retention policies.
Compatibility and Reliability
- CO-RE + BTF: ship a single eBPF binary that adapts to kernel structs at load time.
- Fallback plans: if BTF is missing, use BTFHub or prebuilt per-kernel binaries; bpftrace for ad-hoc probes.
- Verifier tricks: avoid complex loops; break logic into tail calls; keep program size within limits.
- Stability: prefer tracepoints to kprobes on unstable functions; pin maps across reloads to avoid data loss.
Case Study: Fixing p99 Spikes in a Node.js API Without Touching Prod
Symptoms: p99 latency spikes from 50 ms to 900 ms for a Node.js service under moderate load, no corresponding DB slow queries.
Step 1: Attach eBPF probes
- Tracepoints: sys_enter/exit_connect, tcp_retransmit_skb, sched_switch.
- USDT uprobes for Node: libuv loop idle/busy durations, HTTP server events.
- Filters: only the "api-gateway" pod cgroup.
Step 2: Observe
- Within minutes, the timeline shows repeated connect() calls taking 120–400 ms, primarily to upstream internal hostnames. tcp_retransmit_skb shows no retransmits; network is healthy.
- Stack samples indicate stalling in getaddrinfo() paths.
Step 3: Redaction & snapshot
- Payloads are not captured; only hostnames and durations are retained. Hostnames are hashed with a reversible FPE for offline correlation.
- Snapshot collected: container image digest, overlayfs diff, environment, and a pcap of DNS traffic for 20 seconds.
Step 4: Sandbox replay
- Rehydrate the container in gVisor with egress blocked; replay DNS transcript.
- LLM analysis receives context pack: timeline, stacks, and DNS times.
Findings
- Upstream DNS server occasionally returns slow responses; Node’s default resolver uses libc getaddrinfo() serially.
- Suggest enabling the Node option to prefer native c-ares async DNS (via setting resolver or using undici’s dns options) and implementing short TTL caching at the application or sidecar level. Alternatively, configure the OS to use multiple resolvers and enable EDNS0, or deploy a local dnsmasq/coredns cache in the node.
Outcome
- After rolling out a small config change to enable async DNS resolution with local caching, p99 dropped to 60 ms with no further spikes. eBPF confirms connect latency distributions normalized.
At no point did we attach a debugger, pause threads, or exfiltrate payloads.
Pitfalls and How to Avoid Them
- Oversharing: The biggest risk is shipping too much data. Start with metadata. Require explicit policy to capture payloads.
- Hot-path kprobes: Attaching to very hot kernel functions without sampling can add measurable overhead. Prefer tracepoints or sample aggressively.
- Unwinding user stacks: DWARF-less binaries or JIT’d runtimes complicate symbolization. Use USDT probes or runtime-provided events where possible.
- Kernel variance: Exotic kernels or vendor patches can break BTF expectations. Maintain a canary node group to validate new kernels before fleet rollout.
- Verifier rejections: Break complex programs into smaller ones with tail calls; pre-compute constants; lean on libbpf CO-RE helpers.
- Delivery backpressure: Size ring buffers generously and monitor drop counters; implement backoff in agents.
30-Day Rollout Plan
Phase 0 (Week 1): Safety and Scope
- Pick 1–2 services and 1 cluster.
- Deploy a node-level agent with only net and syscall tracepoints, no payload capture.
- Validate zero or negligible latency overhead, watch event drop metrics.
Phase 1 (Week 2): Redaction and Policy
- Implement tokenization, PII regex, and OPA-based allow-lists.
- Add user-space symbolization for selected binaries.
- Start building context packs and store in a secure bucket.
Phase 2 (Week 3): Snapshot and Sandbox
- Integrate overlayfs diff capture and short pcap for incidents.
- Stand up gVisor/Firecracker sandboxes with seccomp and egress off.
- Replay one or two recorded incidents end-to-end.
Phase 3 (Week 4): LLM Integration
- Add a local LLM (or an offline gateway) with RAG over code and runbooks.
- Define prompt templates and guardrails.
- Run shadow analyses; compare AI guidance with human root causes.
Exit criteria: No measurable SLO degradation, policy violations near zero, and at least one real incident resolved faster with the pipeline than without.
References and Tools
- eBPF & libbpf: https://ebpf.io and https://github.com/libbpf/libbpf
- BPF Performance Tools (Brendan Gregg): https://www.brendangregg.com/bpf-performance-tools-book.html
- Cilium (eBPF networking/security): https://cilium.io
- Pixie (K8s observability with eBPF): https://px.dev
- Parca/Parca Agent (continuous profiling): https://www.parca.dev
- CRIU: https://criu.org
- gVisor: https://gvisor.dev
- Firecracker: https://firecracker-microvm.github.io/
- OPA/Rego: https://www.openpolicyagent.org/
- Hyperscan: https://github.com/intel/hyperscan
Conclusion
If your instinct when prod misbehaves is to attach a debugger, retrain that reflex. eBPF gives you a scalpel where a debugger is a sledgehammer. With selective probes, strict redaction, and a replayable snapshot, you can give a debug AI all the context it needs to form accurate, actionable hypotheses—without stopping the world or leaking secrets.
The blueprint here is not hypothetical. Teams at scale already run eBPF-based observability with tight budgets and strong controls. The delta is integrating those signals into a principled, sandboxed, AI-assisted workflow that turns raw events into fixes. Do that, and your mean time to reason—not just to recovery—drops, while your risk profile stays where it belongs: boring.
