From Stack Traces to System Traces: How Debug AI Should Use eBPF and OpenTelemetry to Diagnose Prod Incidents Safely
Most production incident workflows still orbit around two artifacts: logs and stack traces. They are invaluable, but they are also incomplete. Stack traces tell you where a thread was, not why the surrounding system forced it there. The kernel, the network, the scheduler, the filesystem, the NIC offloads, and the runtime environment exert hidden pressure that never surfaces in app-level traces. When your p99 explodes or your API returns sporadic 500s, a text stack trace won’t tell you that a TCP retransmit storm or a thundering herd in the page cache is the real culprit.
This article proposes a pragmatic and safe architecture: marry eBPF-powered system tracing with OpenTelemetry’s semantic model and sampling controls, then put a debug AI on top to automate triage. The core tenets are:
- Minimal data capture by default, with policy-driven escalation and redaction
- Deterministic (or near-deterministic) replay of the failing path without SSH access
- End-to-end correlation between application spans and kernel events
- A closed-loop AI workflow that hypothesizes, narrows the search, and validates findings
The audience for this design is SREs, platform engineers, and observability teams who need deep system visibility without violating safety, privacy, and compliance promises. The goal is not to replace skilled operators; it is to move from whack-a-mole, SSH-based root cause hunts to a repeatable, auditable, and fast diagnostic system.
Why Stack Traces Alone Fall Short
Stack traces answer what a thread was doing at a moment. They do not capture:
- Kernel scheduling delay: Is the run queue saturated? Was the thread preempted at the wrong time?
- I/O stalls: Was an fsync delayed by writeback throttling? Did the cgroup hit an IO limit?
- Network pathologies: TCP SYN backlog overflow, retransmits, or NIC offload quirks
- Contention outside your process: Socket buffer pressure, page cache evictions, kernel lock contention
- Container and cgroup constraints: CPU throttling, memory pressure zones, NUMA effects
The cost of missing these signals is high: flapping incidents that resolve under ad-hoc debugging; risky SSH adventures; and guesswork that leads to incident churn.
eBPF 101: The Safest System Microscope in Production
eBPF (extended Berkeley Packet Filter) allows safely attaching small programs to kernel hooks (kprobes, tracepoints, uprobes for user-space, network hooks, LSM). The kernel’s verifier enforces strict safety constraints; programs are JIT-compiled and run in a sandbox. With CO-RE (compile once, run everywhere) and BTF type info, you can ship one eBPF binary across kernels.
Key primitives relevant to prod-safe debugging:
- kprobes and tracepoints to observe syscalls, scheduler, TCP events, block I/O
- uprobes and USDT to observe user-space functions without code changes
- BPF maps for state: hash maps, LRU maps, per-CPU arrays, ring buffers for event streaming
- Ring buffer API for low-overhead data transfer
- Verifier and tail calls for safety and modularity
Cost: well-designed eBPF programs can stay under a few CPU percent overhead for targeted instrumentation. The principal risk is not performance, but data spillage; we must enforce minimal capture and robust redaction.
OpenTelemetry 101: The Control Plane for Signal Semantics
OpenTelemetry (OTel) gives us a shared grammar for metrics, traces, and logs:
- Traces and spans with context propagation across services
- Attributes with semantic conventions (network peer, HTTP status, database system, etc.)
- Metrics pipelines with exemplars that link back to traces
- Logs as structured events
- OpenTelemetry Collector: a vendor-neutral router with processors (redaction, sampling, tail-based decisions)
OTel is the control plane for how we name, route, sample, and enrich signals. eBPF is the data plane for where to tap the system safely. Combining them enables correlation and policy-driven capture escalations.
Design Principles: Safe Production Diagnosis
- Default deny on data volume: capture metadata-first, payload-later, time-bounded, policy-gated.
- Deterministic replay: capture the minimal boundary conditions to reproduce the fault in a sandbox: syscalls, file reads, network I/O (selectively), time and randomness sources.
- Correlation, not guesswork: link system events to spans, pods, and services via OTel IDs and k8s identity.
- Zero SSH: all control flows over audited APIs, with mTLS and RBAC; actions are idempotent and reversible.
- Human-in-the-loop with AI augmentation: the AI proposes, you approve; or enforce policy guardrails for autonomous modes.
Reference Architecture
A high-level view:
- eBPF Agent (DaemonSet in Kubernetes): attaches kprobes/tracepoints/uprobes and streams ring buffer events.
- OpenTelemetry Collector: receives, enriches with k8s metadata, performs tail-based sampling, redaction, and routing.
- Debug AI Control Plane: embeds a policy engine, issues capture-level changes, queries backends, runs causal analysis.
- Backends: trace store (e.g., Tempo), metrics (Prometheus or OTLP -> TSDB), logs (Loki or OTLP log store), system events (ClickHouse or similar column store).
- Replay Sandbox: ephemeral worker that replays captured boundary conditions for determinism.
Data flow:
- Normal mode: app emits OTel traces; eBPF agent emits low-cardinality system events (metadata only). Collector enriches, correlates, stores.
- Incident triggers: SLO violation or tail-based sampler selects anomalous traces. AI requests temporary capture escalation via control plane.
- Elevated capture: eBPF agent records additional fields (e.g., packet headers or select payload slices, file read hashes) for involved pods only, with TTL and byte caps.
- Replay: AI spins a sandbox with captured inputs to confirm root cause and test patches.
- De-escalation: capture returns to default minimal mode; incident narrative and artifacts are preserved.
Minimal but Sufficient Data: A Layered Capture Strategy
Start metadata-only, then escalate narrowly.
-
Always-on (low risk, low volume):
- syscall latencies (aggregate distributions), TCP retransmit counters, scheduler run queue samples
- kernel error events (ICMP unreachable, timeouts), k8s cgroup IDs, pod and container IDs
- HTTP method, route template (via eBPF-aware HTTP classifiers like Beyla or Pixie; or app telemetry)
- no payloads, no PII
-
On trigger (tail-based sampler picks anomalous traces):
- record per-event syscall timings for the suspect process group
- capture network 5-tuples and top N bytes of payload for matching endpoints only, with on-host redaction rules
- hash of file reads and configs involved in the request
-
Manual escalate (time-boxed, break-glass):
- full packet capture for a single pod or node for a few minutes
- binary snapshots of epoll interest lists or TCP states for forensics
The eBPF agent enforces byte budgets and drops beyond cap. Redaction happens in-kernel where possible (for example, zeroing all digits in payload, or substring removal for known tokens).
eBPF: Capturing Events With Policy
A minimal CO-RE eBPF sketch that captures syscall entries and attaches cgroup identity:
c// bpf_syscalls.c (CO-RE sketch; illustrative) #include <vmlinux.h> #include <bpf/bpf_helpers.h> #include <bpf/bpf_tracing.h> struct event_t { u64 ts; u32 pid; u32 tid; u64 cgroup_id; int syscall_nr; int policy_level; // 0=meta-only, 1=timing, 2=payload slice u32 data_len; char data[80]; // payload slice when policy allows }; struct { __uint(type, BPF_MAP_TYPE_RINGBUF); __uint(max_entries, 1 << 24); // 16 MiB ring buffer } events SEC(".maps"); // Simple policy map keyed by cgroup id -> level struct { __uint(type, BPF_MAP_TYPE_LRU_HASH); __type(key, u64); __type(value, int); __uint(max_entries, 1024); } policy SEC(".maps"); static __always_inline int get_policy_level(u64 cgid) { int *lvl = bpf_map_lookup_elem(&policy, &cgid); return lvl ? *lvl : 0; } SEC("tp/syscalls/sys_enter") int on_sys_enter(struct trace_event_raw_sys_enter *ctx) { u64 cgid = bpf_get_current_cgroup_id(); int lvl = get_policy_level(cgid); struct event_t *e = bpf_ringbuf_reserve(&events, sizeof(*e), 0); if (!e) return 0; e->ts = bpf_ktime_get_ns(); e->pid = bpf_get_current_pid_tgid() >> 32; e->tid = (u32)bpf_get_current_pid_tgid(); e->cgroup_id = cgid; e->syscall_nr = ctx->id; e->policy_level = lvl; e->data_len = 0; // Optionally copy small payload for e.g. sendto, write, or openat if (lvl >= 2 && (ctx->id == __NR_write || ctx->id == __NR_sendto)) { const char *buf = (const char *)ctx->args[1]; if (buf) { int copied = bpf_probe_read_user_str(e->data, sizeof(e->data), buf); if (copied > 0) e->data_len = (u32)copied; } // note: redact-in-kernel if needed (e.g., zero digits) for (int i = 0; i < e->data_len; i++) { if (e->data[i] >= '0' && e->data[i] <= '9') e->data[i] = '0'; } } bpf_ringbuf_submit(e, 0); return 0; } char LICENSE[] SEC("license") = "GPL";
Notes:
- This uses a global policy map keyed by cgroup. The control plane updates it to escalate capture for specific pods.
- It copies at most 80 bytes of payload for select syscalls when policy level permits.
- In-kernel redaction zeros digits. Real systems should use better tokenization rules and allow regex via BPF LPM tries or simplified DFAs.
User space consumes this ring buffer, enriches with Kubernetes metadata, and forwards as OTLP logs with attributes or as events attached to relevant spans.
A minimal user-space consumer sketch (language-agnostic pseudocode, strings use single quotes for illustration):
python# events_consumer.py (pseudocode) from otlp import LogsExporter from k8smeta import CgroupIndex from ringbuf import BpfRing rb = BpfRing('/sys/fs/bpf/events') exporter = LogsExporter(endpoint='otlp-collector:4317') index = CgroupIndex() for ev in rb: pod, ns, container = index.lookup(ev.cgroup_id) attrs = { 'linux.pid': ev.pid, 'linux.tid': ev.tid, 'k8s.pod': pod, 'k8s.namespace': ns, 'k8s.container': container, 'syscall.nr': ev.syscall_nr, 'policy.level': ev.policy_level, } if ev.data_len > 0: attrs['payload.slice'] = ev.data[:ev.data_len] exporter.export_log( timestamp_ns=ev.ts, body='sys_enter', attributes=attrs, )
Correlating Kernel Events With Spans
Correlation is the crux. Three practical strategies:
-
Time and identity join (works everywhere): join kernel events and spans by time window and k8s identity (pod, container, PID), plus TCP 5-tuple if available. This is robust but approximate at millisecond scale.
-
Span exemplars: from eBPF, emit metrics with exemplars that include trace_id and span_id. The collector links these to the trace UI. If the app emits these IDs as thread-local attributes (e.g., exposing the current context via USDT or custom uprobe), store them in a BPF map keyed by TID and attach to events.
-
Direct span linkage via uprobes (advanced): attach uprobes to the application runtime to capture the active span ID. For example, a uprobe on user-space functions that start or activate spans can extract the trace_id and set a TID->trace mapping in a BPF map. Later, syscall events look up that mapping and include the IDs.
Illustrative uprobe sketch (conceptual; actual offsets depend on language/runtime):
c// Map TID -> { trace_hi, trace_lo, span_hi, span_lo } struct trace_ctx { u64 th; u64 tl; u64 sh; u64 sl; }; struct { __uint(type, BPF_MAP_TYPE_LRU_HASH); __type(key, u32); __type(value, struct trace_ctx); __uint(max_entries, 16384); } tid2trace SEC(".maps"); // Suppose we have a USDT probe in the runtime that passes a pointer to a 16-byte trace id and span id SEC("uprobe/runtime_trace_activated") int on_trace_activated(struct pt_regs *ctx) { u32 tid = (u32)bpf_get_current_pid_tgid(); const void *trace_ptr = (const void *)PT_REGS_PARM1(ctx); const void *span_ptr = (const void *)PT_REGS_PARM2(ctx); struct trace_ctx tc = {}; bpf_probe_read_user(&tc, 16, trace_ptr); bpf_probe_read_user(((char *)&tc) + 16, 16, span_ptr); bpf_map_update_elem(&tid2trace, &tid, &tc, BPF_ANY); return 0; }
When emitting a syscall event, look up tid2trace and attach trace IDs. Where direct runtime hooks are infeasible, prefer the time-and-identity join.
Projects to study for inspiration:
- Pixie (CNCF): eBPF data plane with app-level correlation and SQL-like queries
- Grafana Beyla: eBPF auto-instrumentation for HTTP and gRPC, OTel-friendly
- Parca agent: continuous profiling via eBPF
- Cilium Tetragon: security observability with policy and eBPF
Deterministic Replay Without SSH
To reproduce elusive failures, capture boundary conditions deterministically:
- Sources of nondeterminism: getrandom, time, scheduler, network, file system state
- Strategy: record and later inject deterministic values and inputs
- Syscalls: record parameters and return codes for a focused set contributing to the incident path
- Time: intercept time syscalls and store monotonic deltas
- Randomness: capture getrandom buffers for the request scope
- Network: capture request-response payload slices for the implicated flow only
- Files: capture content hashes plus a content snapshot for small files
In a replay sandbox:
- Reconstruct the environment: same container image, env variables, flags
- Use seccomp or LD_PRELOAD interposition to serve recorded syscall responses
- Feed captured network payloads and time/random streams
A small bpftrace example to capture time and randomness syscalls (illustrative):
bash# Capture getrandom and clock_gettime return values for a specific PID bpftrace -e ' tracepoint:syscalls:sys_enter_getrandom /pid == 12345/ { @bytes = args->buf; @len = args->len; } tracepoint:syscalls:sys_exit_getrandom /pid == 12345/ { printf("getrandom len=%d\n", retval); } tracepoint:syscalls:sys_enter_clock_gettime /pid == 12345/ { @clk = args->which_clock; } tracepoint:syscalls:sys_exit_clock_gettime /pid == 12345/ { printf("clock id=%d\n", @clk); } '
For real deployments, prefer CO-RE programs with ring buffers and a user-space recorder. You do not need full RR-style instruction-level replay; a boundary replay suffices for most production incidents.
OpenTelemetry Collector: Policy, Sampling, and Redaction
Use the collector to enforce governance:
- Tail-based sampling: retain anomalous traces, drop the rest
- Attribute processors: drop PII, hash identifiers, normalize route templates
- Metrics to traces: export exemplars for p99 latencies to jump to traces
- Routing: keep data on-prem or in-region
Example collector pipeline (YAML, illustrative):
yamlreceivers: otlp: protocols: grpc: http: journald: {} # A custom eBPF receiver or generic filelog receiver can ingest agent events filelog: include: [/var/log/ebpf-events/*.json] processors: k8sattributes: extract: metadata: [k8s.pod.name, k8s.namespace.name, k8s.container.name] attributes: actions: - key: http.request.header.authorization action: delete - key: payload.slice action: hash tailsampling: decision_wait: 5s policies: - name: server_errors type: status_code status_code: status_codes: [ERROR] - name: high_latency type: latency latency: threshold_ms: 500 transform: error_mode: ignore traces: - retain_keys(['k8s.pod.name','service.name','net.peer.ip']) exporters: otlphttp: endpoint: http://tempo:4318 prometheus: endpoint: 0.0.0.0:8889 clickhouse: dsn: tcp://clickhouse:9000 service: pipelines: traces: receivers: [otlp] processors: [k8sattributes, tailsampling, attributes, transform] exporters: [otlphttp] logs: receivers: [otlp, filelog, journald] processors: [k8sattributes, attributes] exporters: [clickhouse] metrics: receivers: [otlp] processors: [] exporters: [prometheus]
The Debug AI Loop
The AI’s job is to minimize time-to-first-explanation and reduce operator toil, not to guess wildly. A typical loop:
- Detect: SLO breaches or anomaly detectors fire.
- Gather: Query OTel backends for top N anomalous traces, correlated eBPF events, and recent changes (deploys, config, kernel counters).
- Hypothesize: Build a causal graph: spans with local delays, overlay kernel metrics (TCP retransmits, disk stalls) and cgroup throttling.
- Test: Issue a capture escalation for the implicated pods only (policy TTL, budgeted). Re-run queries.
- Explain: Produce a reproducible incident narrative with artifacts and code of where to fix.
- Validate: Launch a replay sandbox using the captured boundary conditions and verify the fix.
A small policy DSL for escalation (illustrative):
yamlwhen: traces: where: service.name: payments-api span.kind: server latency_ms: > 500 then: capture_level: cgroup: from_span level: 2 # enable small payload slices ttl: 10m byte_budget: 50MiB metrics: record_histogram: syscall.durations notify: channel: oncall-sre
The control plane compiles this into map updates for the eBPF agent and corresponding collector processors. All actions are logged.
Security, Privacy, and Compliance
- mTLS everywhere: agent-to-collector, collector-to-backend, control-plane-to-agent
- RBAC: least privilege for capture escalation; break-glass mode requires duo-approval
- On-host redaction default: do not ship raw payloads until policy escalates; even then, redact patterns
- Data minimization: short TTLs, byte budgets, and circuit breakers when volumes spike
- Audit: every policy change, data flow, and replay is an immutable log event
- No SSH: control plane only; ephemeral sandboxes are created via orchestrator APIs
Performance and Safety Considerations
- Keep eBPF programs small and bounded; prefer ring buffers to perf events for lower overhead
- Use per-CPU maps and preallocated buffers to avoid contention
- Benchmark overhead: enable latency histograms on a subset of syscalls; measure delta under load
- Backpressure: if the ring buffer fills, prefer dropping payload slices first, then metadata
- CO-RE portability: rely on BTF; ship one binary; validate on kernel upgrades in staging
Step-by-Step Adoption Plan
Day 1: Baseline OTel
- Instrument services with OTel SDKs or auto-instrumentation
- Deploy OTel Collector with k8s processors and tail-based sampling
- Store traces and metrics; set SLOs and exemplars
Day 2: eBPF Agent in Minimal Mode
- Deploy an eBPF agent DaemonSet that collects metadata-only syscall and TCP events
- Index events by cgroup, pid, and k8s identity; forward as OTLP logs
- Validate overhead under load tests
Day 3: Correlation and Dashboards
- Join spans and eBPF events by time/pod; add exemplar links from metrics to traces
- Build SRE dashboards showing p99 overlays with TCP retransmits and syscall outliers
Day 4: Policy Engine and Escalations
- Implement control-plane APIs to update the agent policy map by cgroup
- Configure collector processors for redaction and tail-based sampling
- Dry-run escalation on canary namespaces
Day 5: Debug AI Integration
- Teach the AI retrieval layer where to query: traces store, metrics, eBPF event warehouse
- Implement hypothesis templates: GC vs IO vs network vs scheduler
- Wire the policy DSL and approval workflow; test replay sandbox on synthetic faults
Case Study: The Mystery of Spiky 500s
Symptoms: the checkout service shows intermittent 500s, clustered in bursts. Stack traces point to timeouts on an upstream call. No errors in the primary service logs.
What happened with the proposed architecture:
- Detection: tail-based sampler kept error traces. Metrics exemplars linked to the worst traces.
- Correlation: overlaying eBPF TCP events showed spikes in TCP retransmits on the node hosting the upstream service. Syscall-level histograms showed sendto latency spikes during the bursts.
- Hypothesis: probable network path issue, not app bug. AI recommends capture escalation of packet headers for the upstream pod and enabling SYN backlog stats.
- Escalation: policy raises capture level for the specific pod for 10 minutes with a 20 MiB cap.
- Capture: headers show large MSS with TSO; NIC driver updated recently. dmesg logs (forwarded via collector) show checksum offload errors.
- Replay: with recorded payload slices, a sandbox replicates the handshake path; disabling offload removes retransmits.
- Fix: roll back NIC driver; attach a config that disables problematic offloads for the node pool. Incident closed in under an hour.
No SSH was needed. Payload capture was limited to headers; no PII left the node. All actions were audited.
Limitations and Real-World Frictions
- Language runtime visibility: extracting active span IDs via uprobes is runtime-specific. Prefer time-and-identity joins unless you can rely on USDT or well-known symbols.
- Payload redaction complexity: naive regex in kernel is hard; use structured tokenization or push advanced redaction to the collector after slicing safe subsets.
- Kernel portability: CO-RE helps, but keep a staging matrix of your kernels. Avoid deep struct poking; stick to stable tracepoints where possible.
- Overhead budgeting: continuous profiling, tracing, and eBPF together can add up. Use adaptive sampling and turn on expensive probes only under policy.
- Replay fidelity: boundary capture fixes most issues, but not all. For race conditions deep in the kernel or hardware bugs, you may need targeted lab repros.
Opinionated Takeaways
- Stop SSHing into prod: build controlled, audited, zero-SSH workflows; you will be faster and more compliant.
- Default to metadata-only capture; earn the right to payloads with tight policies and TTLs.
- Use OTel as the policy and semantics layer; keep eBPF focused on safe, fast, minimal capture.
- Invest in correlation: exemplars, identity joins, and when feasible, runtime hooks for active span IDs.
- Treat deterministic replay as a first-class capability; it short-circuits endless blame-game loops.
Conclusion
Debugging production incidents is a systems problem, not just an application problem. Pairing eBPF with OpenTelemetry gives you system traces that explain the why behind your stack traces. With a debug AI orchestrating policy-driven capture and correlation, you can diagnose incidents faster, safer, and without shell access. The blueprint here favors minimal capture, deterministic replay, and trace-driven escalation. It is opinionated because production demands it: respect safety by default, and instrument the system with precision when it matters.
If you implement one thing this quarter, make it the minimal-mode eBPF agent feeding OTel. The rest — correlation, policy escalations, AI triage, and replay — can layer on. The payoff is a production environment you can interrogate confidently, even on its worst days.
