From Stack Traces to System Traces: How Debug AI Should Use eBPF and OpenTelemetry to Diagnose Prod Incidents Safely

Most production incident workflows still orbit around two artifacts: logs and stack traces. They are invaluable, but they are also incomplete. Stack traces tell you where a thread was, not why the surrounding system forced it there. The kernel, the network, the scheduler, the filesystem, the NIC offloads, and the runtime environment exert hidden pressure that never surfaces in app-level traces. When your p99 explodes or your API returns sporadic 500s, a text stack trace won’t tell you that a TCP retransmit storm or a thundering herd in the page cache is the real culprit.

This article proposes a pragmatic and safe architecture: marry eBPF-powered system tracing with OpenTelemetry’s semantic model and sampling controls, then put a debug AI on top to automate triage. The core tenets are:

Minimal data capture by default, with policy-driven escalation and redaction
Deterministic (or near-deterministic) replay of the failing path without SSH access
End-to-end correlation between application spans and kernel events
A closed-loop AI workflow that hypothesizes, narrows the search, and validates findings

The audience for this design is SREs, platform engineers, and observability teams who need deep system visibility without violating safety, privacy, and compliance promises. The goal is not to replace skilled operators; it is to move from whack-a-mole, SSH-based root cause hunts to a repeatable, auditable, and fast diagnostic system.

Why Stack Traces Alone Fall Short

Stack traces answer what a thread was doing at a moment. They do not capture:

Kernel scheduling delay: Is the run queue saturated? Was the thread preempted at the wrong time?
I/O stalls: Was an fsync delayed by writeback throttling? Did the cgroup hit an IO limit?
Network pathologies: TCP SYN backlog overflow, retransmits, or NIC offload quirks
Contention outside your process: Socket buffer pressure, page cache evictions, kernel lock contention
Container and cgroup constraints: CPU throttling, memory pressure zones, NUMA effects

The cost of missing these signals is high: flapping incidents that resolve under ad-hoc debugging; risky SSH adventures; and guesswork that leads to incident churn.

eBPF 101: The Safest System Microscope in Production

eBPF (extended Berkeley Packet Filter) allows safely attaching small programs to kernel hooks (kprobes, tracepoints, uprobes for user-space, network hooks, LSM). The kernel’s verifier enforces strict safety constraints; programs are JIT-compiled and run in a sandbox. With CO-RE (compile once, run everywhere) and BTF type info, you can ship one eBPF binary across kernels.

Key primitives relevant to prod-safe debugging:

kprobes and tracepoints to observe syscalls, scheduler, TCP events, block I/O
uprobes and USDT to observe user-space functions without code changes
BPF maps for state: hash maps, LRU maps, per-CPU arrays, ring buffers for event streaming
Ring buffer API for low-overhead data transfer
Verifier and tail calls for safety and modularity

Cost: well-designed eBPF programs can stay under a few CPU percent overhead for targeted instrumentation. The principal risk is not performance, but data spillage; we must enforce minimal capture and robust redaction.

OpenTelemetry 101: The Control Plane for Signal Semantics

OpenTelemetry (OTel) gives us a shared grammar for metrics, traces, and logs:

Traces and spans with context propagation across services
Attributes with semantic conventions (network peer, HTTP status, database system, etc.)
Metrics pipelines with exemplars that link back to traces
Logs as structured events
OpenTelemetry Collector: a vendor-neutral router with processors (redaction, sampling, tail-based decisions)

OTel is the control plane for how we name, route, sample, and enrich signals. eBPF is the data plane for where to tap the system safely. Combining them enables correlation and policy-driven capture escalations.

Design Principles: Safe Production Diagnosis

Default deny on data volume: capture metadata-first, payload-later, time-bounded, policy-gated.
Deterministic replay: capture the minimal boundary conditions to reproduce the fault in a sandbox: syscalls, file reads, network I/O (selectively), time and randomness sources.
Correlation, not guesswork: link system events to spans, pods, and services via OTel IDs and k8s identity.
Zero SSH: all control flows over audited APIs, with mTLS and RBAC; actions are idempotent and reversible.
Human-in-the-loop with AI augmentation: the AI proposes, you approve; or enforce policy guardrails for autonomous modes.

Reference Architecture

A high-level view:

eBPF Agent (DaemonSet in Kubernetes): attaches kprobes/tracepoints/uprobes and streams ring buffer events.
OpenTelemetry Collector: receives, enriches with k8s metadata, performs tail-based sampling, redaction, and routing.
Debug AI Control Plane: embeds a policy engine, issues capture-level changes, queries backends, runs causal analysis.
Backends: trace store (e.g., Tempo), metrics (Prometheus or OTLP -> TSDB), logs (Loki or OTLP log store), system events (ClickHouse or similar column store).
Replay Sandbox: ephemeral worker that replays captured boundary conditions for determinism.

Data flow:

Normal mode: app emits OTel traces; eBPF agent emits low-cardinality system events (metadata only). Collector enriches, correlates, stores.
Incident triggers: SLO violation or tail-based sampler selects anomalous traces. AI requests temporary capture escalation via control plane.
Elevated capture: eBPF agent records additional fields (e.g., packet headers or select payload slices, file read hashes) for involved pods only, with TTL and byte caps.
Replay: AI spins a sandbox with captured inputs to confirm root cause and test patches.
De-escalation: capture returns to default minimal mode; incident narrative and artifacts are preserved.

Minimal but Sufficient Data: A Layered Capture Strategy

Start metadata-only, then escalate narrowly.

Always-on (low risk, low volume):
- syscall latencies (aggregate distributions), TCP retransmit counters, scheduler run queue samples
- kernel error events (ICMP unreachable, timeouts), k8s cgroup IDs, pod and container IDs
- HTTP method, route template (via eBPF-aware HTTP classifiers like Beyla or Pixie; or app telemetry)
- no payloads, no PII
On trigger (tail-based sampler picks anomalous traces):
- record per-event syscall timings for the suspect process group
- capture network 5-tuples and top N bytes of payload for matching endpoints only, with on-host redaction rules
- hash of file reads and configs involved in the request
Manual escalate (time-boxed, break-glass):
- full packet capture for a single pod or node for a few minutes
- binary snapshots of epoll interest lists or TCP states for forensics

The eBPF agent enforces byte budgets and drops beyond cap. Redaction happens in-kernel where possible (for example, zeroing all digits in payload, or substring removal for known tokens).

eBPF: Capturing Events With Policy

A minimal CO-RE eBPF sketch that captures syscall entries and attaches cgroup identity:

c
// bpf_syscalls.c (CO-RE sketch; illustrative)
#include <vmlinux.h>
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_tracing.h>

struct event_t {
    u64 ts;
    u32 pid;
    u32 tid;
    u64 cgroup_id;
    int syscall_nr;
    int policy_level; // 0=meta-only, 1=timing, 2=payload slice
    u32 data_len;
    char data[80];    // payload slice when policy allows
};

struct {
    __uint(type, BPF_MAP_TYPE_RINGBUF);
    __uint(max_entries, 1 << 24); // 16 MiB ring buffer
} events SEC(".maps");

// Simple policy map keyed by cgroup id -> level
struct {
    __uint(type, BPF_MAP_TYPE_LRU_HASH);
    __type(key, u64);
    __type(value, int);
    __uint(max_entries, 1024);
} policy SEC(".maps");

static __always_inline int get_policy_level(u64 cgid) {
    int *lvl = bpf_map_lookup_elem(&policy, &cgid);
    return lvl ? *lvl : 0;
}

SEC("tp/syscalls/sys_enter")
int on_sys_enter(struct trace_event_raw_sys_enter *ctx) {
    u64 cgid = bpf_get_current_cgroup_id();
    int lvl = get_policy_level(cgid);

    struct event_t *e = bpf_ringbuf_reserve(&events, sizeof(*e), 0);
    if (!e) return 0;

    e->ts = bpf_ktime_get_ns();
    e->pid = bpf_get_current_pid_tgid() >> 32;
    e->tid = (u32)bpf_get_current_pid_tgid();
    e->cgroup_id = cgid;
    e->syscall_nr = ctx->id;
    e->policy_level = lvl;
    e->data_len = 0;

    // Optionally copy small payload for e.g. sendto, write, or openat
    if (lvl >= 2 && (ctx->id == __NR_write || ctx->id == __NR_sendto)) {
        const char *buf = (const char *)ctx->args[1];
        if (buf) {
            int copied = bpf_probe_read_user_str(e->data, sizeof(e->data), buf);
            if (copied > 0) e->data_len = (u32)copied;
        }
        // note: redact-in-kernel if needed (e.g., zero digits)
        for (int i = 0; i < e->data_len; i++) {
            if (e->data[i] >= '0' && e->data[i] <= '9') e->data[i] = '0';
        }
    }

    bpf_ringbuf_submit(e, 0);
    return 0;
}

char LICENSE[] SEC("license") = "GPL";

Notes:

This uses a global policy map keyed by cgroup. The control plane updates it to escalate capture for specific pods.
It copies at most 80 bytes of payload for select syscalls when policy level permits.
In-kernel redaction zeros digits. Real systems should use better tokenization rules and allow regex via BPF LPM tries or simplified DFAs.

User space consumes this ring buffer, enriches with Kubernetes metadata, and forwards as OTLP logs with attributes or as events attached to relevant spans.

A minimal user-space consumer sketch (language-agnostic pseudocode, strings use single quotes for illustration):

python
# events_consumer.py (pseudocode)
from otlp import LogsExporter
from k8smeta import CgroupIndex
from ringbuf import BpfRing

rb = BpfRing('/sys/fs/bpf/events')
exporter = LogsExporter(endpoint='otlp-collector:4317')
index = CgroupIndex()

for ev in rb:
    pod, ns, container = index.lookup(ev.cgroup_id)
    attrs = {
        'linux.pid': ev.pid,
        'linux.tid': ev.tid,
        'k8s.pod': pod,
        'k8s.namespace': ns,
        'k8s.container': container,
        'syscall.nr': ev.syscall_nr,
        'policy.level': ev.policy_level,
    }
    if ev.data_len > 0:
        attrs['payload.slice'] = ev.data[:ev.data_len]

    exporter.export_log(
        timestamp_ns=ev.ts,
        body='sys_enter',
        attributes=attrs,
    )

Correlating Kernel Events With Spans

Correlation is the crux. Three practical strategies:

Time and identity join (works everywhere): join kernel events and spans by time window and k8s identity (pod, container, PID), plus TCP 5-tuple if available. This is robust but approximate at millisecond scale.
Span exemplars: from eBPF, emit metrics with exemplars that include trace_id and span_id. The collector links these to the trace UI. If the app emits these IDs as thread-local attributes (e.g., exposing the current context via USDT or custom uprobe), store them in a BPF map keyed by TID and attach to events.
Direct span linkage via uprobes (advanced): attach uprobes to the application runtime to capture the active span ID. For example, a uprobe on user-space functions that start or activate spans can extract the trace_id and set a TID->trace mapping in a BPF map. Later, syscall events look up that mapping and include the IDs.

Illustrative uprobe sketch (conceptual; actual offsets depend on language/runtime):

c
// Map TID -> { trace_hi, trace_lo, span_hi, span_lo }
struct trace_ctx { u64 th; u64 tl; u64 sh; u64 sl; };
struct {
    __uint(type, BPF_MAP_TYPE_LRU_HASH);
    __type(key, u32);
    __type(value, struct trace_ctx);
    __uint(max_entries, 16384);
} tid2trace SEC(".maps");

// Suppose we have a USDT probe in the runtime that passes a pointer to a 16-byte trace id and span id
SEC("uprobe/runtime_trace_activated")
int on_trace_activated(struct pt_regs *ctx) {
    u32 tid = (u32)bpf_get_current_pid_tgid();
    const void *trace_ptr = (const void *)PT_REGS_PARM1(ctx);
    const void *span_ptr = (const void *)PT_REGS_PARM2(ctx);
    struct trace_ctx tc = {};
    bpf_probe_read_user(&tc, 16, trace_ptr);
    bpf_probe_read_user(((char *)&tc) + 16, 16, span_ptr);
    bpf_map_update_elem(&tid2trace, &tid, &tc, BPF_ANY);
    return 0;
}

When emitting a syscall event, look up tid2trace and attach trace IDs. Where direct runtime hooks are infeasible, prefer the time-and-identity join.

Projects to study for inspiration:

Pixie (CNCF): eBPF data plane with app-level correlation and SQL-like queries
Grafana Beyla: eBPF auto-instrumentation for HTTP and gRPC, OTel-friendly
Parca agent: continuous profiling via eBPF
Cilium Tetragon: security observability with policy and eBPF

Deterministic Replay Without SSH

To reproduce elusive failures, capture boundary conditions deterministically:

Sources of nondeterminism: getrandom, time, scheduler, network, file system state
Strategy: record and later inject deterministic values and inputs
- Syscalls: record parameters and return codes for a focused set contributing to the incident path
- Time: intercept time syscalls and store monotonic deltas
- Randomness: capture getrandom buffers for the request scope
- Network: capture request-response payload slices for the implicated flow only
- Files: capture content hashes plus a content snapshot for small files

In a replay sandbox:

Reconstruct the environment: same container image, env variables, flags
Use seccomp or LD_PRELOAD interposition to serve recorded syscall responses
Feed captured network payloads and time/random streams

A small bpftrace example to capture time and randomness syscalls (illustrative):

bash
# Capture getrandom and clock_gettime return values for a specific PID
bpftrace -e '
tracepoint:syscalls:sys_enter_getrandom /pid == 12345/ { @bytes = args->buf; @len = args->len; }
tracepoint:syscalls:sys_exit_getrandom /pid == 12345/ { printf("getrandom len=%d\n", retval); }
tracepoint:syscalls:sys_enter_clock_gettime /pid == 12345/ { @clk = args->which_clock; }
tracepoint:syscalls:sys_exit_clock_gettime /pid == 12345/ { printf("clock id=%d\n", @clk); }
'

For real deployments, prefer CO-RE programs with ring buffers and a user-space recorder. You do not need full RR-style instruction-level replay; a boundary replay suffices for most production incidents.

OpenTelemetry Collector: Policy, Sampling, and Redaction

Use the collector to enforce governance:

Tail-based sampling: retain anomalous traces, drop the rest
Attribute processors: drop PII, hash identifiers, normalize route templates
Metrics to traces: export exemplars for p99 latencies to jump to traces
Routing: keep data on-prem or in-region

Example collector pipeline (YAML, illustrative):

yaml
receivers:
  otlp:
    protocols:
      grpc:
      http:
  journald: {}
  # A custom eBPF receiver or generic filelog receiver can ingest agent events
  filelog:
    include: [/var/log/ebpf-events/*.json]

processors:
  k8sattributes:
    extract:
      metadata: [k8s.pod.name, k8s.namespace.name, k8s.container.name]
  attributes:
    actions:
      - key: http.request.header.authorization
        action: delete
      - key: payload.slice
        action: hash
  tailsampling:
    decision_wait: 5s
    policies:
      - name: server_errors
        type: status_code
        status_code:
          status_codes: [ERROR]
      - name: high_latency
        type: latency
        latency:
          threshold_ms: 500
  transform:
    error_mode: ignore
    traces:
      - retain_keys(['k8s.pod.name','service.name','net.peer.ip'])

exporters:
  otlphttp:
    endpoint: http://tempo:4318
  prometheus:
    endpoint: 0.0.0.0:8889
  clickhouse:
    dsn: tcp://clickhouse:9000

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [k8sattributes, tailsampling, attributes, transform]
      exporters: [otlphttp]
    logs:
      receivers: [otlp, filelog, journald]
      processors: [k8sattributes, attributes]
      exporters: [clickhouse]
    metrics:
      receivers: [otlp]
      processors: []
      exporters: [prometheus]

The Debug AI Loop

The AI’s job is to minimize time-to-first-explanation and reduce operator toil, not to guess wildly. A typical loop:

Detect: SLO breaches or anomaly detectors fire.
Gather: Query OTel backends for top N anomalous traces, correlated eBPF events, and recent changes (deploys, config, kernel counters).
Hypothesize: Build a causal graph: spans with local delays, overlay kernel metrics (TCP retransmits, disk stalls) and cgroup throttling.
Test: Issue a capture escalation for the implicated pods only (policy TTL, budgeted). Re-run queries.
Explain: Produce a reproducible incident narrative with artifacts and code of where to fix.
Validate: Launch a replay sandbox using the captured boundary conditions and verify the fix.

A small policy DSL for escalation (illustrative):

yaml
when:
  traces:
    where:
      service.name: payments-api
      span.kind: server
      latency_ms: > 500
then:
  capture_level:
    cgroup: from_span
    level: 2            # enable small payload slices
    ttl: 10m
    byte_budget: 50MiB
  metrics:
    record_histogram: syscall.durations
  notify:
    channel: oncall-sre

The control plane compiles this into map updates for the eBPF agent and corresponding collector processors. All actions are logged.

Security, Privacy, and Compliance

mTLS everywhere: agent-to-collector, collector-to-backend, control-plane-to-agent
RBAC: least privilege for capture escalation; break-glass mode requires duo-approval
On-host redaction default: do not ship raw payloads until policy escalates; even then, redact patterns
Data minimization: short TTLs, byte budgets, and circuit breakers when volumes spike
Audit: every policy change, data flow, and replay is an immutable log event
No SSH: control plane only; ephemeral sandboxes are created via orchestrator APIs

Performance and Safety Considerations

Keep eBPF programs small and bounded; prefer ring buffers to perf events for lower overhead
Use per-CPU maps and preallocated buffers to avoid contention
Benchmark overhead: enable latency histograms on a subset of syscalls; measure delta under load
Backpressure: if the ring buffer fills, prefer dropping payload slices first, then metadata
CO-RE portability: rely on BTF; ship one binary; validate on kernel upgrades in staging

Step-by-Step Adoption Plan

Day 1: Baseline OTel

Instrument services with OTel SDKs or auto-instrumentation
Deploy OTel Collector with k8s processors and tail-based sampling
Store traces and metrics; set SLOs and exemplars

Day 2: eBPF Agent in Minimal Mode

Deploy an eBPF agent DaemonSet that collects metadata-only syscall and TCP events
Index events by cgroup, pid, and k8s identity; forward as OTLP logs
Validate overhead under load tests

Day 3: Correlation and Dashboards

Join spans and eBPF events by time/pod; add exemplar links from metrics to traces
Build SRE dashboards showing p99 overlays with TCP retransmits and syscall outliers

Day 4: Policy Engine and Escalations

Implement control-plane APIs to update the agent policy map by cgroup
Configure collector processors for redaction and tail-based sampling
Dry-run escalation on canary namespaces

Day 5: Debug AI Integration

Teach the AI retrieval layer where to query: traces store, metrics, eBPF event warehouse
Implement hypothesis templates: GC vs IO vs network vs scheduler
Wire the policy DSL and approval workflow; test replay sandbox on synthetic faults

Case Study: The Mystery of Spiky 500s

Symptoms: the checkout service shows intermittent 500s, clustered in bursts. Stack traces point to timeouts on an upstream call. No errors in the primary service logs.

What happened with the proposed architecture:

Detection: tail-based sampler kept error traces. Metrics exemplars linked to the worst traces.
Correlation: overlaying eBPF TCP events showed spikes in TCP retransmits on the node hosting the upstream service. Syscall-level histograms showed sendto latency spikes during the bursts.
Hypothesis: probable network path issue, not app bug. AI recommends capture escalation of packet headers for the upstream pod and enabling SYN backlog stats.
Escalation: policy raises capture level for the specific pod for 10 minutes with a 20 MiB cap.
Capture: headers show large MSS with TSO; NIC driver updated recently. dmesg logs (forwarded via collector) show checksum offload errors.
Replay: with recorded payload slices, a sandbox replicates the handshake path; disabling offload removes retransmits.
Fix: roll back NIC driver; attach a config that disables problematic offloads for the node pool. Incident closed in under an hour.

No SSH was needed. Payload capture was limited to headers; no PII left the node. All actions were audited.

Limitations and Real-World Frictions

Language runtime visibility: extracting active span IDs via uprobes is runtime-specific. Prefer time-and-identity joins unless you can rely on USDT or well-known symbols.
Payload redaction complexity: naive regex in kernel is hard; use structured tokenization or push advanced redaction to the collector after slicing safe subsets.
Kernel portability: CO-RE helps, but keep a staging matrix of your kernels. Avoid deep struct poking; stick to stable tracepoints where possible.
Overhead budgeting: continuous profiling, tracing, and eBPF together can add up. Use adaptive sampling and turn on expensive probes only under policy.
Replay fidelity: boundary capture fixes most issues, but not all. For race conditions deep in the kernel or hardware bugs, you may need targeted lab repros.

Opinionated Takeaways

Stop SSHing into prod: build controlled, audited, zero-SSH workflows; you will be faster and more compliant.
Default to metadata-only capture; earn the right to payloads with tight policies and TTLs.
Use OTel as the policy and semantics layer; keep eBPF focused on safe, fast, minimal capture.
Invest in correlation: exemplars, identity joins, and when feasible, runtime hooks for active span IDs.
Treat deterministic replay as a first-class capability; it short-circuits endless blame-game loops.

Conclusion

Debugging production incidents is a systems problem, not just an application problem. Pairing eBPF with OpenTelemetry gives you system traces that explain the why behind your stack traces. With a debug AI orchestrating policy-driven capture and correlation, you can diagnose incidents faster, safer, and without shell access. The blueprint here favors minimal capture, deterministic replay, and trace-driven escalation. It is opinionated because production demands it: respect safety by default, and instrument the system with precision when it matters.

If you implement one thing this quarter, make it the minimal-mode eBPF agent feeding OTel. The rest — correlation, policy escalations, AI triage, and replay — can layer on. The payoff is a production environment you can interrogate confidently, even on its worst days.