Trace-Aware Code Debugging AI: Wiring OpenTelemetry and eBPF into Automatic Root Cause Analysis
Modern systems are too distributed, too dynamic, and too fast-moving for brittle, rule-based debugging playbooks. We instrument everything, then drown in it. Wake word: heuristic soup. What we want instead is an explicit, causal view of program behavior that ties errors to the code and commits that made them inevitable, then synthesizes a minimal reproduction and a verified patch.
This article presents a practical architecture to get there using two sources of truth:
- Application-level trace context via OpenTelemetry (OTel) spans and logs.
- System-level ground truth via eBPF kernel and user-space probes.
By stitching these streams into a single causal graph, then shaping the data into model-ready features, we can drive a code-debugging AI that:
- Pinpoints the faulty commit(s) and function(s) with high precision.
- Reproduces crashes deterministically, in CI.
- Proposes patches and tests, with automatic verification loops.
No magic heuristics. Just causality, program structure, and evidence. Along the way, we will cover privacy, performance, and what to build first.
Executive summary
- Use OTel to capture spans, events, and error signals with build and commit metadata embedded at source.
- Use eBPF to capture syscall, memory, and network facts, including stack traces, with low overhead.
- Correlate both streams via process and thread IDs, timestamp windows, and explicit span context propagation.
- Normalize and shape the data: enforce cardinality limits, scrub PII, quantize high-entropy fields, map symbols to code and commits, and build a causal graph.
- Drive an AI loop that performs dynamic slicing over the graph, localizes the defect to a change set, synthesizes a reproduction harness, proposes a patch, and validates it in CI.
- Operate with clear performance budgets and privacy controls.
The result: a trace-aware, code-first debugging assistant that improves MTTR, reduces noisy guesswork, and builds a growing corpus of verified fixes.
Why now: telemetry is finally rich enough to do this
- Distributed tracing is ubiquitous thanks to OpenTelemetry. You can get parent-child causal links, time intervals, resource attributes, and error events across services. OTel gives us high-level intent.
- eBPF provides safe, low-overhead kernel and user-space probing on Linux. We can capture syscalls, stack traces, latency histograms, and specific function probes without modifying binaries (or modestly with CO-RE uprobes). eBPF gives us ground truth.
- Code LLMs matured enough to reason about diffs, tests, and patches. We still must verify everything, but they can draft useful reproductions and fixes quickly.
- Build systems and artifact metadata (SBOMs, git metadata, build IDs) can connect runtime behavior to exact commits reliably.
Architecture overview
At a high level, the system looks like this:
- Ingestion
- OTel SDKs emit spans, logs, and metrics to the OTel Collector with build info (git SHA, version, module path).
- eBPF agents run on hosts or nodes to capture kernel tracepoints, kprobes, uprobes, and stack traces, streaming to a collector.
- Correlation and normalization
- A correlator aligns events by process ID (PID), thread ID (TID), container, and time; it enriches OTel spans with kernel facts that fall inside the span interval.
- A normalizer sanitizes fields, de-duplicates events, limits cardinality, and attaches symbolicated stacks and code ownership.
- Data shaping
- Transform the mixed telemetry into a causal graph: spans are nodes; kernel events are annotations; edges encode happens-before and data dependencies.
- Compute features per node and edge (error tags, timeouts, retry patterns, resource contention, lock events, anomalous histograms).
- Join with repo metadata (blame, diffs, tests, coverage) and build artifacts (DWARF, BTF).
- Root cause analysis loop
- Dynamic slicing over the causal graph to isolate minimal sets of operations that make the failure inevitable.
- Rank candidate source files and commits by evidence: stack frequency, first-point-of-divergence vs prior good runs, temporal locality to error, novelty of change.
- Generate a reproduction harness from the trace (inputs, environment, ordering constraints), run it in CI, and bisect if needed.
- Ask a code model to propose a patch and tests; verify against the harness and property checks; iterate.
- Feedback and governance
- Persist verdicts, reproductions, and patches as training examples (opt-in) and operational runbooks.
- Enforce privacy via redaction, field-level RBAC, and data retention policies.
- Manage overhead via sampling, tail-based selection, dynamic eBPF enablement, and stack depth controls.
OpenTelemetry: spans and logs with code lineage
The OTel side gives us structured spans, events, and exceptions. To maximize usefulness for debugging AI, enrich spans with precise code and build metadata.
Recommendations:
- Propagate trace context across processes and messaging (W3C traceparent). Use SDK autoinstrumentation to capture HTTP, gRPC, queue, DB, and custom operations.
- Add resource attributes at process start: git SHA, branch, module version, build ID, compiler flags, container image digest, and feature flags.
- Emit exception events with stack traces and error types; include contextual parameters (sanitized).
- Use span links for retries or fan-in scenarios.
Example: Go HTTP handler instrumented with OTel that captures an error and build metadata.
gopackage api import ( "context" "errors" "net/http" "go.opentelemetry.io/otel" "go.opentelemetry.io/otel/attribute" "go.opentelemetry.io/otel/codes" "go.opentelemetry.io/otel/trace" ) var tracer = otel.Tracer("checkout-service") func initBuildResource() { // Typically done during SDK setup using resource.WithAttributes // Example attributes: service.name, service.version, vcs.revision, build.id } func CheckoutHandler(w http.ResponseWriter, r *http.Request) { ctx, span := tracer.Start(r.Context(), "Checkout") defer span.End() orderID := r.URL.Query().Get("orderId") if orderID == "" { err := errors.New("missing orderId") span.RecordError(err) span.SetStatus(codes.Error, "bad request") span.SetAttributes(attribute.String("http.route", "/checkout")) http.Error(w, err.Error(), http.StatusBadRequest) return } if err := chargeCard(ctx, orderID); err != nil { span.RecordError(err) span.SetStatus(codes.Error, "payment failure") http.Error(w, "payment failed", http.StatusPaymentRequired) return } w.WriteHeader(http.StatusOK) w.Write([]byte("ok")) } func chargeCard(ctx context.Context, orderID string) error { _, span := tracer.Start(ctx, "ChargeCard", trace.WithSpanKind(trace.SpanKindClient)) defer span.End() // call payment provider ... return nil }
OpenTelemetry Collector configuration to batch, sanitize, and export spans and logs into the correlation pipeline:
yamlreceivers: otlp: protocols: http: grpc: processors: batch: timeout: 2s send_batch_size: 2048 memory_limiter: check_interval: 1s limit_mib: 1024 spike_limit_mib: 256 attributes: actions: - key: http.request.header.authorization action: delete - key: db.statement action: hash # reduce cardinality and scrub sensitive SQL transform: error_mode: ignore log_statements: - context: resource statements: - set(attributes["vcs.revision"], Concat([attributes["service.version"], attributes["vcs.revision"]], ":")) where attributes["vcs.revision"] != nil exporters: otlp: endpoint: correlator:4317 tls: insecure: true service: pipelines: traces: receivers: [otlp] processors: [memory_limiter, attributes, batch] exporters: [otlp] logs: receivers: [otlp] processors: [memory_limiter, attributes, batch] exporters: [otlp]
Key idea: bake code lineage (commit SHA, build ID, module path) into resource attributes at process init so every span and log carries exact code provenance.
eBPF: ground truth from kernel and user space
OTel spans capture intent and logical causality. eBPF adds facts: the syscalls actually made, the CPU time burned, locks contended, network bytes sent, and the precise stacks running during hot spots.
Common probes to enable:
- Tracepoints: sys_enter_execve, sched_switch, tcp retransmits, block I/O latency, cgroup attach.
- kprobes and kretprobes for specific syscalls (openat, connect, sendto, read, write) and kernel functions of interest.
- uprobes and uretprobes on libc and application functions to capture function args and latencies without changing code.
- Stack traces via BPF stack maps and BTF for symbolization.
Example bpftrace script to watch file open errors and tag them with pid and tid:
bashbpftrace -e 'kretprobe:do_sys_openat2 /retval < 0/ { printf("openat error pid=%d tid=%d ret=%d\n", pid, tid, retval); }'
For production, prefer libbpf CO-RE (Compile Once, Run Everywhere) with ring buffer delivery and per-CPU maps. A minimal C eBPF program to capture connect errors:
c// connect_err.bpf.c #include <vmlinux.h> #include <bpf/bpf_helpers.h> #include <bpf/bpf_tracing.h> struct event { __u32 pid; __u32 tid; int ret; }; struct { __uint(type, BPF_MAP_TYPE_RINGBUF); __uint(max_entries, 1 << 24); } events SEC(".maps"); SEC("kretprobe/__sys_connect") int BPF_KRETPROBE(handle_connect_exit, int ret) { if (ret >= 0) return 0; struct event *e = bpf_ringbuf_reserve(&events, sizeof(*e), 0); if (!e) return 0; e->pid = bpf_get_current_pid_tgid() >> 32; e->tid = bpf_get_current_pid_tgid() & 0xffffffff; e->ret = ret; bpf_ringbuf_submit(e, 0); return 0; } char LICENSE[] SEC("license") = "GPL";
User-space Go to load it and correlate with process metadata:
gopackage main import ( "log" "os" "github.com/cilium/ebpf/rlimit" ) func main() { if err := rlimit.RemoveMemlock(); err != nil { log.Fatal(err) } objs := bpfObjects{} if err := loadBpfObjects(&objs, nil); err != nil { log.Fatal(err) } defer objs.Close() rd, err := ringbuf.NewReader(objs.Events) if err != nil { log.Fatal(err) } defer rd.Close() for { rec, err := rd.Read() if err != nil { break } var e event if err := binary.Read(bytes.NewBuffer(rec.RawSample), binary.LittleEndian, &e); err == nil { // send to correlator; enrich with container id and build id from pid // e.pid, e.tid, e.ret } rec.Done() } }
Correlating eBPF events with spans
- Maintain a userspace map pid:tid to current span context using OTel hooks. When a span starts, record the span ID under the thread; when it ends, remove. Some runtimes expose thread local storage or allow callback hooks.
- Alternatively, use uprobes on runtime functions that set TLS for tracing (for example, intercept
__tls_get_addrpatterns) or inject USDT probes in the app to expose trace context. - In distributed systems where threads are reused, join by time windows: if an eBPF event occurs within a span interval on the same pid:tid or within a short time window after a known context switch, link it probabilistically. For HTTP or RPC, the correlation is often crisp.
The correlator does the heavy lifting: merge timelines from OTel and eBPF, enforce monotonicity, and flag inconsistencies. The result is a per-trace enriched view with ground truth events aligned to spans.
Data shaping: from telemetry soup to model-ready features
Raw telemetry will overwhelm an LLM. We need structured, compact, de-duplicated features that preserve causality. The shaping pipeline usually includes:
- Cardinality control: hash or bucket high-entropy fields (SQL text, object IDs) after extracting structural tokens (operation verbs, table names). Keep a top-K set per service.
- Value normalization: normalize paths, IPs, ports, tenant IDs, user agents. Convert to types (filesystem, socket, etc.).
- Time quantization: discretize latencies and durations into log buckets; track histo deviations from baseline per edge.
- Stack aggregation: fold identical stacks with counts; keep top N; attach code ownership and repo paths via debug symbols and BTF.
- Exception taxonomy: map raw errors to canonical families (timeout, refused, not found, constraint violation).
- Span causality: store parent-child, links, and overlaps; compute critical path for latency.
- Privacy scrubbing: redact PII fields, apply pattern-based scrubbing, mask payloads.
- Feature joins: enrich with code coverage, tests touched, blame lines, commit recency, and feature flags.
Pseudo-code for shaping a trace into a compact graph record:
pythondef shape_trace(trace): # trace.spans includes start, end, attributes, events nodes = [] edges = [] for span in trace.spans: node = { 'id': span.id, 'service': span.service, 'name': bucket_op(span.name), 'duration_ms': quantize(span.duration_ms), 'status': span.status, 'exceptions': classify_exceptions(span.events), 'stacks': topk_stacks(span, k=5), 'build': span.resource.get('vcs.revision'), 'owner': owner_from_symbols(span.stacks), } nodes.append(node) for child in span.children: edges.append({'from': span.id, 'to': child.id, 'type': 'child'}) for ev in trace.ebpf_events: span_id = correlate(ev) if not span_id: continue annotate(nodes, span_id, ev) graph = {'nodes': nodes, 'edges': edges} return enforce_limits(graph, max_nodes=200, max_edges=400)
The output is compact and structured, suitable for both deterministic analysis (graph algorithms) and as input tokens for an LLM (with summaries rather than raw payloads).
Building an explicit causal graph, not heuristics
The central claim is: avoid ad-hoc rules by constructing an explicit causal model and performing dynamic slicing. Concretely:
- Use happens-before from span parent-child and time intervals.
- Add synchronous edges from eBPF events to the span active at that time for the same pid:tid.
- Detect resource conflicts (locks, file descriptors, ports) by matching arguments across events and spans.
- Mark failure nodes (exception spans, non-zero exit codes, negative syscall returns).
- Compute minimal sets of nodes and edges whose removal eliminates the failure (dynamic slice). Prioritize the earliest divergence from known good traces.
This produces a small frontier of candidate code locations and data dependencies that likely caused the failure. From there, we map to source files and commits.
Mapping telemetry to code and commits
Symbolication and code lineage glue runtime to source:
- Symbolicate stacks using DWARF or BTF plus symbol servers; resolve to file:line, function, and inlined frames.
- Map resolved files to git repo and branch using build metadata and container image SBOM.
- Use blame to determine which commit last touched the lines implicated by stacks or error points.
- Join with test coverage to know which parts of the code are exercised by the failing traces.
A simple commit scoring function can combine:
- Stack frequency score: how often lines from a commit appear in failure stacks.
- Novelty score: how recent and large the change was.
- Divergence score: difference from last good build on the same trace path (first point of divergence).
- Ownership and module boundaries: commits that touch code on the critical path get a boost.
Example scoring formula:
score(commit) = w1 * stack_hits(commit) + w2 * novelty(commit) + w3 * divergence(commit) + w4 * critical_path_overlap(commit)
Keep weights interpretable and measurable. Calibrate on historical incidents.
Reproduction harness synthesis
A credible fix requires a deterministic reproduction. We use the trace and kernel events to synthesize a minimal harness:
- Inputs: request bodies, parameters, user IDs (sanitized), environment variables, feature flags. For privacy, replace values with structurally similar but fake data unless allowed.
- Ordering constraints: reproduce ordering of calls across services that matter for the failure. For concurrency issues, enforce the interleaving with barriers or fault injection points.
- System calls and side effects: stub or record-replay filesystem and network interactions; fake external dependencies with captured responses.
- Environment: same build, container image, and feature gates; pinned clock if relevant.
Two pragmatic approaches:
-
Application-level harness. Generate a test case within the service code that calls public APIs with the captured inputs and mocks downstream calls using replay fixtures.
-
Process-level harness. Use recorded syscalls to replay via LD_PRELOAD shim or ptrace to emulate responses (more brittle but language-agnostic). Tools like rr or ReproZip can help with determinism on Linux for single-process captures.
Example: Python pytest skeleton synthesized from a trace for a failing endpoint.
pythonimport json import os import pytest from myservice.app import app @pytest.fixture(autouse=True) def set_env(): os.environ['FEATURE_FLAG_X'] = '1' os.environ['SERVICE_VERSION'] = '1.2.3+abcd123' class FakePaymentClient: def charge(self, user_id, amount): # replay from trace fixture return {'status': 'declined', 'code': 'insufficient_funds'} @pytest.fixture def client(monkeypatch): app.payment_client = FakePaymentClient() return app.test_client() def test_checkout_repro(client): payload = { 'orderId': 'O-12345', 'userId': 'U-999', 'amount': 42.50 } resp = client.post('/checkout', data=json.dumps(payload), headers={'X-Trace-Repro': '1'}) assert resp.status_code == 402 # assert on log or span attributes if available via in-proc exporter
The debugging AI can generate this harness template, then refine it iteratively until it triggers the same exception path and eBPF signature. Once reproduced, we can bisect to the offending commit if multiple changes landed.
Root cause ranking without noisy heuristics
We still need a ranker, but it should be evidence-driven and structural rather than hand-wavy:
- Use the dynamic slice as the candidate set.
- For each candidate function or line, check whether removing or mutating it in a speculative build reduces failure probability under the same harness (mutation testing with guardrails).
- Compare failure traces with a corpus of healthy traces on the same route to find the first divergence in spans or syscalls. The code at that divergence is highly suspect.
- Incorporate concurrency signals from eBPF (locks, futex waits, high run queue via
sched_switch) to pinpoint races and deadlocks.
The model’s job is to propose minimal code edits consistent with the evidence, not to guess root cause in a vacuum.
Patch suggestion and verification loop
Once we have a high-confidence location and reproduction, the AI can iterate on a patch:
- Draft a minimal code change and a tightening unit test that reproduces the bug on the old code and passes on the new.
- Build and run the harness and the new unit test under eBPF and OTel; ensure the failure signature disappears and no new anomalies are introduced.
- If the patch fails, use the diff and runtime deltas to refine the next attempt.
Prompt structure for the code model:
System: You are a senior engineer. You will propose minimal, safe patches consistent with the evidence. Do not invent behavior. Use the provided reproduction.
Context:
- Failure signature: <summary of spans and eBPF events>
- Dynamic slice: <functions, files, lines, edges>
- First divergence: <stack and syscall diff>
- Commit candidates: <list with scores>
- Coding standards: <lint rules, error handling style>
Task:
- Propose a patch that fixes the defect.
- Provide a unit test that fails on the old code and passes on the patched code.
- Explain the risk and any follow-up telemetry to add.
Verification provides the guardrail. Never ship unverified patches. Keep a strict evaluation:
- The reproduction harness must pass deterministically across multiple runs.
- The patch must eliminate the original error spans and eBPF failure events.
- Performance budgets must remain within thresholds (latency, CPU, allocations), validated by telemetry diff.
- Run static analysis, type checks, and linters.
Privacy and governance
We will process rich telemetry, including arguments and stack traces. Privacy is non-negotiable.
Controls to implement:
- Redaction at source: scrub PII in OTel processors and eBPF user-space before storage. Apply field-level allowlists per service.
- Structured anonymization: tokenize identifiers consistently across a trace so causality is preserved while identities are hidden.
- On-box summarization: where feasible, compute histograms and summaries locally and ship aggregates instead of raw payloads.
- RBAC and data scoping: restrict raw telemetry access to on-call engineers; expose only summaries to the model unless explicit approvals are present.
- Retention and deletion: keep raw repro artifacts for the shortest viable window; auto-purge per policy.
- Model isolation: if using a hosted LLM, proxy via a data loss prevention gateway; consider on-prem models for sensitive repos.
Performance trade-offs and tuning
Telemetry and probes have a cost. For production viability, set budgets and measure:
- OTel overhead: with batching and asynchronous exporters, span overhead is often small for moderate volumes. Avoid high-cardinality attributes and heavy payloads. Tune
send_batch_sizeandtimeoutper service. - eBPF overhead: ring buffers and per-CPU maps keep it low, often a few percent for simple probes. Depth of stack traces and frequency of events are the heavy hitters. Prefer tracepoints to kprobes where possible. Avoid printing from BPF; stream binary and process in user space.
- Sampling strategies:
- Head-based sampling to limit volume for low-value traces.
- Tail-based sampling to retain error or high-latency traces. Promote any trace with exceptions or kernel errors.
- Dynamic sampling during incident windows.
- Probe gating: enable expensive uprobes only on services and revisions under investigation; keep a minimal baseline always on.
- Storage and compute:
- Keep raw eBPF streams ephemeral; persist shaped summaries.
- Compress stacks and histograms; deduplicate by content hash.
Always validate overhead in staging and canaries. Track the telemetry cost as a first-class metric.
Concrete example: crash triage end-to-end
Suppose checkout-service in v1.2.3 is returning sporadic 500s under load.
- OTel spans show Checkout calling ChargeCard then InventoryReserve. Error events occur in ChargeCard.
- eBPF shows spikes in
connectfailures withECONNREFUSEDand hightcp_retransmit_skb. Stacks implicatelibnetwrappers and a new timeout configuration function. - Dynamic slice isolates a path where ChargeCard constructs a client with a 50 ms timeout, against the expected 500 ms. This path did not appear in last week’s traces.
- Mapping to commits shows a recent change introducing a new config default in
client_config.go, authored yesterday. - Reproduction harness sends the same payload and injects a 100 ms server delay in the fake payment service; the 500 reproduces with the same signature.
- The AI proposes a patch: increase default timeout to 1 s, and treat timeouts as retryable. It adds a unit test that simulates a 100 ms delay and asserts success.
- Verification shows the error spans disappear and eBPF connect errors drop; latency histograms look normal. Patch merged.
This is not heuristic luck: the chain of evidence is explicit and checked.
Integrating with git bisect and CI
When multiple commits landed between the last good and current bad build, the system can run git bisect run with the reproduction harness. Steps:
- Build each candidate revision in CI with the same flags and environment.
- Run the harness and telemetry capture; label good or bad based on signature match.
- Stop when the first bad commit is identified.
Combine this with the prior scoring to speed up convergence by testing the highest-scoring commits first (generalized bisect).
Adding domain-specific probes
Generic syscalls are useful, but many bugs hide behind library abstractions. Add targeted uprobes:
- TLS handshake functions to detect cert or SNI issues.
- ORM query functions to catch N+1 or missing transactions.
- Cache client get and set with keys hashed to protect privacy.
- Locking primitives in language runtimes to detect contention and deadlocks.
Each uprobe attaches explicit semantic meaning that upgrades the causal model without code changes. Use CO-RE to keep maintenance manageable across kernels.
Where to use the model, and where not to
Use a code LLM for:
- Summarizing shaped traces into human-readable incident narratives.
- Proposing patch candidates and tests guided by explicit constraints.
- Explaining risks and suggesting additional telemetry to harden against regressions.
Do not use an LLM to:
- Guess root cause from raw logs.
- Replace deterministic slicing and verification.
- Decide sampling or privacy policies.
Keep the model inside a tight, verifiable loop with explicit tools and constraints.
Incremental adoption plan
0 to 30 days
- Instrument top latency and error-producing services with OTel. Enrich spans with commit SHA and build ID.
- Deploy OTel Collector with tail-based sampling and basic scrubbing.
- Roll out a minimal eBPF agent capturing connect, accept, read, write, and their errors plus stack traces at low depth.
- Build the correlator: join spans and eBPF events by pid:tid and time. Store shaped graphs.
30 to 60 days
- Add symbolication and code ownership mapping. Attach blame and commit metadata.
- Implement the dynamic slicing and ranking pipeline. Start producing candidate files and commits for incidents.
- Synthesize reproduction harnesses for top endpoints. Close the loop manually (engineers run them).
60 to 90 days
- Integrate the code model to propose patches and tests. Add automated verification in CI.
- Add targeted uprobes for high-value libraries. Enable dynamic probe gating.
- Establish privacy and RBAC policies. Decide what data is allowed to reach the model.
Beyond
- Build a corpus of verified reproductions and patches. Use them as examples to fine-tune smaller on-prem models if needed.
- Add language-aware static analyzers (Infer, Go vet, Rust clippy) as tools in the loop.
- Experiment with anomaly detection on shaped graphs to detect regressions before user impact.
Practical tips and pitfalls
- Control cardinality aggressively. Cardinality explosions make everything worse: cost, latency, and model context limits.
- Keep code provenance accurate. If builds or hotfixes occur outside CI, your attribution will drift.
- Beware of overfitting patches. A patch that fixes the harness but breaks other flows is a liability; run broader tests and property checks.
- Measure overhead and correctness continuously. Telemetry that alters behavior is self-defeating.
- Avoid huge raw logs in prompts. Summarize first; keep the model focused on deltas and constraints.
References and tools
- OpenTelemetry project (SDKs and Collector)
- eBPF ecosystem: libbpf, BCC, bpftrace; CO-RE techniques
- Observability projects with eBPF focus: Pixie, Parca, Cilium Tetragon
- Record and replay: rr (record and replay framework), ReproZip
- Git bisect documentation
- Dynamic slicing literature (software fault localization)
These are well-documented; choose the tools that match your stack and kernel baseline.
Conclusion
A robust debugging assistant should be trace-aware and code-centric. By wiring OpenTelemetry spans with eBPF probes, we get intent and ground truth in the same frame. Data shaping turns this into a compact causal graph. Dynamic slicing and code lineage map failures to the likely change sets. Then, and only then, a code model proposes a patch and a test, which we verify under the same evidence loop that accused it.
This is not magic, and it does not rely on brittle heuristics. It is engineering: explicit causality, careful data work, guarded model use, and rigorous verification. The payoff is faster incident resolution, higher confidence in fixes, and a growing knowledge base of reproducible postmortems. With privacy controls and performance budgets baked in, you can adopt it incrementally and keep your on-call sane.
