Trace-Aware Debug AI: OpenTelemetry + eBPF for Production-Only Bugs
Production-only bugs are the worst kind: they surface under real load, with real data, across distributed boundaries, and rarely reproduce on demand. They hide in the interleavings—between goroutines, threads, cgroups, NIC queues, GC cycles, and RPC retries. Traditional debugging workflows fall short because they depend on pausing, repro steps, or high-fidelity logs that are too expensive (and too sensitive) to keep in production.
This article lays out a practical, opinionated blueprint for building a Trace-Aware Debug AI: a system that ingests distributed traces (OpenTelemetry) and kernel events (via eBPF) to detect and explain regressions and race conditions in live systems. Critically, it does so without capturing PII, without pausing traffic, and without relying on flaky repro steps.
The approach is informed by proven ideas—Dapper-style tracing, eBPF observability, invariant mining, causal graph analysis, lock-order checks, and tail-based sampling—assembled into an end-to-end pipeline that can operate continuously in production with a predictable overhead budget.
TL;DR
- Use OpenTelemetry for distributed traces and span context propagation.
- Use eBPF to collect low-level kernel signals (futex waits, TCP retransmits, page faults, sched switch, disk I/O latencies) with negligible overhead.
- Correlate kernel events to spans by TID/CGROUP/time alignment and optional USDT/uprobes that propagate trace_id to kernel space.
- Run an AI analysis layer that builds a causal graph per incident, detects regressions via changepoint tests, identifies race patterns (lock inversions, unbounded contention, starvation), and generates a minimal, testable hypothesis.
- Enforce strict privacy by design: record only metadata and hashed identifiers, never payloads, and capture on trigger (tail sampling) rather than all events.
Why Production-Only Bugs Persist
- Data and concurrency shape: real production workloads exercise code paths that never appear in staging—edge cardinalities, resource limits, and microburst traffic patterns.
- Distributed systems do not reproduce well: cross-service latencies, backpressure, and jitter are emergent and history-dependent.
- Existing tooling trades one of three: high fidelity (but high cost/PII), low overhead (but poor signal), or pause-the-world (but unacceptable in prod).
A workable path needs full-fidelity causality when anomalies occur (not always), low default overhead, and strong privacy constraints.
Design Goals
- No PII, no payload capture. Only metadata, hashed identifiers, and structural timing.
- No traffic pauses; overhead must be configurable and bounded (<1–2% CPU on target nodes during quiescent periods; transient spikes during incidents acceptable with caps).
- Accurate causal correlation between spans and kernel events.
- Explainability: create a human-auditable incident report with evidence, not just a score.
- Language/runtime agnostic: Go, Java, Rust, Python must work; optional hints unlock deeper correlation.
System Architecture Overview
- Data ingestion:
- Distributed traces via OpenTelemetry SDKs and Collector, with tail-based sampling policies tuned for anomalies.
- Kernel signals via eBPF programs attached to tracepoints/kprobes/uprobes (CO-RE). Focus on futex waits, sched transitions, TCP retransmissions, disk I/O latencies, major page faults.
- Correlation layer:
- Join spans and kernel events by (pid, tid, cgroup, timestamp) and optional USDT/uprobes that tag the current TID with trace_id/span_id.
- Normalize time (NTP/PTP), account for clock drift, and align via shared monotonic clocks when available.
- Event processing:
- Feature extraction: per-span wait breakdown, critical path analysis, queueing signals (runqueue, iowait), lock graphs, network error markers.
- Privacy filters: redact strings, tokenize identifiers via keyed HMAC, drop payloads.
- Analysis/AI:
- Regression detection via segmented regression and changepoint tests (e.g., PELT, BOCPD, KS tests on latency distributions).
- Race detection heuristics: lock inversions, convoying, starvation, ABA-like patterns.
- Causal graph construction across spans and kernel events; rank root-cause candidates.
- Explanation synthesis: reproduce minimal steps using existing traces and propose code/config changes.
- Output:
- Incident report with: root cause hypothesis, evidence timeline, affected versions/endpoints, risk level, suggested remediation, links to relevant code/commit diffs.
- Machine-readable artifacts for ticketing systems and CI gate integration.
OpenTelemetry: The User-Space Backbone
OpenTelemetry (OTel) provides semantics for distributed spans and context propagation. For this system:
- Instrument your services with OTel SDKs (Go, Java, Rust, Python) at inbound RPC boundaries, database calls, and major internal async operations.
- Use span events to record local milestones (e.g., cache miss, retry backoff) without payloads.
- Propagate W3C traceparent across services. Enrich spans with:
- deployment.version, service.name, k8s.pod.uid
- git.commit, feature flags, region/zone
- hashed principal and hashed tenant ids (HMAC with rotation), never raw PII.
Tail-based sampling is crucial: capture full traces when anomalies occur; otherwise keep summaries. An example OTel Collector config:
yamlreceivers: otlp: protocols: http: grpc: processors: batch: timeout: 2s send_batch_size: 8192 probabilistic_sampler: sampling_percentage: 5 # default head sampling for background traffic tail_sampling: decision_wait: 5s num_traces: 50000 policies: - name: errors type: status_code status_code: status_codes: [ERROR] - name: high-latency type: latency latency: threshold_ms: 500 - name: rare-endpoints type: string_attribute string_attribute: key: http.target values: ["/billing/close", "/export"] enabled_regex_matching: true - name: kernel-anomaly-signal type: boolean_attribute boolean_attribute: key: kernel.anomaly value: true exporters: otlphttp: endpoint: http://trace-analyzer:4318 service: pipelines: traces: receivers: [otlp] processors: [batch, probabilistic_sampler, tail_sampling] exporters: [otlphttp]
This pipeline retains full problematic traces and discards most benign ones, keeping cost and privacy in check.
eBPF: Kernel-Resident Truth
eBPF brings low-overhead visibility into kernel paths that cause tail latency and concurrency pathologies. We focus on a small, high-value set of probes:
- sched:sched_switch to understand runnable-to-running transitions and starvation.
- sys_enter_futex/sys_exit_futex to measure lock wait hotspots (covers many user-space mutexes).
- tcp:tcp_retransmit_skb and net:net_dev_queue for network reliability and congestion hints.
- block:block_rq_issue/block:block_rq_complete for storage latency, queued I/O depth.
- mm:do_page_fault major faults for cold pages under load.
Attach these with CO-RE (Compile Once – Run Everywhere) and filter by cgroup to limit scope to your workload.
Example eBPF C (CO-RE) capturing futex waits and TCP retransmits, publishing to a ring buffer:
c// SPDX-License-Identifier: GPL-2.0 #include "vmlinux.h" #include <bpf/bpf_helpers.h> #include <bpf/bpf_core_read.h> struct event { u64 ts_ns; u32 pid; u32 tid; u32 type; // 1=futex_wait, 2=futex_wake, 3=tcp_retransmit u64 arg0; // futex uaddr or skb len u64 arg1; // wait ns or retrans count u64 cgroup_id; }; struct { __uint(type, BPF_MAP_TYPE_RINGBUF); __uint(max_entries, 1 << 24); // 16MB } events SEC(".maps"); static __always_inline int submit_event(struct event *e) { struct event *out = bpf_ringbuf_reserve(&events, sizeof(*e), 0); if (!out) return 0; __builtin_memcpy(out, e, sizeof(*e)); bpf_ringbuf_submit(out, 0); return 0; } // Filter to our workload via cgroup v2 id (filled from userspace) to reduce overhead. struct { __uint(type, BPF_MAP_TYPE_HASH); __uint(max_entries, 1); __type(key, u32); __type(value, u64); } target_cgrp SEC(".maps"); static __always_inline bool in_target_cgrp() { u32 k = 0; u64 *cg = bpf_map_lookup_elem(&target_cgrp, &k); if (!cg) return false; return bpf_get_current_cgroup_id() == *cg; } SEC("tp/sched/sched_switch") int handle_sched_switch(struct trace_event_raw_sched_switch *ctx) { // Could capture runqueue delays if desired; omitted here for brevity. return 0; } SEC("tp/syscalls/sys_enter_futex") int BPF_PROG(on_futex_enter, int *uaddr, int op, int val, const struct timespec *timeout, int *uaddr2, int val3) { if (!in_target_cgrp()) return 0; // Mark start time in per-thread map return 0; // We track duration via exit handler, see below. } struct key { u32 tid; }; struct val { u64 start_ns; }; struct { __uint(type, BPF_MAP_TYPE_HASH); __uint(max_entries, 65536); __type(key, struct key); __type(value, struct val); } futex_start SEC(".maps"); SEC("tp/syscalls/sys_enter_futex") int on_futex_enter2(struct trace_event_raw_sys_enter *ctx) { if (!in_target_cgrp()) return 0; struct key k = { .tid = bpf_get_current_pid_tgid() }; // low 32 bits is tid struct val v = { .start_ns = bpf_ktime_get_ns() }; bpf_map_update_elem(&futex_start, &k, &v, BPF_ANY); return 0; } SEC("tp/syscalls/sys_exit_futex") int on_futex_exit(struct trace_event_raw_sys_exit *ctx) { if (!in_target_cgrp()) return 0; struct key k = { .tid = bpf_get_current_pid_tgid() }; struct val *v = bpf_map_lookup_elem(&futex_start, &k); if (!v) return 0; u64 dur = bpf_ktime_get_ns() - v->start_ns; bpf_map_delete_elem(&futex_start, &k); struct event e = {}; e.ts_ns = bpf_ktime_get_ns(); u64 tgid_tid = bpf_get_current_pid_tgid(); e.pid = tgid_tid >> 32; e.tid = (u32)tgid_tid; e.type = 1; // futex_wait e.arg0 = 0; e.arg1 = dur; e.cgroup_id = bpf_get_current_cgroup_id(); submit_event(&e); return 0; } SEC("tp/tcp/tcp_retransmit_skb") int on_tcp_retransmit(struct trace_event_raw_tcp_event_sk_skb *ctx) { if (!in_target_cgrp()) return 0; struct event e = {}; e.ts_ns = bpf_ktime_get_ns(); u64 tgid_tid = bpf_get_current_pid_tgid(); e.pid = tgid_tid >> 32; e.tid = (u32)tgid_tid; e.type = 3; // tcp_retransmit e.arg0 = 0; e.arg1 = 1; // one retransmit observed e.cgroup_id = bpf_get_current_cgroup_id(); submit_event(&e); return 0; } char LICENSE[] SEC("license") = "GPL";
A tiny Go user-space loader can attach these, set the target cgroup, and forward ring buffer events to the analysis pipeline via OTLP logs/metrics or directly over gRPC.
gopackage main import ( "context" "log" "os" "time" "github.com/cilium/ebpf" "github.com/cilium/ebpf/link" "github.com/cilium/ebpf/ringbuf" ) type event struct { TsNs uint64 Pid uint32 Tid uint32 Type uint32 Arg0 uint64 Arg1 uint64 CgroupID uint64 } func main() { // Load precompiled BPF object (CO-RE) built with libbpf-CO-RE. obj := struct{ /* generated by bpf2go */ }{} // Assume load is done; set target cgroup id (e.g., of the monitored pod). var cgid uint64 = mustGetTargetCgroupID() key := uint32(0) if err := obj.TargetCgrp.Put(key, cgid); err != nil { log.Fatal(err) } // Attach tracepoints l1, _ := link.AttachTracepoint(link.TracepointOptions{Program: obj.OnFutexEnter2, Category: "syscalls", Name: "sys_enter_futex"}) defer l1.Close() l2, _ := link.AttachTracepoint(link.TracepointOptions{Program: obj.OnFutexExit, Category: "syscalls", Name: "sys_exit_futex"}) defer l2.Close() l3, _ := link.AttachTracepoint(link.TracepointOptions{Program: obj.OnTcpRetransmit, Category: "tcp", Name: "tcp_retransmit_skb"}) defer l3.Close() rb, _ := ringbuf.NewReader(obj.Events) defer rb.Close() ctx := context.Background() for { rec, err := rb.Read() if err != nil { continue } var e event if err := binary.Read(bytes.NewReader(rec.RawSample), binary.LittleEndian, &e); err != nil { continue } // Forward as an OTLP log or custom gRPC to the analyzer forwardKernelEvent(ctx, e) } }
Note: Above code omits error handling and bpf2go scaffolding for brevity.
Correlating Kernel Events to Spans
We need to join a kernel event to the active span. There are four strategies, in order of robustness:
-
Time + cgroup + tid windowing (baseline):
- For each kernel event (tid, ts), find the overlapping span in the same process/container with matching tid during [span.start, span.end]. This already explains most blocking events for synchronous handlers.
-
USDT probes for explicit span context:
- Add a small library that fires a USDT probe at span start/end with trace_id|span_id, keyed by TID. An eBPF uprobe handler updates a BPF map tid->(trace_id, span_id). Then any kernel event for that tid carries the active trace_id.
- Example (Rust):
rust// Cargo.toml: usdt = "0.3" use usdt::Provider; #[no_mangle] pub extern "C" fn debugai_span_enter(trace_id_hi: u64, trace_id_lo: u64, span_id: u64) {} #[no_mangle] pub extern "C" fn debugai_span_exit() {} fn start_span(trace_id: [u8; 16], span_id: [u8; 8]) { let hi = u64::from_be_bytes(trace_id[0..8].try_into().unwrap()); let lo = u64::from_be_bytes(trace_id[8..16].try_into().unwrap()); let sid = u64::from_be_bytes(span_id); unsafe { debugai_span_enter(hi, lo, sid) } } fn end_span() { unsafe { debugai_span_exit() } }
Attach eBPF uprobes to debugai_span_enter/exit to maintain a per-TID context map.
-
Language/runtime-specific hooks (optional):
- For Go, instrument your HTTP/RPC handlers to call debugai_span_enter/exit at the edges. Minimal code, no PII.
-
Fallback correlation via causal edges:
- If a thread pool or async runtime decouples TIDs from spans, correlate via network syscalls timing (recv/send) and span parent-child relationships.
The system should work with (1) alone; (2)–(3) improves precision under async runtimes.
Privacy and Safety by Design
- Never capture payload buffers. In eBPF, avoid reading user pointers; collect only sizes, error codes, and durations.
- Hash identifiers with a rotating HMAC key (e.g., tenants, user ids) before exporting—non-reversible outside the analyzer.
- Drop high-cardinality strings unless whitelisted; use tokenization or dictionary keys for SQL statements (e.g., via pg_stat_statements normalized queryid, not raw SQL).
- Tail sample only upon anomaly; aggregate otherwise.
- Maintain a hard budget for ring buffer and CPU. If backpressure occurs, drop events and set kernel.anomaly=true to trigger full traces opportunistically.
From Signals to Root Cause: The Analysis Layer
The AI need not be a black box. Combine statistical detectors with domain-specific heuristics and a constrained reasoning layer.
-
Causal graph construction:
- Nodes: spans, kernels events (futex wait, tcp retransmit, page fault), resource queues.
- Edges: parent-child, happens-before via timestamps, TID linkage, network boundaries, and lock dependencies.
- Compute the critical path per trace using span durations minus child time, plus kernel wait overlays.
-
Regression detection:
- Maintain per-endpoint distributions per version. Run changepoint detection (e.g., PELT) on P50/P95 and critical-path components (CPU, lock, network, I/O).
- Run two-sample KS tests for distribution shifts; Benjamini–Hochberg control across many endpoints.
- When a shift is detected, rank likely culprits by diff-in-diff features across versions (e.g., lock wait share jumped from 5% to 35% only in v1.24.7, only on nodes with kernel 5.15).
-
Race and concurrency bug patterns:
- Lock inversions: detect opposite acquisition orders on different threads; surface minimal cycle.
- Convoying/contended mutex: long futex waits with one dominant holder identified via thread scheduling; correlates with GC or long syscalls in holder.
- Starvation: runnable threads not scheduled (sched_switch indicates long waits) while CPU is busy; often priority inversion or cgroup CPU shares misconfigured.
- Spurious wakeups/ABA hints: rapid acquire-release with high retry counts and no progress.
-
Explanation synthesis:
- Summarize anomaly: e.g., “p95 latency +320ms after v1.24.7; 70% on critical path due to lock wait at pkg/cache.(*Shard).Lock; holder blocked on synchronous RPC to config service with 2x TCP retransmits in us-east-1.”
- Provide concrete evidence: minimal trace, spans-involved timeline, kernel overlay charts.
- Suggest remediation: lock scope reduction, use RWMutex, set deadline/timeout, bump TCP keepalive, or isolate blocking I/O from hot locks.
-
Guardrails:
- Never claim code certainty without at least two independent corroborations (e.g., lock wait + holder busy + release matches); otherwise label as hypothesis with confidence.
Worked Example: Go Service Deadlock Under Load
Symptoms:
- p95 for POST /orders jumped from 180ms to 1.2s after v1.18.2.
- CPU steady; no obvious errors.
Signals captured:
- Tail-sampled traces show long spans in order.ApplyDiscount.
- eBPF futex_exit durations spiking to 800–1000ms for TIDs running order.ApplyDiscount.
- Lock graph shows acquisition order: A -> B in worker, B -> A in a retry path introduced in v1.18.2.
- sched_switch events indicate threads waiting runnable but never running when lock holders sleep on network I/O.
Root cause hypothesis:
- A classic lock inversion introduced by refactoring. The retry path inadvertently grabbed B before A when handling a cache-miss fallback.
Remediation:
- Normalize acquisition order; add trylock with backoff; break the critical section around network I/O.
Evidence snippet (textual timeline):
- 12:00:31.118 span S1 (/orders#123) enters ApplyDiscount (span_id=0xabc), tid=31984
- 12:00:31.118 kernel futex_enter (tid=31984)
- 12:00:31.119 holder tid=31722 holds A, then blocks on RPC (tcp_retransmit +2)
- 12:00:32.017 S1 still waiting (arg1 dur=898ms)
- 12:00:32.017 holder releases A after RPC latency resolves; S1 acquires, resumes
AI report excerpt:
- Confidence: 0.86, Pattern: lock inversion with network-blocking critical section
- Suggested diff: move RPC call outside lock A; ensure B acquired before A; add timeout to RPC client.
Another Example: Latency Regression With Kernel Upgrade
Symptoms:
- p95 +90ms only on nodes upgraded from 5.10 to 5.15.
Signals:
- tcp_retransmit_skb increased 3x on those nodes.
- No code changes; only node image changed.
- Affected endpoints traverse a service mesh sidecar.
Root cause hypothesis:
- Interaction between TCP pacing defaults in 5.15 and mesh sidecar’s small write coalescing, causing higher retransmit probability under microbursts.
Remediation:
- Tune net.ipv4.tcp_min_tso_segs or enable BBRv2; or adjust sidecar write flush size.
Integrating Kernel Events Into Traces
Represent kernel events as span events or as a parallel stream of “resource spans.” Minimal OTel semantic convention proposal:
-
For each span, aggregate kernel overlays:
- debugai.wait.futex.ns_total
- debugai.wait.scheduler.runqueue_ns
- debugai.net.tcp.retransmits
- debugai.mem.page_faults.major
- debugai.io.block.ns_total
-
Emit fine-grained span events only when thresholds crossed (e.g., futex_wait > 10ms):
json{ "name": "kernel.futex_wait", "time": "2025-08-19T12:00:32.017Z", "attributes": { "tid": 31984, "duration.ns": 898234123, "lock.symbol": "pkg/cache.(*Shard).Lock", "cgroup.id": "...", "debugai.piisafe": true } }
This produces rich traces only when they matter, with millisecond-scale correlation accuracy.
Detecting Lock Inversions and Contention
- Build a lock graph: nodes are lock identities (prefer symbolic names from uprobe sym lookups; otherwise, hash of futex uaddr); edges are observed acquisition order.
- If a cycle is detected (A->B and B->A), raise an inversion alert. Record minimal counterexample from live traces (no synthetic repro).
- Quantify impact: percentage of critical path time attributable to waits on the cycle.
A quick bpftrace prototype for futex hotspots:
bashbpftrace -e 'tracepoint:syscalls:sys_exit_futex /pid==TARGET/ { @wq[tid] = nsecs; } tracepoint:syscalls:sys_enter_futex /@wq[tid]/ { printf("futex wait by %d lasted %d ms\n", tid, (nsecs-@wq[tid])/1e6); delete(@wq[tid]); }'
For production, prefer a compiled CO-RE program with cgroup filters and ring buffer.
Regression Detection Across Deploys
- Maintain per-version feature baselines for endpoints: total latency, CPU share, futex ns share, retransmits per KB, page fault rate.
- After deploy, run:
- Two-sample KS test on latency distributions (pre vs post) at fixed traffic windows.
- Segmented regression to fit new trend lines; flag level shifts.
- Attribution: Shapley-style contribution from each component change to total latency delta.
When a regression is detected, the AI generates a diff-aware explanation:
- “In v1.24.7, futex time share increased +28% specifically in span cache.Repopulate; holder thread was waiting on filesystem I/O with 12ms median block latency due to ext4 journal mode change in node image.”
This focuses teams on the smallest code/config surface that explains the delta.
Putting It Together: Minimal Viable Implementation
- Deploy OTel SDKs and Collector with tail sampling.
- Ship a node-level eBPF agent:
- Attach to sched_switch, futex, tcp_retransmit, and optionally block I/O and page faults.
- Filter by k8s cgroup ids of your services.
- Export events to your trace analyzer with service/pod metadata via k8s API.
- Implement a correlator:
- Join kernel events to spans by tid and time; optionally attach a small USDT shim in services to set tid->trace_id.
- Materialize per-span kernel overlays and a per-trace critical path with component attributions.
- Build the analysis engine:
- Changepoint detection per endpoint/version.
- Lock graph and inversion check.
- Heuristics for starvation/retransmit-induced tail latency.
- Explanation generator producing a Markdown incident report.
- Governance:
- Enforce privacy filters and budgets; maintain audit logs of captured fields.
- Automated Jira/GitHub issue filing with evidence and suggested owners based on code ownership maps.
Example: OTLP Log of a Kernel Anomaly Trigger
When a kernel anomaly occurs (e.g., retransmits > threshold), the agent can set an attribute on the current span or create a lightweight log to kick tail sampling:
json{ "resource": { "attributes": { "service.name": "checkout", "k8s.pod.uid": "6b9f...", "deployment.version": "v1.24.7" } }, "log": { "time": "2025-08-19T12:00:31.119Z", "severity": "WARN", "body": "kernel anomaly: tcp retransmit spike", "attributes": { "kernel.anomaly": true, "tcp.retransmits": 3, "tid": 31722 } } }
The tail sampler policy using boolean_attribute key kernel.anomaly then keeps the full trace for deep analysis.
Overhead, Safety, and Rollout Plan
- Start with minimal probes and scope: futex and tcp_retransmit, filtered by target cgroup.
- Measure overhead with perf stat and bpftool prog profile. Expect low microsecond-scale per-event cost.
- Configure ring buffer size and drop policy; ensure backpressure does not stall application threads.
- Roll out per-namespace, then fleet-wide. Monitor a control dashboard: events/sec, drops/sec, and per-node CPU.
- Keep a feature flag to disable probes instantly if any risk is detected.
Limits and Non-goals
- This system is not a memory forensics tool; it does not read payloads or heap objects.
- It cannot always perfectly correlate async spans without optional hints; we prefer correctness over guesswork.
- It does not replace unit/integration testing; it complements them by covering concurrency and production-specific edges.
Future Work
- Automatic validation of race hypotheses with symbolic execution or partial order reduction on minimal traces.
- Integrate code-aware retrieval: map symbolized lock sites back to code owners and PRs using a code index.
- Fine-grained network path inference (TC eBPF) to separate NIC vs network fabric latency contributions.
- Portable USDT emissaries for common runtimes (Go, Java) to improve TID->trace context fidelity without invasive changes.
References and Further Reading
- OpenTelemetry Specification: https://opentelemetry.io/docs/specs/
- Google Dapper: https://research.google/pubs/pub36356/
- Pivot Tracing: https://dl.acm.org/doi/10.1145/2815400.2815415
- eBPF CO-RE guide: https://nakryiko.com/posts/bpf-portability-and-co-re/
- BPF ring buffer: https://www.kernel.org/doc/html/latest/bpf/ringbuf.html
- Lock-order checking: Eraser (lockset) and FastTrack ideas (adapted from dynamic race detection), see: https://dl.acm.org/doi/10.1145/1453101.1453112
- Tail-based sampling in OTel Collector: https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/tailsamplingprocessor
- TCP retransmit tracepoints: Linux kernel tracepoint documentation
- Queueing theory for latency decomposition: “The Tail at Scale” (Dean & Barroso)
Closing Opinion
Debugging production-only bugs requires humility about what we can instrument safely and confidence in a layered approach. Distributed traces explain intent. eBPF explains reality. The Trace-Aware Debug AI fuses them with statistical and structural reasoning to produce actionable, precise reports without violating privacy or SLOs. Start small—futex + retransmit + tail sampling—and let the evidence guide where to go deeper.
