Stop Shipping Logs? Trace-First Observability with OpenTelemetry Pipelines, Tail-Based Sampling, and eBPF in 2025
If you are paying too much for observability, there is a high chance your biggest line item is logs. In many organizations, logs consume over half of the bill yet contribute the least to incident resolution once distributed tracing is in place. In 2025, the question is not whether to ship fewer logs, it is whether you should make traces your primary signal and treat logs as a constrained, compliance-driven dataset.
This article lays out an opinionated, technical path for a trace-first operating model using OpenTelemetry (OTel) and eBPF. We will cover:
- Why traces have better cost-to-signal economics than logs
- How to design OTel Collector pipelines for trace-first data flows
- Tail-based sampling policies that keep the right 5–20% and 100% of the critical 1%
- Replacing high-volume logs with span events and span metrics
- Log-to-trace enrichment and correlation when you still need some logs
- eBPF-based sources for low-friction, low-overhead signal capture
- Data governance and cost controls that scale
- A pragmatic, step-by-step migration playbook
The target reader: senior engineers, SREs, observability platform owners, and cost-minded engineering leaders.
The economic case for trace-first
Logs are cheap to write and easy to overproduce. They are also the noisiest, most unstructured, and most expensive to index and query at scale. When teams move to microservices, log volume scales with request fan-out and concurrency, not with user traffic. Repetitive line logs and high-cardinality labels make matters worse. If you have ever seen a single release double your log bill, you know the pain.
Traces, by contrast, are structured by design. They encode causal context and timing across the entire request path, and they carry typed attributes under consistent semantic conventions. Critically, traces give you precise control over sampling without destroying utility:
- Keep 100% of error and slow traces
- Keep a small, representative sample of healthy throughput (say 1–10%)
- Elevate priority tenants, rare endpoints, and SLO-violating events
Metrics complete the picture: you can derive RED metrics (rate, errors, duration) from spans before sampling, preserving accurate service level indicators even when you aggressively reduce stored traces. With span events and span links, you can further compress many logging use cases into small, structured annotations within a trace.
The combination of tail-based sampling plus spanmetrics gives you most of the observability value of 100% ingest, at a fraction of the cost.
Architecture overview: the OTel Collector as the control plane
The OpenTelemetry Collector is the backbone of a trace-first architecture. It decouples signal producers from backends, applies policy (filtering, sampling, PII scrubbing), and fans out data to different stores.
A production layout in 2025 typically has:
- Agent collectors running as sidecars or DaemonSets, receiving OTLP from SDKs and from auto-instrumentation (including eBPF exporters)
- A regional or cluster-level gateway collector performing tail-based sampling, attribute normalization, and routing
- A metrics aggregator pipeline fed by connectors such as spanmetrics and servicegraph
- Multiple exporters: a columnar trace store (Tempo, Jaeger, vendor APM), cheap log store (object storage or Loki), and time-series metrics (Prometheus remote write, vendor metrics)
Key design principles:
- Centralize policy in the gateway collector where you can compute on full traces
- Ensure consistent resource attributes across signals (service.name, deployment, version, k8s cluster)
- Generate metrics from traces in the pipeline before trace sampling occurs
- Route the slim subset of logs you still need to cheaper storage and short retention
A reference OTel Collector configuration
Below is a minimal but realistic gateway configuration demonstrating trace-first processing. Adjust for your environment and collector version. The syntax shown maps to common, stable processors available in the opentelemetry-collector-contrib distribution.
yamlreceivers: otlp: protocols: http: grpc: filelog: include: [ /var/log/app/*.log ] start_at: beginning multiline: line_start_pattern: '^[0-9TZ-]+' processors: memory_limiter: check_interval: 2s limit_mib: 4096 batch: send_batch_size: 8192 timeout: 2s k8sattributes: extract: metadata: [ k8s.namespace.name, k8s.pod.name, k8s.node.name, k8s.deployment.name ] filter: node_from_env_var: KUBE_NODE_NAME resource: attributes: - key: deployment.environment action: upsert value: prod attributes/redact: actions: - key: http.request.header.authorization action: delete - key: user.email action: delete transform/logs: error_mode: ignore log_statements: - context: body statements: - replace_all_patterns(body, '(?i)password=[^&\s]+', 'password=REDACTED') groupbytrace: wait_duration: 5s num_traces: 200000 tail_sampling: decision_wait: 5s num_traces: 200000 expected_new_traces_per_sec: 4000 policies: - name: errors type: status_code status_code: status_codes: [ ERROR ] - name: slow type: latency latency: threshold_ms: 1000 - name: high_value_tenants type: string_attribute string_attribute: key: tenant.id values: [ enterprise-.*, vip-.* ] enabled_regex_matching: true - name: rare_endpoints type: string_attribute string_attribute: key: http.target values: [ /checkout, /payment, /transfer ] - name: keep_baseline type: probabilistic probabilistic: sampling_percentage: 5 filter/logs_drop_chatter: logs: include: match_type: strict resource_attributes: - key: log.level value: debug match_type: # intentionally empty so we document the intent: prefer explicit drop via statements spanmetrics: metrics_exporter: otlp aggregation_temporality: cumulative dimensions: [ service.name, http.method, http.route, http.status_code ] histogram: explicit_boundaries: [ 5ms, 10ms, 25ms, 50ms, 100ms, 250ms, 500ms, 1s, 2s, 5s ] servicegraph: store: in-memory latency_histogram: explicit_boundaries: [ 5ms, 10ms, 25ms, 50ms, 100ms, 250ms, 500ms, 1s, 2s, 5s ] exporters: otlp: endpoint: traces.vendor.example:4317 tls: insecure_skip_verify: false otlp/metrics: endpoint: metrics.vendor.example:4317 prometheus: endpoint: 0.0.0.0:9464 loki: endpoint: http://loki.gw.svc:3100/loki/api/v1/push labels: attributes: - k8s.namespace.name - k8s.pod.name - service.name debug: verbosity: basic connectors: spanmetrics: {} servicegraph: {} service: telemetry: logs: level: info pipelines: traces: receivers: [ otlp ] processors: [ memory_limiter, k8sattributes, resource, attributes/redact, groupbytrace, tail_sampling, batch, spanmetrics, servicegraph ] exporters: [ otlp, debug ] metrics: receivers: [ otlp, spanmetrics, servicegraph ] processors: [ batch ] exporters: [ otlp/metrics, prometheus ] logs: receivers: [ otlp, filelog ] processors: [ memory_limiter, k8sattributes, resource, transform/logs, batch ] exporters: [ loki ]
Notes:
- groupbytrace ensures the gateway sees all spans in a trace for correct tail sampling. In a horizontally scaled gateway, use consistent hashing on trace IDs or a load balancer that supports stream affinity.
- spanmetrics and servicegraph are connectors: they consume spans and emit metrics into the metrics pipeline. Place them before sampling if you want unbiased metrics; otherwise your RED metrics will reflect the sampled subset.
- For PII, use a combination of attributes and transform processors to delete or mask sensitive fields.
- If you export metrics to Prometheus, you can scrape the collector or remote write elsewhere.
Tail-based sampling that keeps the right data
Head-based sampling (in SDKs) randomly drops traces before you know their outcome. It is simple and still useful at the edge, but you will inevitably drop the 1% you care about most: errors, slowdowns, and rare paths.
Tail-based sampling waits to see the whole trace, then applies policies based on status code, latency, attributes, or rate-limits. It gives you a precise spend dial while preserving the incidents and anomalies that actually matter.
Common policies that work well in practice:
- Always keep error traces (status code ERROR)
- Keep slow traces above a latency threshold (e.g., p95 SLO + 20%)
- Keep 100% for high-value tenants or priority endpoints
- Keep a small baseline probabilistic sample to preserve visibility on the happy path
- Rate-limit noisy categories like health checks or batch jobs without losing coverage entirely
Sizing the tail sampler:
- Memory is proportional to expected_new_traces_per_sec × decision_wait × avg_spans_per_trace × bytes_per_span
- Rule of thumb: 4–8 KB per span in-memory overhead depending on attributes and collector build
- Example: if you see 3k new traces per second, decision_wait is 5s, and an average trace has 20 spans, memory ≈ 3,000 × 5 × 20 × 4 KB = 1.2 GB. Set memory_limiter higher than that plus headroom and scale horizontally as needed.
Operational best practices:
- Measure end-to-end tail sampling latency; keep decision_wait small (3–7s) so spans are exported promptly
- Drop low-value spans before sampling only if you are certain they are noise (e.g., verbose internal calls); otherwise, you might affect sampling decisions
- Emit spanmetrics before sampling for unbiased RED metrics; you can still export traces after sampling
- Keep baseline sampling >1% so statistical aggregates from traces remain directionally useful during normal operation
Replace high-volume logs with span events
A large fraction of logs are simply annotations of events that happen during a request: cache misses, retries, validation errors, rate limit decisions. These are better expressed as span events with structured attributes. They are cheaper (they compress with the trace), easier to query, and automatically correlated to context like user, tenant, route, and resource.
Examples follow for Go and Python. The intent is to replace line-logs like cache miss for key=X ttl=0
with structured span events.
Go:
goimport ( "context" "go.opentelemetry.io/otel" "go.opentelemetry.io/otel/attribute" "go.opentelemetry.io/otel/codes" ) func Handler(ctx context.Context) error { tracer := otel.Tracer("checkout-service") ctx, span := tracer.Start(ctx, "apply-discount") defer span.End() // replace: log.Printf("cache miss user=%s", userID) span.AddEvent("cache.miss", trace.WithAttributes( attribute.String("cache", "discounts"), attribute.String("user.id", userID), )) if err := doWork(); err != nil { span.RecordError(err) span.SetStatus(codes.Error, "discount application failed") return err } return nil }
Python:
pythonfrom opentelemetry import trace from opentelemetry.trace import Status, StatusCode def handler(): tracer = trace.get_tracer("payments") with tracer.start_as_current_span("authorize") as span: span.add_event( "policy.decision", {"policy": "kyc-check", "result": "pass", "score": 0.82}, ) try: charge() except Exception as exc: span.record_exception(exc) span.set_status(Status(StatusCode.ERROR, str(exc))) raise
Guidelines:
- Emit events only when they add decision or diagnostic value; avoid per-loop chatter
- Use canonical attribute keys per OTel semantic conventions (http.; db.; exception.*) and your domain conventions (tenant.id, user.id)
- Prefer SetStatus and RecordError rather than writing error logs; it makes error traces easy to find and sample
- For periodic logs (heartbeat), use metrics rather than span events
Log-to-trace enrichment and correlation
You will still need some logs: audit trails, security events, cron output, and occasionally investigative dumps. Keep them, but make them joinable to traces.
Key practices:
- Propagate W3C TraceContext across services and into logging. Many language loggers can automatically inject trace_id and span_id as fields in each log record when a span is active.
- Use OTel logging bridge or appender in your language runtime (e.g., Java Logback appender for OpenTelemetry, Python logging instrumentation). This ensures logs arrive with resource attributes like service.name and version.
- In the collector, normalize log fields and scrub PII. Add or rename attributes so your log store label set stays bounded.
Example of adding trace and span IDs to log lines in the collector when the application does not do it yet, by parsing a request-id and mapping it:
yamlprocessors: transform/logs: log_statements: - context: body statements: - extract_regex(body, "request_id=(?P<req>[a-f0-9\-]{16,36})", "attributes.request.id") - context: resource statements: - set(attributes["service.namespace"], "retail")
Where feasible, prefer application-level injection of trace_id to logs, because reconstructing correlations post-hoc is probabilistic and fragile.
Routing logs to cheaper storage:
- Keep only audit, security, and compliance logs for long retention (90–365 days)
- Shorten retention for everything else (7–14 days) and route to a columnar or compressed store
- Drop DEBUG level logs in production; keep them in dev and CI only
Instrument more with less friction using eBPF
Manual instrumentation is costly and incomplete. Auto-instrumentation via language agents helps, but you still miss kernel, network, and some database layers. eBPF fills these gaps with low-overhead probes that observe syscalls and kernel events, reconstructing L7 metadata for common protocols.
Viable options in 2025:
- Grafana Beyla: eBPF-based auto-instrumentation that creates spans and metrics for HTTP, gRPC, and common runtimes; exports OTLP
- Pixie: deep eBPF observability for Kubernetes; supports OTLP export and on-cluster analytics
- Parca Agent or Pyroscope: continuous profiling using eBPF; integrates with OTel metrics/logs
- Cilium Hubble for network flow telemetry (export via OTLP adapters or bridge)
An example Beyla configuration that ships traces and metrics to your OTel gateway:
bashexport OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-gateway:4317 export BEYLA_OPEN_PORT=6060 beyla run --k8s-autodetect --otlp-endpoint $OTEL_EXPORTER_OTLP_ENDPOINT
Benefits of eBPF in a trace-first approach:
- Captures service edges and latency even when app code is not instrumented, improving tail-based sampling decisions
- Provides fallback spans for hot paths so you avoid reintroducing logs just to understand who called whom
- Overhead is typically under low single-digit CPU percentage for moderate throughput, and it can be scoped by namespace or container labels
Caveats:
- eBPF does not replace domain-specific attributes; you still want application-level context for good sampling policies
- Kernel and distro differences matter; test in staging and pin compatible agent versions
Data governance and cost guardrails
Without policy, trace-first can still become expensive. Governance needs to be built into your pipelines.
- Attribute budgets: enforce a bounded set of attribute keys and drop unexpected ones at the collector. High-cardinality attributes like user.id and raw UUIDs should be used sparingly and, if needed, hashed.
- PII minimization: delete secrets, tokens, and personal data at the edge. Use transform and attributes processors for masking at ingest time.
- Routing by environment and team: dev and CI go to cheap sinks with short TTL; prod goes to primary APM store with tail sampling
- Rate limiting: apply rate_limiter extension on exporters or rate_limiting policies in tail sampling to prevent budget overruns during incidents
- Quotas per tenant: implement sampling rules that keep fairness under traffic spikes, and alert when sampling percentages auto-tighten
- Schema versioning: adhere to OTel semantic conventions and internal data contracts so dashboards and queries do not break as teams evolve attributes
Example transform rules to delete unsafe attributes and cap cardinality:
yamlprocessors: transform/traces: trace_statements: - context: span statements: - delete_key(attributes, "db.statement") - set(attributes["user.id_hash"], sha1(attributes["user.id"])) - delete_key(attributes, "user.id")
A small loss of fidelity prevents large compliance and cost problems later.
A 90-day migration playbook
This is a practical roadmap for teams moving from log-heavy to trace-first without breaking incident response. Adapt timing to your scale.
Phase 0: set goals and baseline (1–2 weeks)
- Define target outcomes: reduce observability spend by 30–60% while improving median time to detect by 20% and preserving 100% of critical incidents
- Baseline your current data volume by signal and service; identify top 10 log producers and their purpose
- Identify key SLOs and error budgets; these will anchor your sampling policies
Phase 1: instrument traces and ensure context in logs (2–3 weeks)
- Deploy OTel SDKs or auto-instrumentation agents in 2–3 critical services; propagate TraceContext end-to-end
- Add span events for high-volume logs: cache outcomes, retry decisions, rate limiters, validation
- Enable automatic log correlation: configure log appenders to inject trace_id and span_id into every record
- Stand up an OTel Collector gateway in staging with memory_limiter, batch, k8sattributes, resource, and attributes/redact processors; no sampling yet
Phase 2: introduce tail-based sampling in staging (2 weeks)
- Start with conservative policies: keep errors and slow traces, 10–20% baseline
- Add spanmetrics connector to generate RED metrics from traces pre-sampling; validate against existing metrics
- Validate eBPF auto-instrumentation in one cluster (Beyla or Pixie) to fill gaps
- Run load tests and confirm memory sizing and decision latency are acceptable
Phase 3: ship to production with guardrails (2–3 weeks)
- Roll out the gateway to prod with tail sampling and metrics connectors
- Route logs to a cheaper store with short TTL by default; keep only compliance logs long-term
- Turn down verbose logs in prod; raise log levels in dev and CI for developer UX
- Monitor KPIs: cost per 1k requests, trace coverage (percentage of requests with a kept trace), time-to-first-signal during incidents
Phase 4: optimize and expand (ongoing)
- Tighten sampling percentages gradually while confirming incident review quality remains high
- Add SLO-aware sampling: increase sampling when error rate or latency SLOs are violated to get more traces during bad periods
- Define team-level data contracts: allowed attributes, redaction patterns, budgets
- Migrate remaining services onto OTel; integrate client-side spans for end-to-end journeys
Exit criteria:
- Most investigative tasks and postmortems are trace-driven using span events
- You can reconstruct RED metrics from traces without relying on application counters
- Log volume reduced by at least a third, ideally by half, with no degradation in incident response
SLO-aware and dynamic sampling patterns
Static percentages are a good start, but production systems are dynamic. You can adjust sampling at runtime using feature flags or configuration reloads when SLOs breach.
Approaches:
- Latency-aware: when p95 latency of a service exceeds its SLO, increase baseline sampling from 5% to 20% and keep all traces over the threshold. This provides visibility during brownouts.
- Error-aware: when error rate crosses the budget, keep 100% of error traces and raise baseline for 10 minutes. Combine with rate-limiting to avoid meltdown.
- Tenant-aware: keep more for enterprise tenants or users under active investigation by support.
These policies are implemented with tail sampling rules and occasional config updates. For more adaptive behavior, some teams feed metrics back to dynamic samplers via a control loop.
Pitfalls and how to avoid them
- Sampling before enriching: If you sample before k8sattributes and resource population, you will lose the ability to write attribute-based policies. Always enrich early.
- Disabling errors in code: Developers sometimes reduce error severity to avoid noisy alerts. Your sampling relies on accurate status codes; enforce coding standards and lint for correct span status.
- Overusing high-cardinality attributes: Attributes like raw IDs or user agents will explode storage cardinality. Hash or bucket them, and never use full query strings.
- Span events as logs dumping ground: Keep events small and meaningful. Large payloads negate the cost advantage and can be redaction risks.
- Missing client and edge spans: Traces that start at ingress or client browser provide the most insight into user impact. Add instrumentation at these edges so tail sampling can prioritize real user pain.
What to measure: KPIs for a trace-first program
- Cost per 1k requests by signal (target a declining trend for logs; steady or modest for traces)
- Trace coverage (percentage of requests with at least one kept trace) and error-trace coverage (should be 100%)
- Time to first actionable signal during incidents (span events and error traces should appear within seconds)
- Query performance for common workflows (find error spikes by route or tenant in under a few seconds)
- Pipeline health: collector CPU, memory, decision latency, exporter backpressure, and dropped data counts
Frequently asked questions
Do we really stop shipping logs?
- You reduce them heavily and you ship them intentionally. Keep audit and security logs; keep application logs that do not map to request lifecycle; drop or convert the rest to span events.
Will we miss a breadcrumb that only existed in logs?
- Possibly at first. That is why you start with both signals and transition gradually. Once teams adopt span events and structured attributes, you will find that traces tell a more complete story with better context.
Do we lose the ability to compute long-tail counters and trends?
- No. Derive metrics from traces using spanmetrics and servicegraph. Keep baseline samples for healthy traffic so you maintain representative distributions.
Is eBPF safe and portable enough?
- For mainstream kernels and managed Kubernetes distributions, eBPF-based agents are stable. You still need a compatibility matrix and staged rollouts.
How do we handle compliance and GDPR?
- Delete PII at the edge using collector processors. Prefer hashed IDs and avoid payloads in attributes. Maintain an internal attribute allowlist to enforce the policy consistently.
References and further reading
- Google Dapper paper: Large-scale distributed systems tracing infrastructure
- OpenTelemetry project documentation and semantic conventions
- OpenTelemetry Collector tail sampling and transform processors
- Grafana Tempo and the spanmetrics connector design notes
- eBPF reference material and Beyla, Pixie project docs
Conclusion: lead with traces, demote logs
In 2025, the most effective, cost-conscious observability stacks are trace-first. Traces encode the causal structure of your system, support precise tail-based sampling, and can generate the metrics you rely on for SLOs. Span events replace a surprising amount of log noise. eBPF fills in the blind spots at low overhead. The OTel Collector gives you the control plane to enforce policy, governance, and budgets.
You do not have to go all-in at once: begin with a gateway, introduce tail-based sampling, convert the noisiest logs to span events, and route the remaining logs to a cheaper store. Within a quarter, most teams can cut spend materially while improving the speed and quality of incident response. Trace-first is not a fad; it is the practical shape of modern observability.