Stop Shipping Logs? Trace-First Observability with OpenTelemetry Pipelines, Tail-Based Sampling, and eBPF in 2025

If you are paying too much for observability, there is a high chance your biggest line item is logs. In many organizations, logs consume over half of the bill yet contribute the least to incident resolution once distributed tracing is in place. In 2025, the question is not whether to ship fewer logs, it is whether you should make traces your primary signal and treat logs as a constrained, compliance-driven dataset.

This article lays out an opinionated, technical path for a trace-first operating model using OpenTelemetry (OTel) and eBPF. We will cover:

Why traces have better cost-to-signal economics than logs
How to design OTel Collector pipelines for trace-first data flows
Tail-based sampling policies that keep the right 5–20% and 100% of the critical 1%
Replacing high-volume logs with span events and span metrics
Log-to-trace enrichment and correlation when you still need some logs
eBPF-based sources for low-friction, low-overhead signal capture
Data governance and cost controls that scale
A pragmatic, step-by-step migration playbook

The target reader: senior engineers, SREs, observability platform owners, and cost-minded engineering leaders.

The economic case for trace-first

Logs are cheap to write and easy to overproduce. They are also the noisiest, most unstructured, and most expensive to index and query at scale. When teams move to microservices, log volume scales with request fan-out and concurrency, not with user traffic. Repetitive line logs and high-cardinality labels make matters worse. If you have ever seen a single release double your log bill, you know the pain.

Traces, by contrast, are structured by design. They encode causal context and timing across the entire request path, and they carry typed attributes under consistent semantic conventions. Critically, traces give you precise control over sampling without destroying utility:

Keep 100% of error and slow traces
Keep a small, representative sample of healthy throughput (say 1–10%)
Elevate priority tenants, rare endpoints, and SLO-violating events

Metrics complete the picture: you can derive RED metrics (rate, errors, duration) from spans before sampling, preserving accurate service level indicators even when you aggressively reduce stored traces. With span events and span links, you can further compress many logging use cases into small, structured annotations within a trace.

The combination of tail-based sampling plus spanmetrics gives you most of the observability value of 100% ingest, at a fraction of the cost.

Architecture overview: the OTel Collector as the control plane

The OpenTelemetry Collector is the backbone of a trace-first architecture. It decouples signal producers from backends, applies policy (filtering, sampling, PII scrubbing), and fans out data to different stores.

A production layout in 2025 typically has:

Agent collectors running as sidecars or DaemonSets, receiving OTLP from SDKs and from auto-instrumentation (including eBPF exporters)
A regional or cluster-level gateway collector performing tail-based sampling, attribute normalization, and routing
A metrics aggregator pipeline fed by connectors such as spanmetrics and servicegraph
Multiple exporters: a columnar trace store (Tempo, Jaeger, vendor APM), cheap log store (object storage or Loki), and time-series metrics (Prometheus remote write, vendor metrics)

Key design principles:

Centralize policy in the gateway collector where you can compute on full traces
Ensure consistent resource attributes across signals (service.name, deployment, version, k8s cluster)
Generate metrics from traces in the pipeline before trace sampling occurs
Route the slim subset of logs you still need to cheaper storage and short retention

A reference OTel Collector configuration

Below is a minimal but realistic gateway configuration demonstrating trace-first processing. Adjust for your environment and collector version. The syntax shown maps to common, stable processors available in the opentelemetry-collector-contrib distribution.

yaml
receivers:
  otlp:
    protocols:
      http:
      grpc:
  filelog:
    include: [ /var/log/app/*.log ]
    start_at: beginning
    multiline:
      line_start_pattern: '^[0-9TZ-]+'

processors:
  memory_limiter:
    check_interval: 2s
    limit_mib: 4096
  batch:
    send_batch_size: 8192
    timeout: 2s
  k8sattributes:
    extract:
      metadata: [ k8s.namespace.name, k8s.pod.name, k8s.node.name, k8s.deployment.name ]
    filter:
      node_from_env_var: KUBE_NODE_NAME
  resource:
    attributes:
      - key: deployment.environment
        action: upsert
        value: prod
  attributes/redact:
    actions:
      - key: http.request.header.authorization
        action: delete
      - key: user.email
        action: delete
  transform/logs:
    error_mode: ignore
    log_statements:
      - context: body
        statements:
          - replace_all_patterns(body, '(?i)password=[^&\s]+', 'password=REDACTED')
  groupbytrace:
    wait_duration: 5s
    num_traces: 200000
  tail_sampling:
    decision_wait: 5s
    num_traces: 200000
    expected_new_traces_per_sec: 4000
    policies:
      - name: errors
        type: status_code
        status_code:
          status_codes: [ ERROR ]
      - name: slow
        type: latency
        latency:
          threshold_ms: 1000
      - name: high_value_tenants
        type: string_attribute
        string_attribute:
          key: tenant.id
          values: [ enterprise-.*, vip-.* ]
          enabled_regex_matching: true
      - name: rare_endpoints
        type: string_attribute
        string_attribute:
          key: http.target
          values: [ /checkout, /payment, /transfer ]
      - name: keep_baseline
        type: probabilistic
        probabilistic:
          sampling_percentage: 5
  filter/logs_drop_chatter:
    logs:
      include:
        match_type: strict
        resource_attributes:
          - key: log.level
            value: debug
      match_type: # intentionally empty so we document the intent: prefer explicit drop via statements
  spanmetrics:
    metrics_exporter: otlp
    aggregation_temporality: cumulative
    dimensions: [ service.name, http.method, http.route, http.status_code ]
    histogram:
      explicit_boundaries: [ 5ms, 10ms, 25ms, 50ms, 100ms, 250ms, 500ms, 1s, 2s, 5s ]
  servicegraph:
    store: in-memory
    latency_histogram:
      explicit_boundaries: [ 5ms, 10ms, 25ms, 50ms, 100ms, 250ms, 500ms, 1s, 2s, 5s ]

exporters:
  otlp:
    endpoint: traces.vendor.example:4317
    tls:
      insecure_skip_verify: false
  otlp/metrics:
    endpoint: metrics.vendor.example:4317
  prometheus:
    endpoint: 0.0.0.0:9464
  loki:
    endpoint: http://loki.gw.svc:3100/loki/api/v1/push
    labels:
      attributes:
        - k8s.namespace.name
        - k8s.pod.name
        - service.name
  debug:
    verbosity: basic

connectors:
  spanmetrics: {}
  servicegraph: {}

service:
  telemetry:
    logs:
      level: info
  pipelines:
    traces:
      receivers: [ otlp ]
      processors: [ memory_limiter, k8sattributes, resource, attributes/redact, groupbytrace, tail_sampling, batch, spanmetrics, servicegraph ]
      exporters: [ otlp, debug ]
    metrics:
      receivers: [ otlp, spanmetrics, servicegraph ]
      processors: [ batch ]
      exporters: [ otlp/metrics, prometheus ]
    logs:
      receivers: [ otlp, filelog ]
      processors: [ memory_limiter, k8sattributes, resource, transform/logs, batch ]
      exporters: [ loki ]

Notes:

groupbytrace ensures the gateway sees all spans in a trace for correct tail sampling. In a horizontally scaled gateway, use consistent hashing on trace IDs or a load balancer that supports stream affinity.
spanmetrics and servicegraph are connectors: they consume spans and emit metrics into the metrics pipeline. Place them before sampling if you want unbiased metrics; otherwise your RED metrics will reflect the sampled subset.
For PII, use a combination of attributes and transform processors to delete or mask sensitive fields.
If you export metrics to Prometheus, you can scrape the collector or remote write elsewhere.

Tail-based sampling that keeps the right data

Head-based sampling (in SDKs) randomly drops traces before you know their outcome. It is simple and still useful at the edge, but you will inevitably drop the 1% you care about most: errors, slowdowns, and rare paths.

Tail-based sampling waits to see the whole trace, then applies policies based on status code, latency, attributes, or rate-limits. It gives you a precise spend dial while preserving the incidents and anomalies that actually matter.

Common policies that work well in practice:

Always keep error traces (status code ERROR)
Keep slow traces above a latency threshold (e.g., p95 SLO + 20%)
Keep 100% for high-value tenants or priority endpoints
Keep a small baseline probabilistic sample to preserve visibility on the happy path
Rate-limit noisy categories like health checks or batch jobs without losing coverage entirely

Sizing the tail sampler:

Memory is proportional to expected_new_traces_per_sec × decision_wait × avg_spans_per_trace × bytes_per_span
Rule of thumb: 4–8 KB per span in-memory overhead depending on attributes and collector build
Example: if you see 3k new traces per second, decision_wait is 5s, and an average trace has 20 spans, memory ≈ 3,000 × 5 × 20 × 4 KB = 1.2 GB. Set memory_limiter higher than that plus headroom and scale horizontally as needed.

Operational best practices:

Measure end-to-end tail sampling latency; keep decision_wait small (3–7s) so spans are exported promptly
Drop low-value spans before sampling only if you are certain they are noise (e.g., verbose internal calls); otherwise, you might affect sampling decisions
Emit spanmetrics before sampling for unbiased RED metrics; you can still export traces after sampling
Keep baseline sampling >1% so statistical aggregates from traces remain directionally useful during normal operation

Replace high-volume logs with span events

A large fraction of logs are simply annotations of events that happen during a request: cache misses, retries, validation errors, rate limit decisions. These are better expressed as span events with structured attributes. They are cheaper (they compress with the trace), easier to query, and automatically correlated to context like user, tenant, route, and resource.

Examples follow for Go and Python. The intent is to replace line-logs like cache miss for key=X ttl=0 with structured span events.

Go:

go
import (
    "context"
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/attribute"
    "go.opentelemetry.io/otel/codes"
)

func Handler(ctx context.Context) error {
    tracer := otel.Tracer("checkout-service")
    ctx, span := tracer.Start(ctx, "apply-discount")
    defer span.End()

    // replace: log.Printf("cache miss user=%s", userID)
    span.AddEvent("cache.miss", trace.WithAttributes(
        attribute.String("cache", "discounts"),
        attribute.String("user.id", userID),
    ))

    if err := doWork(); err != nil {
        span.RecordError(err)
        span.SetStatus(codes.Error, "discount application failed")
        return err
    }
    return nil
}

Python:

python
from opentelemetry import trace
from opentelemetry.trace import Status, StatusCode

def handler():
    tracer = trace.get_tracer("payments")
    with tracer.start_as_current_span("authorize") as span:
        span.add_event(
            "policy.decision",
            {"policy": "kyc-check", "result": "pass", "score": 0.82},
        )
        try:
            charge()
        except Exception as exc:
            span.record_exception(exc)
            span.set_status(Status(StatusCode.ERROR, str(exc)))
            raise

Guidelines:

Emit events only when they add decision or diagnostic value; avoid per-loop chatter
Use canonical attribute keys per OTel semantic conventions (http.; db.; exception.*) and your domain conventions (tenant.id, user.id)
Prefer SetStatus and RecordError rather than writing error logs; it makes error traces easy to find and sample
For periodic logs (heartbeat), use metrics rather than span events

Log-to-trace enrichment and correlation

You will still need some logs: audit trails, security events, cron output, and occasionally investigative dumps. Keep them, but make them joinable to traces.

Key practices:

Propagate W3C TraceContext across services and into logging. Many language loggers can automatically inject trace_id and span_id as fields in each log record when a span is active.
Use OTel logging bridge or appender in your language runtime (e.g., Java Logback appender for OpenTelemetry, Python logging instrumentation). This ensures logs arrive with resource attributes like service.name and version.
In the collector, normalize log fields and scrub PII. Add or rename attributes so your log store label set stays bounded.

Example of adding trace and span IDs to log lines in the collector when the application does not do it yet, by parsing a request-id and mapping it:

yaml
processors:
  transform/logs:
    log_statements:
      - context: body
        statements:
          - extract_regex(body, "request_id=(?P<req>[a-f0-9\-]{16,36})", "attributes.request.id")
      - context: resource
        statements:
          - set(attributes["service.namespace"], "retail")

Where feasible, prefer application-level injection of trace_id to logs, because reconstructing correlations post-hoc is probabilistic and fragile.

Routing logs to cheaper storage:

Keep only audit, security, and compliance logs for long retention (90–365 days)
Shorten retention for everything else (7–14 days) and route to a columnar or compressed store
Drop DEBUG level logs in production; keep them in dev and CI only

Instrument more with less friction using eBPF

Manual instrumentation is costly and incomplete. Auto-instrumentation via language agents helps, but you still miss kernel, network, and some database layers. eBPF fills these gaps with low-overhead probes that observe syscalls and kernel events, reconstructing L7 metadata for common protocols.

Viable options in 2025:

Grafana Beyla: eBPF-based auto-instrumentation that creates spans and metrics for HTTP, gRPC, and common runtimes; exports OTLP
Pixie: deep eBPF observability for Kubernetes; supports OTLP export and on-cluster analytics
Parca Agent or Pyroscope: continuous profiling using eBPF; integrates with OTel metrics/logs
Cilium Hubble for network flow telemetry (export via OTLP adapters or bridge)

An example Beyla configuration that ships traces and metrics to your OTel gateway:

bash
export OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-gateway:4317
export BEYLA_OPEN_PORT=6060
beyla run --k8s-autodetect --otlp-endpoint $OTEL_EXPORTER_OTLP_ENDPOINT

Benefits of eBPF in a trace-first approach:

Captures service edges and latency even when app code is not instrumented, improving tail-based sampling decisions
Provides fallback spans for hot paths so you avoid reintroducing logs just to understand who called whom
Overhead is typically under low single-digit CPU percentage for moderate throughput, and it can be scoped by namespace or container labels

Caveats:

eBPF does not replace domain-specific attributes; you still want application-level context for good sampling policies
Kernel and distro differences matter; test in staging and pin compatible agent versions

Data governance and cost guardrails

Without policy, trace-first can still become expensive. Governance needs to be built into your pipelines.

Attribute budgets: enforce a bounded set of attribute keys and drop unexpected ones at the collector. High-cardinality attributes like user.id and raw UUIDs should be used sparingly and, if needed, hashed.
PII minimization: delete secrets, tokens, and personal data at the edge. Use transform and attributes processors for masking at ingest time.
Routing by environment and team: dev and CI go to cheap sinks with short TTL; prod goes to primary APM store with tail sampling
Rate limiting: apply rate_limiter extension on exporters or rate_limiting policies in tail sampling to prevent budget overruns during incidents
Quotas per tenant: implement sampling rules that keep fairness under traffic spikes, and alert when sampling percentages auto-tighten
Schema versioning: adhere to OTel semantic conventions and internal data contracts so dashboards and queries do not break as teams evolve attributes

Example transform rules to delete unsafe attributes and cap cardinality:

yaml
processors:
  transform/traces:
    trace_statements:
      - context: span
        statements:
          - delete_key(attributes, "db.statement")
          - set(attributes["user.id_hash"], sha1(attributes["user.id"]))
          - delete_key(attributes, "user.id")

A small loss of fidelity prevents large compliance and cost problems later.

A 90-day migration playbook

This is a practical roadmap for teams moving from log-heavy to trace-first without breaking incident response. Adapt timing to your scale.

Phase 0: set goals and baseline (1–2 weeks)

Define target outcomes: reduce observability spend by 30–60% while improving median time to detect by 20% and preserving 100% of critical incidents
Baseline your current data volume by signal and service; identify top 10 log producers and their purpose
Identify key SLOs and error budgets; these will anchor your sampling policies

Phase 1: instrument traces and ensure context in logs (2–3 weeks)

Deploy OTel SDKs or auto-instrumentation agents in 2–3 critical services; propagate TraceContext end-to-end
Add span events for high-volume logs: cache outcomes, retry decisions, rate limiters, validation
Enable automatic log correlation: configure log appenders to inject trace_id and span_id into every record
Stand up an OTel Collector gateway in staging with memory_limiter, batch, k8sattributes, resource, and attributes/redact processors; no sampling yet

Phase 2: introduce tail-based sampling in staging (2 weeks)

Start with conservative policies: keep errors and slow traces, 10–20% baseline
Add spanmetrics connector to generate RED metrics from traces pre-sampling; validate against existing metrics
Validate eBPF auto-instrumentation in one cluster (Beyla or Pixie) to fill gaps
Run load tests and confirm memory sizing and decision latency are acceptable

Phase 3: ship to production with guardrails (2–3 weeks)

Roll out the gateway to prod with tail sampling and metrics connectors
Route logs to a cheaper store with short TTL by default; keep only compliance logs long-term
Turn down verbose logs in prod; raise log levels in dev and CI for developer UX
Monitor KPIs: cost per 1k requests, trace coverage (percentage of requests with a kept trace), time-to-first-signal during incidents

Phase 4: optimize and expand (ongoing)

Tighten sampling percentages gradually while confirming incident review quality remains high
Add SLO-aware sampling: increase sampling when error rate or latency SLOs are violated to get more traces during bad periods
Define team-level data contracts: allowed attributes, redaction patterns, budgets
Migrate remaining services onto OTel; integrate client-side spans for end-to-end journeys

Exit criteria:

Most investigative tasks and postmortems are trace-driven using span events
You can reconstruct RED metrics from traces without relying on application counters
Log volume reduced by at least a third, ideally by half, with no degradation in incident response

SLO-aware and dynamic sampling patterns

Static percentages are a good start, but production systems are dynamic. You can adjust sampling at runtime using feature flags or configuration reloads when SLOs breach.

Approaches:

Latency-aware: when p95 latency of a service exceeds its SLO, increase baseline sampling from 5% to 20% and keep all traces over the threshold. This provides visibility during brownouts.
Error-aware: when error rate crosses the budget, keep 100% of error traces and raise baseline for 10 minutes. Combine with rate-limiting to avoid meltdown.
Tenant-aware: keep more for enterprise tenants or users under active investigation by support.

These policies are implemented with tail sampling rules and occasional config updates. For more adaptive behavior, some teams feed metrics back to dynamic samplers via a control loop.

Pitfalls and how to avoid them

Sampling before enriching: If you sample before k8sattributes and resource population, you will lose the ability to write attribute-based policies. Always enrich early.
Disabling errors in code: Developers sometimes reduce error severity to avoid noisy alerts. Your sampling relies on accurate status codes; enforce coding standards and lint for correct span status.
Overusing high-cardinality attributes: Attributes like raw IDs or user agents will explode storage cardinality. Hash or bucket them, and never use full query strings.
Span events as logs dumping ground: Keep events small and meaningful. Large payloads negate the cost advantage and can be redaction risks.
Missing client and edge spans: Traces that start at ingress or client browser provide the most insight into user impact. Add instrumentation at these edges so tail sampling can prioritize real user pain.

What to measure: KPIs for a trace-first program

Cost per 1k requests by signal (target a declining trend for logs; steady or modest for traces)
Trace coverage (percentage of requests with at least one kept trace) and error-trace coverage (should be 100%)
Time to first actionable signal during incidents (span events and error traces should appear within seconds)
Query performance for common workflows (find error spikes by route or tenant in under a few seconds)
Pipeline health: collector CPU, memory, decision latency, exporter backpressure, and dropped data counts

Frequently asked questions

Do we really stop shipping logs?

You reduce them heavily and you ship them intentionally. Keep audit and security logs; keep application logs that do not map to request lifecycle; drop or convert the rest to span events.

Will we miss a breadcrumb that only existed in logs?

Possibly at first. That is why you start with both signals and transition gradually. Once teams adopt span events and structured attributes, you will find that traces tell a more complete story with better context.

Do we lose the ability to compute long-tail counters and trends?

No. Derive metrics from traces using spanmetrics and servicegraph. Keep baseline samples for healthy traffic so you maintain representative distributions.

Is eBPF safe and portable enough?

For mainstream kernels and managed Kubernetes distributions, eBPF-based agents are stable. You still need a compatibility matrix and staged rollouts.

How do we handle compliance and GDPR?

Delete PII at the edge using collector processors. Prefer hashed IDs and avoid payloads in attributes. Maintain an internal attribute allowlist to enforce the policy consistently.

References and further reading

Google Dapper paper: Large-scale distributed systems tracing infrastructure
OpenTelemetry project documentation and semantic conventions
OpenTelemetry Collector tail sampling and transform processors
Grafana Tempo and the spanmetrics connector design notes
eBPF reference material and Beyla, Pixie project docs

Conclusion: lead with traces, demote logs

In 2025, the most effective, cost-conscious observability stacks are trace-first. Traces encode the causal structure of your system, support precise tail-based sampling, and can generate the metrics you rely on for SLOs. Span events replace a surprising amount of log noise. eBPF fills in the blind spots at low overhead. The OTel Collector gives you the control plane to enforce policy, governance, and budgets.

You do not have to go all-in at once: begin with a gateway, introduce tail-based sampling, convert the noisiest logs to span events, and route the remaining logs to a cheaper store. Within a quarter, most teams can cut spend materially while improving the speed and quality of incident response. Trace-first is not a fad; it is the practical shape of modern observability.