Backpressure by Design in 2025: Concurrency Limits, Admission Control, and Queueing Patterns for Reliable Services

Modern distributed systems fail most often not because a single component is down, but because the system is overloaded. Overload is subtle: median latency looks fine, but the 99th percentile climbs; CPU is not pegged, yet queues grow; retries amplify the problem until the whole stack collapses.

In 2025, reliable services must be engineered with overload as a first-class concern. This article is a practical playbook rooted in queueing theory, production patterns, and the latest tooling—Envoy, gRPC, and Kubernetes—to deliver end-to-end backpressure: from admission control at the edge to adaptive concurrency at each hop, with queue time SLOs, hedged requests, and bulkheads preventing cascading failure.

We’ll cover:

Why overload happens and how to observe it
Concurrency limiters vs token buckets vs queue length caps
Queue time SLOs and adaptive shedding
Hedged requests that don’t melt your servers
Bulkheads at the code, process, and cluster layers
End-to-end flow control with Envoy, gRPC, and Kubernetes
Step-by-step deployment plans and test strategies

Opinionated take: If you implement only one thing, implement queue time SLOs with end-to-end cancellation. It’s the single most effective habit to avoid serving work you already cannot complete in time.

1) Overload: the real failure mode

Nearly all distributed outages share a theme: demand exceeds effective capacity (for some slice of traffic), queues grow, latency tails explode, clients retry and hedge improperly, and the thundering herd finishes the job. Preventing this requires a predictable relationship between arrival rate and service capacity.

Core principles:

Little’s Law: L = λW. The average number of items in the system equals arrival rate times average time in the system. If you let W (latency) grow via queues, L grows, which further increases W, and so on. Bound queues to bound latency.
Tail latency is multiplicative across hops. A single slow hop dominates end-to-end latencies; queuing at each stage multiplies the tail.
Retries shift demand in time; hedges increase demand. Without budgets and idempotency, they worsen overload.
Capacity is not just CPU/memory. Locks, DB connections, in-flight RPC limits, and cloud IO are bottlenecks.

The corollary: your service needs explicit mechanisms that limit how much work it accepts, how long it waits in queues, how it cancels work when budgets are exceeded, and how it sheds gracefully.

2) Signals and SLOs that matter

You can’t control what you don’t measure. Instrument at least:

In-flight concurrent requests (by route and priority)
Queue length and queue time distribution
End-to-end latency distribution (p50, p90, p99, p99.9)
CPU, memory, GC/heap pressure, and thread pool saturation
Retries, hedges, and their success rate
Deadline/cancellation propagation (rate of server-side canceled work)
Admission decisions: accepted vs shed, and reasons

Two SLOs drive backpressure design:

End-to-end latency SLO: e.g., 95% of requests complete under 200 ms
Queue time SLO: e.g., drop requests that have already spent > 50 ms waiting anywhere before your handler

Queue time SLOs are proactive. Once the budget is blown, finishing the work rarely helps the user; drop early and save capacity for fresh requests that can still succeed.

3) Building blocks

3.1 Token buckets and leaky buckets

Token buckets are simple and effective rate limiters: tokens arrive at a steady rate up to a max burst; each request consumes a token. If the bucket is empty, reject (or queue briefly).

Advantages:

Good for edges and shared resources to cap overall arrival rate
Burst tolerance via bucket size

Limitations:

Not aware of dynamic service time; can accept requests even when concurrency is saturated

Minimal Go token bucket:

go
package ratelimit

import (
    "sync"
    "time"
)

type TokenBucket struct {
    mu      sync.Mutex
    tokens  int
    cap     int
    rate    int           // tokens per second
    lastRef time.Time
}

func NewTokenBucket(cap, rate int) *TokenBucket {
    return &TokenBucket{cap: cap, rate: rate, tokens: cap, lastRef: time.Now()}
}

func (b *TokenBucket) Allow() bool {
    b.mu.Lock()
    defer b.mu.Unlock()
    now := time.Now()
    elapsed := now.Sub(b.lastRef)
    refill := int(elapsed.Seconds() * float64(b.rate))
    if refill > 0 {
        b.tokens = min(b.cap, b.tokens+refill)
        b.lastRef = now
    }
    if b.tokens > 0 {
        b.tokens--
        return true
    }
    return false
}

func min(a, b int) int { if a < b { return a } ; return b }

Use at ingress to flatten spikes; combine with concurrency limits internally.

3.2 Concurrency limiters

Concurrency limiters cap in-flight work, directly controlling queueing. A static limit is a start; adaptive limiters are better. Netflix’s “concurrency-limits” algorithms (e.g., Gradient2, Vegas) adjust allowed concurrency based on latency and queue signals. Envoy’s adaptive concurrency filter implements a similar gradient controller.

Benefits:

Prevents unbounded queueing; bounds memory and latency
Naturally adapts to slower/fast backends

A simple weighted semaphore pattern (Go):

go
package limiter

import (
    "context"
    "errors"
    "time"

    "golang.org/x/sync/semaphore"
)

type ConcurrencyLimiter struct {
    sem *semaphore.Weighted
    qTimeout time.Duration // queue time budget
}

func New(limit int64, qTimeout time.Duration) *ConcurrencyLimiter {
    return &ConcurrencyLimiter{
        sem: semaphore.NewWeighted(limit),
        qTimeout: qTimeout,
    }
}

var ErrQueueTimeout = errors.New("queue time budget exceeded")

func (c *ConcurrencyLimiter) Run(ctx context.Context, fn func(context.Context) error) error {
    start := time.Now()
    qctx, cancel := context.WithTimeout(ctx, c.qTimeout)
    defer cancel()

    if err := c.sem.Acquire(qctx, 1); err != nil {
        if errors.Is(err, context.DeadlineExceeded) {
            return ErrQueueTimeout
        }
        return err
    }
    defer c.sem.Release(1)

    // Adjust remaining deadline for handler
    if deadline, ok := ctx.Deadline(); ok {
        left := time.Until(deadline)
        // terminate early if little budget left
        if left < 5*time.Millisecond {
            return context.DeadlineExceeded
        }
    }
    // Attach queue time for logging/metrics
    ctx = context.WithValue(ctx, "queue_time_ms", time.Since(start).Milliseconds())
    return fn(ctx)
}

Start with a static limit per instance (e.g., 2x vCPU for CPU-bound, or 4x for IO-bound) and iterate with data. Graduating to adaptive controllers (Envoy, concurrency-limits libraries) yields better utilization under variable latency.

3.3 Queue length caps and bounded buffers

If you must queue, bound it. In practice, cap queue length at something proportional to concurrency, like 2–4x the concurrency limit. Anything beyond that is rarely salvageable under an SLO.

Pitfall: default executors (e.g., unbounded thread pools) hide overload until GC or OOM does the shedding. Always bound.

4) Queue time SLOs: shed early, succeed more

Queue time SLOs enforce a hard budget for time spent waiting before execution. You can compute queue time end-to-end by propagating a start timestamp header from the edge and comparing it before you accept work in each hop.

Common headers:

x-request-start or x-queue-start (set by edge proxies)
grpc-timeout (deadline-based budgets)
x-envoy-start-time (Envoy start time; use to compute queue time)

Server-side guard (Go HTTP middleware example):

go
func QueueBudgetMiddleware(next http.Handler, budget time.Duration) http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        start := time.Now()
        if hdr := r.Header.Get("x-request-start"); hdr != "" {
            if t, err := time.Parse(time.RFC3339Nano, hdr); err == nil {
                if time.Since(t) > budget {
                    http.Error(w, "queue time budget exceeded", http.StatusServiceUnavailable)
                    return
                }
            }
        }
        // Add our start time for downstreams
        r.Header.Set("x-request-start", start.UTC().Format(time.RFC3339Nano))
        next.ServeHTTP(w, r)
    })
}

With gRPC, prefer deadlines:

Clients set context deadline (or per-RPC timeout via grpc-timeout header)
Servers check context.Done() frequently and stop work immediately
Proxies should honor and clamp timeouts, not extend them

Reasonable initial budgets:

Edge to origin: 50–100 ms queue time budget
Internal hop: 10–30 ms budget per hop

Tune budgets based on observed queue histograms and SLOs.

5) Admission control: the front door

Admission control decides whether a request may enter the system. Good admission is multi-signal and probabilistic near limits.

Signals to consider:

Token bucket availability (rate cap)
Concurrency occupancy (in-flight / limit)
Queue time budget remaining
Critical resource pressure (CPU, memory, DB connections)
Priority/class of traffic (e.g., interactive vs batch)

Simple composite policy:

Drop if queue time exceeded
Drop if concurrency limit saturated and queue depth beyond threshold
Rate-limit by token bucket (burst control)
Preferentially admit high-priority traffic (bulkhead)

Pseudocode:

python
def should_admit(req, state):
    if req.queue_time_ms > req.queue_budget_ms:
        return False, "queue_budget"
    if state.inflight >= state.max_concurrency and state.queue_depth >= state.max_queue:
        return False, "queue_full"
    if not state.token_bucket.allow():
        return False, "rate_limit"
    return True, "ok"

This is effectively a small finite state machine. Observe and export the reason for drops.

6) Adaptive shedding: degrade gracefully

Instead of hard on/off, gradually shed under strain. Probabilistic shed rates reduce oscillations.

Idea: when p99 latency exceeds SLO or CPU exceeds 80%, compute a drop probability p in [0, 1], increasing with pressure. Sample per request.

Example policy:

Measure error rate E and tail latency exceedance T = max(0, p99/SLO − 1)
Shed probability p = clamp(αT + βE, 0, 0.9)

Adaptive controllers must avoid synchronized oscillations:

Smooth metrics with EWMA
Update p at low frequency (e.g., 1–2 Hz)
Apply jitter

Envoy implements this as the admission_control filter (see config below).

7) Hedged requests: cut tail latency responsibly

Hedging sends a duplicate request after a delay to cut tail latency. This can be extremely effective for idempotent reads, but catastrophic under overload if misused.

Rules:

Hedge only idempotent operations
Use budgets: never exceed overall concurrent hedge limits
Cancel slower copy immediately when one response arrives
Disable hedges when shed probability is non-zero (system is overloaded)

gRPC client service config (language support varies):

json
{
  "methodConfig": [{
    "name": [{"service": "catalog.Search"}],
    "timeout": "300ms",
    "retryPolicy": {
      "maxAttempts": 3,
      "initialBackoff": "20ms",
      "maxBackoff": "200ms",
      "backoffMultiplier": 2,
      "retryableStatusCodes": ["UNAVAILABLE", "DEADLINE_EXCEEDED"]
    },
    "hedgingPolicy": {
      "maxAttempts": 2,
      "hedgingDelay": "30ms",
      "nonFatalStatusCodes": ["RESOURCE_EXHAUSTED", "ABORTED"]
    }
  }]
}

Start conservatively (one extra attempt with a 20–50 ms delay) and measure amplification factors.

8) Bulkheads: isolate and prioritize

Bulkheading isolates resources so a failure or surge in one area doesn’t sink the whole ship.

Implement bulkheads at multiple layers:

Code: separate thread pools/executors per dependency (e.g., DB vs cache), with per-pool concurrency limits
Process: dedicated worker pools or pods per traffic class (interactive vs batch)
Network/proxy: per-route circuit breakers, priority queues
Cluster: Kubernetes PriorityClasses and ResourceQuotas; PodDisruptionBudgets per service

Kubernetes PriorityClass example:

yaml
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: interactive-high
value: 100000
preemptionPolicy: PreemptLowerPriority
globalDefault: false
description: "Interactive latency-sensitive traffic"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: batch-low
value: 1000
preemptionPolicy: Never
globalDefault: false
description: "Batch and background jobs"

Two Deployments with separate HPAs and resource limits:

yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-interactive
spec:
  replicas: 3
  selector:
    matchLabels: {app: api, tier: interactive}
  template:
    metadata:
      labels: {app: api, tier: interactive}
    spec:
      priorityClassName: interactive-high
      containers:
      - name: api
        image: ghcr.io/acme/api:2025-09
        resources:
          requests: {cpu: "500m", memory: "512Mi"}
          limits: {cpu: "1", memory: "1Gi"}
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-interactive
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-interactive
  minReplicas: 3
  maxReplicas: 15
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 30
    scaleDown:
      stabilizationWindowSeconds: 300
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 60

Deploy a second, separately scaled Deployment for batch traffic (lower priority, tighter limits). This way overload in batch doesn’t steal capacity from interactive.

9) Envoy: front-line backpressure and flow control

Envoy is an excellent place to enforce backpressure policies consistently.

Key features:

local_ratelimit filter: per-route token bucket
admission_control filter: probabilistic shedding based on upstream success rate and concurrency
adaptive_concurrency filter: auto-tune concurrency per route based on latency sampling
circuit breakers on clusters: max connections, pending, requests, retries
overload_manager: define triggers (e.g., memory) and actions (e.g., stop accepting new requests)

Example Envoy HTTP filter chain with rate limit, admission control, adaptive concurrency, and router:

yaml
static_resources:
  listeners:
  - name: http
    address:
      socket_address: { address: 0.0.0.0, port_value: 8080 }
    filter_chains:
    - filters:
      - name: envoy.filters.network.http_connection_manager
        typed_config:
          "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
          stat_prefix: ingress_http
          route_config:
            name: local_route
            virtual_hosts:
            - name: app
              domains: ["*"]
              routes:
              - match: { prefix: "/" }
                route: { cluster: app_service }
          http_filters:
          - name: envoy.filters.http.adaptive_concurrency
            typed_config:
              "@type": type.googleapis.com/envoy.extensions.filters.http.adaptive_concurrency.v3.AdaptiveConcurrency
              gradient_controller_config:
                sample_aggregate_percentile: 95
                concurrency_limit_params:
                  max_concurrency_limit: 1000
                  min_concurrency_limit: 10
          - name: envoy.filters.http.admission_control
            typed_config:
              "@type": type.googleapis.com/envoy.extensions.filters.http.admission_control.v3.AdmissionControl
              enabled:
                default_value: true
                runtime_key: admission_control.enabled
              sampling_window: 10s
              aggression: 2.0
              rps_threshold: 20
          - name: envoy.filters.http.local_ratelimit
            typed_config:
              "@type": type.googleapis.com/envoy.extensions.filters.http.local_ratelimit.v3.LocalRateLimit
              stat_prefix: local_rate_limit
              token_bucket: { max_tokens: 1000, tokens_per_fill: 100, fill_interval: 1s }
              filter_enabled: { default_value: true }
              filter_enforced: { default_value: true }
          - name: envoy.filters.http.router
  clusters:
  - name: app_service
    connect_timeout: 0.25s
    type: STRICT_DNS
    lb_policy: ROUND_ROBIN
    load_assignment:
      cluster_name: app_service
      endpoints:
      - lb_endpoints:
        - endpoint:
            address:
              socket_address: { address: app, port_value: 8080 }
    circuit_breakers:
      thresholds:
      - max_connections: 1024
        max_pending_requests: 512
        max_requests: 2048
        max_retries: 3
overload_manager:
  refresh_interval: 0.25s
  resource_monitors:
  - name: envoy.resource_monitors.fixed_heap
    typed_config:
      "@type": type.googleapis.com/envoy.extensions.resource_monitors.fixed_heap.v3.FixedHeapConfig
      max_heap_size_bytes: 2147483648 # 2 GiB
  actions:
  - name: envoy.overload_actions.shrink_heap
    triggers:
    - name: envoy.resource_monitors.fixed_heap
      threshold: { value: 0.90 }
  - name: envoy.overload_actions.stop_accepting_requests
    triggers:
    - name: envoy.resource_monitors.fixed_heap
      threshold: { value: 0.98 }

Notes:

admission_control ramps drop probability when upstream signals poor success/tail latency
adaptive_concurrency learns a concurrency limit to optimize throughput vs latency
circuit breakers protect upstreams from unbounded pending
overload_manager guards against heap exhaustion

To propagate queue time: use Envoy’s x-request-start or x-envoy-start-time and a small Lua filter to compute and shed if over budget:

yaml
- name: envoy.filters.http.lua
  typed_config:
    "@type": type.googleapis.com/envoy.extensions.filters.http.lua.v3.Lua
    inline_code: |
      function envoy_on_request(handle)
        local start = handle:headers():get("x-request-start")
        if start == nil then
          start = handle:streamInfo():startTime():rfc3339(true)
          handle:headers():add("x-request-start", start)
        end
        local now = os.date("!%Y-%m-%dT%H:%M:%S.") .. string.format("%03dZ", math.floor(handle:timestamp():nanoseconds()/1e6)%1000)
        -- simple parse using Envoy start_time instead; real code should parse RFC3339
        local budget_ms = tonumber(handle:headers():get("x-queue-budget-ms")) or 50
        local elapsed_ms = handle:streamInfo():requestComplete():value() or 0
        if elapsed_ms > budget_ms then
          handle:respond({[":status"] = "503"}, "queue time budget exceeded")
        end
      end

This is illustrative; in practice parse start_time properly and compute elapsed precisely via streamInfo.

10) gRPC: deadlines, cancellations, and per-RPC budgets

gRPC’s built-in deadlines are the cleanest way to propagate budgets. Use them everywhere.

Client best practices:

Set an explicit deadline on every RPC
Size deadlines to end-to-end SLO minus client-side budgets
Configure retries/hedges only for idempotent methods
Clamp max attempts globally

Server best practices:

Check ctx.Done() in handlers and abort promptly
Avoid unbounded per-stream buffers
Bound concurrency (semaphore pattern) and queue time

Go gRPC server example with concurrency limit and queue time guard:

go
type server struct{ 
    lim *limiter.ConcurrencyLimiter 
}

func (s *server) Get(ctx context.Context, req *pb.GetRequest) (*pb.GetResponse, error) {
    err := s.lim.Run(ctx, func(ctx context.Context) error {
        // Simulate work respecting ctx
        select {
        case <-time.After(20 * time.Millisecond):
            // do work
            return nil
        case <-ctx.Done():
            return ctx.Err()
        }
    })
    if err != nil {
        if errors.Is(err, limiter.ErrQueueTimeout) {
            return nil, status.Error(codes.ResourceExhausted, "queue budget")
        }
        if errors.Is(err, context.DeadlineExceeded) || errors.Is(err, context.Canceled) {
            return nil, status.Error(codes.DeadlineExceeded, "deadline")
        }
        return nil, status.Error(codes.Unavailable, err.Error())
    }
    return &pb.GetResponse{Ok: true}, nil
}

Client with deadline and hedging (Go):

go
conn, _ := grpc.Dial(
    target,
    grpc.WithTransportCredentials(insecure.NewCredentials()),
    grpc.WithDefaultServiceConfig(`{
      "methodConfig": [{
        "name": [{"service":"catalog.Search"}],
        "timeout": "0.3s",
        "retryPolicy": {"maxAttempts": 3, "initialBackoff": "0.02s", "maxBackoff": "0.2s", "backoffMultiplier": 2, "retryableStatusCodes": ["UNAVAILABLE","DEADLINE_EXCEEDED"]},
        "hedgingPolicy": {"maxAttempts": 2, "hedgingDelay": "0.03s"}
      }]}
    `),
)
ctx, cancel := context.WithTimeout(context.Background(), 300*time.Millisecond)
defer cancel()
resp, err := pb.NewCatalogClient(conn).Search(ctx, &pb.SearchRequest{Q: "foo"})

Validate support for hedging in your language/runtime; otherwise implement client-side manually with contexts and timers.

11) Kubernetes: capacity is policy

Kubernetes gives you levers to ensure backpressure policies stick under autoscaling and failures.

Requests/limits: set realistic CPU/memory requests; avoid overcommit on critical services
HPAs: slow scale-down, smoothed scale-up to avoid oscillations; scale on RPS or in-flight gauges via custom metrics, not just CPU
PodDisruptionBudgets: preserve minimum replicas during maintenance
TopologySpreadConstraints and anti-affinity: avoid co-locating all replicas in one node/AZ
PriorityClasses: steer the scheduler during contention
NetworkPolicy: isolate services to reduce noisy neighbors

Sample HPA based on a custom in-flight gauge (via Prometheus Adapter):

yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-inflight
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-interactive
  minReplicas: 3
  maxReplicas: 30
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 20
    scaleDown:
      stabilizationWindowSeconds: 300
  metrics:
  - type: Pods
    pods:
      metric:
        name: http_inflight_requests
      target:
        type: AverageValue
        averageValue: "50"  # target ~50 in-flight per pod

Ensure your metric is monotonic and cleaned up on pod termination to avoid stale values.

12) End-to-end flow control: stitch it together

Backpressure works when the entire path speaks the same language about budgets.

Put it together:

At ingress: token bucket to flatten bursts, admission_control with adaptive shedding; inject x-request-start and queue budget header
Service-to-service: propagate deadlines (gRPC), x-request-start; enforce per-hop queue time SLO and concurrency limiters
Prioritize with bulkheads: separate pools/deployments per priority; Envoy route-level circuit breakers
Cancellation everywhere: on drop, cancel outstanding downstream work

Headers and context to propagate:

grpc-timeout (or explicit deadline)
x-request-id for trace correlation
x-request-start for queue time computation
x-priority or routing metadata for bulkheads

Finally, push back to the user appropriately: use 429 (Too Many Requests) or 503 (Service Unavailable) with Retry-After when shedding at the edge; internally, prefer gRPC codes ResourceExhausted or Unavailable.

13) Testing and proving it works

Test the overload path before production does it for you.

Load generators for tail latency:

wrk2 or fortio for constant-QPS testing (better tail analysis)
vegeta or k6 for programmable scenarios
ghz for gRPC load

Experiments:

Step QPS beyond capacity; verify that latency flattens at SLO and shed rate increases instead of latency exploding
Induce a backend slowdown (add 20 ms) and observe adaptive concurrency reduce allowed in-flight
Kill N pods; verify bulkheads keep interactive SLOs while batch sheds
Clamp deadlines smaller than service time; ensure cancellations propagate and work halts

Metrics to check:

In-flight gauge plateaus at limit; queue depth stays bounded
p99 latency stable under overload; shed rate rises smoothly (no oscillation)
Retry/hedge amplification factor stays < 1.2x under tail events
CPU doesn’t peg; GC/heap steady; no OOMKills

Chaos drills:

Add artificial tail latency injection (e.g., 1% requests sleep 500 ms) and confirm hedging reduces p99 without triggering a meltdown
Disable a DB shard; confirm bulkhead prevents cache/other paths from starving

14) Common pitfalls and anti-patterns

Unbounded queues and thread pools: the fastest path to meltdown
Retries without budgets: multiplicative demand under failure
Hedging writes or non-idempotent operations
Ignoring client-side timeouts: without deadlines, no end-to-end backpressure
Autoscaling on CPU alone: IO-bound services need RPS/concurrency-based signals
Over-trusting averages: p50 is a liar; design for p99
Shedding at the deepest layer: drop at the edge when possible

15) A pragmatic rollout plan

Instrumentation
- Add in-flight gauges, queue depth/time histograms, and deadline propagation metrics
- Emit admission decisions (accepted/shed) with reason
Deadlines everywhere
- Set client timeouts for all RPCs
- Enforce server-side ctx cancellation
Bound concurrency and queues
- Add a static concurrency limiter per service
- Bound executor/connection pool queues
Queue time SLO
- Add x-request-start at ingress
- Enforce per-hop queue time budget
Edge policies
- Envoy local_ratelimit + admission_control
- Circuit breakers on clusters
Bulkheads
- Split deployments by priority
- Per-route concurrency limits
Adaptive controllers and hedging
- Enable Envoy adaptive_concurrency on hot paths
- Introduce hedged reads with small delays and strict idempotency
Test overload
- Run wrk2/ghz scenarios; iterate policies based on data

16) Reference snippets and checklists

Envoy cluster circuit breakers per route

yaml
route_config:
  virtual_hosts:
  - name: app
    domains: ["*"]
    routes:
    - match: { prefix: "/search" }
      route:
        cluster: search_service
        retry_policy:
          retry_on: 5xx,reset,connect-failure
          num_retries: 2
        max_stream_duration:
          grpc_timeout_header_max: 300ms
    - match: { prefix: "/write" }
      route:
        cluster: write_service
        retry_policy: { retry_on: "" }  # disable retries for non-idempotent writes

Go concurrency limiter with priority

go
type PriorityLimiter struct {
    hi *semaphore.Weighted
    lo *semaphore.Weighted
}

func (p *PriorityLimiter) Run(ctx context.Context, pri string, fn func(context.Context) error) error {
    sem := p.lo
    if pri == "high" { sem = p.hi }
    if err := sem.Acquire(ctx, 1); err != nil { return err }
    defer sem.Release(1)
    return fn(ctx)
}

Assign higher capacity to interactive traffic; ensure a minimum reserved for critical flows.

Kubernetes anti-affinity and spread

yaml
spec:
  template:
    spec:
      topologySpreadConstraints:
      - maxSkew: 1
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfiable: ScheduleAnyway
        labelSelector:
          matchLabels: {app: api, tier: interactive}
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 50
            podAffinityTerm:
              labelSelector:
                matchLabels: {app: api, tier: interactive}
              topologyKey: kubernetes.io/hostname

17) How much capacity should I admit?

A practical heuristic to set initial limits:

Measure steady-state service time S under typical load
Pick target utilization U (e.g., 60–70% for latency-sensitive services)
For each pod, allowed concurrency C ≈ U × (cores × K) for CPU-bound tasks, or based on active IO slots for IO-bound
If you know arrival rate λ and target latency W, Little’s Law suggests L ≈ λW as the number of in-flight slots across the tier; divide by replicas for per-pod C

Then let adaptive controllers refine around that.

18) What’s new or improved by 2025

Envoy’s adaptive concurrency and admission control are mature enough for production on busy routes; pair with circuit breakers to avoid queue blowups
gRPC hedging support is stable in major languages; deadlines and retries have more robust configs; still validate for your stack
eBPF-based observability tools (e.g., for per-socket queueing and tail latency attribution) make it practical to see where queueing occurs without code changes
Kubernetes scheduling and autoscaling are better at respecting PriorityClasses and custom metrics; use them for bulkheads and scale signals

19) Closing thoughts

Overload is inevitable; collapse is optional. The difference is whether your system has explicit, measured, and enforced backpressure. Put a budget on queues. Cap concurrency. Shed early and adapt smoothly. Hedge prudently. Isolate with bulkheads. And make deadlines the lingua franca from edge to leaf and back.

If you approach reliability as backpressure by design, you won’t have to “fix” outages nearly as often—your system will bend, not break.