Backpressure by Design in 2025: Concurrency Limits, Admission Control, and Queueing Patterns for Reliable Services
Modern distributed systems fail most often not because a single component is down, but because the system is overloaded. Overload is subtle: median latency looks fine, but the 99th percentile climbs; CPU is not pegged, yet queues grow; retries amplify the problem until the whole stack collapses.
In 2025, reliable services must be engineered with overload as a first-class concern. This article is a practical playbook rooted in queueing theory, production patterns, and the latest tooling—Envoy, gRPC, and Kubernetes—to deliver end-to-end backpressure: from admission control at the edge to adaptive concurrency at each hop, with queue time SLOs, hedged requests, and bulkheads preventing cascading failure.
We’ll cover:
- Why overload happens and how to observe it
- Concurrency limiters vs token buckets vs queue length caps
- Queue time SLOs and adaptive shedding
- Hedged requests that don’t melt your servers
- Bulkheads at the code, process, and cluster layers
- End-to-end flow control with Envoy, gRPC, and Kubernetes
- Step-by-step deployment plans and test strategies
Opinionated take: If you implement only one thing, implement queue time SLOs with end-to-end cancellation. It’s the single most effective habit to avoid serving work you already cannot complete in time.
1) Overload: the real failure mode
Nearly all distributed outages share a theme: demand exceeds effective capacity (for some slice of traffic), queues grow, latency tails explode, clients retry and hedge improperly, and the thundering herd finishes the job. Preventing this requires a predictable relationship between arrival rate and service capacity.
Core principles:
- Little’s Law: L = λW. The average number of items in the system equals arrival rate times average time in the system. If you let W (latency) grow via queues, L grows, which further increases W, and so on. Bound queues to bound latency.
- Tail latency is multiplicative across hops. A single slow hop dominates end-to-end latencies; queuing at each stage multiplies the tail.
- Retries shift demand in time; hedges increase demand. Without budgets and idempotency, they worsen overload.
- Capacity is not just CPU/memory. Locks, DB connections, in-flight RPC limits, and cloud IO are bottlenecks.
The corollary: your service needs explicit mechanisms that limit how much work it accepts, how long it waits in queues, how it cancels work when budgets are exceeded, and how it sheds gracefully.
2) Signals and SLOs that matter
You can’t control what you don’t measure. Instrument at least:
- In-flight concurrent requests (by route and priority)
- Queue length and queue time distribution
- End-to-end latency distribution (p50, p90, p99, p99.9)
- CPU, memory, GC/heap pressure, and thread pool saturation
- Retries, hedges, and their success rate
- Deadline/cancellation propagation (rate of server-side canceled work)
- Admission decisions: accepted vs shed, and reasons
Two SLOs drive backpressure design:
- End-to-end latency SLO: e.g., 95% of requests complete under 200 ms
- Queue time SLO: e.g., drop requests that have already spent > 50 ms waiting anywhere before your handler
Queue time SLOs are proactive. Once the budget is blown, finishing the work rarely helps the user; drop early and save capacity for fresh requests that can still succeed.
3) Building blocks
3.1 Token buckets and leaky buckets
Token buckets are simple and effective rate limiters: tokens arrive at a steady rate up to a max burst; each request consumes a token. If the bucket is empty, reject (or queue briefly).
Advantages:
- Good for edges and shared resources to cap overall arrival rate
- Burst tolerance via bucket size
Limitations:
- Not aware of dynamic service time; can accept requests even when concurrency is saturated
Minimal Go token bucket:
gopackage ratelimit import ( "sync" "time" ) type TokenBucket struct { mu sync.Mutex tokens int cap int rate int // tokens per second lastRef time.Time } func NewTokenBucket(cap, rate int) *TokenBucket { return &TokenBucket{cap: cap, rate: rate, tokens: cap, lastRef: time.Now()} } func (b *TokenBucket) Allow() bool { b.mu.Lock() defer b.mu.Unlock() now := time.Now() elapsed := now.Sub(b.lastRef) refill := int(elapsed.Seconds() * float64(b.rate)) if refill > 0 { b.tokens = min(b.cap, b.tokens+refill) b.lastRef = now } if b.tokens > 0 { b.tokens-- return true } return false } func min(a, b int) int { if a < b { return a } ; return b }
Use at ingress to flatten spikes; combine with concurrency limits internally.
3.2 Concurrency limiters
Concurrency limiters cap in-flight work, directly controlling queueing. A static limit is a start; adaptive limiters are better. Netflix’s “concurrency-limits” algorithms (e.g., Gradient2, Vegas) adjust allowed concurrency based on latency and queue signals. Envoy’s adaptive concurrency filter implements a similar gradient controller.
Benefits:
- Prevents unbounded queueing; bounds memory and latency
- Naturally adapts to slower/fast backends
A simple weighted semaphore pattern (Go):
gopackage limiter import ( "context" "errors" "time" "golang.org/x/sync/semaphore" ) type ConcurrencyLimiter struct { sem *semaphore.Weighted qTimeout time.Duration // queue time budget } func New(limit int64, qTimeout time.Duration) *ConcurrencyLimiter { return &ConcurrencyLimiter{ sem: semaphore.NewWeighted(limit), qTimeout: qTimeout, } } var ErrQueueTimeout = errors.New("queue time budget exceeded") func (c *ConcurrencyLimiter) Run(ctx context.Context, fn func(context.Context) error) error { start := time.Now() qctx, cancel := context.WithTimeout(ctx, c.qTimeout) defer cancel() if err := c.sem.Acquire(qctx, 1); err != nil { if errors.Is(err, context.DeadlineExceeded) { return ErrQueueTimeout } return err } defer c.sem.Release(1) // Adjust remaining deadline for handler if deadline, ok := ctx.Deadline(); ok { left := time.Until(deadline) // terminate early if little budget left if left < 5*time.Millisecond { return context.DeadlineExceeded } } // Attach queue time for logging/metrics ctx = context.WithValue(ctx, "queue_time_ms", time.Since(start).Milliseconds()) return fn(ctx) }
Start with a static limit per instance (e.g., 2x vCPU for CPU-bound, or 4x for IO-bound) and iterate with data. Graduating to adaptive controllers (Envoy, concurrency-limits libraries) yields better utilization under variable latency.
3.3 Queue length caps and bounded buffers
If you must queue, bound it. In practice, cap queue length at something proportional to concurrency, like 2–4x the concurrency limit. Anything beyond that is rarely salvageable under an SLO.
Pitfall: default executors (e.g., unbounded thread pools) hide overload until GC or OOM does the shedding. Always bound.
4) Queue time SLOs: shed early, succeed more
Queue time SLOs enforce a hard budget for time spent waiting before execution. You can compute queue time end-to-end by propagating a start timestamp header from the edge and comparing it before you accept work in each hop.
Common headers:
- x-request-start or x-queue-start (set by edge proxies)
- grpc-timeout (deadline-based budgets)
- x-envoy-start-time (Envoy start time; use to compute queue time)
Server-side guard (Go HTTP middleware example):
gofunc QueueBudgetMiddleware(next http.Handler, budget time.Duration) http.Handler { return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { start := time.Now() if hdr := r.Header.Get("x-request-start"); hdr != "" { if t, err := time.Parse(time.RFC3339Nano, hdr); err == nil { if time.Since(t) > budget { http.Error(w, "queue time budget exceeded", http.StatusServiceUnavailable) return } } } // Add our start time for downstreams r.Header.Set("x-request-start", start.UTC().Format(time.RFC3339Nano)) next.ServeHTTP(w, r) }) }
With gRPC, prefer deadlines:
- Clients set context deadline (or per-RPC timeout via grpc-timeout header)
- Servers check context.Done() frequently and stop work immediately
- Proxies should honor and clamp timeouts, not extend them
Reasonable initial budgets:
- Edge to origin: 50–100 ms queue time budget
- Internal hop: 10–30 ms budget per hop
Tune budgets based on observed queue histograms and SLOs.
5) Admission control: the front door
Admission control decides whether a request may enter the system. Good admission is multi-signal and probabilistic near limits.
Signals to consider:
- Token bucket availability (rate cap)
- Concurrency occupancy (in-flight / limit)
- Queue time budget remaining
- Critical resource pressure (CPU, memory, DB connections)
- Priority/class of traffic (e.g., interactive vs batch)
Simple composite policy:
- Drop if queue time exceeded
- Drop if concurrency limit saturated and queue depth beyond threshold
- Rate-limit by token bucket (burst control)
- Preferentially admit high-priority traffic (bulkhead)
Pseudocode:
pythondef should_admit(req, state): if req.queue_time_ms > req.queue_budget_ms: return False, "queue_budget" if state.inflight >= state.max_concurrency and state.queue_depth >= state.max_queue: return False, "queue_full" if not state.token_bucket.allow(): return False, "rate_limit" return True, "ok"
This is effectively a small finite state machine. Observe and export the reason for drops.
6) Adaptive shedding: degrade gracefully
Instead of hard on/off, gradually shed under strain. Probabilistic shed rates reduce oscillations.
Idea: when p99 latency exceeds SLO or CPU exceeds 80%, compute a drop probability p in [0, 1], increasing with pressure. Sample per request.
Example policy:
- Measure error rate E and tail latency exceedance T = max(0, p99/SLO − 1)
- Shed probability p = clamp(αT + βE, 0, 0.9)
Adaptive controllers must avoid synchronized oscillations:
- Smooth metrics with EWMA
- Update p at low frequency (e.g., 1–2 Hz)
- Apply jitter
Envoy implements this as the admission_control filter (see config below).
7) Hedged requests: cut tail latency responsibly
Hedging sends a duplicate request after a delay to cut tail latency. This can be extremely effective for idempotent reads, but catastrophic under overload if misused.
Rules:
- Hedge only idempotent operations
- Use budgets: never exceed overall concurrent hedge limits
- Cancel slower copy immediately when one response arrives
- Disable hedges when shed probability is non-zero (system is overloaded)
gRPC client service config (language support varies):
json{ "methodConfig": [{ "name": [{"service": "catalog.Search"}], "timeout": "300ms", "retryPolicy": { "maxAttempts": 3, "initialBackoff": "20ms", "maxBackoff": "200ms", "backoffMultiplier": 2, "retryableStatusCodes": ["UNAVAILABLE", "DEADLINE_EXCEEDED"] }, "hedgingPolicy": { "maxAttempts": 2, "hedgingDelay": "30ms", "nonFatalStatusCodes": ["RESOURCE_EXHAUSTED", "ABORTED"] } }] }
Start conservatively (one extra attempt with a 20–50 ms delay) and measure amplification factors.
8) Bulkheads: isolate and prioritize
Bulkheading isolates resources so a failure or surge in one area doesn’t sink the whole ship.
Implement bulkheads at multiple layers:
- Code: separate thread pools/executors per dependency (e.g., DB vs cache), with per-pool concurrency limits
- Process: dedicated worker pools or pods per traffic class (interactive vs batch)
- Network/proxy: per-route circuit breakers, priority queues
- Cluster: Kubernetes PriorityClasses and ResourceQuotas; PodDisruptionBudgets per service
Kubernetes PriorityClass example:
yamlapiVersion: scheduling.k8s.io/v1 kind: PriorityClass metadata: name: interactive-high value: 100000 preemptionPolicy: PreemptLowerPriority globalDefault: false description: "Interactive latency-sensitive traffic" --- apiVersion: scheduling.k8s.io/v1 kind: PriorityClass metadata: name: batch-low value: 1000 preemptionPolicy: Never globalDefault: false description: "Batch and background jobs"
Two Deployments with separate HPAs and resource limits:
yamlapiVersion: apps/v1 kind: Deployment metadata: name: api-interactive spec: replicas: 3 selector: matchLabels: {app: api, tier: interactive} template: metadata: labels: {app: api, tier: interactive} spec: priorityClassName: interactive-high containers: - name: api image: ghcr.io/acme/api:2025-09 resources: requests: {cpu: "500m", memory: "512Mi"} limits: {cpu: "1", memory: "1Gi"} --- apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: api-interactive spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: api-interactive minReplicas: 3 maxReplicas: 15 behavior: scaleUp: stabilizationWindowSeconds: 30 scaleDown: stabilizationWindowSeconds: 300 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 60
Deploy a second, separately scaled Deployment for batch traffic (lower priority, tighter limits). This way overload in batch doesn’t steal capacity from interactive.
9) Envoy: front-line backpressure and flow control
Envoy is an excellent place to enforce backpressure policies consistently.
Key features:
- local_ratelimit filter: per-route token bucket
- admission_control filter: probabilistic shedding based on upstream success rate and concurrency
- adaptive_concurrency filter: auto-tune concurrency per route based on latency sampling
- circuit breakers on clusters: max connections, pending, requests, retries
- overload_manager: define triggers (e.g., memory) and actions (e.g., stop accepting new requests)
Example Envoy HTTP filter chain with rate limit, admission control, adaptive concurrency, and router:
yamlstatic_resources: listeners: - name: http address: socket_address: { address: 0.0.0.0, port_value: 8080 } filter_chains: - filters: - name: envoy.filters.network.http_connection_manager typed_config: "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager stat_prefix: ingress_http route_config: name: local_route virtual_hosts: - name: app domains: ["*"] routes: - match: { prefix: "/" } route: { cluster: app_service } http_filters: - name: envoy.filters.http.adaptive_concurrency typed_config: "@type": type.googleapis.com/envoy.extensions.filters.http.adaptive_concurrency.v3.AdaptiveConcurrency gradient_controller_config: sample_aggregate_percentile: 95 concurrency_limit_params: max_concurrency_limit: 1000 min_concurrency_limit: 10 - name: envoy.filters.http.admission_control typed_config: "@type": type.googleapis.com/envoy.extensions.filters.http.admission_control.v3.AdmissionControl enabled: default_value: true runtime_key: admission_control.enabled sampling_window: 10s aggression: 2.0 rps_threshold: 20 - name: envoy.filters.http.local_ratelimit typed_config: "@type": type.googleapis.com/envoy.extensions.filters.http.local_ratelimit.v3.LocalRateLimit stat_prefix: local_rate_limit token_bucket: { max_tokens: 1000, tokens_per_fill: 100, fill_interval: 1s } filter_enabled: { default_value: true } filter_enforced: { default_value: true } - name: envoy.filters.http.router clusters: - name: app_service connect_timeout: 0.25s type: STRICT_DNS lb_policy: ROUND_ROBIN load_assignment: cluster_name: app_service endpoints: - lb_endpoints: - endpoint: address: socket_address: { address: app, port_value: 8080 } circuit_breakers: thresholds: - max_connections: 1024 max_pending_requests: 512 max_requests: 2048 max_retries: 3 overload_manager: refresh_interval: 0.25s resource_monitors: - name: envoy.resource_monitors.fixed_heap typed_config: "@type": type.googleapis.com/envoy.extensions.resource_monitors.fixed_heap.v3.FixedHeapConfig max_heap_size_bytes: 2147483648 # 2 GiB actions: - name: envoy.overload_actions.shrink_heap triggers: - name: envoy.resource_monitors.fixed_heap threshold: { value: 0.90 } - name: envoy.overload_actions.stop_accepting_requests triggers: - name: envoy.resource_monitors.fixed_heap threshold: { value: 0.98 }
Notes:
- admission_control ramps drop probability when upstream signals poor success/tail latency
- adaptive_concurrency learns a concurrency limit to optimize throughput vs latency
- circuit breakers protect upstreams from unbounded pending
- overload_manager guards against heap exhaustion
To propagate queue time: use Envoy’s x-request-start or x-envoy-start-time and a small Lua filter to compute and shed if over budget:
yaml- name: envoy.filters.http.lua typed_config: "@type": type.googleapis.com/envoy.extensions.filters.http.lua.v3.Lua inline_code: | function envoy_on_request(handle) local start = handle:headers():get("x-request-start") if start == nil then start = handle:streamInfo():startTime():rfc3339(true) handle:headers():add("x-request-start", start) end local now = os.date("!%Y-%m-%dT%H:%M:%S.") .. string.format("%03dZ", math.floor(handle:timestamp():nanoseconds()/1e6)%1000) -- simple parse using Envoy start_time instead; real code should parse RFC3339 local budget_ms = tonumber(handle:headers():get("x-queue-budget-ms")) or 50 local elapsed_ms = handle:streamInfo():requestComplete():value() or 0 if elapsed_ms > budget_ms then handle:respond({[":status"] = "503"}, "queue time budget exceeded") end end
This is illustrative; in practice parse start_time properly and compute elapsed precisely via streamInfo.
10) gRPC: deadlines, cancellations, and per-RPC budgets
gRPC’s built-in deadlines are the cleanest way to propagate budgets. Use them everywhere.
Client best practices:
- Set an explicit deadline on every RPC
- Size deadlines to end-to-end SLO minus client-side budgets
- Configure retries/hedges only for idempotent methods
- Clamp max attempts globally
Server best practices:
- Check ctx.Done() in handlers and abort promptly
- Avoid unbounded per-stream buffers
- Bound concurrency (semaphore pattern) and queue time
Go gRPC server example with concurrency limit and queue time guard:
gotype server struct{ lim *limiter.ConcurrencyLimiter } func (s *server) Get(ctx context.Context, req *pb.GetRequest) (*pb.GetResponse, error) { err := s.lim.Run(ctx, func(ctx context.Context) error { // Simulate work respecting ctx select { case <-time.After(20 * time.Millisecond): // do work return nil case <-ctx.Done(): return ctx.Err() } }) if err != nil { if errors.Is(err, limiter.ErrQueueTimeout) { return nil, status.Error(codes.ResourceExhausted, "queue budget") } if errors.Is(err, context.DeadlineExceeded) || errors.Is(err, context.Canceled) { return nil, status.Error(codes.DeadlineExceeded, "deadline") } return nil, status.Error(codes.Unavailable, err.Error()) } return &pb.GetResponse{Ok: true}, nil }
Client with deadline and hedging (Go):
goconn, _ := grpc.Dial( target, grpc.WithTransportCredentials(insecure.NewCredentials()), grpc.WithDefaultServiceConfig(`{ "methodConfig": [{ "name": [{"service":"catalog.Search"}], "timeout": "0.3s", "retryPolicy": {"maxAttempts": 3, "initialBackoff": "0.02s", "maxBackoff": "0.2s", "backoffMultiplier": 2, "retryableStatusCodes": ["UNAVAILABLE","DEADLINE_EXCEEDED"]}, "hedgingPolicy": {"maxAttempts": 2, "hedgingDelay": "0.03s"} }]} `), ) ctx, cancel := context.WithTimeout(context.Background(), 300*time.Millisecond) defer cancel() resp, err := pb.NewCatalogClient(conn).Search(ctx, &pb.SearchRequest{Q: "foo"})
Validate support for hedging in your language/runtime; otherwise implement client-side manually with contexts and timers.
11) Kubernetes: capacity is policy
Kubernetes gives you levers to ensure backpressure policies stick under autoscaling and failures.
- Requests/limits: set realistic CPU/memory requests; avoid overcommit on critical services
- HPAs: slow scale-down, smoothed scale-up to avoid oscillations; scale on RPS or in-flight gauges via custom metrics, not just CPU
- PodDisruptionBudgets: preserve minimum replicas during maintenance
- TopologySpreadConstraints and anti-affinity: avoid co-locating all replicas in one node/AZ
- PriorityClasses: steer the scheduler during contention
- NetworkPolicy: isolate services to reduce noisy neighbors
Sample HPA based on a custom in-flight gauge (via Prometheus Adapter):
yamlapiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: api-inflight spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: api-interactive minReplicas: 3 maxReplicas: 30 behavior: scaleUp: stabilizationWindowSeconds: 20 scaleDown: stabilizationWindowSeconds: 300 metrics: - type: Pods pods: metric: name: http_inflight_requests target: type: AverageValue averageValue: "50" # target ~50 in-flight per pod
Ensure your metric is monotonic and cleaned up on pod termination to avoid stale values.
12) End-to-end flow control: stitch it together
Backpressure works when the entire path speaks the same language about budgets.
Put it together:
- At ingress: token bucket to flatten bursts, admission_control with adaptive shedding; inject x-request-start and queue budget header
- Service-to-service: propagate deadlines (gRPC), x-request-start; enforce per-hop queue time SLO and concurrency limiters
- Prioritize with bulkheads: separate pools/deployments per priority; Envoy route-level circuit breakers
- Cancellation everywhere: on drop, cancel outstanding downstream work
Headers and context to propagate:
- grpc-timeout (or explicit deadline)
- x-request-id for trace correlation
- x-request-start for queue time computation
- x-priority or routing metadata for bulkheads
Finally, push back to the user appropriately: use 429 (Too Many Requests) or 503 (Service Unavailable) with Retry-After when shedding at the edge; internally, prefer gRPC codes ResourceExhausted or Unavailable.
13) Testing and proving it works
Test the overload path before production does it for you.
Load generators for tail latency:
- wrk2 or fortio for constant-QPS testing (better tail analysis)
- vegeta or k6 for programmable scenarios
- ghz for gRPC load
Experiments:
- Step QPS beyond capacity; verify that latency flattens at SLO and shed rate increases instead of latency exploding
- Induce a backend slowdown (add 20 ms) and observe adaptive concurrency reduce allowed in-flight
- Kill N pods; verify bulkheads keep interactive SLOs while batch sheds
- Clamp deadlines smaller than service time; ensure cancellations propagate and work halts
Metrics to check:
- In-flight gauge plateaus at limit; queue depth stays bounded
- p99 latency stable under overload; shed rate rises smoothly (no oscillation)
- Retry/hedge amplification factor stays < 1.2x under tail events
- CPU doesn’t peg; GC/heap steady; no OOMKills
Chaos drills:
- Add artificial tail latency injection (e.g., 1% requests sleep 500 ms) and confirm hedging reduces p99 without triggering a meltdown
- Disable a DB shard; confirm bulkhead prevents cache/other paths from starving
14) Common pitfalls and anti-patterns
- Unbounded queues and thread pools: the fastest path to meltdown
- Retries without budgets: multiplicative demand under failure
- Hedging writes or non-idempotent operations
- Ignoring client-side timeouts: without deadlines, no end-to-end backpressure
- Autoscaling on CPU alone: IO-bound services need RPS/concurrency-based signals
- Over-trusting averages: p50 is a liar; design for p99
- Shedding at the deepest layer: drop at the edge when possible
15) A pragmatic rollout plan
-
Instrumentation
- Add in-flight gauges, queue depth/time histograms, and deadline propagation metrics
- Emit admission decisions (accepted/shed) with reason
-
Deadlines everywhere
- Set client timeouts for all RPCs
- Enforce server-side ctx cancellation
-
Bound concurrency and queues
- Add a static concurrency limiter per service
- Bound executor/connection pool queues
-
Queue time SLO
- Add x-request-start at ingress
- Enforce per-hop queue time budget
-
Edge policies
- Envoy local_ratelimit + admission_control
- Circuit breakers on clusters
-
Bulkheads
- Split deployments by priority
- Per-route concurrency limits
-
Adaptive controllers and hedging
- Enable Envoy adaptive_concurrency on hot paths
- Introduce hedged reads with small delays and strict idempotency
-
Test overload
- Run wrk2/ghz scenarios; iterate policies based on data
16) Reference snippets and checklists
Envoy cluster circuit breakers per route
yamlroute_config: virtual_hosts: - name: app domains: ["*"] routes: - match: { prefix: "/search" } route: cluster: search_service retry_policy: retry_on: 5xx,reset,connect-failure num_retries: 2 max_stream_duration: grpc_timeout_header_max: 300ms - match: { prefix: "/write" } route: cluster: write_service retry_policy: { retry_on: "" } # disable retries for non-idempotent writes
Go concurrency limiter with priority
gotype PriorityLimiter struct { hi *semaphore.Weighted lo *semaphore.Weighted } func (p *PriorityLimiter) Run(ctx context.Context, pri string, fn func(context.Context) error) error { sem := p.lo if pri == "high" { sem = p.hi } if err := sem.Acquire(ctx, 1); err != nil { return err } defer sem.Release(1) return fn(ctx) }
Assign higher capacity to interactive traffic; ensure a minimum reserved for critical flows.
Kubernetes anti-affinity and spread
yamlspec: template: spec: topologySpreadConstraints: - maxSkew: 1 topologyKey: topology.kubernetes.io/zone whenUnsatisfiable: ScheduleAnyway labelSelector: matchLabels: {app: api, tier: interactive} affinity: podAntiAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 50 podAffinityTerm: labelSelector: matchLabels: {app: api, tier: interactive} topologyKey: kubernetes.io/hostname
17) How much capacity should I admit?
A practical heuristic to set initial limits:
- Measure steady-state service time S under typical load
- Pick target utilization U (e.g., 60–70% for latency-sensitive services)
- For each pod, allowed concurrency C ≈ U × (cores × K) for CPU-bound tasks, or based on active IO slots for IO-bound
- If you know arrival rate λ and target latency W, Little’s Law suggests L ≈ λW as the number of in-flight slots across the tier; divide by replicas for per-pod C
Then let adaptive controllers refine around that.
18) What’s new or improved by 2025
- Envoy’s adaptive concurrency and admission control are mature enough for production on busy routes; pair with circuit breakers to avoid queue blowups
- gRPC hedging support is stable in major languages; deadlines and retries have more robust configs; still validate for your stack
- eBPF-based observability tools (e.g., for per-socket queueing and tail latency attribution) make it practical to see where queueing occurs without code changes
- Kubernetes scheduling and autoscaling are better at respecting PriorityClasses and custom metrics; use them for bulkheads and scale signals
19) Closing thoughts
Overload is inevitable; collapse is optional. The difference is whether your system has explicit, measured, and enforced backpressure. Put a budget on queues. Cap concurrency. Shed early and adapt smoothly. Hedge prudently. Isolate with bulkheads. And make deadlines the lingua franca from edge to leaf and back.
If you approach reliability as backpressure by design, you won’t have to “fix” outages nearly as often—your system will bend, not break.
Suggested further reading and tools:
- Little’s Law and queueing theory primers
- Netflix concurrency-limits (Gradient2, Vegas) papers and libraries
- Envoy documentation: admission_control, adaptive_concurrency, overload_manager, circuit breakers
- gRPC service config for retries and hedging; grpc-timeout semantics
- wrk2, fortio, ghz for tail-aware load testing