LLM systems don’t behave like typical web services. They’re bursty, heavy-tailed, and their resource consumption is driven by tokens, not requests. In 2025, treating a chat endpoint like a classic HTTP microservice will cost you in both latency and money.
This article makes a case for token-aware autoscaling, dives into KV cache warmth and GPU multiplexing, and then gets hands-on with Kubernetes patterns using vLLM and TensorRT-LLM. If you’ve ever watched p95 latency spike as your model shifts from short prompts to long context loads, or if your GPU sits half idle due to poor batching, this is for you.
Contents
- Why RPS-based autoscaling fails for LLMs
- Token-aware metrics: the only autoscaling signal that makes sense
- Splitting prefill vs decode: separate queues, budgets, and fairness
- KV cache warmth: paging, prefix caches, and avoiding cold starts
- Batching and GPU multiplexing: continuous batching, MPS, and fairness
- Right-sizing GPUs with MIG/MPS and matching to model classes
- Kubernetes patterns with vLLM and TensorRT-LLM
- Capacity planning and SLO-driven autoscaling formulas
- Guardrails, admission control, and cost governance
- A practical checklist
Why RPS-based autoscaling fails for LLMs
- Token cost dominates: Two requests can have identical RPS but wildly different token footprints. A 200-token prompt with a 64-token completion is nothing like a 100k-token retrieval-augmented prompt with a 1k-token completion. RPS doesn’t capture this.
- Long-tailed generation: Generation lengths and prompt sizes follow heavy-tailed distributions. p95/p99 latencies are driven by the fat tail, not the mean. Scale based on RPS and you’ll under-provision exactly when you need capacity.
- Phase shift within one request: LLM inference has distinct phases. Prefill (processing the prompt) is compute- and bandwidth-heavy in attention layers, while decode (token-by-token generation) is lighter per step but long-lived. Autoscaling must consider both phases to avoid head-of-line blocking.
- Streaming masks real load: Streaming responses deliver partial tokens early, but the GPU is occupied for the duration. RPS-based scale out can stabilize early streaming but still starve the decode phase.
- KV cache as the bottleneck: Many modern servers (e.g., vLLM with PagedAttention) are KV-cache-bound rather than pure FLOP-bound. RPS doesn’t reflect when the KV cache is nearing exhaustion, causing evictions and sharp latency cliffs.
The correct unit for LLM capacity is tokens/sec (both in and out). You need phase-aware token budgets, not just requests/sec.
Token-aware metrics: the only autoscaling signal that makes sense Define clear metrics. At a minimum:
- tokens_in_total: Cumulative number of prompt (prefill) tokens processed.
- tokens_out_total: Cumulative number of decode (generated) tokens produced.
- active_sequences: Number of sequences currently resident on the GPU.
- prefill_queue_len, decode_queue_len: Queue depths by phase.
- kv_cache_bytes_used, kv_cache_bytes_capacity: KV footprint and headroom.
- kv_cache_evictions_total: Evictions—an early warning for impending latency cliffs.
- batch_merge_efficiency: Percent of batch slots filled when stepping.
- latency_p95_decode_step_ms, latency_p95_prefill_ms: Phase-specific latency.
From these counters you can derive rate metrics:
- token_in_rate = rate(tokens_in_total[1m])
- token_out_rate = rate(tokens_out_total[1m])
- kv_utilization = kv_cache_bytes_used / kv_cache_bytes_capacity
Capacity is pod- and model-dependent. Empirically benchmark tokens/sec per GPU and per instance configuration. For example, you might observe on a single H100:
- Llama-3 8B FP8, vLLM: ~30–40k tok/s prefill burst, ~7–10k tok/s steady decode (numbers vary widely with batch size, sequence length, and quantization).
- Mixtral 8x7B INT4, TensorRT-LLM: lower prefill throughput (MoE routing overhead), higher decode throughput under good batching.
Once you have per-pod capacity targets, autoscaling becomes straightforward:
- desired_pod_count_prefill = ceil(token_in_rate / target_tok_in_per_pod)
- desired_pod_count_decode = ceil(token_out_rate / target_tok_out_per_pod)
Aggregate both with headroom and stability windows. Your aim is to keep utilization between 60–80% so the scheduler can absorb traffic spikes without tail-latency blowups.
Splitting prefill vs decode: separate queues, budgets, and fairness Prefill and decode compete for the same GPU but demand different resource shapes. If you let long prefill jobs (e.g., huge contexts) monopolize the device, short decodes stall and tail latencies explode for everyone streaming replies.
The fix is to split scheduling into at least two queues:
- Prefill queue: admission limited by a cap on prompt tokens in-flight.
- Decode queue: budgeted to run at a consistently high step rate with bounded per-step latency.
Key ideas:
- Chunked prefill: Break long prompts into chunks so the scheduler can interleave decode steps between prefill chunks. This reduces head-of-line blocking from single giant prompts.
- Separate budgets: Pre-allocate a fraction of tokens/sec to prefill and the rest to decode. Example: 30% prefill, 70% decode under steady-state chat workloads. Dynamically shift share based on queue backlogs.
- Priority-based scheduling: Decode steps should have higher priority once a stream is in flight; otherwise you degrade the interactive experience. You can also prioritize short prompts to reduce mean waiting time without starving long ones (shortest-expected-processing-time within fairness bounds).
- Max per-request tokens: Admission control based on maximum prompt tokens and maximum generated tokens. This protects the KV cache and prevents pathological jobs from collapsing throughput.
Consider this simple budget controller that runs periodically:
- Measure R_in = token_in_rate, R_out = token_out_rate.
- If prefill_queue_len grows faster than decode_queue_len, increase prefill budget within a cap; otherwise, allocate more to decode.
- Keep kv_utilization < 0.85; if breached, throttle new prefill admissions or shorten max_new_tokens.
Framework notes
- vLLM: Implements continuous batching and a scheduler that can interleave prefill and decode. Newer versions support chunked prefill to avoid long stalls. Also exposes metrics that map well to token-rate autoscaling.
- TensorRT-LLM (with Triton): Provides dynamic batching and paged KV cache. You can run separate instance groups or models for different traffic classes if needed.
KV cache warmth: paging, prefix caches, and avoiding cold starts The KV cache is the working set of attention keys and values for each active sequence. Efficient KV management is the difference between a server that hums and one that thrashes.
What matters in 2025
- Paged KV cache: Avoids large contiguous allocations by paging KV blocks, enabling higher concurrency and reduced fragmentation. vLLM pioneered this with PagedAttention.
- Prefix cache: When many requests share the same system prompt or common prefix (e.g., RAG templates), caching the prefix activations prevents recomputation and shortens prefill. Warm these prefixes proactively at startup.
- Avoid cold process starts: Loading weights is expensive (GBs). Prefer long-lived pods, pre-warm them with synthetic prompts, and stagger rollouts.
- Eviction policy: Favor evicting least-progress decode sequences or low-priority prefill chunks when the cache is under pressure. If your framework lets you configure this, align it with your SLOs.
A simple warm-up routine after a pod becomes Ready:
- Load the model and adapters.
- Run a small suite of representative prompts that cover typical context lengths and templates.
- Insert frequent system prompts into the prefix cache.
- Reserve a small KV slice as a warm buffer to stabilize latency when traffic arrives.
Example conceptual Python for warmup (adjust for your server framework):
pythonimport time import requests WARM_PROMPTS = [ {"input": "You are a helpful assistant.\n\nUser: Hello\nAssistant:", "max_tokens": 8}, {"input": "System: You are a coding assistant.\nUser: Explain quicksort in Python.", "max_tokens": 16}, ] for p in WARM_PROMPTS: try: requests.post("http://127.0.0.1:8000/generate", json=p, timeout=10) except Exception: pass # periodic pings to keep prefix hot while True: try: requests.post("http://127.0.0.1:8000/generate", json=WARM_PROMPTS[0], timeout=5) except Exception: pass time.sleep(30)
Batching and GPU multiplexing: continuous batching, MPS, and fairness Batching is mandatory for throughput, but naïve batching can destroy tail latency. The state of the art is continuous batching: interleave many sequences so that each decode step advances multiple requests together.
Best practices
- Cap batch queueing delay: For interactivity, cap waiting time before a sequence enters the batch (e.g., 2–5 ms). This keeps p95 stable while still gaining from batching.
- Use micro-batching for prefill: Combine a small number of prefill chunks to keep tensor cores busy without starving decode steps.
- Multiplex carefully: Multiple processes can share a GPU via MPS, but internal schedulers (vLLM/TensorRT-LLM) already multiplex sequences efficiently. If you use MPS, treat it as a way to colocate light models or isolate tenants, not to power through single-model peak throughput.
- Monitor batch efficiency: Track how full your step batches are. If batch_merge_efficiency drops, consider allowing slightly more queue delay or increasing concurrency.
MPS vs MIG
- MPS (Multi-Process Service): Shares a single GPU among processes with fine-grained scheduling. Pros: flexible, easy to enable, good for mixing tiny services. Cons: noisy-neighbor risk and shared global memory/KV pressure; can make latency less predictable.
- MIG (Multi-Instance GPU): Hardware partitioning on A100/H100, creating isolated GPU instances with dedicated memory and compute slices. Pros: stronger isolation, predictable latency, better for multi-tenant. Cons: less flexible partition sizes; model must fit within the MIG slice.
Right-sizing GPUs with MIG/MPS and matching to model classes Map models and workloads to the right GPU profile:
- Small models (e.g., 3–8B) with 4–8-bit weights: Fit on smaller MIG slices like 1g.10gb or 2g.20gb with modest concurrency. Good for ultra-low-latency endpoints and cost-efficient A/B tests.
- Medium models (e.g., 13–34B): Often need full-GPU or larger MIG profiles (4g.40gb+). Decode throughput benefits more from batching; consider one instance per full GPU.
- MoE models (e.g., 8x7B): Memory footprint can fit, but router overhead and active expert KV sets complicate caching; measure carefully. Often benefit from TensorRT-LLM optimizations.
Rules of thumb (validate with benchmarks):
- Ensure parameter weights + optimizer-free safety margin + expected KV per active token < available GPU memory (or MIG slice). KV usage scales roughly with sequence length × hidden size × number of layers × 2 (K and V), and is multiplied by batch/beam sizes.
- Favor MIG when you need isolation or run heterogeneous tenants. Favor full-GPU without MIG for maximum single-model throughput.
- If you need to pack multiple lightweight services on one GPU, consider MPS with strict admission control and per-process token budgets.
Kubernetes patterns with vLLM and TensorRT-LLM We’ll cover metrics collection, HPA/KEDA autoscaling, GPU selection with MIG, and routing by traffic class. Treat the following as templates; adapt flags and resource names to your environment and framework versions.
Expose token-aware metrics
- Scrape server metrics via Prometheus. Ensure your LLM server exports:
- tokens_in_total, tokens_out_total
- prefill_queue_len, decode_queue_len
- kv_cache_bytes_used, kv_cache_bytes_capacity
- kv_cache_evictions_total
Prometheus recording rules (example):
yamlapiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: name: llm-token-rates namespace: llm spec: groups: - name: llm rules: - record: llm:token_in_rate:1m expr: sum(rate(tokens_in_total[1m])) by (deployment) - record: llm:token_out_rate:1m expr: sum(rate(tokens_out_total[1m])) by (deployment) - record: llm:kv_utilization expr: sum(kv_cache_bytes_used) by (deployment) / sum(kv_cache_bytes_capacity) by (deployment)
Horizontal Pod Autoscaler (HPA) with custom metrics
yamlapiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: llm-decode-hpa namespace: llm spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: llm-decode minReplicas: 1 maxReplicas: 40 behavior: scaleUp: stabilizationWindowSeconds: 0 policies: - type: Percent value: 200 periodSeconds: 60 scaleDown: stabilizationWindowSeconds: 300 policies: - type: Percent value: 50 periodSeconds: 60 metrics: - type: Pods pods: metric: name: llm_token_out_rate target: type: AverageValue averageValue: "8000" # tokens/sec per pod target - type: Pods pods: metric: name: llm_kv_utilization target: type: AverageValue averageValue: "0.75" # keep KV around 75%
If your cluster doesn’t expose custom metrics via the Kubernetes API, KEDA is simpler.
KEDA ScaledObject with Prometheus
yamlapiVersion: keda.sh/v1alpha1 kind: ScaledObject metadata: name: llm-decode-scaledobject namespace: llm spec: scaleTargetRef: name: llm-decode minReplicaCount: 1 maxReplicaCount: 40 cooldownPeriod: 300 triggers: - type: prometheus metadata: serverAddress: http://prometheus-operated.monitoring:9090 query: sum(rate(tokens_out_total[30s])) by (deployment) threshold: "8000" # tokens/sec per pod target - type: prometheus metadata: serverAddress: http://prometheus-operated.monitoring:9090 query: sum(kv_cache_bytes_used) by (deployment) / sum(kv_cache_bytes_capacity) by (deployment) threshold: "0.85" # throttle scale down if KV is hot
Deployments: separate decoding vs prefill budgets inside one server Prefer one server per GPU that internally splits prefill/decode queues with budgets. If your framework exposes knobs, set them explicitly.
Example vLLM deployment with MIG
yamlapiVersion: apps/v1 kind: Deployment metadata: name: llm-decode namespace: llm spec: replicas: 1 selector: matchLabels: app: llm-decode template: metadata: labels: app: llm-decode spec: nodeSelector: nvidia.com/gpu.product: NVIDIA-A100-SXM4-80GB nvidia.com/mig.strategy: single tolerations: - key: nvidia.com/gpu operator: Exists effect: NoSchedule containers: - name: server image: yourrepo/vllm:stable args: - serve - /models/llama-3-8b - --port=8000 - --tensor-parallel-size=1 - --gpu-memory-utilization=0.9 - --max-num-seqs=512 env: # conceptual knobs; names vary by version - name: VLLM_PREFILL_TOKEN_BUDGET value: "0.3" # 30% to prefill - name: VLLM_DECODE_TOKEN_BUDGET value: "0.7" - name: VLLM_CHUNKED_PREFILL value: "true" - name: VLLM_MAX_QUEUE_DELAY_MS value: "3" ports: - containerPort: 8000 resources: limits: nvidia.com/mig-2g.20gb: "1" cpu: "4" memory: 16Gi requests: nvidia.com/mig-2g.20gb: "1" cpu: "2" memory: 8Gi volumeMounts: - name: model mountPath: /models volumes: - name: model persistentVolumeClaim: claimName: models-pvc
TensorRT-LLM with Triton (dynamic batching and instance groups)
yamlapiVersion: apps/v1 kind: Deployment metadata: name: trtllm namespace: llm spec: replicas: 1 selector: matchLabels: app: trtllm template: metadata: labels: app: trtllm spec: containers: - name: triton image: nvcr.io/nvidia/tritonserver:24.06-py3 args: ["tritonserver", "--model-repository=/models", "--http-port=8000", "--metrics-port=8002"] ports: - containerPort: 8000 - containerPort: 8002 resources: limits: nvidia.com/gpu: "1" cpu: "8" memory: 32Gi volumeMounts: - name: model mountPath: /models volumes: - name: model persistentVolumeClaim: claimName: triton-models
Triton model configuration (model repository):
text# models/llama/config.pbtxt name: "llama" platform: "tensorrt_llm" max_batch_size: 128 instance_group { kind: KIND_GPU count: 1 } dynamic_batching { preferred_batch_size: [ 1, 2, 4, 8, 16, 32 ] max_queue_delay_microseconds: 3000 } # You would also configure scheduler priorities or separate models # for prefill-focused and decode-focused traffic if applicable.
Routing by traffic class It’s often beneficial to route short prompts and interactive chat to one pool and long-context batch jobs to another. Use two Deployments with identical weights but different policies.
- llm-interactive: Higher decode budget, smaller max queue delay, tighter p95 SLOs.
- llm-batch: Larger prefill allowance, larger max queue delay, higher throughput.
Example Envoy/Ingress concept: route by header X-Traffic-Class: interactive or batch.
GPU node pools and warm pools
- Create separate node groups for full GPUs and for MIG-enabled nodes.
- Keep a warm pool of pre-provisioned nodes (e.g., 1–2 per group) to avoid 2–6 minute cold node brings.
- Use PodDisruptionBudgets and RollingUpdate with maxSurge to avoid taking all capacity down during deploys.
Example PDB:
yamlapiVersion: policy/v1 kind: PodDisruptionBudget metadata: name: llm-pdb namespace: llm spec: minAvailable: 80% selector: matchLabels: app: llm-decode
Admission control and backpressure
- Reject or defer requests when kv_utilization > 0.9.
- Clamp max_new_tokens dynamically when prefill_queue_len is large.
- Inform clients via Retry-After for soft-throttle to allow retries without overload.
Capacity planning and SLO-driven autoscaling formulas You need two numbers per deployment:
- C_in: sustainable prefill token capacity per pod (tokens/sec) at your p95 target.
- C_out: sustainable decode token capacity per pod (tokens/sec) at your p95 target.
Measure them under realistic mixes and sequence lengths. Then compute:
- N_in = ceil(R_in / (C_in × U_target))
- N_out = ceil(R_out / (C_out × U_target))
- N = max(N_in, N_out)
Where U_target = 0.7–0.8 to preserve headroom. Add:
- Stabilization windows: 3–5 minutes for scale-down to avoid flapping.
- Burst scale-ups: allow 2–3x per minute growth.
- Safety clamp: Never scale down if kv_utilization > 0.75 or prefill_queue_len is nontrivial.
SLO coupling
- If p95 latency > budget and queues are non-empty, scale out regardless of token-rate targets.
- If p95 is healthy but utilization is high, consider accepting more queue delay for efficiency.
Guardrails, admission control, and cost governance
- Token budgets per tenant: Enforce max tokens per minute per API key. This caps cost and protects capacity.
- Prompt truncation policy: Truncate or refuse prompts that would exceed the KV budget under current load. Communicate limits explicitly to clients.
- Expected token estimator: Before admission, estimate max prefill and decode tokens for a request. If the estimate breaches budgets, push to a batch queue or reject.
- Timeouts and partial results: Stream partial tokens early, but cancel sequences that violate hard time budgets.
- Observability: Dashboards for token rates, KV utilization, batching efficiency, and p95 latencies. Alert on kv_cache_evictions_total increasing or batch efficiency dropping.
Practical checklist
- Metrics
- Export tokens_in_total, tokens_out_total, queue lengths, KV usage, evictions, batch efficiency.
- Record token_in_rate and token_out_rate in Prometheus.
- Scheduling
- Separate prefill and decode queues with distinct budgets.
- Enable chunked prefill to reduce head-of-line blocking.
- Cap max queue delay for interactivity.
- Autoscaling
- HPA/KEDA based on token rates and KV utilization.
- Stabilization windows and asymmetric scale up/down policies.
- Guardrails: don’t scale down if KV hot or queues non-empty.
- KV cache
- Use paged KV cache; monitor evictions.
- Warm common prefixes at startup; periodic pings to keep warm.
- Admission control when KV usage > 85–90%.
- GPU multiplexing
- Batch aggressively with small queue delay.
- Use MIG for isolation; MPS for light colocation.
- Validate fairness; avoid starving decode.
- Kubernetes
- Separate node pools for MIG/full GPUs; keep warm nodes.
- PDBs and rolling updates with surge.
- Routing by traffic class (interactive vs batch).
- Governance
- Per-tenant token-rate limits.
- Max prompt and completion tokens; truncation policy.
- Cost per token dashboards.
Notes on frameworks and references
- vLLM introduced PagedAttention and continuous batching; it exposes token-centric metrics and supports chunked prefill in recent releases. It’s a solid default for Pythonic stacks.
- TensorRT-LLM combines engine-optimized kernels with Triton’s dynamic batching and is strong on H100/A100 for production. Use its paged KV cache features and tune dynamic_batching.
- FlashAttention reduces attention memory traffic and is widely adopted under the hood; you benefit indirectly via frameworks that integrate it.
Final take Autoscaling LLMs by RPS is like provisioning a data warehouse by “queries per second.” It looks neat on a dashboard and fails you in production. In 2025, the winning strategy is token-aware, phase-aware, and cache-aware:
- Scale on token rates, not requests.
- Protect decode latency with separate budgets and chunked prefill.
- Keep the KV cache warm and below critical utilization.
- Batch and multiplex, but with fairness and queue-delay caps.
- Right-size GPUs using MIG/MPS to match model and tenant profiles.
- Encode all of it in Kubernetes with HPA/KEDA using Prometheus metrics, plus sane routing and admission control.
Do that, and your p95s will hold steady, your GPUs will stay busy for the right reasons, and your cost curve won’t surprise you the next time your users paste in a 90k-token context.