Autoscaling LLM Services in 2025: Token‑Aware Metrics, KV Cache Warmth, and GPU Multiplexing

LLM systems don’t behave like typical web services. They’re bursty, heavy-tailed, and their resource consumption is driven by tokens, not requests. In 2025, treating a chat endpoint like a classic HTTP microservice will cost you in both latency and money.

This article makes a case for token-aware autoscaling, dives into KV cache warmth and GPU multiplexing, and then gets hands-on with Kubernetes patterns using vLLM and TensorRT-LLM. If you’ve ever watched p95 latency spike as your model shifts from short prompts to long context loads, or if your GPU sits half idle due to poor batching, this is for you.

Contents

Why RPS-based autoscaling fails for LLMs
Token-aware metrics: the only autoscaling signal that makes sense
Splitting prefill vs decode: separate queues, budgets, and fairness
KV cache warmth: paging, prefix caches, and avoiding cold starts
Batching and GPU multiplexing: continuous batching, MPS, and fairness
Right-sizing GPUs with MIG/MPS and matching to model classes
Kubernetes patterns with vLLM and TensorRT-LLM
Capacity planning and SLO-driven autoscaling formulas
Guardrails, admission control, and cost governance
A practical checklist

Why RPS-based autoscaling fails for LLMs

Token cost dominates: Two requests can have identical RPS but wildly different token footprints. A 200-token prompt with a 64-token completion is nothing like a 100k-token retrieval-augmented prompt with a 1k-token completion. RPS doesn’t capture this.
Long-tailed generation: Generation lengths and prompt sizes follow heavy-tailed distributions. p95/p99 latencies are driven by the fat tail, not the mean. Scale based on RPS and you’ll under-provision exactly when you need capacity.
Phase shift within one request: LLM inference has distinct phases. Prefill (processing the prompt) is compute- and bandwidth-heavy in attention layers, while decode (token-by-token generation) is lighter per step but long-lived. Autoscaling must consider both phases to avoid head-of-line blocking.
Streaming masks real load: Streaming responses deliver partial tokens early, but the GPU is occupied for the duration. RPS-based scale out can stabilize early streaming but still starve the decode phase.
KV cache as the bottleneck: Many modern servers (e.g., vLLM with PagedAttention) are KV-cache-bound rather than pure FLOP-bound. RPS doesn’t reflect when the KV cache is nearing exhaustion, causing evictions and sharp latency cliffs.

The correct unit for LLM capacity is tokens/sec (both in and out). You need phase-aware token budgets, not just requests/sec.

Token-aware metrics: the only autoscaling signal that makes sense Define clear metrics. At a minimum:

tokens_in_total: Cumulative number of prompt (prefill) tokens processed.
tokens_out_total: Cumulative number of decode (generated) tokens produced.
active_sequences: Number of sequences currently resident on the GPU.
prefill_queue_len, decode_queue_len: Queue depths by phase.
kv_cache_bytes_used, kv_cache_bytes_capacity: KV footprint and headroom.
kv_cache_evictions_total: Evictions—an early warning for impending latency cliffs.
batch_merge_efficiency: Percent of batch slots filled when stepping.
latency_p95_decode_step_ms, latency_p95_prefill_ms: Phase-specific latency.

From these counters you can derive rate metrics:

token_in_rate = rate(tokens_in_total[1m])
token_out_rate = rate(tokens_out_total[1m])
kv_utilization = kv_cache_bytes_used / kv_cache_bytes_capacity

Capacity is pod- and model-dependent. Empirically benchmark tokens/sec per GPU and per instance configuration. For example, you might observe on a single H100:

Llama-3 8B FP8, vLLM: ~30–40k tok/s prefill burst, ~7–10k tok/s steady decode (numbers vary widely with batch size, sequence length, and quantization).
Mixtral 8x7B INT4, TensorRT-LLM: lower prefill throughput (MoE routing overhead), higher decode throughput under good batching.

Once you have per-pod capacity targets, autoscaling becomes straightforward:

desired_pod_count_prefill = ceil(token_in_rate / target_tok_in_per_pod)
desired_pod_count_decode = ceil(token_out_rate / target_tok_out_per_pod)

Aggregate both with headroom and stability windows. Your aim is to keep utilization between 60–80% so the scheduler can absorb traffic spikes without tail-latency blowups.

Splitting prefill vs decode: separate queues, budgets, and fairness Prefill and decode compete for the same GPU but demand different resource shapes. If you let long prefill jobs (e.g., huge contexts) monopolize the device, short decodes stall and tail latencies explode for everyone streaming replies.

The fix is to split scheduling into at least two queues:

Prefill queue: admission limited by a cap on prompt tokens in-flight.
Decode queue: budgeted to run at a consistently high step rate with bounded per-step latency.

Key ideas:

Chunked prefill: Break long prompts into chunks so the scheduler can interleave decode steps between prefill chunks. This reduces head-of-line blocking from single giant prompts.
Separate budgets: Pre-allocate a fraction of tokens/sec to prefill and the rest to decode. Example: 30% prefill, 70% decode under steady-state chat workloads. Dynamically shift share based on queue backlogs.
Priority-based scheduling: Decode steps should have higher priority once a stream is in flight; otherwise you degrade the interactive experience. You can also prioritize short prompts to reduce mean waiting time without starving long ones (shortest-expected-processing-time within fairness bounds).
Max per-request tokens: Admission control based on maximum prompt tokens and maximum generated tokens. This protects the KV cache and prevents pathological jobs from collapsing throughput.

Consider this simple budget controller that runs periodically:

Measure R_in = token_in_rate, R_out = token_out_rate.
If prefill_queue_len grows faster than decode_queue_len, increase prefill budget within a cap; otherwise, allocate more to decode.
Keep kv_utilization < 0.85; if breached, throttle new prefill admissions or shorten max_new_tokens.

Framework notes

vLLM: Implements continuous batching and a scheduler that can interleave prefill and decode. Newer versions support chunked prefill to avoid long stalls. Also exposes metrics that map well to token-rate autoscaling.
TensorRT-LLM (with Triton): Provides dynamic batching and paged KV cache. You can run separate instance groups or models for different traffic classes if needed.

KV cache warmth: paging, prefix caches, and avoiding cold starts The KV cache is the working set of attention keys and values for each active sequence. Efficient KV management is the difference between a server that hums and one that thrashes.

What matters in 2025

Paged KV cache: Avoids large contiguous allocations by paging KV blocks, enabling higher concurrency and reduced fragmentation. vLLM pioneered this with PagedAttention.
Prefix cache: When many requests share the same system prompt or common prefix (e.g., RAG templates), caching the prefix activations prevents recomputation and shortens prefill. Warm these prefixes proactively at startup.
Avoid cold process starts: Loading weights is expensive (GBs). Prefer long-lived pods, pre-warm them with synthetic prompts, and stagger rollouts.
Eviction policy: Favor evicting least-progress decode sequences or low-priority prefill chunks when the cache is under pressure. If your framework lets you configure this, align it with your SLOs.

A simple warm-up routine after a pod becomes Ready:

Load the model and adapters.
Run a small suite of representative prompts that cover typical context lengths and templates.
Insert frequent system prompts into the prefix cache.
Reserve a small KV slice as a warm buffer to stabilize latency when traffic arrives.

Example conceptual Python for warmup (adjust for your server framework):

python
import time
import requests

WARM_PROMPTS = [
    {"input": "You are a helpful assistant.\n\nUser: Hello\nAssistant:", "max_tokens": 8},
    {"input": "System: You are a coding assistant.\nUser: Explain quicksort in Python.", "max_tokens": 16},
]

for p in WARM_PROMPTS:
    try:
        requests.post("http://127.0.0.1:8000/generate", json=p, timeout=10)
    except Exception:
        pass

# periodic pings to keep prefix hot
while True:
    try:
        requests.post("http://127.0.0.1:8000/generate", json=WARM_PROMPTS[0], timeout=5)
    except Exception:
        pass
    time.sleep(30)

Batching and GPU multiplexing: continuous batching, MPS, and fairness Batching is mandatory for throughput, but naïve batching can destroy tail latency. The state of the art is continuous batching: interleave many sequences so that each decode step advances multiple requests together.

Best practices

Cap batch queueing delay: For interactivity, cap waiting time before a sequence enters the batch (e.g., 2–5 ms). This keeps p95 stable while still gaining from batching.
Use micro-batching for prefill: Combine a small number of prefill chunks to keep tensor cores busy without starving decode steps.
Multiplex carefully: Multiple processes can share a GPU via MPS, but internal schedulers (vLLM/TensorRT-LLM) already multiplex sequences efficiently. If you use MPS, treat it as a way to colocate light models or isolate tenants, not to power through single-model peak throughput.
Monitor batch efficiency: Track how full your step batches are. If batch_merge_efficiency drops, consider allowing slightly more queue delay or increasing concurrency.

MPS vs MIG

MPS (Multi-Process Service): Shares a single GPU among processes with fine-grained scheduling. Pros: flexible, easy to enable, good for mixing tiny services. Cons: noisy-neighbor risk and shared global memory/KV pressure; can make latency less predictable.
MIG (Multi-Instance GPU): Hardware partitioning on A100/H100, creating isolated GPU instances with dedicated memory and compute slices. Pros: stronger isolation, predictable latency, better for multi-tenant. Cons: less flexible partition sizes; model must fit within the MIG slice.

Right-sizing GPUs with MIG/MPS and matching to model classes Map models and workloads to the right GPU profile:

Small models (e.g., 3–8B) with 4–8-bit weights: Fit on smaller MIG slices like 1g.10gb or 2g.20gb with modest concurrency. Good for ultra-low-latency endpoints and cost-efficient A/B tests.
Medium models (e.g., 13–34B): Often need full-GPU or larger MIG profiles (4g.40gb+). Decode throughput benefits more from batching; consider one instance per full GPU.
MoE models (e.g., 8x7B): Memory footprint can fit, but router overhead and active expert KV sets complicate caching; measure carefully. Often benefit from TensorRT-LLM optimizations.

Rules of thumb (validate with benchmarks):

Ensure parameter weights + optimizer-free safety margin + expected KV per active token < available GPU memory (or MIG slice). KV usage scales roughly with sequence length × hidden size × number of layers × 2 (K and V), and is multiplied by batch/beam sizes.
Favor MIG when you need isolation or run heterogeneous tenants. Favor full-GPU without MIG for maximum single-model throughput.
If you need to pack multiple lightweight services on one GPU, consider MPS with strict admission control and per-process token budgets.

Kubernetes patterns with vLLM and TensorRT-LLM We’ll cover metrics collection, HPA/KEDA autoscaling, GPU selection with MIG, and routing by traffic class. Treat the following as templates; adapt flags and resource names to your environment and framework versions.

Expose token-aware metrics

Scrape server metrics via Prometheus. Ensure your LLM server exports:
- tokens_in_total, tokens_out_total
- prefill_queue_len, decode_queue_len
- kv_cache_bytes_used, kv_cache_bytes_capacity
- kv_cache_evictions_total

Prometheus recording rules (example):

yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: llm-token-rates
  namespace: llm
spec:
  groups:
  - name: llm
    rules:
    - record: llm:token_in_rate:1m
      expr: sum(rate(tokens_in_total[1m])) by (deployment)
    - record: llm:token_out_rate:1m
      expr: sum(rate(tokens_out_total[1m])) by (deployment)
    - record: llm:kv_utilization
      expr: sum(kv_cache_bytes_used) by (deployment) / sum(kv_cache_bytes_capacity) by (deployment)

Horizontal Pod Autoscaler (HPA) with custom metrics

yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: llm-decode-hpa
  namespace: llm
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: llm-decode
  minReplicas: 1
  maxReplicas: 40
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 0
      policies:
      - type: Percent
        value: 200
        periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 50
        periodSeconds: 60
  metrics:
  - type: Pods
    pods:
      metric:
        name: llm_token_out_rate
      target:
        type: AverageValue
        averageValue: "8000"  # tokens/sec per pod target
  - type: Pods
    pods:
      metric:
        name: llm_kv_utilization
      target:
        type: AverageValue
        averageValue: "0.75"  # keep KV around 75%

If your cluster doesn’t expose custom metrics via the Kubernetes API, KEDA is simpler.

KEDA ScaledObject with Prometheus

yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: llm-decode-scaledobject
  namespace: llm
spec:
  scaleTargetRef:
    name: llm-decode
  minReplicaCount: 1
  maxReplicaCount: 40
  cooldownPeriod: 300
  triggers:
  - type: prometheus
    metadata:
      serverAddress: http://prometheus-operated.monitoring:9090
      query: sum(rate(tokens_out_total[30s])) by (deployment)
      threshold: "8000"  # tokens/sec per pod target
  - type: prometheus
    metadata:
      serverAddress: http://prometheus-operated.monitoring:9090
      query: sum(kv_cache_bytes_used) by (deployment) / sum(kv_cache_bytes_capacity) by (deployment)
      threshold: "0.85"  # throttle scale down if KV is hot

Deployments: separate decoding vs prefill budgets inside one server Prefer one server per GPU that internally splits prefill/decode queues with budgets. If your framework exposes knobs, set them explicitly.

Example vLLM deployment with MIG

yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm-decode
  namespace: llm
spec:
  replicas: 1
  selector:
    matchLabels:
      app: llm-decode
  template:
    metadata:
      labels:
        app: llm-decode
    spec:
      nodeSelector:
        nvidia.com/gpu.product: NVIDIA-A100-SXM4-80GB
        nvidia.com/mig.strategy: single
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      containers:
      - name: server
        image: yourrepo/vllm:stable
        args:
        - serve
        - /models/llama-3-8b
        - --port=8000
        - --tensor-parallel-size=1
        - --gpu-memory-utilization=0.9
        - --max-num-seqs=512
        env:
        # conceptual knobs; names vary by version
        - name: VLLM_PREFILL_TOKEN_BUDGET
          value: "0.3"   # 30% to prefill
        - name: VLLM_DECODE_TOKEN_BUDGET
          value: "0.7"
        - name: VLLM_CHUNKED_PREFILL
          value: "true"
        - name: VLLM_MAX_QUEUE_DELAY_MS
          value: "3"
        ports:
        - containerPort: 8000
        resources:
          limits:
            nvidia.com/mig-2g.20gb: "1"
            cpu: "4"
            memory: 16Gi
          requests:
            nvidia.com/mig-2g.20gb: "1"
            cpu: "2"
            memory: 8Gi
        volumeMounts:
        - name: model
          mountPath: /models
      volumes:
      - name: model
        persistentVolumeClaim:
          claimName: models-pvc

TensorRT-LLM with Triton (dynamic batching and instance groups)

yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: trtllm
  namespace: llm
spec:
  replicas: 1
  selector:
    matchLabels:
      app: trtllm
  template:
    metadata:
      labels:
        app: trtllm
    spec:
      containers:
      - name: triton
        image: nvcr.io/nvidia/tritonserver:24.06-py3
        args: ["tritonserver", "--model-repository=/models", "--http-port=8000", "--metrics-port=8002"]
        ports:
        - containerPort: 8000
        - containerPort: 8002
        resources:
          limits:
            nvidia.com/gpu: "1"
            cpu: "8"
            memory: 32Gi
        volumeMounts:
        - name: model
          mountPath: /models
      volumes:
      - name: model
        persistentVolumeClaim:
          claimName: triton-models

Triton model configuration (model repository):

text
# models/llama/config.pbtxt
name: "llama"
platform: "tensorrt_llm"
max_batch_size: 128
instance_group {
  kind: KIND_GPU
  count: 1
}
dynamic_batching {
  preferred_batch_size: [ 1, 2, 4, 8, 16, 32 ]
  max_queue_delay_microseconds: 3000
}
# You would also configure scheduler priorities or separate models
# for prefill-focused and decode-focused traffic if applicable.

Routing by traffic class It’s often beneficial to route short prompts and interactive chat to one pool and long-context batch jobs to another. Use two Deployments with identical weights but different policies.

llm-interactive: Higher decode budget, smaller max queue delay, tighter p95 SLOs.
llm-batch: Larger prefill allowance, larger max queue delay, higher throughput.

Example Envoy/Ingress concept: route by header X-Traffic-Class: interactive or batch.

GPU node pools and warm pools

Create separate node groups for full GPUs and for MIG-enabled nodes.
Keep a warm pool of pre-provisioned nodes (e.g., 1–2 per group) to avoid 2–6 minute cold node brings.
Use PodDisruptionBudgets and RollingUpdate with maxSurge to avoid taking all capacity down during deploys.

Example PDB:

yaml
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: llm-pdb
  namespace: llm
spec:
  minAvailable: 80%
  selector:
    matchLabels:
      app: llm-decode

Admission control and backpressure

Reject or defer requests when kv_utilization > 0.9.
Clamp max_new_tokens dynamically when prefill_queue_len is large.
Inform clients via Retry-After for soft-throttle to allow retries without overload.

Capacity planning and SLO-driven autoscaling formulas You need two numbers per deployment:

C_in: sustainable prefill token capacity per pod (tokens/sec) at your p95 target.
C_out: sustainable decode token capacity per pod (tokens/sec) at your p95 target.

Measure them under realistic mixes and sequence lengths. Then compute:

N_in = ceil(R_in / (C_in × U_target))
N_out = ceil(R_out / (C_out × U_target))
N = max(N_in, N_out)

Where U_target = 0.7–0.8 to preserve headroom. Add:

Stabilization windows: 3–5 minutes for scale-down to avoid flapping.
Burst scale-ups: allow 2–3x per minute growth.
Safety clamp: Never scale down if kv_utilization > 0.75 or prefill_queue_len is nontrivial.

SLO coupling

If p95 latency > budget and queues are non-empty, scale out regardless of token-rate targets.
If p95 is healthy but utilization is high, consider accepting more queue delay for efficiency.

Guardrails, admission control, and cost governance

Token budgets per tenant: Enforce max tokens per minute per API key. This caps cost and protects capacity.
Prompt truncation policy: Truncate or refuse prompts that would exceed the KV budget under current load. Communicate limits explicitly to clients.
Expected token estimator: Before admission, estimate max prefill and decode tokens for a request. If the estimate breaches budgets, push to a batch queue or reject.
Timeouts and partial results: Stream partial tokens early, but cancel sequences that violate hard time budgets.
Observability: Dashboards for token rates, KV utilization, batching efficiency, and p95 latencies. Alert on kv_cache_evictions_total increasing or batch efficiency dropping.

Practical checklist

Notes on frameworks and references

vLLM introduced PagedAttention and continuous batching; it exposes token-centric metrics and supports chunked prefill in recent releases. It’s a solid default for Pythonic stacks.
TensorRT-LLM combines engine-optimized kernels with Triton’s dynamic batching and is strong on H100/A100 for production. Use its paged KV cache features and tune dynamic_batching.
FlashAttention reduces attention memory traffic and is widely adopted under the hood; you benefit indirectly via frameworks that integrate it.

Final take Autoscaling LLMs by RPS is like provisioning a data warehouse by “queries per second.” It looks neat on a dashboard and fails you in production. In 2025, the winning strategy is token-aware, phase-aware, and cache-aware:

Scale on token rates, not requests.
Protect decode latency with separate budgets and chunked prefill.
Keep the KV cache warm and below critical utilization.
Batch and multiplex, but with fairness and queue-delay caps.
Right-size GPUs using MIG/MPS to match model and tenant profiles.
Encode all of it in Kubernetes with HPA/KEDA using Prometheus metrics, plus sane routing and admission control.

Do that, and your p95s will hold steady, your GPUs will stay busy for the right reasons, and your cost curve won’t surprise you the next time your users paste in a 90k-token context.