Kubernetes GPU Scheduling in 2025: Practical Patterns for AI Workloads with Kueue, Volcano, and MIG
AI infrastructure is GPU-bound, not pod-bound. In 2025, the clusters that ship models to production and keep R&D humming are those that schedule GPUs as a first-class resource across multi-tenant, bursty, heterogeneous fleets. This article is a pragmatic, opinionated playbook for GPU-aware Kubernetes: how to combine device plugins, gang scheduling, queues and preemption, MIG/MPS partitioning, bin-packing, spot fallback, and orchestration via Kueue, Volcano, and Ray.
If you want the nutshell: use the NVIDIA GPU Operator for device management, Kueue or Volcano (or both) for batch admission control and gang semantics, MIG for stable isolation and bin-packing small jobs, MPS for throughput on latency-tolerant inference, and a queue-first policy model to keep GPU utilization high without blowing up your tenants.
Executive summary
- Kueue gives you cluster-wide queues, quotas, cohort borrowing, and preemption that fit enterprise multi-tenant AI. Combined with the default scheduler and the NVIDIA device plugin, it provides gang-admission semantics for Jobs and common ML operators.
- Volcano is still the most mature gang scheduler with rich batch plugins (PodGroup, priorities, preemption, rescheduling). It shines for classic HPC-like AI training and complex co-scheduling.
- MIG (Multi-Instance GPU) turns a single GPU into multiple hardware-sliced instances; use it to pack small training and inference safely. MPS (Multi-Process Service) time-slices compute to boost throughput when latency and isolation constraints allow.
- Bin-packing GPUs is a business decision as much as a technical one. Aim to protect contiguous full-GPU islands for large training and fill the rest with MIG or MPS.
- Spot fallback is viable if you checkpoint aggressively and model your queue policies around interruption. Integrate with Kueue or Volcano preemption rules, taints/tolerations, and an autoscaler (CA or Karpenter).
- Ray integrates well with both Kueue and Volcano, but mind the gang semantics: the cluster must reserve the whole actor set or you invite deadlock.
Why GPU-bound beats pod-bound
Kubernetes has a pod-centric scheduler. AI workloads care about the count, type, and topology of GPUs. The scheduler’s default scoring and fit predicates aren’t enough to coordinate multi-GPU, multi-node, network topology, and tenant quotas by themselves.
You need to:
- Admit or reject entire jobs atomically (gang semantics) so a training job doesn’t get stuck with half of its workers running.
- Enforce tenant- and queue-level quotas across clusters, not just per node.
- Shape GPU instances (full, MIG, or MPS) to match the distribution of your workload footprints.
- Preempt cleanly and fairly to avoid starvation and preserve SLAs.
- Avoid GPU fragmentation so that tomorrow’s 8xH100 training run isn’t blocked by today’s 1g.10gb MIG shards scattered across your best nodes.
Building blocks: GPU device management on Kubernetes
NVIDIA GPU Operator and Device Plugin
In 2025, the NVIDIA GPU Operator remains the default for production. It installs the driver, nvidia-container-toolkit, DCGM, and the device plugin.
- Device plugin exposes GPUs as extended resources (e.g., nvidia.com/gpu) and manages allocation.
- Supports MIG: advertises profile-specific resources (e.g., nvidia.com/mig-1g.10gb) depending on strategy.
- Supports time-slicing and MPS modes to allow multiple pods per GPU with controlled concurrency.
Basic deployment hint:
bashhelm repo add nvidia https://nvidia.github.io/gpu-operator helm upgrade --install gpu-operator nvidia/gpu-operator -n gpu-operator --create-namespace \ --set driver.enabled=true \ --set toolkit.enabled=true \ --set mig.strategy=single # or mixed
Key device plugin configuration patterns:
- MIG strategies:
- none: expose full GPUs only.
- single: nodes run a single MIG layout; resources are stable (recommended for shared clusters).
- mixed: multiple MIG layouts possible per node; flexible but can increase fragmentation.
- Time-slicing:
- Enable for inference/experimentation to drive utilization.
- Combine with MPS to reduce context-switch overhead.
Example device plugin config via ConfigMap:
yamlapiVersion: v1 kind: ConfigMap metadata: name: nvidia-device-plugin-config namespace: gpu-operator data: config.yaml: | version: v1 flags: migStrategy: single nvidiaDriverRoot: /run/nvidia/driver sharing: timeSlicing: resources: - name: nvidia.com/gpu replicas: 4 # up to 4 pods share a single GPU (per-GPU time-slice) mig: strategy: single
Pods requesting GPUs:
yamlapiVersion: v1 kind: Pod metadata: name: trainer spec: containers: - name: main image: nvcr.io/nvidia/pytorch:24.08-py3 resources: limits: nvidia.com/gpu: 4
With MIG enabled and configured, request profile-specific resources:
yamlapiVersion: v1 kind: Pod metadata: name: small-trainer spec: nodeSelector: nvidia.com/mig.capable: 'true' containers: - name: main image: nvcr.io/nvidia/pytorch:24.08-py3 resources: limits: nvidia.com/mig-1g.10gb: 1 # H100 80GB example profile
Opinion: choose a small number of cluster-wide MIG layouts and stick to them. Stability beats theoretical flexibility, because reconfiguring MIG requires draining the node and can disrupt running jobs.
Topology and CPU pinning
GPU allocation quality matters:
- Topology Manager (kubelet) can align CPU and device NUMA locality.
- NVIDIA device plugin supports preferred allocation for multi-GPU requests (e.g., same NVLink island).
Recommended kubelet settings for GPU nodes:
yamlkubeletConfiguration: featureGates: TopologyManager: true cpuManagerPolicy: static topologyManagerPolicy: single-numa-node
Also consider Node Feature Discovery (NFD) to label nodes with GPU model, NVLink/NVSwitch presence, and MIG capability, enabling Flavors in Kueue and placement constraints in Volcano.
Gang semantics: Kueue vs. Volcano
Both Kueue and Volcano provide job-wide semantics beyond what the default scheduler offers, but they take different approaches.
Kueue: queue-based admission for batch
Kueue inserts an admission control layer before pods are scheduled:
- Workloads (e.g., Kubernetes Job, MPIJob, PyTorchJob, RayJob) are enqueued into LocalQueues.
- LocalQueues bind to ClusterQueues with configured quotas by ResourceGroup and Flavors (e.g., A100 vs. H100 classes).
- Kueue simulates scheduling across the cluster and either admits the entire workload (gang) or keeps it pending.
- Cohorts allow queues to borrow idle quota from peers with weighted fairness.
- Preemption policies determine what can be displaced to make room for higher-priority work.
This model plays well with the default kube-scheduler and the NVIDIA device plugin. It preserves standard scheduling extensions and lets you use upstream semantics and autoscalers.
A minimal Kueue setup:
yamlapiVersion: kueue.x-k8s.io/v1beta1 kind: ResourceFlavor metadata: name: h100 spec: nodeLabels: nvidia.com/gpu.product: NVIDIA-H100 --- apiVersion: kueue.x-k8s.io/v1beta1 kind: ResourceFlavor metadata: name: a100 spec: nodeLabels: nvidia.com/gpu.product: NVIDIA-A100 --- apiVersion: kueue.x-k8s.io/v1beta1 kind: ClusterQueue metadata: name: prod-ai spec: cohort: global namespaceSelector: {} queueingPolicy: admission: StrictFIFO resourceGroups: - coveredResources: [cpu, memory, nvidia.com/gpu] flavors: - name: h100 resources: - name: nvidia.com/gpu nominalQuota: 128 # 16 nodes x 8 GPUs - name: cpu nominalQuota: 1024 - name: memory nominalQuota: 6Ti - name: a100 resources: - name: nvidia.com/gpu nominalQuota: 64 --- apiVersion: kueue.x-k8s.io/v1beta1 kind: LocalQueue metadata: name: team-vision namespace: team-vision spec: clusterQueue: prod-ai
Submitting a job with Kueue annotations:
yamlapiVersion: batch/v1 kind: Job metadata: name: resnet-train namespace: team-vision labels: kueue.x-k8s.io/queue-name: team-vision spec: completions: 1 parallelism: 1 template: metadata: labels: kueue.x-k8s.io/queue-name: team-vision spec: restartPolicy: Never containers: - name: train image: nvcr.io/nvidia/pytorch:24.08-py3 resources: requests: cpu: '8' memory: 64Gi nvidia.com/gpu: 8 limits: nvidia.com/gpu: 8
Pros of Kueue:
- Works with native controllers; no need to replace kube-scheduler.
- Strong multi-tenant controls (ClusterQueues, cohorts, borrowing).
- Natural extension point for batch autoscaling and preemption policies.
Caveats:
- True gang scheduling across multiple pods relies on the job controller integration with Kueue. Use supported job types or CRDs that integrate with Kueue admission.
- Network topology awareness is delegated to underlying plugin heuristics; Kueue does not directly model NVLink/NVSwitch.
Volcano: battle-tested batch scheduler with PodGroup
Volcano replaces kube-scheduler for targeted workloads and adds batch plugins:
- PodGroup CRD gives gang semantics; all pods in the group are scheduled once minAvailable can be satisfied.
- Rich preemption, rescheduling, and queue priorities.
- Integrations with MPI, TensorFlow, PyTorch operators, and classic HPC stacks.
Example Volcano PodGroup with a gang Job:
yamlapiVersion: scheduling.volcano.sh/v1beta1 kind: Queue metadata: name: research spec: weight: 1 --- apiVersion: scheduling.volcano.sh/v1beta1 kind: PodGroup metadata: name: gpt-train namespace: research spec: minMember: 16 minResources: nvidia.com/gpu: 128 queue: research --- apiVersion: batch/v1 kind: Job metadata: name: gpt-train namespace: research spec: completions: 16 parallelism: 16 template: metadata: labels: pod-group.scheduling.volcano.sh/name: gpt-train spec: schedulerName: volcano restartPolicy: Never containers: - name: worker image: nvcr.io/nvidia/pytorch:24.08-py3 resources: limits: nvidia.com/gpu: 8
Pros of Volcano:
- Mature gang scheduling, preemption, and batch policies.
- Tunable scoring plugins helpful for bin-packing and fragmentation control.
Caveats:
- You are running an alternative scheduler; ensure compatibility with other controllers and policies.
- Multi-scheduler setups require care to avoid policy conflicts.
Choosing Kueue vs. Volcano (or both)
- Start with Kueue if you need enterprise multi-tenant controls, fair sharing, and easy integration with Job-like controllers without replacing kube-scheduler.
- Choose Volcano if your workloads resemble HPC batch training with strict gang semantics and you want deeper control of scheduling internals.
- Combine them for certain patterns: Kueue for queueing and quota, Volcano as the scheduler for admitted workloads. This is advanced and requires careful integration testing; many teams succeed with Kueue + default scheduler plus the coscheduling plugin where needed.
MIG and MPS: partitioning GPUs in anger
MIG profiles: hard isolation and predictable performance
MIG partitions Ampere/Hopper GPUs into independent instances with dedicated memory, cache, and compute slices. Common profiles:
- A100 80GB: 1g.10gb, 2g.20gb, 3g.40gb, 7g.80gb
- H100 80GB: 1g.10gb, 2g.20gb, 3g.40gb, 7g.80gb
Patterns:
- Stable shared clusters: pick 2–3 layouts per GPU family, e.g.,
- A100: either 7x 1g.10gb (for inference/experimentation) or 1x 7g.80gb (for big training).
- H100: dynamic clusters often choose a 3x 3g.40gb layout for medium training and 1x 1g.10gb leftover; but be wary of fragmentation.
- Keep a subset of nodes MIG-disabled to host large training. Label these nodes and target them via Flavors.
Operational realities:
- Switching MIG layouts requires cordoning and draining nodes; plan change windows.
- Kueue ResourceFlavors can encode MIG presence via node labels to steer workloads.
Example MIG-aware ResourceFlavor:
yamlapiVersion: kueue.x-k8s.io/v1beta1 kind: ResourceFlavor metadata: name: h100-mig spec: nodeLabels: nvidia.com/mig.capable: 'true' nvidia.com/gpu.product: NVIDIA-H100
Then request MIG resources in jobs:
yamlapiVersion: batch/v1 kind: Job metadata: name: micro-batch-infer labels: kueue.x-k8s.io/queue-name: team-infer spec: template: spec: restartPolicy: Never containers: - name: serve image: nvcr.io/nvidia/tritonserver:24.08-py3 resources: limits: nvidia.com/mig-1g.10gb: 1
MPS and time-slicing: high throughput, softer isolation
MPS allows multiple CUDA contexts to share a GPU with improved concurrency over raw time-slicing. It boosts throughput for:
- Latency-tolerant inference
- Hyperparameter sweeps
- Lightweight fine-tuning/LoRA tasks
Trade-offs:
- Interference: jobs can compete for SMs and memory bandwidth.
- Accounting: measuring per-job utilization is fuzzier; use DCGM and per-pod GPU metrics.
Kubernetes pattern:
- Enable device plugin time-slicing.
- For NVIDIA MPS, use the GPU Operator’s MPS sidecar or node-level MPS control to run multiple pods per GPU with caps on active contexts.
Example pod annotation to request time-slicing via device plugin config classes:
yamlapiVersion: v1 kind: Pod metadata: name: batch-infer labels: nvidia.com/gpu.workload.config: timeslice spec: containers: - name: infer image: yourrepo/infer:latest resources: limits: nvidia.com/gpu: 1
Opinion: prefer MIG for SLO-bound inference or when noisy neighbors are unacceptable. Use MPS/time-slicing when your objective is raw throughput and jobs are resilient to interference.
Gang scheduling patterns for AI
Multi-GPU training requires that all workers and parameter servers or collective participants start together. Patterns:
- Single-node, multi-GPU: request N GPUs in a single pod or use a Job with parallelism=1 and nvidia.com/gpu: N. The device plugin will try to allocate GPUs on the same NVLink island.
- Multi-node, multi-GPU: use a job controller that understands distributed workers (MPIJob, PyTorchJob, Ray). Ensure gang admission either with Kueue (admit workload only when all pods can be created) or Volcano PodGroup with minMember.
- Network topology: for NCCL efficiency, prefer nodes connected via NVSwitch or fat-tree fabric. Label nodes and encode this via Flavors or node selectors.
Example Kueue-admitted TorchElastic Job (via PyTorchJob CRD integration):
yamlapiVersion: kubeflow.org/v1 kind: PyTorchJob metadata: name: imagenet-train namespace: research labels: kueue.x-k8s.io/queue-name: research spec: pytorchReplicaSpecs: Master: replicas: 1 template: spec: containers: - name: pytorch image: yourrepo/pytorch-train:latest resources: limits: nvidia.com/gpu: 8 Worker: replicas: 15 template: spec: containers: - name: pytorch image: yourrepo/pytorch-train:latest resources: limits: nvidia.com/gpu: 8
Opinion: avoid ad hoc scripts that spin up pods manually. Use a controller with native retry and topology hints, and wrap it with Kueue or Volcano gang semantics to prevent zombie partial allocations.
Queues, priorities, and preemption that reflect your business
Your GPU policy should be declared in queues, not in pager duty. Kueue’s ClusterQueues and Volcano’s Queues provide the leverage.
Kueue quota, cohorts, and borrowing
- Quotas: set nominal quotas per ResourceFlavor (e.g., H100 vs. A100, MIG vs. full GPU).
- Cohorts: group ClusterQueues into a borrowing pool; idle capacity is lent to peers with configured weights.
- WorkloadPriorityClass: define business priorities that inform fair share and preemption.
Example WorkloadPriorityClass and preemption policy:
yamlapiVersion: kueue.x-k8s.io/v1beta1 kind: WorkloadPriorityClass metadata: name: critical value: 1000 preemptionPolicy: LowerPriority --- apiVersion: kueue.x-k8s.io/v1beta1 kind: WorkloadPriorityClass metadata: name: batch value: 100 preemptionPolicy: Never
Then tag Jobs:
yamlmetadata: labels: kueue.x-k8s.io/queue-name: prod annotations: kueue.x-k8s.io/priority-class: critical
Preemption guidance:
- Prefer queue-level preemption first: evict within the same queue when priorities differ.
- Allow cross-queue preemption sparingly and with grace periods.
- For spot workloads, model them with lower priority, taints/tolerations, and checkpointing so they are the first to be evicted.
Volcano priorities and preemption
Volcano’s preemptor plugin can:
- Preempt lower-priority PodGroups to satisfy a higher-priority group.
- Consider gang integrity: preempt enough pods to free resources for the entire high-priority group.
- Support rescheduling plugins to backfill.
Example Volcano priority class:
yamlapiVersion: scheduling.k8s.io/v1 kind: PriorityClass metadata: name: volcano-critical value: 100000 globalDefault: false preemptionPolicy: PreemptLowerPriority
GPU bin-packing: minimize fragmentation, maximize throughput
Bin-packing GPUs is about preserving large contiguous allocations and backfilling intelligently.
Tactics:
- Separate node pools by GPU family and MIG policy. Expose them as distinct Flavors.
- Score nodes with a bin-packing bias (e.g., NodeResourcesFit with MostAllocated) for jobs requesting full GPUs, to pack them tightly and free whole nodes for future big jobs.
- For MIG nodes, choose a small set of layouts that correlate with request patterns (e.g., lots of 1g.10gb and a reasonable share of 3g.40gb). Avoid exotic profiles.
- Use anti-affinity judiciously. Spreading workers across too many nodes increases cross-node traffic; prefer packing within NVSwitch domains.
Kube-scheduler plugins to consider:
- NodeResourcesFit with ScoringStrategy Type=MostAllocated.
- PodTopologySpread for resilience of services, but turn it off for tightly-coupled training.
- Coscheduling plugin for soft gang semantics if not using Kueue/Volcano gangs.
Example SchedulerConfiguration tuned for bin-packing bias (when you manage your own scheduler profile):
yamlapiVersion: kubescheduler.config.k8s.io/v1 kind: KubeSchedulerConfiguration profiles: - schedulerName: default-scheduler pluginConfig: - name: NodeResourcesFit args: scoringStrategy: type: MostAllocated resources: - name: nvidia.com/gpu weight: 10 - name: cpu weight: 1 - name: memory weight: 1
Opinion: many teams over-index on spread for perceived fairness. For GPUs, co-location beats spread unless you have a strong failure-domain requirement.
Spot fallback: cheaper GPUs without chaos
Spot instances can cut costs 50–80% but demand resilience.
Control-plane patterns:
- Isolate spot nodes with taints like spot=true:NoSchedule.
- Add tolerations only to restartable jobs.
- Use PriorityClasses so spot jobs are preempted first.
- Configure Cluster Autoscaler or Karpenter with separate provisioners for spot and on-demand with explicit labels (e.g., capacity-type=spot).
Example Karpenter provisioners:
yamlapiVersion: karpenter.sh/v1beta1 kind: NodePool metadata: name: gpu-spot spec: template: metadata: labels: capacity-type: spot nvidia.com/gpu.present: 'true' spec: taints: - key: spot value: 'true' effect: NoSchedule disruption: consolidationPolicy: WhenUnderutilized limits: cpu: 2000 --- apiVersion: karpenter.sh/v1beta1 kind: NodePool metadata: name: gpu-ondemand spec: template: metadata: labels: capacity-type: on-demand nvidia.com/gpu.present: 'true'
Workload patterns:
- Checkpoint frequently (every few minutes) to object storage. For PyTorch, integrate torch.checkpoint with a sidecar or a library like torchsnapshot.
- Use PreStop hooks to save progress on SIGTERM.
- For Ray, enable autoscaler to rebalance and recreate actors.
Pod tolerations and node selectors:
yamlspec: tolerations: - key: spot operator: Equal value: 'true' effect: NoSchedule nodeSelector: capacity-type: spot
Queue policy:
- Define spot queues with lower borrowing priority. Allow them to consume idle capacity but be preempted immediately when on-demand work arrives.
Orchestrating with Ray: clusters, jobs, and queues
Ray is a popular substrate for distributed training, tuning, and batch inference. On Kubernetes you have two main modes:
- RayCluster CRD: a long-lived cluster with head and worker groups; workloads are submitted as Ray Jobs.
- Per-job ephemeral clusters: each Job spins up a cluster, runs, then tears down.
GPU-aware patterns:
- Define per-group Pod templates that request GPUs (full or MIG) and pin to flavors.
- Use Kueue admission for RayJobs so the entire cluster is admitted as a gang.
- For Volcano, put Ray pods into a PodGroup.
Example RayCluster with MIG workers and Kueue queueing:
yamlapiVersion: ray.io/v1 kind: RayCluster metadata: name: ray-mig namespace: ml labels: kueue.x-k8s.io/queue-name: ml-shared spec: headGroupSpec: template: spec: containers: - name: ray-head image: rayproject/ray:2.32.0 resources: requests: cpu: '8' memory: 32Gi workerGroupSpecs: - groupName: mig-workers replicas: 4 template: spec: containers: - name: ray-worker image: rayproject/ray:2.32.0 resources: limits: nvidia.com/mig-1g.10gb: 1
Submitting a RayJob that should be admitted atomically:
yamlapiVersion: ray.io/v1 kind: RayJob metadata: name: ray-train namespace: ml labels: kueue.x-k8s.io/queue-name: ml-shared spec: entrypoint: python train.py --epochs 5 rayClusterSelector: rayClusterName: ray-mig
Opinion: ephemeral Ray clusters per Job work well with Kueue because admission can model the cluster’s entire footprint. Long-lived clusters are fine for interactive use but can suffer from internal fragmentation; implement internal Ray resource reservations to mitigate.
A reference architecture that works in practice
- Node pools:
- pool-h100-full: H100 nodes with MIG disabled for large training.
- pool-h100-mig: H100 nodes with a fixed layout (e.g., 7x 1g.10gb) for inference and small jobs.
- pool-a100-mig: Backfill and lower-priority training.
- pool-gpu-spot: Mixed GPU types for low-priority work.
- Kueue configuration:
- ResourceFlavors for h100, h100-mig, a100-mig, gpu-spot.
- ClusterQueues: prod, research, batch. prod has highest priority and can borrow, research is capped but can borrow from batch, batch is lowest and can use spot.
- WorkloadPriorityClasses: critical, high, normal, spot.
- Default scheduler with bin-packing bias and Topology Manager.
- NVIDIA GPU Operator with MIG strategy=single per MIG pool; time-slicing enabled only on batch and spot pools.
- Observability: DCGM exporter, Prometheus, per-queue dashboards; alert on queue wait time SLOs and GPU idle rates.
Sample Kueue ClusterQueues with cohorts:
yamlapiVersion: kueue.x-k8s.io/v1beta1 kind: ClusterQueue metadata: name: prod spec: cohort: global resourceGroups: - coveredResources: [nvidia.com/gpu] flavors: - name: h100 resources: - name: nvidia.com/gpu nominalQuota: 256 - name: h100-mig resources: - name: nvidia.com/mig-1g.10gb nominalQuota: 128 --- apiVersion: kueue.x-k8s.io/v1beta1 kind: ClusterQueue metadata: name: research spec: cohort: global resourceGroups: - coveredResources: [nvidia.com/gpu] flavors: - name: h100 resources: - name: nvidia.com/gpu nominalQuota: 64 - name: a100-mig resources: - name: nvidia.com/mig-1g.10gb nominalQuota: 256 --- apiVersion: kueue.x-k8s.io/v1beta1 kind: ClusterQueue metadata: name: batch spec: cohort: global resourceGroups: - coveredResources: [nvidia.com/gpu] flavors: - name: gpu-spot resources: - name: nvidia.com/gpu nominalQuota: 512
Now, a high-priority training job targeting full H100s:
yamlapiVersion: batch/v1 kind: Job metadata: name: llama-train namespace: prod-ml labels: kueue.x-k8s.io/queue-name: prod annotations: kueue.x-k8s.io/priority-class: critical spec: parallelism: 32 completions: 32 template: spec: restartPolicy: Never nodeSelector: nvidia.com/gpu.product: NVIDIA-H100 containers: - name: trainer image: yourrepo/llama-train:latest resources: limits: nvidia.com/gpu: 8
A backfill batch inference job on MIG:
yamlapiVersion: batch/v1 kind: Job metadata: name: nightly-infer namespace: research labels: kueue.x-k8s.io/queue-name: batch annotations: kueue.x-k8s.io/priority-class: spot spec: parallelism: 100 template: spec: restartPolicy: Never tolerations: - key: spot operator: Equal value: 'true' effect: NoSchedule nodeSelector: capacity-type: spot containers: - name: infer image: yourrepo/infer:latest resources: limits: nvidia.com/mig-1g.10gb: 1
Observability and SLOs for GPU scheduling
KPIs that matter:
- GPU utilization: target 65–85% average at the node and fleet levels. Below 50% indicates fragmentation or policy misfit.
- Queue wait time: 50th/95th percentiles by queue and priority. Keep p95 under your agreed SLO (e.g., 2 hours for research, 10 minutes for prod-critical).
- Preemption rate and lost work: track time lost to preemption; if it exceeds 5–10% for a queue, invest in checkpointing and better preemption windows.
- Fragmentation metrics: percent of GPUs available as full vs. only MIG shards; stranded capacity by profile.
Tools:
- DCGM Exporter for GPU metrics; scrape with Prometheus and export to your APM.
- Kueue metrics: admitted vs. pending workloads, borrowing, preemption counts.
- Volcano metrics: queue lengths, PodGroup states, preemptions.
- Custom dashboards by queue and flavor.
Operational playbooks:
- Weekly planning: adjust MIG layouts and quotas based on observed footprint distribution.
- Hot fixes: if a large training is starved, temporarily drain MIG nodes to revert them to full GPUs; communicate change windows.
- Auto-tuning: feed observed request histograms back to templates so developers choose profile sizes that fit your layout.
Common pitfalls and how to avoid them
- Too many MIG layouts: operational thrash and stranded capacity. Limit layouts to those you actually need.
- Pretending MPS is isolation: it is not. Use MPS only when SLOs allow interference.
- Ignoring CPU and memory: oversubscribed CPUs can throttle your GPUs. Allocate sufficient CPU/memory with static CPU manager where needed.
- No gang semantics for Ray: partial startup frequently deadlocks actor sets. Use Kueue or Volcano to admit the whole cluster.
- Preemption without checkpoints: you will waste GPU hours. Make checkpointing a policy, not a suggestion.
- Spreading training pods across racks without a reason: network becomes your bottleneck. Prefer NVSwitch/NVLink locality.
- One giant prod queue: it looks fair until one team backfills forever. Use per-tenant LocalQueues bound to ClusterQueues with borrowing.
A short decision guide
- Need tight multi-tenant quotas and admission with minimal disruption to upstream controllers? Use Kueue + default scheduler.
- Need HPC-like gang scheduling and intricate preemption behaviors? Use Volcano.
- Mix of both with strong governance? Use Kueue for queues/admission and carefully integrate Volcano for scheduling select namespaces.
- Serving and small training: consider MIG or MPS. Big training: reserve full GPUs.
- On a budget with resilience: spot fallback with checkpointing, taints/tolerations, and preemptible queues.
Closing thoughts
The organizations that win with AI at scale in 2025 made a cultural shift: they treat GPUs as a shared, policy-driven substrate governed by queues, not as pets hand-assigned to projects. Kubernetes can be a great fit if you embrace GPU-aware primitives: device plugins, MIG and MPS, gang admission, quotas and cohorts, and preemption that aligns with your business.
Start small: pick two GPU families, define three queues, set a handful of MIG layouts, and turn on Kueue. Watch your queue metrics and GPU utilization for a month. Then, add Volcano where you need stricter gang semantics. Keep iterating on your layouts and priorities. The goal isn’t to perfectly pack every GPU every minute; it’s to keep your teams productive, your SLAs honest, and your cloud bill boring.
Further reading
- NVIDIA GPU Operator and device plugin: https://github.com/NVIDIA/gpu-operator
- Kueue docs: https://kueue.sigs.k8s.io/
- Volcano scheduler: https://volcano.sh/
- Kubernetes Topology Manager: https://kubernetes.io/docs/tasks/administer-cluster/topology-manager/
- Node Feature Discovery: https://github.com/kubernetes-sigs/node-feature-discovery
- Ray on K8s: https://docs.ray.io/en/latest/cluster/kubernetes/index.html
- DCGM Exporter: https://github.com/NVIDIA/dcgm-exporter