Kubernetes GPU Scheduling in 2025: Practical Patterns for AI Workloads with Kueue, Volcano, and MIG
Opinionated, hands-on guide to GPU-aware Kubernetes in 2025: device plugins, gang scheduling, queues/preemption, MIG/MPS partitioning, bin-packing, spot fallback, and orchestration via Kueue, Volcano, and Ray.

Kubernetes GPU Scheduling in 2025: Practical Patterns for AI Workloads with Kueue, Volcano, and MIG
AI infrastructure is GPU-bound, not pod-bound. In 2025, the clusters that ship models to production and keep R&D humming are those that schedule GPUs as a first-class resource across multi-tenant, bursty, heterogeneous fleets. This article is a pragmatic, opinionated playbook for GPU-aware Kubernetes: how to combine device plugins, gang scheduling, queues and preemption, MIG/MPS partitioning, bin-packing, spot fallback, and orchestration via Kueue, Volcano, and Ray.
If you want the nutshell: use the NVIDIA GPU Operator for device management, Kueue or Volcano (or both) for batch admission control and gang semantics, MIG for stable isolation and bin-packing small jobs, MPS for throughput on latency-tolerant inference, and a queue-first policy model to keep GPU utilization high without blowing up your tenants.
Executive summary
Why GPU-bound beats pod-bound
Kubernetes has a pod-centric scheduler. AI workloads care about the count, type, and topology of GPUs. The scheduler’s default scoring and fit predicates aren’t enough to coordinate multi-GPU, multi-node, network topology, and tenant quotas by themselves.
You need to:
Building blocks: GPU device management on Kubernetes
NVIDIA GPU Operator and Device Plugin
In 2025, the NVIDIA GPU Operator remains the default for production. It installs the driver, nvidia-container-toolkit, DCGM, and the device plugin.
Basic deployment hint:
`bash
helm repo add nvidia https://nvidia.github.io/gpu-operator
helm upgrade --install gpu-operator nvidia/gpu-operator -n gpu-operator --create-namespace \
--set driver.enabled=true \
--set toolkit.enabled=true \
--set mig.strategy=single # or mixed
`
Key device plugin configuration patterns:
Example device plugin config via ConfigMap:
`yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: nvidia-device-plugin-config
namespace: gpu-operator
data:
config.yaml: |
version: v1
flags:
migStrategy: single
nvidiaDriverRoot: /run/nvidia/driver
sharing:
timeSlicing:
resources:
- name: nvidia.com/gpu
replicas: 4 # up to 4 pods share a single GPU (per-GPU time-slice)
mig:
strategy: single
`
Pods requesting GPUs:
`yaml
apiVersion: v1
kind: Pod
metadata:
name: trainer
spec:
containers:
- name: main
image: nvcr.io/nvidia/pytorch:24.08-py3
resources:
limits:
nvidia.com/gpu: 4
`
With MIG enabled and configured, request profile-specific resources:
`yaml
apiVersion: v1
kind: Pod
metadata:
name: small-trainer
spec:
nodeSelector:
nvidia.com/mig.capable: 'true'
containers:
- name: main
image: nvcr.io/nvidia/pytorch:24.08-py3
resources:
limits:
nvidia.com/mig-1g.10gb: 1 # H100 80GB example profile
`
Opinion: choose a small number of cluster-wide MIG layouts and stick to them. Stability beats theoretical flexibility, because reconfiguring MIG requires draining the node and can disrupt running jobs.
Topology and CPU pinning
GPU allocation quality matters:
Recommended kubelet settings for GPU nodes:
`yaml
kubeletConfiguration:
featureGates:
TopologyManager: true
cpuManagerPolicy: static
topologyManagerPolicy: single-numa-node
`
Also consider Node Feature Discovery (NFD) to label nodes with GPU model, NVLink/NVSwitch presence, and MIG capability, enabling Flavors in Kueue and placement constraints in Volcano.
Gang semantics: Kueue vs. Volcano
Both Kueue and Volcano provide job-wide semantics beyond what the default scheduler offers, but they take different approaches.
Kueue: queue-based admission for batch
Kueue inserts an admission control layer before pods are scheduled:
This model plays well with the default kube-scheduler and the NVIDIA device plugin. It preserves standard scheduling extensions and lets you use upstream semantics and autoscalers.
A minimal Kueue setup:
`yaml
apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
name: h100
spec:
nodeLabels:
nvidia.com/gpu.product: NVIDIA-H100
apiVersion: kueue.x-k8s.io/v1beta1 kind: ResourceFlavor metadata: name: a100 spec: nodeLabels: nvidia.com/gpu.product: NVIDIA-A100
apiVersion: kueue.x-k8s.io/v1beta1 kind: ClusterQueue metadata: name: prod-ai spec: cohort: global namespaceSelector: {} queueingPolicy: admission: StrictFIFO resourceGroups: - coveredResources: [cpu, memory, nvidia.com/gpu] flavors: - name: h100 resources: - name: nvidia.com/gpu nominalQuota: 128 # 16 nodes x 8 GPUs - name: cpu nominalQuota: 1024 - name: memory nominalQuota: 6Ti - name: a100 resources: - name: nvidia.com/gpu nominalQuota: 64
apiVersion: kueue.x-k8s.io/v1beta1 kind: LocalQueue metadata: name: team-vision namespace: team-vision spec: clusterQueue: prod-ai
`
Submitting a job with Kueue annotations:
`yaml
apiVersion: batch/v1
kind: Job
metadata:
name: resnet-train
namespace: team-vision
labels:
kueue.x-k8s.io/queue-name: team-vision
spec:
completions: 1
parallelism: 1
template:
metadata:
labels:
kueue.x-k8s.io/queue-name: team-vision
spec:
restartPolicy: Never
containers:
- name: train
image: nvcr.io/nvidia/pytorch:24.08-py3
resources:
requests:
cpu: '8'
memory: 64Gi
nvidia.com/gpu: 8
limits:
nvidia.com/gpu: 8
`
Pros of Kueue:
Caveats:
Volcano: battle-tested batch scheduler with PodGroup
Volcano replaces kube-scheduler for targeted workloads and adds batch plugins:
Example Volcano PodGroup with a gang Job:
`yaml
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
name: research
spec:
weight: 1
apiVersion: scheduling.volcano.sh/v1beta1 kind: PodGroup metadata: name: gpt-train namespace: research spec: minMember: 16 minResources: nvidia.com/gpu: 128 queue: research
apiVersion: batch/v1 kind: Job metadata: name: gpt-train namespace: research spec: completions: 16 parallelism: 16 template: metadata: labels: pod-group.scheduling.volcano.sh/name: gpt-train spec: schedulerName: volcano restartPolicy: Never containers: - name: worker image: nvcr.io/nvidia/pytorch:24.08-py3 resources: limits: nvidia.com/gpu: 8
`
Pros of Volcano:
Caveats:
Choosing Kueue vs. Volcano (or both)
MIG and MPS: partitioning GPUs in anger
MIG profiles: hard isolation and predictable performance
MIG partitions Ampere/Hopper GPUs into independent instances with dedicated memory, cache, and compute slices. Common profiles:
Patterns:
- A100: either 7x 1g.10gb (for inference/experimentation) or 1x 7g.80gb (for big training).
- H100: dynamic clusters often choose a 3x 3g.40gb layout for medium training and 1x 1g.10gb leftover; but be wary of fragmentation.
Operational realities:
Example MIG-aware ResourceFlavor:
`yaml
apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
name: h100-mig
spec:
nodeLabels:
nvidia.com/mig.capable: 'true'
nvidia.com/gpu.product: NVIDIA-H100
`
Then request MIG resources in jobs:
`yaml
apiVersion: batch/v1
kind: Job
metadata:
name: micro-batch-infer
labels:
kueue.x-k8s.io/queue-name: team-infer
spec:
template:
spec:
restartPolicy: Never
containers:
- name: serve
image: nvcr.io/nvidia/tritonserver:24.08-py3
resources:
limits:
nvidia.com/mig-1g.10gb: 1
`
MPS and time-slicing: high throughput, softer isolation
MPS allows multiple CUDA contexts to share a GPU with improved concurrency over raw time-slicing. It boosts throughput for:
Trade-offs:
Kubernetes pattern:
Example pod annotation to request time-slicing via device plugin config classes:
`yaml
apiVersion: v1
kind: Pod
metadata:
name: batch-infer
labels:
nvidia.com/gpu.workload.config: timeslice
spec:
containers:
- name: infer
image: yourrepo/infer:latest
resources:
limits:
nvidia.com/gpu: 1
`
Opinion: prefer MIG for SLO-bound inference or when noisy neighbors are unacceptable. Use MPS/time-slicing when your objective is raw throughput and jobs are resilient to interference.
Gang scheduling patterns for AI
Multi-GPU training requires that all workers and parameter servers or collective participants start together. Patterns:
Example Kueue-admitted TorchElastic Job (via PyTorchJob CRD integration):
`yaml
apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
name: imagenet-train
namespace: research
labels:
kueue.x-k8s.io/queue-name: research
spec:
pytorchReplicaSpecs:
Master:
replicas: 1
template:
spec:
containers:
- name: pytorch
image: yourrepo/pytorch-train:latest
resources:
limits:
nvidia.com/gpu: 8
Worker:
replicas: 15
template:
spec:
containers:
- name: pytorch
image: yourrepo/pytorch-train:latest
resources:
limits:
nvidia.com/gpu: 8
`
Opinion: avoid ad hoc scripts that spin up pods manually. Use a controller with native retry and topology hints, and wrap it with Kueue or Volcano gang semantics to prevent zombie partial allocations.
Queues, priorities, and preemption that reflect your business
Your GPU policy should be declared in queues, not in pager duty. Kueue’s ClusterQueues and Volcano’s Queues provide the leverage.
Kueue quota, cohorts, and borrowing
Example WorkloadPriorityClass and preemption policy:
`yaml
apiVersion: kueue.x-k8s.io/v1beta1
kind: WorkloadPriorityClass
metadata:
name: critical
value: 1000
preemptionPolicy: LowerPriority
apiVersion: kueue.x-k8s.io/v1beta1 kind: WorkloadPriorityClass metadata: name: batch value: 100 preemptionPolicy: Never
`
Then tag Jobs:
`yaml
metadata:
labels:
kueue.x-k8s.io/queue-name: prod
annotations:
kueue.x-k8s.io/priority-class: critical
`
Preemption guidance:
Volcano priorities and preemption
Volcano’s preemptor plugin can:
Example Volcano priority class:
`yaml
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: volcano-critical
value: 100000
globalDefault: false
preemptionPolicy: PreemptLowerPriority
`
GPU bin-packing: minimize fragmentation, maximize throughput
Bin-packing GPUs is about preserving large contiguous allocations and backfilling intelligently.
Tactics:
Kube-scheduler plugins to consider:
Example SchedulerConfiguration tuned for bin-packing bias (when you manage your own scheduler profile):
`yaml
apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
profiles:
`
Opinion: many teams over-index on spread for perceived fairness. For GPUs, co-location beats spread unless you have a strong failure-domain requirement.
Spot fallback: cheaper GPUs without chaos
Spot instances can cut costs 50–80% but demand resilience.
Control-plane patterns:
Example Karpenter provisioners:
`yaml
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
name: gpu-spot
spec:
template:
metadata:
labels:
capacity-type: spot
nvidia.com/gpu.present: 'true'
spec:
taints:
- key: spot
value: 'true'
effect: NoSchedule
disruption:
consolidationPolicy: WhenUnderutilized
limits:
cpu: 2000
apiVersion: karpenter.sh/v1beta1 kind: NodePool metadata: name: gpu-ondemand spec: template: metadata: labels: capacity-type: on-demand nvidia.com/gpu.present: 'true'
`
Workload patterns:
Pod tolerations and node selectors:
`yaml
spec:
tolerations:
- key: spot
operator: Equal
value: 'true'
effect: NoSchedule
nodeSelector:
capacity-type: spot
`
Queue policy:
Orchestrating with Ray: clusters, jobs, and queues
Ray is a popular substrate for distributed training, tuning, and batch inference. On Kubernetes you have two main modes:
GPU-aware patterns:
Example RayCluster with MIG workers and Kueue queueing:
`yaml
apiVersion: ray.io/v1
kind: RayCluster
metadata:
name: ray-mig
namespace: ml
labels:
kueue.x-k8s.io/queue-name: ml-shared
spec:
headGroupSpec:
template:
spec:
containers:
- name: ray-head
image: rayproject/ray:2.32.0
resources:
requests:
cpu: '8'
memory: 32Gi
workerGroupSpecs:
- groupName: mig-workers
replicas: 4
template:
spec:
containers:
- name: ray-worker
image: rayproject/ray:2.32.0
resources:
limits:
nvidia.com/mig-1g.10gb: 1
`
Submitting a RayJob that should be admitted atomically:
`yaml
apiVersion: ray.io/v1
kind: RayJob
metadata:
name: ray-train
namespace: ml
labels:
kueue.x-k8s.io/queue-name: ml-shared
spec:
entrypoint: python train.py --epochs 5
rayClusterSelector:
rayClusterName: ray-mig
`
Opinion: ephemeral Ray clusters per Job work well with Kueue because admission can model the cluster’s entire footprint. Long-lived clusters are fine for interactive use but can suffer from internal fragmentation; implement internal Ray resource reservations to mitigate.
A reference architecture that works in practice
Sample Kueue ClusterQueues with cohorts:
`yaml
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
name: prod
spec:
cohort: global
resourceGroups:
- coveredResources: [nvidia.com/gpu]
flavors:
- name: h100
resources:
- name: nvidia.com/gpu
nominalQuota: 256
- name: h100-mig
resources:
- name: nvidia.com/mig-1g.10gb
nominalQuota: 128
apiVersion: kueue.x-k8s.io/v1beta1 kind: ClusterQueue metadata: name: research spec: cohort: global resourceGroups: - coveredResources: [nvidia.com/gpu] flavors: - name: h100 resources: - name: nvidia.com/gpu nominalQuota: 64 - name: a100-mig resources: - name: nvidia.com/mig-1g.10gb nominalQuota: 256
apiVersion: kueue.x-k8s.io/v1beta1 kind: ClusterQueue metadata: name: batch spec: cohort: global resourceGroups: - coveredResources: [nvidia.com/gpu] flavors: - name: gpu-spot resources: - name: nvidia.com/gpu nominalQuota: 512
`
Now, a high-priority training job targeting full H100s:
`yaml
apiVersion: batch/v1
kind: Job
metadata:
name: llama-train
namespace: prod-ml
labels:
kueue.x-k8s.io/queue-name: prod
annotations:
kueue.x-k8s.io/priority-class: critical
spec:
parallelism: 32
completions: 32
template:
spec:
restartPolicy: Never
nodeSelector:
nvidia.com/gpu.product: NVIDIA-H100
containers:
- name: trainer
image: yourrepo/llama-train:latest
resources:
limits:
nvidia.com/gpu: 8
`
A backfill batch inference job on MIG:
`yaml
apiVersion: batch/v1
kind: Job
metadata:
name: nightly-infer
namespace: research
labels:
kueue.x-k8s.io/queue-name: batch
annotations:
kueue.x-k8s.io/priority-class: spot
spec:
parallelism: 100
template:
spec:
restartPolicy: Never
tolerations:
- key: spot
operator: Equal
value: 'true'
effect: NoSchedule
nodeSelector:
capacity-type: spot
containers:
- name: infer
image: yourrepo/infer:latest
resources:
limits:
nvidia.com/mig-1g.10gb: 1
`
Observability and SLOs for GPU scheduling
KPIs that matter:
Tools:
Operational playbooks:
Common pitfalls and how to avoid them
A short decision guide
Closing thoughts
The organizations that win with AI at scale in 2025 made a cultural shift: they treat GPUs as a shared, policy-driven substrate governed by queues, not as pets hand-assigned to projects. Kubernetes can be a great fit if you embrace GPU-aware primitives: device plugins, MIG and MPS, gang admission, quotas and cohorts, and preemption that aligns with your business.
Start small: pick two GPU families, define three queues, set a handful of MIG layouts, and turn on Kueue. Watch your queue metrics and GPU utilization for a month. Then, add Volcano where you need stricter gang semantics. Keep iterating on your layouts and priorities. The goal isn’t to perfectly pack every GPU every minute; it’s to keep your teams productive, your SLAs honest, and your cloud bill boring.