Kubernetes GPU Scheduling in 2025: Practical Patterns for AI Workloads with Kueue, Volcano, and MIG

AI infrastructure is GPU-bound, not pod-bound. In 2025, the clusters that ship models to production and keep R&D humming are those that schedule GPUs as a first-class resource across multi-tenant, bursty, heterogeneous fleets. This article is a pragmatic, opinionated playbook for GPU-aware Kubernetes: how to combine device plugins, gang scheduling, queues and preemption, MIG/MPS partitioning, bin-packing, spot fallback, and orchestration via Kueue, Volcano, and Ray.

If you want the nutshell: use the NVIDIA GPU Operator for device management, Kueue or Volcano (or both) for batch admission control and gang semantics, MIG for stable isolation and bin-packing small jobs, MPS for throughput on latency-tolerant inference, and a queue-first policy model to keep GPU utilization high without blowing up your tenants.

Executive summary

Kueue gives you cluster-wide queues, quotas, cohort borrowing, and preemption that fit enterprise multi-tenant AI. Combined with the default scheduler and the NVIDIA device plugin, it provides gang-admission semantics for Jobs and common ML operators.
Volcano is still the most mature gang scheduler with rich batch plugins (PodGroup, priorities, preemption, rescheduling). It shines for classic HPC-like AI training and complex co-scheduling.
MIG (Multi-Instance GPU) turns a single GPU into multiple hardware-sliced instances; use it to pack small training and inference safely. MPS (Multi-Process Service) time-slices compute to boost throughput when latency and isolation constraints allow.
Bin-packing GPUs is a business decision as much as a technical one. Aim to protect contiguous full-GPU islands for large training and fill the rest with MIG or MPS.
Spot fallback is viable if you checkpoint aggressively and model your queue policies around interruption. Integrate with Kueue or Volcano preemption rules, taints/tolerations, and an autoscaler (CA or Karpenter).
Ray integrates well with both Kueue and Volcano, but mind the gang semantics: the cluster must reserve the whole actor set or you invite deadlock.

Why GPU-bound beats pod-bound

Kubernetes has a pod-centric scheduler. AI workloads care about the count, type, and topology of GPUs. The scheduler’s default scoring and fit predicates aren’t enough to coordinate multi-GPU, multi-node, network topology, and tenant quotas by themselves.

You need to:

Admit or reject entire jobs atomically (gang semantics) so a training job doesn’t get stuck with half of its workers running.
Enforce tenant- and queue-level quotas across clusters, not just per node.
Shape GPU instances (full, MIG, or MPS) to match the distribution of your workload footprints.
Preempt cleanly and fairly to avoid starvation and preserve SLAs.
Avoid GPU fragmentation so that tomorrow’s 8xH100 training run isn’t blocked by today’s 1g.10gb MIG shards scattered across your best nodes.

Building blocks: GPU device management on Kubernetes

NVIDIA GPU Operator and Device Plugin

In 2025, the NVIDIA GPU Operator remains the default for production. It installs the driver, nvidia-container-toolkit, DCGM, and the device plugin.

Device plugin exposes GPUs as extended resources (e.g., nvidia.com/gpu) and manages allocation.
Supports MIG: advertises profile-specific resources (e.g., nvidia.com/mig-1g.10gb) depending on strategy.
Supports time-slicing and MPS modes to allow multiple pods per GPU with controlled concurrency.

Basic deployment hint:

bash
helm repo add nvidia https://nvidia.github.io/gpu-operator
helm upgrade --install gpu-operator nvidia/gpu-operator -n gpu-operator --create-namespace \
  --set driver.enabled=true \
  --set toolkit.enabled=true \
  --set mig.strategy=single # or mixed

Key device plugin configuration patterns:

MIG strategies:
- none: expose full GPUs only.
- single: nodes run a single MIG layout; resources are stable (recommended for shared clusters).
- mixed: multiple MIG layouts possible per node; flexible but can increase fragmentation.
Time-slicing:
- Enable for inference/experimentation to drive utilization.
- Combine with MPS to reduce context-switch overhead.

Example device plugin config via ConfigMap:

yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: nvidia-device-plugin-config
  namespace: gpu-operator
data:
  config.yaml: |
    version: v1
    flags:
      migStrategy: single
      nvidiaDriverRoot: /run/nvidia/driver
    sharing:
      timeSlicing:
        resources:
        - name: nvidia.com/gpu
          replicas: 4   # up to 4 pods share a single GPU (per-GPU time-slice)
    mig:
      strategy: single

Pods requesting GPUs:

yaml
apiVersion: v1
kind: Pod
metadata:
  name: trainer
spec:
  containers:
  - name: main
    image: nvcr.io/nvidia/pytorch:24.08-py3
    resources:
      limits:
        nvidia.com/gpu: 4

With MIG enabled and configured, request profile-specific resources:

yaml
apiVersion: v1
kind: Pod
metadata:
  name: small-trainer
spec:
  nodeSelector:
    nvidia.com/mig.capable: 'true'
  containers:
  - name: main
    image: nvcr.io/nvidia/pytorch:24.08-py3
    resources:
      limits:
        nvidia.com/mig-1g.10gb: 1  # H100 80GB example profile

Opinion: choose a small number of cluster-wide MIG layouts and stick to them. Stability beats theoretical flexibility, because reconfiguring MIG requires draining the node and can disrupt running jobs.

Topology and CPU pinning

GPU allocation quality matters:

Topology Manager (kubelet) can align CPU and device NUMA locality.
NVIDIA device plugin supports preferred allocation for multi-GPU requests (e.g., same NVLink island).

Recommended kubelet settings for GPU nodes:

yaml
kubeletConfiguration:
  featureGates:
    TopologyManager: true
  cpuManagerPolicy: static
  topologyManagerPolicy: single-numa-node

Also consider Node Feature Discovery (NFD) to label nodes with GPU model, NVLink/NVSwitch presence, and MIG capability, enabling Flavors in Kueue and placement constraints in Volcano.

Gang semantics: Kueue vs. Volcano

Both Kueue and Volcano provide job-wide semantics beyond what the default scheduler offers, but they take different approaches.

Kueue: queue-based admission for batch

Kueue inserts an admission control layer before pods are scheduled:

Workloads (e.g., Kubernetes Job, MPIJob, PyTorchJob, RayJob) are enqueued into LocalQueues.
LocalQueues bind to ClusterQueues with configured quotas by ResourceGroup and Flavors (e.g., A100 vs. H100 classes).
Kueue simulates scheduling across the cluster and either admits the entire workload (gang) or keeps it pending.
Cohorts allow queues to borrow idle quota from peers with weighted fairness.
Preemption policies determine what can be displaced to make room for higher-priority work.

This model plays well with the default kube-scheduler and the NVIDIA device plugin. It preserves standard scheduling extensions and lets you use upstream semantics and autoscalers.

A minimal Kueue setup:

yaml
apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
  name: h100
spec:
  nodeLabels:
    nvidia.com/gpu.product: NVIDIA-H100
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
  name: a100
spec:
  nodeLabels:
    nvidia.com/gpu.product: NVIDIA-A100
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
  name: prod-ai
spec:
  cohort: global
  namespaceSelector: {}
  queueingPolicy:
    admission: StrictFIFO
  resourceGroups:
  - coveredResources: [cpu, memory, nvidia.com/gpu]
    flavors:
    - name: h100
      resources:
      - name: nvidia.com/gpu
        nominalQuota: 128  # 16 nodes x 8 GPUs
      - name: cpu
        nominalQuota: 1024
      - name: memory
        nominalQuota: 6Ti
    - name: a100
      resources:
      - name: nvidia.com/gpu
        nominalQuota: 64
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: LocalQueue
metadata:
  name: team-vision
  namespace: team-vision
spec:
  clusterQueue: prod-ai

Submitting a job with Kueue annotations:

yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: resnet-train
  namespace: team-vision
  labels:
    kueue.x-k8s.io/queue-name: team-vision
spec:
  completions: 1
  parallelism: 1
  template:
    metadata:
      labels:
        kueue.x-k8s.io/queue-name: team-vision
    spec:
      restartPolicy: Never
      containers:
      - name: train
        image: nvcr.io/nvidia/pytorch:24.08-py3
        resources:
          requests:
            cpu: '8'
            memory: 64Gi
            nvidia.com/gpu: 8
          limits:
            nvidia.com/gpu: 8

Pros of Kueue:

Works with native controllers; no need to replace kube-scheduler.
Strong multi-tenant controls (ClusterQueues, cohorts, borrowing).
Natural extension point for batch autoscaling and preemption policies.

Caveats:

True gang scheduling across multiple pods relies on the job controller integration with Kueue. Use supported job types or CRDs that integrate with Kueue admission.
Network topology awareness is delegated to underlying plugin heuristics; Kueue does not directly model NVLink/NVSwitch.

Volcano: battle-tested batch scheduler with PodGroup

Volcano replaces kube-scheduler for targeted workloads and adds batch plugins:

PodGroup CRD gives gang semantics; all pods in the group are scheduled once minAvailable can be satisfied.
Rich preemption, rescheduling, and queue priorities.
Integrations with MPI, TensorFlow, PyTorch operators, and classic HPC stacks.

Example Volcano PodGroup with a gang Job:

yaml
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
  name: research
spec:
  weight: 1
---
apiVersion: scheduling.volcano.sh/v1beta1
kind: PodGroup
metadata:
  name: gpt-train
  namespace: research
spec:
  minMember: 16
  minResources:
    nvidia.com/gpu: 128
  queue: research
---
apiVersion: batch/v1
kind: Job
metadata:
  name: gpt-train
  namespace: research
spec:
  completions: 16
  parallelism: 16
  template:
    metadata:
      labels:
        pod-group.scheduling.volcano.sh/name: gpt-train
    spec:
      schedulerName: volcano
      restartPolicy: Never
      containers:
      - name: worker
        image: nvcr.io/nvidia/pytorch:24.08-py3
        resources:
          limits:
            nvidia.com/gpu: 8

Pros of Volcano:

Mature gang scheduling, preemption, and batch policies.
Tunable scoring plugins helpful for bin-packing and fragmentation control.

Caveats:

You are running an alternative scheduler; ensure compatibility with other controllers and policies.
Multi-scheduler setups require care to avoid policy conflicts.

Choosing Kueue vs. Volcano (or both)

Start with Kueue if you need enterprise multi-tenant controls, fair sharing, and easy integration with Job-like controllers without replacing kube-scheduler.
Choose Volcano if your workloads resemble HPC batch training with strict gang semantics and you want deeper control of scheduling internals.
Combine them for certain patterns: Kueue for queueing and quota, Volcano as the scheduler for admitted workloads. This is advanced and requires careful integration testing; many teams succeed with Kueue + default scheduler plus the coscheduling plugin where needed.

MIG and MPS: partitioning GPUs in anger

MIG profiles: hard isolation and predictable performance

MIG partitions Ampere/Hopper GPUs into independent instances with dedicated memory, cache, and compute slices. Common profiles:

A100 80GB: 1g.10gb, 2g.20gb, 3g.40gb, 7g.80gb
H100 80GB: 1g.10gb, 2g.20gb, 3g.40gb, 7g.80gb

Patterns:

Stable shared clusters: pick 2–3 layouts per GPU family, e.g.,
- A100: either 7x 1g.10gb (for inference/experimentation) or 1x 7g.80gb (for big training).
- H100: dynamic clusters often choose a 3x 3g.40gb layout for medium training and 1x 1g.10gb leftover; but be wary of fragmentation.
Keep a subset of nodes MIG-disabled to host large training. Label these nodes and target them via Flavors.

Operational realities:

Switching MIG layouts requires cordoning and draining nodes; plan change windows.
Kueue ResourceFlavors can encode MIG presence via node labels to steer workloads.

Example MIG-aware ResourceFlavor:

yaml
apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
  name: h100-mig
spec:
  nodeLabels:
    nvidia.com/mig.capable: 'true'
    nvidia.com/gpu.product: NVIDIA-H100

Then request MIG resources in jobs:

yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: micro-batch-infer
  labels:
    kueue.x-k8s.io/queue-name: team-infer
spec:
  template:
    spec:
      restartPolicy: Never
      containers:
      - name: serve
        image: nvcr.io/nvidia/tritonserver:24.08-py3
        resources:
          limits:
            nvidia.com/mig-1g.10gb: 1

MPS and time-slicing: high throughput, softer isolation

MPS allows multiple CUDA contexts to share a GPU with improved concurrency over raw time-slicing. It boosts throughput for:

Latency-tolerant inference
Hyperparameter sweeps
Lightweight fine-tuning/LoRA tasks

Trade-offs:

Interference: jobs can compete for SMs and memory bandwidth.
Accounting: measuring per-job utilization is fuzzier; use DCGM and per-pod GPU metrics.

Kubernetes pattern:

Enable device plugin time-slicing.
For NVIDIA MPS, use the GPU Operator’s MPS sidecar or node-level MPS control to run multiple pods per GPU with caps on active contexts.

Example pod annotation to request time-slicing via device plugin config classes:

yaml
apiVersion: v1
kind: Pod
metadata:
  name: batch-infer
  labels:
    nvidia.com/gpu.workload.config: timeslice
spec:
  containers:
  - name: infer
    image: yourrepo/infer:latest
    resources:
      limits:
        nvidia.com/gpu: 1

Opinion: prefer MIG for SLO-bound inference or when noisy neighbors are unacceptable. Use MPS/time-slicing when your objective is raw throughput and jobs are resilient to interference.

Gang scheduling patterns for AI

Multi-GPU training requires that all workers and parameter servers or collective participants start together. Patterns:

Single-node, multi-GPU: request N GPUs in a single pod or use a Job with parallelism=1 and nvidia.com/gpu: N. The device plugin will try to allocate GPUs on the same NVLink island.
Multi-node, multi-GPU: use a job controller that understands distributed workers (MPIJob, PyTorchJob, Ray). Ensure gang admission either with Kueue (admit workload only when all pods can be created) or Volcano PodGroup with minMember.
Network topology: for NCCL efficiency, prefer nodes connected via NVSwitch or fat-tree fabric. Label nodes and encode this via Flavors or node selectors.

Example Kueue-admitted TorchElastic Job (via PyTorchJob CRD integration):

yaml
apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
  name: imagenet-train
  namespace: research
  labels:
    kueue.x-k8s.io/queue-name: research
spec:
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      template:
        spec:
          containers:
          - name: pytorch
            image: yourrepo/pytorch-train:latest
            resources:
              limits:
                nvidia.com/gpu: 8
    Worker:
      replicas: 15
      template:
        spec:
          containers:
          - name: pytorch
            image: yourrepo/pytorch-train:latest
            resources:
              limits:
                nvidia.com/gpu: 8

Opinion: avoid ad hoc scripts that spin up pods manually. Use a controller with native retry and topology hints, and wrap it with Kueue or Volcano gang semantics to prevent zombie partial allocations.

Queues, priorities, and preemption that reflect your business

Your GPU policy should be declared in queues, not in pager duty. Kueue’s ClusterQueues and Volcano’s Queues provide the leverage.

Kueue quota, cohorts, and borrowing

Quotas: set nominal quotas per ResourceFlavor (e.g., H100 vs. A100, MIG vs. full GPU).
Cohorts: group ClusterQueues into a borrowing pool; idle capacity is lent to peers with configured weights.
WorkloadPriorityClass: define business priorities that inform fair share and preemption.

Example WorkloadPriorityClass and preemption policy:

yaml
apiVersion: kueue.x-k8s.io/v1beta1
kind: WorkloadPriorityClass
metadata:
  name: critical
value: 1000
preemptionPolicy: LowerPriority
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: WorkloadPriorityClass
metadata:
  name: batch
value: 100
preemptionPolicy: Never

Then tag Jobs:

yaml
metadata:
  labels:
    kueue.x-k8s.io/queue-name: prod
  annotations:
    kueue.x-k8s.io/priority-class: critical

Preemption guidance:

Prefer queue-level preemption first: evict within the same queue when priorities differ.
Allow cross-queue preemption sparingly and with grace periods.
For spot workloads, model them with lower priority, taints/tolerations, and checkpointing so they are the first to be evicted.

Volcano priorities and preemption

Volcano’s preemptor plugin can:

Preempt lower-priority PodGroups to satisfy a higher-priority group.
Consider gang integrity: preempt enough pods to free resources for the entire high-priority group.
Support rescheduling plugins to backfill.

Example Volcano priority class:

yaml
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: volcano-critical
value: 100000
globalDefault: false
preemptionPolicy: PreemptLowerPriority

GPU bin-packing: minimize fragmentation, maximize throughput

Bin-packing GPUs is about preserving large contiguous allocations and backfilling intelligently.

Tactics:

Separate node pools by GPU family and MIG policy. Expose them as distinct Flavors.
Score nodes with a bin-packing bias (e.g., NodeResourcesFit with MostAllocated) for jobs requesting full GPUs, to pack them tightly and free whole nodes for future big jobs.
For MIG nodes, choose a small set of layouts that correlate with request patterns (e.g., lots of 1g.10gb and a reasonable share of 3g.40gb). Avoid exotic profiles.
Use anti-affinity judiciously. Spreading workers across too many nodes increases cross-node traffic; prefer packing within NVSwitch domains.

Kube-scheduler plugins to consider:

NodeResourcesFit with ScoringStrategy Type=MostAllocated.
PodTopologySpread for resilience of services, but turn it off for tightly-coupled training.
Coscheduling plugin for soft gang semantics if not using Kueue/Volcano gangs.

Example SchedulerConfiguration tuned for bin-packing bias (when you manage your own scheduler profile):

yaml
apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
profiles:
- schedulerName: default-scheduler
  pluginConfig:
  - name: NodeResourcesFit
    args:
      scoringStrategy:
        type: MostAllocated
        resources:
        - name: nvidia.com/gpu
          weight: 10
        - name: cpu
          weight: 1
        - name: memory
          weight: 1

Opinion: many teams over-index on spread for perceived fairness. For GPUs, co-location beats spread unless you have a strong failure-domain requirement.

Spot fallback: cheaper GPUs without chaos

Spot instances can cut costs 50–80% but demand resilience.

Control-plane patterns:

Isolate spot nodes with taints like spot=true:NoSchedule.
Add tolerations only to restartable jobs.
Use PriorityClasses so spot jobs are preempted first.
Configure Cluster Autoscaler or Karpenter with separate provisioners for spot and on-demand with explicit labels (e.g., capacity-type=spot).

Example Karpenter provisioners:

yaml
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: gpu-spot
spec:
  template:
    metadata:
      labels:
        capacity-type: spot
        nvidia.com/gpu.present: 'true'
    spec:
      taints:
      - key: spot
        value: 'true'
        effect: NoSchedule
  disruption:
    consolidationPolicy: WhenUnderutilized
  limits:
    cpu: 2000
---
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: gpu-ondemand
spec:
  template:
    metadata:
      labels:
        capacity-type: on-demand
        nvidia.com/gpu.present: 'true'

Workload patterns:

Checkpoint frequently (every few minutes) to object storage. For PyTorch, integrate torch.checkpoint with a sidecar or a library like torchsnapshot.
Use PreStop hooks to save progress on SIGTERM.
For Ray, enable autoscaler to rebalance and recreate actors.

Pod tolerations and node selectors:

yaml
spec:
  tolerations:
  - key: spot
    operator: Equal
    value: 'true'
    effect: NoSchedule
  nodeSelector:
    capacity-type: spot

Queue policy:

Define spot queues with lower borrowing priority. Allow them to consume idle capacity but be preempted immediately when on-demand work arrives.

Orchestrating with Ray: clusters, jobs, and queues

Ray is a popular substrate for distributed training, tuning, and batch inference. On Kubernetes you have two main modes:

RayCluster CRD: a long-lived cluster with head and worker groups; workloads are submitted as Ray Jobs.
Per-job ephemeral clusters: each Job spins up a cluster, runs, then tears down.

GPU-aware patterns:

Define per-group Pod templates that request GPUs (full or MIG) and pin to flavors.
Use Kueue admission for RayJobs so the entire cluster is admitted as a gang.
For Volcano, put Ray pods into a PodGroup.

Example RayCluster with MIG workers and Kueue queueing:

yaml
apiVersion: ray.io/v1
kind: RayCluster
metadata:
  name: ray-mig
  namespace: ml
  labels:
    kueue.x-k8s.io/queue-name: ml-shared
spec:
  headGroupSpec:
    template:
      spec:
        containers:
        - name: ray-head
          image: rayproject/ray:2.32.0
          resources:
            requests:
              cpu: '8'
              memory: 32Gi
  workerGroupSpecs:
  - groupName: mig-workers
    replicas: 4
    template:
      spec:
        containers:
        - name: ray-worker
          image: rayproject/ray:2.32.0
          resources:
            limits:
              nvidia.com/mig-1g.10gb: 1

Submitting a RayJob that should be admitted atomically:

yaml
apiVersion: ray.io/v1
kind: RayJob
metadata:
  name: ray-train
  namespace: ml
  labels:
    kueue.x-k8s.io/queue-name: ml-shared
spec:
  entrypoint: python train.py --epochs 5
  rayClusterSelector:
    rayClusterName: ray-mig

Opinion: ephemeral Ray clusters per Job work well with Kueue because admission can model the cluster’s entire footprint. Long-lived clusters are fine for interactive use but can suffer from internal fragmentation; implement internal Ray resource reservations to mitigate.

A reference architecture that works in practice

Node pools:
- pool-h100-full: H100 nodes with MIG disabled for large training.
- pool-h100-mig: H100 nodes with a fixed layout (e.g., 7x 1g.10gb) for inference and small jobs.
- pool-a100-mig: Backfill and lower-priority training.
- pool-gpu-spot: Mixed GPU types for low-priority work.
Kueue configuration:
- ResourceFlavors for h100, h100-mig, a100-mig, gpu-spot.
- ClusterQueues: prod, research, batch. prod has highest priority and can borrow, research is capped but can borrow from batch, batch is lowest and can use spot.
- WorkloadPriorityClasses: critical, high, normal, spot.
Default scheduler with bin-packing bias and Topology Manager.
NVIDIA GPU Operator with MIG strategy=single per MIG pool; time-slicing enabled only on batch and spot pools.
Observability: DCGM exporter, Prometheus, per-queue dashboards; alert on queue wait time SLOs and GPU idle rates.

Sample Kueue ClusterQueues with cohorts:

yaml
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
  name: prod
spec:
  cohort: global
  resourceGroups:
  - coveredResources: [nvidia.com/gpu]
    flavors:
    - name: h100
      resources:
      - name: nvidia.com/gpu
        nominalQuota: 256
    - name: h100-mig
      resources:
      - name: nvidia.com/mig-1g.10gb
        nominalQuota: 128
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
  name: research
spec:
  cohort: global
  resourceGroups:
  - coveredResources: [nvidia.com/gpu]
    flavors:
    - name: h100
      resources:
      - name: nvidia.com/gpu
        nominalQuota: 64
    - name: a100-mig
      resources:
      - name: nvidia.com/mig-1g.10gb
        nominalQuota: 256
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
  name: batch
spec:
  cohort: global
  resourceGroups:
  - coveredResources: [nvidia.com/gpu]
    flavors:
    - name: gpu-spot
      resources:
      - name: nvidia.com/gpu
        nominalQuota: 512

Now, a high-priority training job targeting full H100s:

yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: llama-train
  namespace: prod-ml
  labels:
    kueue.x-k8s.io/queue-name: prod
  annotations:
    kueue.x-k8s.io/priority-class: critical
spec:
  parallelism: 32
  completions: 32
  template:
    spec:
      restartPolicy: Never
      nodeSelector:
        nvidia.com/gpu.product: NVIDIA-H100
      containers:
      - name: trainer
        image: yourrepo/llama-train:latest
        resources:
          limits:
            nvidia.com/gpu: 8

A backfill batch inference job on MIG:

yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: nightly-infer
  namespace: research
  labels:
    kueue.x-k8s.io/queue-name: batch
  annotations:
    kueue.x-k8s.io/priority-class: spot
spec:
  parallelism: 100
  template:
    spec:
      restartPolicy: Never
      tolerations:
      - key: spot
        operator: Equal
        value: 'true'
        effect: NoSchedule
      nodeSelector:
        capacity-type: spot
      containers:
      - name: infer
        image: yourrepo/infer:latest
        resources:
          limits:
            nvidia.com/mig-1g.10gb: 1

Observability and SLOs for GPU scheduling

KPIs that matter:

GPU utilization: target 65–85% average at the node and fleet levels. Below 50% indicates fragmentation or policy misfit.
Queue wait time: 50th/95th percentiles by queue and priority. Keep p95 under your agreed SLO (e.g., 2 hours for research, 10 minutes for prod-critical).
Preemption rate and lost work: track time lost to preemption; if it exceeds 5–10% for a queue, invest in checkpointing and better preemption windows.
Fragmentation metrics: percent of GPUs available as full vs. only MIG shards; stranded capacity by profile.

Tools:

DCGM Exporter for GPU metrics; scrape with Prometheus and export to your APM.
Kueue metrics: admitted vs. pending workloads, borrowing, preemption counts.
Volcano metrics: queue lengths, PodGroup states, preemptions.
Custom dashboards by queue and flavor.

Operational playbooks:

Weekly planning: adjust MIG layouts and quotas based on observed footprint distribution.
Hot fixes: if a large training is starved, temporarily drain MIG nodes to revert them to full GPUs; communicate change windows.
Auto-tuning: feed observed request histograms back to templates so developers choose profile sizes that fit your layout.

Common pitfalls and how to avoid them

Too many MIG layouts: operational thrash and stranded capacity. Limit layouts to those you actually need.
Pretending MPS is isolation: it is not. Use MPS only when SLOs allow interference.
Ignoring CPU and memory: oversubscribed CPUs can throttle your GPUs. Allocate sufficient CPU/memory with static CPU manager where needed.
No gang semantics for Ray: partial startup frequently deadlocks actor sets. Use Kueue or Volcano to admit the whole cluster.
Preemption without checkpoints: you will waste GPU hours. Make checkpointing a policy, not a suggestion.
Spreading training pods across racks without a reason: network becomes your bottleneck. Prefer NVSwitch/NVLink locality.
One giant prod queue: it looks fair until one team backfills forever. Use per-tenant LocalQueues bound to ClusterQueues with borrowing.

A short decision guide

Need tight multi-tenant quotas and admission with minimal disruption to upstream controllers? Use Kueue + default scheduler.
Need HPC-like gang scheduling and intricate preemption behaviors? Use Volcano.
Mix of both with strong governance? Use Kueue for queues/admission and carefully integrate Volcano for scheduling select namespaces.
Serving and small training: consider MIG or MPS. Big training: reserve full GPUs.
On a budget with resilience: spot fallback with checkpointing, taints/tolerations, and preemptible queues.

Closing thoughts

The organizations that win with AI at scale in 2025 made a cultural shift: they treat GPUs as a shared, policy-driven substrate governed by queues, not as pets hand-assigned to projects. Kubernetes can be a great fit if you embrace GPU-aware primitives: device plugins, MIG and MPS, gang admission, quotas and cohorts, and preemption that aligns with your business.

Start small: pick two GPU families, define three queues, set a handful of MIG layouts, and turn on Kueue. Watch your queue metrics and GPU utilization for a month. Then, add Volcano where you need stricter gang semantics. Keep iterating on your layouts and priorities. The goal isn’t to perfectly pack every GPU every minute; it’s to keep your teams productive, your SLAs honest, and your cloud bill boring.