Guide

GPU Sharing in Kubernetes with vCluster

From DIY GPU Workarounds to Scalable, Isolated Multi-Tenant GPU Infrastructure

1. Introduction

GPUs have become standard equipment in Kubernetes environments. The explosion of AI and ML workloads has made them essential, but GPUs are expensive, and most organizations don't use them anywhere close to capacity. An 8xH100 server can cost $250,000. If that hardware sits idle between training jobs, the economics of bringing AI in-house start to fall apart.

The core problem: Kubernetes can only assign whole GPUs to individual pods. There's no native way to share a single GPU across multiple workloads. This forces teams into a trade-off, either dedicate expensive hardware to single tenants and accept low utilization, or try to share GPUs using a patchwork of tools that weren't designed for multi-tenant containers.

NVIDIA offers technologies like time-slicing and MIG (Multi-Instance GPU) to address this, but each comes with real limitations - weak isolation, hardware rigidity, or both. DIY workarounds exist, but they tend to be fragile and hard to maintain. What's missing is a solution at a higher level of abstraction: one that provides strong tenant isolation and efficient GPU sharing without being tied to specific hardware.

This guide covers the full picture. We start with the GPU sharing problem in Kubernetes, walk through every major DIY approach and its trade-offs, explain why those approaches fall short for production multi-tenancy, and show how vCluster solves the problem at the Kubernetes orchestration layer. The final section is a hands-on tutorial, you'll deploy a virtual cluster on GPU-enabled Kubernetes and run a real PyTorch workload inside it.

2. The GPU Sharing Problem in Kubernetes

Kubernetes supports GPU allocation through device plugins provided by your GPU vendor - NVIDIA, AMD, or Intel. With a device plugin installed, you can assign GPUs to pods using the standard resource requests and limits system. For example, a pod can claim an NVIDIA GPU by specifying nvidia.com/gpu : 1 in its resource limits.

The mechanism is simple and predictable, but it has one major drawback: you can only assign whole GPU instances. Fractional requests like nvidia.com/gpu: 0.5 aren't supported. Every pod that needs GPU access must consume an entire physical GPU, even if it only uses a fraction of its capacity.

This is wasteful by default. AI development is characterized by bursts of intense activity followed by periods of inactivity. A data scientist might run a training job that consumes multiple GPUs for twelve hours, and then those GPUs sit completely idle. If a GPU that costs tens of thousands of dollars is dedicated to a single team, utilization can easily drop below 20–30 percent.

This financial pressure forces platform engineers to treat the GPU fleet as a shared pool. The goal is to dynamically assign workloads and keep hardware consistently busy. But Kubernetes doesn't make this easy out of the box.

3. DIY Approaches to GPU Sharing

Because Kubernetes lacks native GPU sharing, every approach requires manual configuration. Each has different implications for isolation, flexibility, and operational complexity. Here's what's available.

3.1 Time-Slicing with the NVIDIA GPU Operator

GPU time-slicing is available through the NVIDIA GPU Operator. It shares a GPU between multiple pods by creating logical replicas of the device. Activity on any replica runs on the same physical GPU, and the available compute time is divided evenly between running processes.

To set it up, install the NVIDIA GPU Operator via Helm, then create a ConfigMap that defines how many replicas each GPU should expose. For example, setting replicas: 4 splits each GPU into four time-sliced replicas. After applying the ConfigMap and patching the cluster policy, your nodes will report four allocatable GPU instances per physical GPU.

From there, pods can request GPU access as usual with nvidia.com/gpu: 1 - up to the configured replica count.

Pros

  • Simple to configure You need only the GPU Operator and a ConfigMap.
  • Supports large replica counts A single GPU can be sliced into hundreds of replicas for small or infrequent workloads.
  • Flexible You can configure different sharing settings per node or per GPU.

Cons

  • No private memory Memory is shared across all replicas. There's no way to restrict per-pod memory usage, which leads to resource contention and noisy-neighbor problems.
  • No fault isolation GPU replicas aren't isolated at the hardware level. A fault in one task can affect others on the same GPU.
  • Compute time isn't proportional A pod requesting two replicas doesn't get twice the compute time. Time is shared among all processes on the GPU, not divided between pods.
  • Context-switching overhead The GPU must continually switch between tasks, which can add small but hard-to-diagnose performance costs.

3.2 NVIDIA MIG (Multi-Instance GPU)

MIG (Multi-Instance GPU) is a hardware-level partitioning feature available on newer NVIDIA data-center GPUs (A100, H100, and later). It splits a single GPU into up to seven self-contained instances, each with its own private memory, dedicated compute paths, and full fault isolation. Each MIG partition behaves like a separate physical GPU.

To enable MIG, install the GPU Operator with a MIG strategy configured, then label your nodes with the desired MIG profile (e.g., all-1g.5gb for seven partitions on an A100). The GPU Operator handles the partitioning. Pods request GPU access the same way, nvidia.com/gpu : 1 but each request now gets exclusive access to a private, isolated partition.

Pros

  • Strong isolation - Each partition has private memory and hardware-enforced fault isolation. No noisy-neighbor risk.
  • Hardware-enforced boundaries - Each instance gets its own path through the GPU's memory system - caches, controllers, and compute units.
  • Full observability - MIG is compatible with NVIDIA's DCGM-Exporter, so you can monitor utilization per partition via Prometheus.

Cons

  • Requires specific hardware - MIG only works on modern high-end NVIDIA GPUs (Ampere generation and later). Older or consumer GPUs are not supported.
  • Fixed partition sizes - You choose from predefined profiles. If your workloads don't match the available slice sizes, you waste capacity. For example, partitioning a 96GB H100 into three slices gives each 24GB - leaving 24GB unallocated.
  • Maximum seven partitions - You can't create arbitrary numbers of instances.
  • Profile changes are disruptive - Switching MIG profiles changes available partitions, which can break existing workload scheduling.

3.3 Combining MIG and Time-Slicing

MIG and time-slicing are independent mechanisms, and you can combine them. First, partition the GPU using MIG to create isolated instances. Then, apply time-slicing to each MIG partition to create multiple replicas per partition.

For example, with seven MIG partitions and four time-sliced replicas each, a single physical GPU can serve up to twenty-eight GPU resource requests. This is useful when you need isolation between workloads (via MIG) but also want multiple pods per workload (via time-slicing within each partition).

The combination partially addresses the weaknesses of each approach individually, MIG provides the isolation that time-slicing lacks, while time-slicing adds the density that MIG's seven-partition limit restricts.

3.4 Alternative Workarounds

When MIG and time-slicing don't fit - maybe you lack MIG-compatible hardware, need stronger isolation than time-slicing, or run non-NVIDIA GPUs - there are several DIY alternatives. Each comes with significant trade-offs.

Node Labeling and Taints - You can apply taints to GPU nodes so only approved pods schedule there, and use node affinity to target specific GPU types. This doesn't share GPUs between workloads, but it controls which pods get access. Pair it with a custom admission controller for enforcement.

Resource Quotas - Kubernetes ResourceQuotas can limit the number of nvidia.com/gpu resources available per namespace. This lets you fairly distribute GPU access across teams — useful with or without other sharing mechanisms.

User-Space Multiplexing - Tools like CUDA MPS, GVirtuS, and LibVF.IO virtualize GPU access at the OS level. MPS dispatches CUDA commands from multiple processes to a single GPU, but runs everything in a shared memory space - no isolation. GVirtuS is experimental. LibVF.IO supports NVIDIA, AMD, and Intel but hasn't been actively maintained in over two years.

Container-Level cgroups and Runtime Hacks - You can use cgroup hooks to control which GPU devices are visible to specific containers. This offers basic access control but requires custom runtime configuration and doesn't provide real sharing or isolation.

All four of these approaches produce fragile systems that are difficult to configure, maintain, and debug. The operational overhead often exceeds the cost of just buying better hardware.

4. Why DIY GPU Sharing Falls Short

The root problem with every DIY approach is architectural: GPUs were designed for a fundamentally different purpose than multi-tenant Kubernetes workloads. GPUs were built to render pixels on a screen, optimized for intensive single-threaded processes. The concept of multiple isolated tenants running concurrent workloads wasn't part of the original design.

This mismatch shows up clearly in the available tools. Time-slicing and MPS provide the appearance of sharing but no real isolation — all processes share the same memory space. MIG solves the isolation problem with hardware-enforced partitions, but only works on specific expensive GPUs and locks you into rigid partition sizes. A year from now, there's going to be something else, and then you can't do this. And you just spent a small fortune on this set of GPUs.

The alternative workarounds - node labeling, resource quotas, cgroup hacks are even more limited. They don't provide real sharing or isolation. They produce brittle configurations that lack monitoring, debugging tools, and operational maturity.

What's needed is a solution at a higher level of abstraction: one that provides strong tenant isolation without depending on specific GPU hardware. That's where virtual clusters come in.

5. The Virtual Cluster Solution

Instead of attacking GPU sharing at the hardware or driver layer, vCluster solves it at the Kubernetes orchestration layer. The architecture is straightforward:

  • A host cluster - a single physical Kubernetes cluster connected to the entire pool of physical GPUs, regardless of make or MIG capability. This cluster owns all the physical resources.
  • Virtual clusters - lightweight, fully functional Kubernetes control planes running as pods inside the host cluster. Each virtual cluster has its own API server, controller manager, RBAC, CRDs, and Helm charts. From the tenant's perspective, it looks and behaves like a dedicated Kubernetes cluster. But it provisions in seconds, not minutes.
  • Centralized scheduling - a single scheduler on the host cluster handles workloads from all virtual clusters. It sees resource requests across every tenant and places pods onto the available physical GPUs. This keeps the hardware consistently utilized while maintaining clean tenant boundaries.

The result is two things at once: tenant autonomy and centralized efficiency. Each tenant gets their own cluster with full control — "You can have your own API server, so you can install your own Helm charts and your own versions of your Helm charts and have all the different CRDs that you would prefer to have."

Meanwhile, the underlying GPU fleet is treated as one shared resource pool. The host scheduler dynamically allocates GPUs to workloads from any virtual cluster, "keeping them at a relatively high utilization, like 50 percent, 90 percent."

vCluster doesn't replace MIG or time-slicing, it's fully compatible with both. It also works with GPU workflow platforms like Run:ai, Volcano, and Kubeflow. The difference is that vCluster handles the multi-tenancy layer that these tools don't address. You get isolated environments for each team or customer, centralized GPU scheduling, and full Kubernetes-native resource management, all on the same physical hardware.

6. Use Cases: Enterprise Teams and GPU-as-a-Service

The combination of tenant isolation and shared GPU efficiency makes vCluster particularly valuable in two scenarios.

Internal Enterprise Use

Large companies can provide self-service, isolated Kubernetes environments to dozens or hundreds of development teams. Each team gets its own virtual cluster with full administrative control — their own CRDs, operators, RBAC, and Helm charts. But all teams share the same central GPU infrastructure. The host scheduler keeps GPUs busy across all tenants, turning a fleet of expensive hardware into a shared, high-utilization resource pool.

GPU-as-a-Service

Cloud providers and managed service providers can offer secure, multi-tenant GPU-enabled Kubernetes clusters to their customers. Each customer runs in their own isolated virtual cluster, with no visibility into other tenants. The provider maintains a single GPU fleet and a single host cluster, avoiding the operational complexity of managing one physical cluster per customer.

Real-World Impact: Aussie Broadband

Aussie Broadband, an Australian internet service provider, adopted vCluster to consolidate its development and testing environments. The results:

  • $180,000 in annual cost savings - from reducing the number of physical Kubernetes clusters needed.
  • 2,400 developer hours saved/year - vCluster provisions environments 99 percent faster than spinning up traditional physical clusters.
  • Reduced licensing costs - by moving away from virtual machines for tenancy.

7. Tutorial: Deploying vCluster on GPU-Enabled Kubernetes

With the "why" and "what" covered, let's move to the "how." This section walks through deploying a virtual cluster on a GPU-enabled Kubernetes cluster and running a real deep learning workload inside it.

7.1 Prerequisites

Before you start, make sure you have:

  • Access to a GPU-enabled Kubernetes cluster - managed services like GKE, EKS, or AKS, or a local minikube cluster with GPU passthrough.
  • kubectl - the Kubernetes CLI for interacting with your cluster.
  • Helm - the Kubernetes package manager.
  • vCluster CLI - for virtual cluster operations. See the quick start guide.
  • NVIDIA GPU Operator (or equivalent) - installed and verified, unless your managed provider handles GPU drivers natively.

The examples below use Google Kubernetes Engine (GKE) with GPU-enabled nodes. GKE automatically manages NVIDIA GPU driver installation, which simplifies setup and eliminates driver version mismatches.

7.2 Provisioning the GKE Environment

Create a GKE cluster with a single GPU-enabled node:

gcloud container clusters create vcluster-gpu-gke \
 --zone us-east1-b \
 --num-nodes 1 \
 --machine-type g2-standard-4 \
 --accelerator type=nvidia-l4,count=1 \
 --preemptible \
 --scopes=https://www.googleapis.com/auth/cloud-platform \
 --cluster-version latest \
 --enable-ip-alias

Key configuration details:

  • --machine-type g2-standard-4 - a VM designed for GPU workloads with a balanced CPU-to-GPU ratio.
  • --accelerator type=nvidia-l4,count=1 - attaches a single NVIDIA L4 GPU, well suited for AI inference and training.
  • --preemptible - reduces costs for test environments. Avoid this in production since Google can reclaim the instance at any time.
  • --enable-ip-alias - activates VPC-native routing for better networking scalability in multi-tenant setups.

This single-node setup is designed to demonstrate the workflow at the lowest cost. For production, use standard (on-demand) nodes with a multi-node cluster.

Note: GKE provisioning may fail if your Google Cloud project doesn't have sufficient GPUS_ALL_REGIONS quota in the target zone. Check your quota at the Google Cloud Console Quotas page and request an increase if needed.

7.3 Validating GPU Access on the Host

Virtual clusters inherit GPU resources from the host cluster. Any issues with GPU drivers or device plugins on the host will directly affect GPU availability inside your virtual clusters. Validate first.

Check that the NVIDIA device plugin is running:

kubectl get daemonset -n kube-system | grep nvidia

You should see one or more nvidia-gpu-device-plugin DaemonSets with pods scheduled and running.

Next, verify end-to-end GPU access by running a test pod. Create nvidia-cuda-test.yaml:

apiVersion: v1
kind: Pod
metadata:
 name: nvidia-cuda-test
spec:
 containers:
 - name: cuda-container
   image: nvidia/cuda:12.2.0-base-ubuntu22.04
   resources:
     limits:
       nvidia.com/gpu: 1
   command: ["nvidia-smi"]
   args: ["-L"]
 restartPolicy: Never

Apply it and check the logs:

kubectl apply -f nvidia-cuda-test.yaml
kubectl logs nvidia-cuda-test

Expected output:

GPU 0: NVIDIA L4 (UUID: GPU-xxx-xxxxx-xxxx-xxxx-xxxxxxxxxxx)

This confirms your node's NVIDIA drivers, device plugin, and container runtime are correctly configured. You're ready to create a virtual cluster.

7.4 Setting Up a GPU-Enabled vCluster

vCluster creates isolated Kubernetes environments that inherit access to the host's resources, including GPUs. Virtual clusters have visibility into node-level GPU resources as long as they're available and correctly advertised in the physical cluster.

Start by creating a dedicated namespace for GPU workloads:

kubectl create ns gpu-01

Then deploy a virtual cluster:

vcluster create vcluster-01 --namespace gpu-01

The CLI establishes a port-forwarded connection to the virtual cluster's API server. Keep that terminal open. In a second terminal, connect to the virtual cluster:

vcluster connect vcluster-01 -n gpu-01

Your virtual cluster is now running and ready for GPU workloads.

7.5 Testing vCluster GPU Access with a PyTorch Workload

A surface-level nvidia-smi check can miss configuration gaps that only show up under real ML workloads. To rigorously validate GPU passthrough, run a live PyTorch job inside the virtual cluster.

Create pytorch-gpu-check.yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
 name: pytorch-gpu-check
spec:
 replicas: 1
 selector:
   matchLabels:
     app: pytorch-gpu-check
 template:
   metadata:
     labels:
       app: pytorch-gpu-check
   spec:
     containers:
     - name: pytorch
       image: pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime
       command: ["python", "-c"]
       args:
         - |
           import torch
           if torch.cuda.is_available():
               print("CUDA is available: ", torch.cuda.get_device_name(0))
           else:
               print("CUDA NOT available")
       resources:
         limits:
           nvidia.com/gpu: 1
     restartPolicy: Always

This container runs a Python script that checks CUDA availability and prints the GPU name. Unlike the single-shot nvidia-smi test, this deploys as a persistent deployment that claims and retains the GPU - a stricter, more realistic test.

Apply and verify:

kubectl apply -f pytorch-gpu-check.yaml
kubectl get pods
kubectl logs deployment/pytorch-gpu-check

Expected output:

CUDA is available:  NVIDIA L4

That confirms end-to-end GPU functionality: your virtual cluster can identify, claim, and actively use the GPU for machine learning. From here, you can scale to multiple virtual clusters, each dynamically allocating GPUs based on workload requirements.

8. Conclusion

GPU sharing in Kubernetes is stuck between two bad options. Software-based approaches like time-slicing give you density but no isolation. Hardware-based approaches like MIG give you isolation but limit flexibility and lock you into specific GPU models. The DIY workarounds in between are fragile and hard to operate.

vCluster breaks this trade-off by solving GPU multi-tenancy at the Kubernetes orchestration layer, not the hardware or driver layer. Each tenant gets a fully isolated virtual cluster with its own API server, RBAC, CRDs, and Helm charts. The underlying GPU fleet stays shared, with a centralized scheduler that keeps utilization high across all tenants. It works with MIG, time-slicing, Run:ai, and the rest of the GPU ecosystem. No vendor lock-in, no rigid hardware requirements.

The result: you turn an underutilized cluster of expensive GPUs into a shared, high-efficiency engine for your AI teams, without sacrificing the isolation and control that production workloads demand.

For a deeper look at GPU infrastructure patterns, download the free eBook: GPU-Enabled Platforms on Kubernetes.

References

This guide was compiled from the following resources:

Ready to take vCluster for a spin?

Deploy your first virtual cluster today.