GPUs are now commonplace in Kubernetes environments, thanks to the rise of AI and ML workloads. Yet GPUs are expensive and not always used to capacity, so implementing an effective sharing strategy that enables multiple workloads to run on one GPU is also increasingly important. With individual GPUs typically costing thousands of dollars, it's crucial to make the most of the hardware you have available.
Natively, Kubernetes allows GPUs to be assigned to only a single pod at a time. This is highly inefficient as pods must be allocated an entire GPU even if they only need a fraction of its capacity. GPU-sharing techniques address this limitation, allowing you to improve GPU utilization and reduce your operating costs. Sharing enables you to safely use a single GPU for several apps, teams, or customers instead of needing a distinct device to be assigned to each deployment.
This article examines the problems involved in Kubernetes GPU sharing. We'll then discuss the key ways to implement GPU sharing using standard and DIY approaches, along with their benefits and pitfalls.
Understanding the Kubernetes GPU-Sharing Problem
Kubernetes has built-in support for allocating GPUs to your workloads. It relies on device plugins provided by your GPU manufacturer—NVIDIA, AMD, or Intel.
With a device plugin installed in your cluster, you can assign GPUs to your pods using the Kubernetes resource requests and limits system. The following example demonstrates how a pod can claim an NVIDIA GPU via the `nvidia.com/gpu` resource. The resource type is provided by the NVIDIA Kubernetes device plugin:
apiVersion: v1
kind: Pod
metadata:
name: pod-with-gpu
spec:
containers:
- name: my-container
image: my-image:latest
resources:
limits:
# Request a single NVIDIA GPU instance
nvidia.com/gpu: 1This Kubernetes-native mechanism is simple and predictable, but it has a big drawback: You can only assign whole GPU instances to your pods. Fractional requests, such as `nvidia.com/gpu: 0.5`, aren't supported, so each pod that requests a GPU must consume an entire physical instance. This prevents efficient GPU utilization when operating smaller workloads that only need a fraction of an instance each.
GPU sharing solves this problem. It enables a single GPU to be allocated to more than one Kubernetes pod. This improves operating efficiency by reducing the number of GPUs you need to procure. Depending on the sharing mechanism you use, you can also ensure different workloads remain isolated at the GPU level.
The following sections explore the main strategies that allow you to share GPUs between multiple pods in your Kubernetes cluster. Because of the absence of GPU-sharing capabilities in native Kubernetes, each of these options requires a degree of manual configuration. They also have differing implications for workload isolation, so you should check the notes for each solution before you get started.
Approach 1: Time-Slicing with the NVIDIA GPU Operator
GPU time-slicing is an NVIDIA GPU-sharing feature available when you have the NVIDIA GPU Operator installed in your Kubernetes cluster. The GPU Operator automates in-cluster management of the NVIDIA device plugin, drivers, and container tools.
Time-slicing allows you to share a GPU between multiple pods by creating logical replicas of the device. Activity on any of the replicas runs on the same physical GPU. The GPU's available compute time is sliced evenly between the running processes.
Time-slicing provides a relatively thin layer of abstraction over the physical GPU. It does not provide private memory, so you can't restrict the memory consumption of individual pods. It also doesn't provide fault isolation, meaning a problem caused by one pod could potentially impact other pods that are sharing the same GPU.
To configure time-slicing, you must first install the NVIDIA GPU Operator using Helm:
$ helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
$ helm repo update
$ helm install \
--wait \
--generate-name \
-n gpu-operator --create-namespace \
--version=v25.3.2 \
nvidia/gpu-operatorOnce you've installed the GPU Operator, you can enable time-slicing for your GPUs by creating a ConfigMap in the GPU Operator's namespace (`gpu-operator` in the example above). The following example specifies that each of your GPUs will be split into four time-sliced replicas:
apiVersion: v1
kind: ConfigMap
metadata:
name: time-slicing-configmap
namespace: gpu-operator
data:
any: |-
version: v1
sharing:
timeSlicing:
resources:
- name: nvidia.com/gpu
replicas: 4To activate time-slicing, you must first apply the ConfigMap to your cluster:
$ kubectl apply -f time-slicing-configmap.yamlNext, patch the NVIDIA device plugin's default Cluster Policy to reference the ConfigMap:
$ kubectl patch clusterpolicies.nvidia.com/cluster-policy \
-n gpu-operator \
--type merge \
-p \'{"spec": {"devicePlugin": {"config": {"name": "time-slicing-configmap", "default": "any"}}}}\'Inspecting the `nvidia.com/gpus` annotation on your nodes should now report the configured number of time-sliced GPU replicas:
$ kubectl describe node | grep nvidia.com
nvidia.com/gpu.count=1
nvidia.com/gpu.family=pascal
nvidia.com/gpu.memory=11264
nvidia.com/gpu.present=true
nvidia.com/gpu.product=NVIDIA-GeForce-GTX-1080-Ti-SHARED
nvidia.com/gpu.replicas=4
nvidia.com/gpu.sharing-strategy=time-slicing
nvidia.com/gpu: 4While the output above has been condensed, it demonstrates four key details:
- `nvidia.com/gpu.count=1`: There is a single physical GPU attached to the node.
- `nvidia.com/gpu.replicas=4`: Four time-sliced replicas are being exposed per physical GPU.
- `nvidia.com/gpu.sharing-strategy=time-slicing`: The NVIDIA device plugin is configured to share GPUs using time-slicing.
- `nvidia.com/gpu: 4`: There are four GPU instances available for pods to request.
You can now go ahead and deploy multiple pods that request GPU access, up to a maximum of the four GPU replicas configured. Here's a demo that runs NVIDIA's CUDA sample image, as detailed in the documentation:
apiVersion: apps/v1
kind: Deployment
metadata:
name: demo-deployment
spec:
# The 3 replicas will create successfully
# Each of the 3 Pods will be able to claim a time-sliced GPU replica
replicas: 3
selector:
matchLabels:
app: demo-deployment
template:
metadata:
labels:
app: demo-deployment
containers:
- name: nvidia-cuda-sample
image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubuntu20.04
command: ["/bin/bash", "-c", "--"]
args:
- while true; do /cuda-samples/vectorAdd; done
resources:
limits:
nvidia.com/gpu: 1Now that we've seen how to implement GPU time-slicing, let's break down the pros and cons of this approach. As we've already outlined above, time-slicing is versatile and easy to configure, but it also lacks robust isolation so it's not suitable for every use case.
Time-Slicing Pros
- Simplicity: Time-slicing's biggest advantage is its ease of configuration. You only need the NVIDIA GPU Operator and a simple ConfigMap to slice your GPU into multiple replicas.
- Supports very large replica counts: Time-slicing allows you to create as many GPU replicas as you need. A single GPU could be sliced into hundreds of replicas to serve many small or infrequently accessed apps.
- Flexible configuration: Despite its simplicity, time-slicing can be customized for advanced scenarios. You can specify different sharing configurations on a per-node or per-GPU basis, for example.
Time-Slicing Cons
- No private memory: Memory is shared between each time-sliced GPU replica. The inability to set pod-level memory usage constraints can lead to resource contention and noisy-neighbor problems.
- No fault isolation or scheduling guarantees: Time-slicing provides a bare minimum level of abstraction. GPU replicas aren't isolated at the hardware level, so faults in one task may have knock-on impacts on others. Further, the GPU time available to each pod isn't guaranteed to be proportional to the number of GPU replicas it requests. This limitation can prevent you from precisely optimizing workload performance.
- Potential performance issues: Time-slicing can potentially add small performance overheads that are hard to diagnose. The GPU must continually context-switch between tasks assigned to its replicas.
- Allocated GPU time isn't proportional to the number of GPU replicas requested: Pods may request two or more time-sliced replicas (*eg* `nvidia.com/gpu: 2`) to run multiple processes, but this does not guarantee the pod will receive proportionally more compute time than other pods sharing the same GPU. The GPU's time is shared equally among all the _processes_ using the GPU, across all replicas and pods—it's not divided between the pods themselves.
Approach 2: NVIDIA MIG (Multi-Instance GPU)
MIG (Multi-Instance GPU) is a newer GPU-sharing capability available within the NVIDIA GPU Operator. It lets you partition GPUs into up to seven self-contained instances. Each instance acts as though it's a new physical GPU attached to the host node. You can then assign the partitioned instances to your Kubernetes pods using a standard `nvidia.com/gpu: <instance-count>` resource request.
MIG partitions operate independently of each other, with full hardware-level fault isolation. Each partition is allocated its own private share of the GPU's memory. Partitions are also assigned dedicated paths to GPU resources, such as caches, memory controllers, and compute components. This enhances security and prevents high utilization in one partition from impacting the others.
MIG is the most powerful GPU-sharing strategy when you need strong isolation guarantees. However, it's a relatively inflexible tool that only works in certain configurations. The partitioning options available depend on your hardware and may not always allow you to use your GPU's entire capacity.
For instance, if you partition a 96GB H100 card into three partitions, then each partition will be assigned 24GB of memory, leaving 24GB unallocated. It's not possible to assign differing amounts of compute resources to individual partitions. Moreover, MIG's only supported on select NVIDIA GPUs, starting from the Ampere generation introduced in 2020.
Assuming MIG works with your GPUs, you can activate the feature in your Kubernetes cluster by installing the NVIDIA GPU Operator with a MIG strategy configured. The following command enables MIG for all GPUs on your nodes:
$ helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
$ helm repo update
$ helm install \
--wait \
--generate-name \
-n gpu-operator --create-namespace \
--set mig.strategy=single \
--version=v25.3.2 \
nvidia/gpu-operatorYou can optionally set the `mig.strategy` option to `mixed` instead of `single` to selectively activate MIG for specific GPUs.
Once the GPU Operator's installed, you should find your node's `nvidia.com/mig.capable` label has been set to `true`. You can also check the `nvidia.com/mig.strategy` and `nvidia.com/mig.config.state` labels to ensure the MIG configuration has applied successfully:
$ kubectl describe node | grep nvidia.com/gpu
nvidia.com/gpu.present=true
nvidia.com/gpu.count=1
nvidia.com/gpu.replicas=1
nvidia.com/mig.capable=true
nvidia.com/mig.config=all-disabled
nvidia.com/mig.config.state=success
nvidia.com/mig.strategy=singleNext, you must choose a MIG profile to partition your GPU with. As noted above, the available GPU profiles are documented in the MIG user guide. As an example, you can use the `all-1g.5gb` profile if you have an A100 GPU that you'd like to partition into the maximum of seven MIG instances.
Once you've selected a MIG profile to use, update your Kubernetes node's `nvidia.com/mig.config` label to reference the profile's name:
$ kubectl label nodes my-node nvidia.com/mig.config=all-1g.5gb --overwriteThe NVIDIA GPU Operator will then partition the GPU. Wait a few moments, then check the node's labels again. The `nvidia.com/gpu.count` label should now state the expected number of partitioned replicas, while `nvidia.com/mig.config` will show the profile you've configured.
$ kubectl describe node | grep nvidia.com/gpu
nvidia.com/gpu.count=7
nvidia.com/mig.capable=true
nvidia.com/mig.config=all-1g.5gb
nvidia.com/mig.config.state=success
nvidia.com/mig.strategy=singleThere are now effectively seven distinct GPUs available in your cluster. You can use the standard `nvidia.com/gpu` resource to assign the partitioned GPUs to your Kubernetes workloads:
apiVersion: apps/v1
kind: Deployment
metadata:
name: demo-deployment
spec:
# The 3 replicas will create successfully
# Each of the 3 Pods will be able to claim a partitioned GPU instance
replicas: 3
selector:
matchLabels:
app: demo-deployment
template:
metadata:
labels:
app: demo-deployment
containers:
- name: nvidia-cuda-sample
image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubuntu20.04
command: ["/bin/bash", "-c", "--"]
args:
- while true; do /cuda-samples/vectorAdd; done
resources:
limits:
nvidia.com/gpu: 1You'll notice that this example is the same as the one shown above for time-slicing. Whether you use time-slicing or MIG, you should configure your Kubernetes pods in the same way. But it's important to realize that whereas time-slicing shares physical GPU access between pods with `nvidia.com/gpu` requests, MIG assigns each request exclusive access to a private partitioned instance. If a pod's assigned an MIG partition, then its GPU access is fully isolated from other pods also accessing that GPU. Pods that are assigned a time-sliced replica do not benefit from this isolation.
Although using MIG to share GPUs gives the strongest possible isolation, it's also a relatively inflexible solution. Let's take a closer look at its pros and cons.
NVIDIA MIG Pros
- Improved isolation: Compared with time-slicing, MIG offers greatly enhanced isolation. Each workload using MIG gets its own private memory and fault isolation, preventing noisy-neighbor issues and security threats.
- Hardware-enforced partitioning: MIG partitioning is enforced at the hardware level, with each instance having its own path through the GPU's entire memory system.
- Enhanced observability: MIG is fully compatible with NVIDIA's DCGM-Exporter, a tool that makes GPU utilization data available as Prometheus metrics. You can individually monitor utilization stats for each of your MIG partitions, allowing you to see precisely which apps and teams are occupying your GPUs.
NVIDIA MIG Cons
- Requires specific hardware: MIG only works with modern high-end NVIDIA GPUs, so it won't necessarily work with your current infrastructure. Upgrading to MIG-compatible hardware could be a significant upfront investment.
- Partitioning options are preconfigured: MIG only supports up to seven partitions per GPU. The compute time and memory available to each partition are fixed, with no ability to arbitrarily assign resources. You must choose from one of the available profiles for your GPU.
- Reconfiguring profiles can be operationally complex: You can switch between MIG profiles by simply updating the `nvidia.com/mig.config` label on the affected Kubernetes node. However, as this will change the number of partitions available, activating a new profile is often operationally complex. You must check that your workloads can still schedule and run performantly with the new configuration applied.
Approach 3: MIG and Time-Slicing
MIG and time-slicing are independent mechanisms that each implement a form of GPU sharing. You can use either approach individually, but combining MIG and time-slicing can help partially address the weaknesses of both.
To enable this strategy, you must first set up MIG by following the steps shown above. Once MIG is enabled, you can then configure time-slicing to provide shared access to the partitioned GPU instances created by MIG. The following time-slicing ConfigMap demonstrates how to time-slice the MIG partitions created in the previous example so there are four replicas of each partition:
apiVersion: v1
kind: ConfigMap
metadata:
name: time-slicing-configmap
namespace: gpu-operator
data:
any: |-
version: v1
flags:
migStrategy: single
sharing:
timeSlicing:
resources:
- name: nvidia.com/gpu
replicas: 4With this configuration enabled, up to twenty-eight GPU resource requests (seven MIG partitions times four time-sliced replicas) can be granted to pods—all from a single physical GPU. For advanced use cases, you can also configure MIG time-slicing in `mixed` mode (`flags.migStrategy: mixed`) to configure a different number of time-sliced replicas for each active MIG profile.
Combining MIG and time-slicing is useful when you want to share GPU access across multiple isolated workloads, with multiple replicas running for each workload. By combining MIG and time-slicing, you can run up to seven distinct workloads per GPU—each with its own isolated GPU partition—but allow multiple pods per workload by time-slicing the MIG partitions into replicas.
Alternative Work-Arounds and Custom Solutions
MIG and time-slicing are the leading technologies for Kubernetes GPU sharing. However, it's also possible to build DIY workarounds in case these solutions don't work for your use case. Perhaps you don't have MIG-compatible hardware and need stronger isolation than time-slicing alone, or are using non-NVIDIA GPUs. Here are a few different strategies you can choose.
Manually Assigning GPUs Using Node Labeling and Taints
You can match Kubernetes workloads to GPUs using node labels and taints. While this doesn't let you share GPUs between multiple workloads, it can help ensure only approved pods get access to a GPU.
For example, you could apply a `has-gpu` taint to nodes with available GPUs:
$ kubectl taint nodes my-node-with-gpu has-gpu=true:NoSchedule-This prevents pods from scheduling to the node unless they tolerate the `has-gpu` taint. A pod manifest like the following one can then be used to schedule a pod to a GPU-enabled node while matching on a specific GPU type using the `nvidia.com/gpu.product` node label:
apiVersion: apps/v1
kind: Deployment
metadata:
name: demo-deployment
spec:
replicas: 3
selector:
matchLabels:
app: demo-deployment
template:
metadata:
labels:
app: demo-deployment
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- key: nvidia.com/gpu.product
value: A100-SXM4-40GB
tolerations:
- key: has-gpu
operator: Exists
effect: NoSchedule
containers:
- name: nvidia-cuda-sample
image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubuntu20.04
command: ["/bin/bash", "-c", "--"]
args:
- while true; do /cuda-samples/vectorAdd; done
resources:
limits:
nvidia.com/gpu: 1Now, the node is effectively reserved for pods that need the GPU—expressed via the `has-gpu` toleration—while the pod can guarantee it accesses an NVIDIA A100-SXM4-40GB device. You could expand this strategy by creating a custom Kubernetes admission controller to reject pods that aren't authorized to use GPUs.
Enforcing GPU Quotas via Resource Requests and Quotas
Kubernetes resource quotas allow you to control the maximum amount of a resource available to individual Kubernetes namespaces. You can use them to specify the maximum number of GPU instances available to individual tenants in your cluster.
Resource quotas allow you to fairly share physical GPUs—when no other GPU-sharing mechanism is available—or divide time-sliced or partitioned GPU replicas between your cluster's namespaces. The following example specifies that pods in the `my-namespace` namespace can request up to four `nvidia.com/gpu` resources:
apiVersion: v1
kind: ResourceQuota
metadata:
name: gpu-resource-quota
namespace: my-namespace
spec:
hard:
requests.nvidia.com/gpu: 4User-Space GPU-Multiplexing Tools
User-space GPU-multiplexing tools—like GVirtuS, CUDA MPS—virtualize access to GPUs while running within user space at the operating system level. Each of these tools provides a mechanism to share GPUs between multiple processes, but using them with Kubernetes requires custom configuration.
GVirtuS is an experimental project, while CUDA MPS is another NVIDIA software feature. LibVF.IO implements vendor-agnostic multiplexing for NVIDIA, AMD, and Intel devices, but the project's last activity was over two years ago. It's unclear whether it's still maintained or compatible with current hardware.
Container-Level GPU Cgroups and Runtime Hacks
You can use cgroups and container runtime configuration changes to manually implement an approximation of GPU sharing. If you've got multiple physical GPUs attached to your host, then cgroup hooks let you customize which devices are visible to the containers running within the cgroup. This lets you enforce that only specific containers can use the GPU, based on the cgroup they're assigned to.
Why These Strategies Fall Short
Using any of these four strategies is likely to produce a fragile system that's difficult to configure, maintain, and secure. The operational overheads involved may surpass the cost of purchasing new hardware that supports stable features like MIG and time-slicing. You'll also lack debugging and monitoring capabilities, making it hard to investigate failures.
How vCluster Helps Simplify GPU Sharing
Running multitenant AI/ML workloads with shared GPU access involves more than just the GPUs themselves. Kubernetes includes basic built-in multitenancy features via namespaces and RBAC, but these don't go far enough to ensure full isolation.
vCluster solves this problem by enabling you to create fully functioning virtual clusters inside your physical Kubernetes environment. Each virtual cluster is fully isolated from its neighbors and operates independently. It has its own control plane with full RBAC and resource quota management.
vCluster doesn't natively implement GPU sharing, but it's fully compatible with existing technologies like NVIDIA MIG and time-slicing. It allows you to easily manage the tenanted workloads that access your GPU instances. vCluster also works with leading GPU-equipped workflow orchestration platforms, including Run:ai, Volcano, and Kubeflow, enabling you to fully isolate your AI/ML workloads and environments.
We'll take a closer look at using vCluster for GPU-equipped workflows in the next article in this series.
Conclusion
GPU sharing maximizes GPU utilization in your Kubernetes clusters. You can reduce operating costs and improve efficiency by allowing multiple AI, ML, and big data workloads to access the same physical GPU.
NVIDIA's MIG and GPU time-slicing mechanisms provide the basics needed to successfully share your GPUs. With MIG, you can partition your hardware into seven isolated instances, whereas time-slicing lets you create replicas that evenly divide the GPU's compute time between different processes. Alternative DIY approaches are also available, but they're generally complex to configure and brittle in use.
GPU sharing doesn't end with these technologies, though. Virtual clusters are the crucial final piece for building full multitenancy in Kubernetes. By combining vCluster, MIG, and time-slicing, you can safely use a small number of GPUs to serve many independent workloads. Your virtual clusters separate the different tenanted resources in your physical Kubernetes environment, while MIG guarantees hardware-level GPU isolation. This reduces GPU infrastructure requirements for AI/ML deployments, increasing your return on investment.
If you're building robust GPU-enabled platforms, we've created a comprehensive resource to help. Download our free ebook, "GPU-Enabled Platforms on Kubernetes," which explains how Kubernetes abstracts GPU resources, why traditional isolation fails, and what architectural patterns enable multi-tenant GPU platforms. This guide covers everything from how GPUs meet Kubernetes and why GPU multi-tenancy is hard, to orchestrating GPU sharing, hardware isolation and enforcement, and architecting GPU infrastructure with vCluster for optimal isolation and efficiency. Download the eBook here.
Deploy your first virtual cluster today.

