Table of Contents
The first part of this series explored why enterprises are increasingly prioritizing in-house GPU infrastructure over cloud-only strategies, driven by cost efficiency, data control, and guaranteed capacity. However, acquiring GPU hardware is only the first step; the real challenge lies in maximizing the value of these expensive resources through effective sharing.
GPU multitenancy refers to using a single physical GPU to operate several independent workloads. It allows you to serve multiple apps, teams, or customers using one pool of shared GPU infrastructure. This helps improve resource efficiency and reduce operating costs as you build your enterprise AI strategy.
Multitenant GPU access in Kubernetes lets you run several AI/ML deployments in one cluster. While you can implement basic Kubernetes multitenancy using built-in mechanisms such as namespaces, resource quotas, and RBAC, GPU-dependent workloads require special consideration. Unlike other types of hardware resources, GPUs don't support native sharing between multiple isolated processes. You normally have to assign an entire physical GPU to each Kubernetes pod, making it impossible to scale your AI/ML apps without incurring huge costs.
This article explores the problems with GPU multitenancy and discusses four different approaches for safely and efficiently running multitenant GPU apps at scale.
What Is Multitenancy in GPU Infrastructure?
Multitenant GPU infrastructure operates multiple workloads from a single pool of GPUs. It lets several apps and users share GPU hardware, leading to improvements in GPU utilization and operating efficiency.
GPU multitenancy mechanisms enable different GPU-enabled workloads to coexist without interfering with each other. Each workload acquires its own isolated slice of the GPU, allowing fair allocation of resources without adverse impacts on neighbors. GPU-level isolation helps maintain interprocess security and ensure clear accountability when several different tenants are using the GPU.
Multitenancy is also a key step towards reducing the cost of running AI/ML workloads at scale. GPUs are expensive, hard to source, and sometimes overpowered compared to the solutions that use them. To get the best return on investment, you must ensure your GPUs and workloads are optimally paired. Sharing GPUs between multiple deployments enables you to increase utilization rates and improve your operating efficiency.
Types of GPU Multitenancy
GPU multitenancy has three main use cases:
1. Team-level multitenancy: This is where multiple teams within your organization use the same pool of GPUs, such as ML, GenAI, and data analytics developers. Sharing GPUs can significantly reduce development costs, but it's crucial to ensure each team gets a fair share of the available resources.
2. Workload-level multitenancy: Workload-oriented multitenancy refers to sharing GPUs between several distinct apps or tasks, such as running both training and inference workloads on the same GPU.
3. Customer-level multitenancy: Some SaaS AI platforms may deploy a new instance for each customer. In this scenario, each customer instance must be granted safe access to GPU capacity.
In all cases, the basic requirement stays the same: GPU multitenancy should allow several isolated deployments to share a single GPU resource. But historically, it's been challenging to achieve this in practice.
Why GPU Multitenancy Is Hard to Implement
GPU multitenancy is problematic because GPUs aren't designed for partial allocation. While CPUs can be easily divided into fractional shares and then assigned to multiple isolated processes, GPUs are usually scheduled as a whole device. Standard Kubernetes device plugins enforce this, so you can't request a fractional GPU share:
yaml
apiVersion: v1
kind: Pod
metadata:
name: demo
spec:
containers:
- image: hello-world:latest
resources:
limits:
# This doesn't work
nvidia.com/gpu: 0.5
In a perfect world, the example shown above would allow a Kubernetes pod to gain access to half of an available GPU. Another pod could then claim the remaining capacity. But because the request must be a whole number, this example doesn't work. The first pod has to exclusively claim the entire GPU (`nvidia.com/gpu: 1`) even though it doesn't actually need the GPU's full capacity. Another GPU would be needed to run another workload, incurring substantial extra costs.
GPU sharing in a multitenant context also raises security concerns. For instance, having multiple processes target the GPU could enable unauthorized shared memory access or unintentionally expose data between processes if there are driver-level bugs. To stay safe, GPU multitenancy systems need to deliver hardware-level isolation to help mitigate these risks.
Multitenancy impacts GPU observability processes too. Simply monitoring GPU-level usage stats isn't enough to give you the whole picture of what's happening in your workloads. Accurately tracking who's using different portions of the GPU requires specialist tools that understand tenanted access, increasing configuration complexity. Having this data allows you to optimize resource allocation to achieve maximum efficiency, but Kubernetes can't provide it by default.
Key GPU Multitenancy Challenges
In addition to the general issues described above, teams building multitenant GPU environments often experience the following challenges:
- Contention: Running multiple workloads on one GPU can lead to high resource contention. A single demanding workload could consume all the available GPU capacity, causing performance problems for neighboring deployments. Dedicated GPU scheduling and throttling mechanisms are needed to prevent excess contention.
- Isolation: Without proper barriers, workloads using the GPU could interfere with each other or expose sensitive information at the kernel or device level. For instance, memory leaks from one process may impact other workloads using the GPU.
- Quota management: GPU resource quotas must be precisely managed so teams can fairly share GPU capacity. Standard Kubernetes GPU device plugins only allow you to match whole GPUs to your workloads, with no fine-grained control available.
- Cost and usage visibility: Allowing multiple apps or teams to use a GPU makes it harder to monitor usage. Without visibility into each workload's activity, you can't attribute usage to the cost center that triggered it. This makes it challenging to implement accurate charge-backs.
A robust GPU multitenancy strategy should address these problems by giving you granular control over GPU assignments. Because Kubernetes includes only limited built-in GPU allocation features, extra tools and technologies are needed to operate multitenant workloads reliably.
4 Steps to Effective GPU Multitenancy
The following sections outline four key steps towards successful multitenant GPU operations in Kubernetes. They combine built-in Kubernetes features, GPU driver capabilities, and advanced virtual cluster tools to achieve multitenancy at the workload, cluster, and GPU level.
1. Kubernetes Namespaces and Resource Quotas
Kubernetes namespaces are the bedrock of in-cluster multitenancy. They're a built-in mechanism for isolating groups of objects within your cluster. When combined with RBAC policies, you can use them to safely separate objects belonging to different apps or teams.
Namespaces also work with resource quotas, which can limit the namespace's resource consumption by setting the amount of CPU, memory, storage, and GPU instances it can use. This allows you to enforce which tenants are able to access GPUs.
Unfortunately, this is also where Kubernetes's native multitenancy features end. The system doesn't include any capabilities for GPU-level tenancy. Namespaces and resource quotas alone don't facilitate GPU sharing, so you can't specify that workloads from two tenants should use one GPU or remain properly isolated when being accessed. You'll see how you can solve this in the next step.
2. GPU Scheduling Extensions
GPU scheduling extensions implement robust GPU multitenancy at the driver level. NVIDIA's Multi-Instance GPU(MIG) technology lets you split a single physical GPU into seven isolated instances. Each instance is assigned its own share of the GPU's compute and memory resources.
Once configured, the NVIDIA driver presents the partitioned GPU instances as independent GPUs attached to the node. You can then assign each instance to individual Kubernetes pods. This enables one GPU to securely serve up to seven different workloads in your cluster, drastically expanding your capacity to operate tenanted deployments.
If you need to share a GPU among more than seven workloads, then time-slicing is an alternative option to use instead of MIG. Natively available within the NVIDIA Kubernetes device plugin, it allows Kubernetes pods to oversubscribe to the available GPUs.
Time-slicing creates a set of GPU replicas that pods can request, with each then receiving a proportional slice of the GPU's available compute time. However, unlike MIG, time-slicing does not implement memory or fault isolation, so issues in one workload could potentially impact others.
Kubernetes node affinity rules and taints can also impact GPU scheduling for multitenant workloads. You can use taints to ensure GPU-dependent workloads don't schedule to nodes without a suitable GPU, for instance, or configure affinity rules to prefer nodes that can supply GPUs of a particular class.
3. Logical Isolation with Virtual Clusters
vCluster enables you to create fully isolated Kubernetes environments within a single physical cluster, known as virtual clusters. Each virtual cluster looks and behaves just like a real cluster but operates independently of its neighbors. Virtual clusters have their own virtualized control planes so they can host unique CRDs, cluster-level RBAC policies, and resource quotas.
Virtual clusters are lightweight, fast, and capable of sleeping when they're unused. Compared with plain Kubernetes namespaces, they offer more granular control and improved security. Assigning tenants their own virtual cluster allows them to take control of their Kubernetes environment without the risk of other tenants being affected.
You can use virtual clusters in conjunction with NVIDIA MIG and GPU time-slicing to achieve full multitenancy for AI/ML workloads. Creating a virtual cluster for each tenant and then assigning a partitioned GPU share gives tenants isolated access to the GPU. You can precisely manage GPU allocation to prevent resource contention while sharing GPU nodes between multiple tenants to improve utilization. GPU access runs at native speeds, without any hypervisor overheads.
4. Custom GPU Allocation Strategies
In the most demanding multitenant environments, custom GPU allocation strategies can help address scheduling challenges. You can use native Kubernetes features like preemption policies and priority classes to configure how new GPU-enabled pods are matched to suitable nodes. Evicting a low-priority job in favor of a more urgent one can help ensure stable performance for critical workloads, for instance.
Some workloads may benefit from bin-pack scheduling. This prioritizes packing deployments onto nodes until they're full, letting you make the most of available resources before provisioning new capacity. It's well suited to bursty deployments that are frequently created and destroyed, as it leaves other GPUs free to host stable long-running workloads. Scheduling plugins, such as Volcano, and dedicated batch job runners, like Argo Workflows, can also help ensure AI/ML tasks are scheduled to run as efficiently as possible.
In addition, solutions like vCluster Auto Nodes (powered by Karpenter) can automate much of this scheduling complexity. Auto Nodes dynamically provisions GPU-capable nodes based on workload demand, ensuring the right GPUs are available when needed and consolidating workloads efficiently to reduce idle capacity. This helps realize bin-pack efficiency without manual tuning, while also scaling GPU infrastructure up or down in response to tenant workloads.
Best Practices for Safe GPU Multitenancy
Now that we've covered the main ways to achieve GPU multitenancy, here are five best practices that help ensure success.
Use NVIDIA MIG (if Your Hardware Supports It)
NVIDIA MIG is one of the critical components to include in a multitenant GPU implementation. As discussed above, MIG allows you to split a single physical GPU into up to seven separate partitions. It lets you deploy multiple tenanted workloads with hardware-level memory isolation, minimizing the risk of security or performance problems occurring.
With MIG enabled, multiple GPU devices will be presented to Kubernetes for each physical unit connected to your node. You can then allocate GPUs to pods using standard `nvidia.com/gpu: <gpu-count>` Kubernetes resource requests. However, MIG isn't universally available; it's only supported on high-end Ampere generation and newer devices, while the partitioning options you can use depend on your specific GPU.
Apply Quotas and Scheduling Policies at the Namespace or Virtual Cluster Level
Enforcing resource quotas at the namespace or virtual cluster level allows you to fairly allocate GPU instances to your tenants. This prevents one tenant from consuming all the available resources. The following example demonstrates a resource quota that limits the `team-a` namespace to five NVIDIA GPU instances:
yaml
apiVersion: v1
kind: ResourceQuota
metadata:
name: gpu-quota
namespace: team-a
spec:
hard:
requests.nvidia.com/gpu: 1
Continually Monitor GPU Usage with the Prometheus DCGM Exporter
Tracking GPU activity allows you to identify the causes of performance bottlenecks. NVIDIA's DCGM-Exporter tool provides detailed GPU metrics that you can scrape using Prometheus. It's fully compatible with MIG, so you can individually monitor each of your partitioned GPU instances, such as to compare GPU utilization between different tenants. This can help you spot potential issues, such as tenants that are demonstrating consistently high utilization and that may need more resources.
Even if you're not using MIG, DCGM provides vital insights into the NVIDIA GPU activity in your cluster. Standard Kubernetes monitoring components like metrics-server and kube-state-metrics don't cover GPUs, so you need DCGM to keep tabs on your hardware. An official Helm chart simplifies DCGM-Exporter installation on Kubernetes.
Separate Long-Running and Bursty GPU Workloads
Different types of GPU workload can have drastically different performance characteristics. For instance, an AI training process may run for multiple hours or days, consistently occupying a set amount of GPU capacity. However, workloads that use AI inference to serve user requests in real time are more likely to experience sporadic bursts of utilization.
Separating these workloads so they run on different GPU nodes can help optimize your infrastructure. Assigning specific GPUs to long-running workloads ensures capacity will always be available for them. Burstable apps can generally share a pool of GPUs as none of the workloads will occupy the GPU for long.
Use RBAC and Admission Controllers to Secure GPU Access
GPUs are expensive specialist devices that should be reserved for the workloads that use them. Allowing other teams to utilize GPUs or inspect their workloads increases operating costs, affects performance, and could lead to security issues. You can avoid these problems by securing GPU access using Kubernetes RBAC rules and admission controllers at the namespace or virtual cluster level.
RBAC allows you to define which actions and resources different cluster users can interact with. When used alongside namespace resource quotas, RBAC rules let you prevent unauthorized users from creating pods in namespaces that have GPUs assigned.
Similarly, admission controllers let you reject new pods that try to request GPU access, unless they meet specific criteria. For instance, you could use a validating admission policy to enforce that pods can only request GPU access if they run a certain image and have approved labels attached.
Conclusion: You Can Solve GPU Multitenancy Challenges
GPU multitenancy improves AI and ML operating efficiency by serving multiple deployments from a single GPU. This increases hardware utilization rates but also creates management complexity. Neither GPUs nor Kubernetes are natively designed for fully isolated multitenant scenarios, so dedicated tooling is crucial to prevent resource contention, enhance performance, and protect data security.
Combining vCluster with solutions like NVIDIA MIG lets you build effective multitenant GPU infrastructure at scale. vCluster allows you to create independent virtual clusters within your physical Kubernetes environment, giving each team, app, or customer their own private space. You can then use MIG to partition your GPUs and share them among your virtual clusters.
Because AI development depends on more than just GPUs, it's also important that the rest of your infrastructure is properly optimized. In the next part of this series, we'll show you how to architect an entire private AI cloud. Using private resources gives you total data governance and guaranteed hardware exclusivity, enabling safe and performant AI/ML innovation.
This guide covered the practical strategies for GPU multitenancy. If you want to understand the architectural fundamentals that make these approaches necessary, why CUDA contexts can't be preempted like CPU processes, how GPU memory differs from traditional RAM, and what MIG actually does at the silicon level, we published "The GPU Challenge at Scale." It's a technical deep-dive into the GPU and Kubernetes primitives that shape how you build AI infrastructure.



.jpeg)
