Tech Blog by vClusterPress and Media Resources

Solving GPU-Sharing Challenges with Virtual Clusters

Cliff Malmborg
Jan 13, 2026
|
4
min Read
Solving GPU-Sharing Challenges with Virtual Clusters

In the first part of this series, we explored why building an in-house GPU infrastructure is becoming a strategic necessity if you're serious about AI. Owning the hardware provides significant long-term advantages in cost, security, and resource control, but acquiring a rack of powerful GPUs is only the beginning. For instance, an organization might invest $250,000 in an 8xH100 server, only to find that sharing it effectively across multiple teams, projects, and workloads is a complex technical challenge.

GPUs are notoriously difficult to share because they were designed for a fundamentally different purpose: rendering graphics for a single user. Their architecture was optimized for intensive, single-threaded processes, not the multitenant, containerized, and often ephemeral workloads that define modern enterprise IT. This mismatch stalls many AI strategies and creates operational bottlenecks, frustrates developers, and leads to expensive hardware sitting idle—a recipe for wasted capital.

To understand the technical hurdles of GPU sharing and how to overcome them, we spoke with Scott McAllister of vCluster. This article breaks down the common approaches to GPU sharing, their limitations, and how a virtual cluster architecture provides a more elegant, flexible, and efficient solution.

The High Cost of Idle GPUs

Enterprises pursuing in-house GPU infrastructure face a simple financial problem: The hardware is incredibly expensive. "They're way expensive," says McAllister. "So when you do get them, you want to utilize them as much as you can. If you can't share a single processor or a single processing unit, then you're not getting your money's worth."

This problem is magnified by the nature of AI workloads. Unlike a web server that handles a relatively steady stream of traffic, AI development is characterized by bursts of intense activity followed by periods of inactivity. As McAllister notes, "They'll have a burst of processing that happens on it, and then it's going to go away." A data scientist might run a model-training job that consumes multiple GPUs for twelve hours straight, and after that, the GPUs sit completely idle. This bursty pattern means the hardware is either fully loaded or not used at all.

If a GPU that costs tens of thousands of dollars is dedicated to a single team, its utilization rate can easily fall below 20–30 percent, undermining the economic argument for bringing infrastructure in-house. You've traded a high -loud OpEx for a massive CapEx, only to see that capital asset sit unused. "Those times you're not using it feel like wasted money," McAllister adds. This intense financial pressure forces platform engineers and ML infrastructure teams to build scheduling and tenancy models that treat the entire GPU fleet as a shared pool, dynamically assigning workloads to keep the hardware consistently busy and maximize the return on investment.

Technical Hurdles of Traditional Sharing

NVIDIA has introduced various technologies to facilitate GPU sharing and address these challenges. But these solutions often feel like workarounds; their history reveals why. As McAllister says, "Graphics cards have been around forever, but using them for processing didn't really get popular until we started processing all these AI workloads." GPUs were built to render pixels on a screen, with their evolution driven by video games and visual effects. The concept of multiple isolated tenants running concurrent workloads simply wasn't part of the original design. This context is key to understanding the inherent limitations of modern sharing solutions.

Multi-Process Service (MPS) and the Lack of Isolation

MPS is a software-based solution that allows multiple processes from different applications to share a single GPU. It acts as an intermediary, collecting CUDA commands and dispatching them to the GPU. While MPS creates the appearance of concurrency, it doesn't provide the isolation necessary for true multitenancy. "The issue is that it's a shared memory space," McAllister explains. "You're not really isolated."

All processes funneled through MPS run through the same memory and compute context. This creates two major problems. First, it's a security and compliance nonstarter for most multitenant environments. "If you have, let's say, two customers like Pepsi and Coke, right? They don't want to be sharing memory; they want to be completely isolated," McAllister illustrates. The risk of data leakage or interference is too high. Second, it creates a "noisy neighbor" problem, where a resource-intensive process can degrade the performance of all other processes sharing the GPU.

Multi-Instance GPU (MIG) and Hardware Rigidity

MIG (Multi-Instance GPU) offers a more secure approach by enabling sharing at the hardware level. It physically partitions a single GPU into multiple, smaller, fully isolated instances. "MIG is basically taking an approach where it's slicing up the hardware. The GPU has separate instances on that piece of hardware, so it can be scheduled independently," says McAllister. Each MIG instance has its own dedicated compute units and memory, providing the true, hardware-enforced isolation that MPS lacks.

But MIG's reliance on hardware creates significant limitations. Its most critical drawback is its dependency on specific, high-end NVIDIA data-center GPUs, like the A100 or H100. This is a major issue for enterprises with existing investments in older hardware or mixed-GPU environments, as it offers no sharing solution for the vast majority of GPUs already deployed.

It also locks organizations into a costly and rapid hardware-upgrade cycle. "A year from now, there's going to be something else, and then you can't do this. And you just spent a small fortune on this set of GPUs," McAllister warns. This inflexibility extends to configuration; you can only partition the GPU into a few predefined slice sizes. If your workload needs don't perfectly match one of those sizes, you're left with wasted resources. MIG provides isolation, but at the cost of flexibility, vendor lock-in, and significant capital expenditure.

The Virtual Cluster Solution

The limitations of both software- and hardware-based sharing highlight the need for a solution at a higher level of abstraction: one that provides strong isolation without being tied to specific hardware. This is the problem that the virtual cluster architecture is designed to solve.

Instead of coming at the sharing problem from the low-level hardware or driver layer, vCluster solves it at the Kubernetes orchestration layer. As McAllister explains, the architecture allows each tenant to have their own cluster, and "you can have virtual clusters inside of that host cluster that give you separation from each other. But then, you can still share the same GPU nodes." The model is simple yet powerful:

1. A host cluster: A single physical Kubernetes cluster is connected to the entire pool of physical GPUs, regardless of their make or MIG capability. This cluster owns all the physical resources.

2. Virtual clusters: Inside this host cluster, multiple virtual clusters are created. A virtual cluster is not a virtual machine; it's an incredibly lightweight, fully functional Kubernetes control plane (its own API server, controller manager, etc.) that runs as a simple pod on the host. That's why it can be provisioned in seconds.

3. Centralized scheduling: A single powerful scheduler on the host cluster is responsible for all workloads. It sees the resource requests (eg. "I need 1 NVIDIA GPU") from pods created in all the different virtual clusters and intelligently places them onto the available physical GPUs in the host's resource pool.

This model delivers two key benefits simultaneously: tenant autonomy and centralized efficiency. Each tenant, whether an internal development team or external customer, gets a cluster that functions as their own dedicated, private Kubernetes cluster. As McAllister notes, "You can have your own API server, so you can install your own Helm charts and your own versions of your Helm charts and have all the different CRDs that you would prefer to have."

Meanwhile, from the organization's perspective, the underlying hardware is treated as one giant interchangeable resource pool. The host scheduler dynamically allocates fractional or whole GPUs to workloads from any virtual cluster, ensuring the expensive hardware is always in use. "They're going to be distributing that work throughout those GPUs, keeping them at a relatively high utilization, like 50 percent, 90 percent," as McAllister puts it.

This combination of tenant isolation and resource efficiency makes the approach particularly valuable in two key scenarios:

  • Internal enterprise use: Large companies can provide self-service, isolated Kubernetes environments to their dozens or hundreds of development teams, allowing them to innovate quickly without stepping on each other's toes, all while sharing the same central GPU infrastructure.

  • GPU as a service: Cloud providers or MSPs can offer secure, cost-effective, multitenant GPU-enabled clusters to their customers, with each customer running in their own isolated virtual cluster on the shared hardware.

Real-World Impact: The Aussie Broadband Case Study

The benefits of this virtualized model are clearly demonstrated by the success of Aussie Broadband, an internet service provider that adopted vCluster to streamline its development and testing environments. The results include:

  • $180,000 in annual cost savings: This came directly from reducing the number of physical Kubernetes clusters they needed to provision and pay for, consolidating workloads onto fewer, more efficient clusters.

  • 2,400 developer hours saved per year: This time saving, worth more than a senior engineer's annual salary, came from vCluster's ability to provision environments 99 percent faster than the minutes-long process of spinning up a traditional physical cluster.

  • Reduced licensing costs: Aussie Broadband also saw significant savings by moving away from virtual machines for tenancy. As McAllister notes, "A lot of folks use virtual machines as a way to do tenancy … They were able to reduce the number of virtual machines they had to use." Reducing or eliminating enterprise VM licensing costs, which can run into thousands of dollars per host, is another significant saving.

Conclusion

The fundamental challenges of GPU-sharing stem from a technology designed for one purpose being retrofitted for another. Direct, low-level sharing methods like MPS and MIG force you into an undesirable choice between true isolation and hardware flexibility.

Virtual clusters provide a better abstraction layer that resolves this conflict. By decoupling the tenant's administrative environment from the underlying physical hardware, vCluster lets you achieve both the strong isolation needed for secure multitenancy and the dynamic resource sharing required to actually maximize your GPU investment ROI. If you're building a scalable and cost-effective AI platform, this virtualized approach is what turns an underutilized cluster of GPUs into a highly efficient engine for your teams.

Now that we've covered the "why" and "what" of using virtual clusters for GPU sharing, let's move to the "how." In the [final part of this series](URL), we'll walk through actually deploying a virtual cluster on a GPU-enabled Kubernetes cluster to see these principles in action.

Share:
Ready to take vCluster for a spin?

Deploy your first virtual cluster today.