Tech Blog by vCluster Press and Media Resources

Deploying vClusters on a GPU-Enabled Kubernetes Cluster

Jan 13, 2026

min Read

Deploying vClusters on a GPU-Enabled Kubernetes Cluster

This is the final article in our three-part series exploring GPU sharing and multitenancy in Kubernetes environments. In the first article, we examined traditional open source solutions like time-slicing and NVIDIA MIG, revealing how these "DIY" approaches often fall short in production-grade MLOps environments due to limitations in manageability, security, and operational clarity as organizations scale. The second article introduced Kubernetes virtual clusters (vClusters) as a transformative solution, showcasing how vCluster technology delivers robust multitenancy, enhanced governance, and streamlined GPU management.

Now, it's time to move beyond theory. If you're tired of wrestling with unreliable GPU sharing or juggling complex tenant isolation in Kubernetes, this hands-on tutorial will show you exactly how to deploy virtual clusters on a GPU-enabled Kubernetes cluster.

Step-by-step, you'll learn how to set up prerequisites, install virtual clusters, expose GPU resources, and run real deep learning workloads in an isolated, multitenant environment, each step explained for both the "how" and the critical "why." Whether you're a platform engineer or MLOps architect, this tutorial equips you to unlock scalable, secure, and developer-friendly GPU infrastructure with Loft's vCluster.

Implementing vCluster on a GPU-Enabled Cluster

The successful deployment of vCluster for GPU workloads begins with a properly configured Kubernetes environment, purpose-built for GPU virtualization and secure multitenancy. The prerequisites below establish the technical baseline necessary to install vCluster, expose GPU resources safely, and ensure robust workload isolation and sharing. Each Kubernetes platform implements GPU enablement differently, affecting both how you configure nodes and how GPU drivers are managed. Because of this variability, the deployment steps and tools may differ between providers.

This tutorial guides you through the process of deploying and configuring a specific GPU-enabled Kubernetes environment to ensure your environment matches the tutorial's requirements. The examples use Google Kubernetes Engine (GKE) with GPU-enabled nodes. Google's managed service is ideal for demonstration purposes since GKE automatically manages NVIDIA GPU driver installation, eliminating the need for manual setup with the NVIDIA GPU Operator. This abstraction not only speeds up deployment but also reduces operational risk from potential driver and version mismatches.

Prerequisites

Before continuing with the tutorial, check that you have the following:

Access to a GPU-enabled Kubernetes cluster: options include managed services such as GKE, Amazon Elastic Kubernetes Service (EKS), Azure Kubernetes Service (AKS), or a local minikube cluster with GPU passthrough
kubectl (installation guide): the Kubernetes command line tool for interacting with your cluster
Helm (installation guide): a package manager for Kubernetes deployments
vCluster CLI (installation guide): streamlines virtual cluster operations and compatibility
NVIDIA GPU Operator (or equivalent): installed and verified, unless your chosen managed provider handles GPU driver installation natively

Provisioning the GKE Environment

To kick off the deployment, provision a GKE cluster with a single GPU-enabled node. This provides the dedicated hardware required for GPU workloads.

Use the following command to create the cluster:

gcloud container clusters create vcluster-gpu-gke \
  --zone us-east1-b \
  --num-nodes 1 \
  --machine-type g2-standard-4 \
  --accelerator type=nvidia-l4,count=1 \
  --preemptible \
  --scopes=https://www.googleapis.com/auth/cloud-platform \
  --cluster-version latest \
  --enable-ip-alias

Let's break down the configuration:

`--zone us-east1-b`: Specifies the physical location where your cluster will run. Choosing a region close to users or data can reduce latency and improve throughput; here, *us-east1-b* is selected for broad accessibility.
`--num-nodes 1`: Provisions only one node in the cluster's default pool, minimizing costs and complexity. For production, multiple nodes are typically required for redundancy and scalability.
`--machine-type g2-standard-4`: Allocates a virtual machine designed for GPU workloads with a balanced CPU-to-GPU ratio. This instance type offers sufficient memory and compute for demanding ML inference or training jobs but is still cost-conscious.
`--accelerator type=nvidia-l4,count=1`: Attaches a single NVIDIA L4 GPU to the node, ensuring that Kubernetes can schedule GPU workloads requiring CUDA support. L4s are well suited for AI inference and training scenarios.
`--preemptible`: Marks the VM as preemptible, significantly reducing costs by allowing Google to reclaim the instance at any time. This is optimal for test environments or tutorials where uptime reliability is less critical.
`--scopes=https://www.googleapis.com/auth/cloud-platform`: Grants the node full access to Google Cloud APIs, enabling broader functionality (*eg* storage and monitoring), which simplifies integration for experimentation.
`--cluster-version latest`: Ensures the cluster uses the latest available GKE version, reducing risk of security vulnerabilities or incompatibility issues.
`--enable-ip-alias`: Activates VPC-native routing, which improves cluster networking scalability and security in enterprise and multitenant setups.

This example uses `g2-standard-4` machine type with one preemptible node to demonstrate the workflow at the lowest reasonable cost. Keep in mind that preemptible nodes are great for testing things out or learning, but you'll want to avoid them in production since they can get shut down without warning.

Likewise, opting for a single-node, single-pool configuration is driven by cost-saving considerations; it helps you replicate the steps without incurring high cloud expenses. For real-world production deployments, use standard (on-demand) nodes and a multinode cluster to ensure reliability, high availability, and performance.

Lastly, remember that GKE provisioning may fail if your Google Cloud project does not have sufficient `GPUS_ALL_REGIONS` quota available in the target region or zone. Before deploying, check your current GPU quota and, if needed, request an increase via the Google Cloud Console Quotas page. Quota adjustments may take time, so plan ahead to avoid unnecessary delays in your cluster setup.

Validating GPU Access on the Host

Since virtual clusters inherit GPU resources from the host cluster, any issues with GPU drivers or device plugins on the host will directly impact GPU availability within your virtual clusters. That's why it's good practice to validate that the NVIDIA driver and device plugin have been installed and are running correctly in your GKE cluster.

Run:

kubectl get daemonset -n kube-system | grep nvidia

You should see output similar to:

nvidia-gpu-device-plugin-large-cos        0         0         0       0            0           <none>                                               63s
nvidia-gpu-device-plugin-large-ubuntu     0         0         0       0            0           <none>                                               62s
nvidia-gpu-device-plugin-medium-cos       0         0         0       0            0           <none>                                               63s
nvidia-gpu-device-plugin-medium-ubuntu    0         0         0       0            0           <none>                                               62s
nvidia-gpu-device-plugin-small-cos        1         1         0       1            0           <none>                                               63s
nvidia-gpu-device-plugin-small-ubuntu     0         0         0       0            0           <none>                                               63s

This output confirms that the NVIDIA GPU device plugin `DaemonSet` has successfully scheduled and is running on the node(s) with GPU support.

But before moving on, you should still verify that your host Kubernetes cluster can access and utilize the GPU before building any higher-level abstractions. If the physical/VM setup or driver installation is misconfigured, virtual clusters will silently fail or hang, resulting in lengthy troubleshooting cycles.

To check end-to-end GPU access at the node level, create a simple validation pod called `nvidia-cuda-test.yaml` and paste in the following code:

apiVersion: v1
kind: Pod
metadata:
  name: nvidia-cuda-test
spec:
  containers:
  - name: cuda-container
    image: nvidia/cuda:12.2.0-base-ubuntu22.04
    resources:
      limits:
        nvidia.com/gpu: 1
    command: ["nvidia-smi"]
    args: ["-L"]
  restartPolicy: Never

Apply the manifest:

kubectl apply -f nvidia-cuda-test.yaml

After the pod is scheduled, retrieve the logs output to confirm GPU detection:

kubectl logs nvidia-cuda-test

The expected result is similar to:

GPU 0: NVIDIA L4 (UUID: GPU-xxx-xxxxx-xxxx-xxxx-xxxxxxxxxxx)

A successful log message confirms that your node's NVIDIA drivers, device plugin, and runtime are correctly configured to expose GPU resources within Kubernetes workloads.

With the GPU layer now validated, you're ready to create a virtual cluster on your newly provisioned environment.

Setting Up a GPU-Enabled vCluster

You can use vCluster to create isolated Kubernetes environments that also inherit access to the host's resources. In other words, virtual clusters deployed with vCluster have visibility into node-level resources, including GPUs, provided they are available in the physical cluster and correctly advertised.

For better resource separation and access control, start by creating a dedicated namespace for your GPU workloads. Isolating the virtual cluster in its own namespace ensures cleaner management, finer-grained RBAC, and easier auditability:

kubectl create ns gpu-01

With the namespace ready, deploy a new virtual cluster named `vcluster-01` within the namespace `gpu-01`:

vcluster create vcluster-01 --namespace gpu-01

Sample output:

06:58:11 info Chart not embedded: "open chart/vcluster-0.27.0.tgz: file does not exist", pulling from helm repository.
06:58:11 info Create vcluster vcluster-01...
...
06:59:11 done vCluster is up and running
Forwarding from 127.0.0.1:10391 -> 8443
Handling connection for 10391
06:59:12 done Switched active kube context to vcluster_vcluster-01_gpu-01_...
...
06:59:12 warn Since you are using port-forwarding to connect, you will need to leave this terminal open
- Use CTRL+C to return to your previous kube context
- Use `kubectl get namespaces` in another terminal to access the vcluster

When running the vCluster CLI, keep in mind that a port-forwarded connection is established so your `kubectl` commands point to the virtual cluster's API server. To stay connected, leave the original terminal open.

Open another terminal session, and run the following command to connect and interact with the virtual cluster in parallel:

vcluster connect vcluster-01 -n gpu-01

At this stage, your virtual cluster environment is deployed and ready for validation. The next step is to confirm that this virtual cluster can properly discover and schedule real GPU workloads.

Testing vCluster GPU Access Using a PyTorch Workload

Previously, you verified GPU functionality on the host by running a simple pod with `nvidia-smi -L`, ensuring the hardware and driver stack are intact. However, surface-level checks can mask configuration gaps that only appear under real-world machine learning workloads.

To rigorously validate GPU passthrough, you'll now run a live PyTorch job inside the virtual cluster. This approach represents a true "slice of reality" for MLOps scenarios since it will run an actual GPU workload.

First, create the deployment manifest `pytorch-gpu-check.yaml`, which launches a container running a CUDA-enabled PyTorch image:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: pytorch-gpu-check
spec:
  replicas: 1
  selector:
    matchLabels:
      app: pytorch-gpu-check
  template:
    metadata:
      labels:
        app: pytorch-gpu-check
    spec:
      containers:
      - name: pytorch
        image: pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime
        command: ["python", "-c"]
        args:
          - |
            import torch
            if torch.cuda.is_available():
                print("CUDA is available: ", torch.cuda.get_device_name(0))
            else:
                print("CUDA NOT available")
        resources:
          limits:
            nvidia.com/gpu: 1
      restartPolicy: Always

The container executes a short Python script. If the CUDA runtime and GPU are accessible, it prints the GPU name; otherwise, it logs an error.

Unlike the previous single-shot check, this workload deploys as a continuously running deployment, ensuring the pod claims and retains the allocated GPU until scaled down or deleted, making it a stricter and more realistic test for cluster capability and scheduling.

Apply the workload using the command:

kubectl apply -f pytorch-gpu-check.yaml

Give the pod a few moments to pull the image and start (initial startup may take a couple of minutes due to image size and node resource allocation).

Check the pod status:

kubectl get pods

For a healthy deployment, you should see:

NAME                         READY   STATUS    RESTARTS   AGE
pytorch-gpu-check-xxx        1/1     Running   0          2m

To confirm GPU access, review the deployment logs:

kubectl logs deployment/pytorch-gpu-check

Here's the expected output:

CUDA is available:  NVIDIA L4

This result demonstrates end-to-end functionality: Your virtual cluster can identify, claim, and actively use the GPU for machine learning, a critical milestone for multitenant and production-grade MLOps workloads.

With the GPU validation complete, you have a fully functional foundation for GPU-enabled virtual clusters. From here, you can continue to configure vCluster for advanced isolation and safe GPU sharing at scale based on your specific production requirements.

Conclusion

You've successfully deployed a virtual cluster atop a GPU-enabled GKE node, verified end-to-end GPU access, and launched a real PyTorch workload, thus establishing the vital foundation for multitenant, production-ready MLOps in Kubernetes.

This tutorial used a minimal single-node setup to demonstrate core concepts and workflows, but scaling out is straightforward: A multinode cluster unlocks true multitenancy, allowing you to create multiple virtual clusters, each capable of dynamically allocating one or more GPUs with `nvidia.com/gpu: 1` or greater, tailored to workload requirements. This level of control enables teams to run isolated machine learning jobs, tune resource limits, and flexibly share or reserve GPU resources for diverse projects.

Experiment with more complex ML or data engineering pipelines, adjust resource limits to suit multi-GPU jobs, and begin integrating virtual clusters with your broader MLOps tooling. With vCluster, you're equipped to eliminate resource contention and operational overhead, allowing your teams to move faster, experiment safely, and deliver results at scale. This tutorial wraps up our three-part journey—from exploring the pain points of traditional GPU sharing to discovering the power of virtual clusters, and now actually getting it all working in practice.

📚 Explore the Full Series: GPU Multitenancy with vCluster

This tutorial is the final instalment of our three-part series on optimizing GPU resources within Kubernetes. If you missed the earlier chapters, catch up below to understand the full journey from "DIY" sharing to advanced virtualization:

Part 1: DIY GPU Sharing in Kubernetes

Part 2: Solving GPU Sharing Challenges with Virtual Clusters

AI & GPUs