AI INFRA ON NVIDIA

Get the Public Cloud Experience on Your DGX Systems

Run elastic, multi-tenant Kubernetes clusters directly on NVIDIA DGX hardware. vCluster brings the elasticity, automation, and developer experience of the public cloud to your on-prem GPU infrastructure.

What’s Holding Back Your NVIDIA GPU Investment

Before vCluster, DGX deployments were either too rigid or too fragmented—limiting utilization and slowing teams down. With vCluster, NVIDIA GPU infrastructure becomes cloud-native, efficient, and fully automated.

Before

Fragmented, Rigid, and Hard to Scale

Most DGX systems today are deployed as either one giant shared cluster or many small ones, both come with trade-offs that limit performance and efficiency.

  • Manual provisioning and upgrades across DGX nodes
  • Separate clusters per team for isolation
  • Idle GPUs and wasted capacity
  • Fragmented workflows outside standard DevOps pipelines
  • Complex compliance and security management
After

Elastic, Kubernetes-Native AI Infrastructure

vCluster transforms DGX environments into cloud-like Kubernetes platforms—secure, dynamic, and automated—with unified control over GPU resources.

  • Automated provisioning and lifecycle management via BCM + vCluster
  • Isolated virtual clusters without VM overhead
  • GPUs scale dynamically with Auto Nodes
  • Seamless GitOps and Terraform integration
  • Strong governance and high utilization

NVIDIA DGX Reference Architecture

A blueprint for bringing cloud-grade elasticity and automation to NVIDIA DGX systems, showing how BCM and vCluster work together to deliver multi-tenancy, autoscaling, and security across GPU clusters.

Download Guide

The Building Blocks of Cloud-Grade DGX Operations

Each capability represents a step in modernizing DGX infrastructure, from automation and elasticity to security and hybrid connectivity.

  • 1. Automate GPU Lifecycle Management
    Integrate vCluster with NVIDIA Base Command Manager (BCM) to automate provisioning, scaling, and lifecycle operations for DGX and SuperPOD systems. Manage GPU capacity through a Kubernetes-native control plane.
  • 2. Isolate Tenants Without VMs
    Combine Private Nodes with the vNode Runtime to deliver secure, isolated environments for each team. Prevent container breakouts and ensure GPU isolation without reverting to virtual machines.
  • 3. Scale GPU Workloads on Demand
    Use Auto Nodes to dynamically scale GPU and CPU resources across clouds, data centers, and bare metal. Automatically add or remove DGX capacity as workloads fluctuate, maximizing utilization and reducing idle time.
  • 4. Connect Hybrid Environments Securely
    Enable vCluster VPN to create private, encrypted communication between control planes and worker nodes, ideal for hybrid or burst-to-cloud GPU deployments.
  • 5. Simplify Network Management for Tenants
    Use the Netris Integration to automate network isolation and lifecycle management. Each tenant receives its own secure network path with policies and firewalls managed declaratively.

Empowering Every AI Team on DGX

vCluster enables teams across data science, platform engineering, and operations to collaborate efficiently on shared GPU infrastructure, without compromising speed, cost, or security.

Data Science & Research

Launch isolated development environments in minutes, then put them to sleep when idle to reclaim GPU resources.

AI Training & Inference

Scale DGX workloads dynamically across hybrid clusters to meet real-time compute demands.

Platform Engineering

Unify provisioning, monitoring, and upgrades across all DGX systems with one control plane.

FinOps
Optimization

Track GPU usage per tenant, automate scaling, and eliminate idle cost waste.

Proven Gains in Efficiency and Scale

Organizations running AI infrastructure on NVIDIA DGX with vCluster achieve measurable improvements in utilization, velocity, and simplicity.

Faster Provisioning

  • Cut environment setup from days to minutes through declarative automation.

Higher GPU Utilization

  • 60–85% GPU usage sustained across production clusters.

Reduced Idle Time

  • Idle GPU hours lowered by up to 70% with Auto Nodes and Sleep Mode.

Simplified Operations

  • 50–70% fewer manual management tasks with vCluster Platform.
Empower Your AI Teams on DGX

Talk with our team about turning your DGX clusters into a scalable, self-service platform for AI and ML workloads.