AI Cloud: Managed Kubernetes Service
Run a managed Kubernetes service on your GPU infrastructure. Each customer gets an isolated tenant cluster with dedicated GPU nodes. Your product is what customers interact with. Platform is the operations layer your team runs behind it.
Typical stack: Standalone (HA) as the Control Plane Cluster. Private nodes per customer cluster. vMetal for bare metal GPU lifecycle. vNode for workload runtime isolation.

What makes this path different: Customers never touch Platform. They interact with your product. Platform RBAC locks direct access to your platform engineering team.
Day 0: Design decisions​
| Decision | Read next | Outcome |
|---|---|---|
| Choose the control plane deployment model | Standalone deployment, Architecture | Control planes as pods on an existing Kubernetes cluster, or Standalone on dedicated CPU nodes. Standalone is the common choice when no prior Kubernetes substrate exists. |
| Plan bare metal GPU provisioning | vMetal docs, Metal3 node provider, bare metal overview | Decide whether vMetal manages the full machine lifecycle (PXE, OS imaging, BMC, reclaim) or nodes are joined manually or via another provisioner. |
| Define per-customer node isolation | Private Nodes, node requirements | Each customer's tenant cluster gets its own dedicated GPU node pool with a separate CNI/CSI, eliminating interference between customers. |
| Plan network isolation | VPN, Netris integration | Tenant clusters connect to their private nodes over an encrypted VPN tunnel. Netris integration adds switch-level VLAN/VXLAN isolation per tenant. |
| Choose runtime isolation model | vNode docs, Virtual Nodes | vNode provides kernel-level container isolation without VM overhead. Recommended when customers run privileged workloads, dynamic code execution, or need GPU access via CDI. |
| Define cluster templates and AI stacks | Templates, Certified Stacks | Each customer cluster template includes GPU Operator, a scheduler (Run.ai, Kueue, Volcano), and optionally a developer environment. Certified Stacks provide pre-validated configurations. |
| Plan the customer-facing provisioning API | Projects, Quotas, Platform API | Your product API calls Platform to provision tenant clusters. Define the project structure, quota model, and automation hooks that back your customer-facing workflows. |
| Plan durability | Backing store, container control plane HA, Standalone HA, Platform HA | Choose the data store and replica model for Platform and per-customer control planes. |
Day 1: Stand up the first production customer cluster​
Steps 3 and 4 configure Platform for your platform engineering team, not for your customers. Customers provision clusters through your product. Platform access should be restricted to your ops team.
- Install vCluster Platform. If building from bare metal, deploy vCluster Standalone first, then move to Standalone HA before production traffic.
- Configure backing store and Platform HA.
- Configure SSO and permissions for your platform engineering team.
- Create projects, templates, quotas, and Auto Nodes to back your customer provisioning workflows.
- Set up vMetal and the Metal3 node provider: register BMC credentials, configure PXE networking, define OS images, and verify bare metal hosts reach
available. - Configure per-customer network isolation with VPN and, if using Netris, the Netris integration.
- Install vNode on eligible GPU nodes. Configure
sync.toHost.pods.runtimeClassName: vnodein the cluster template. - Deploy the first customer template using Certified Stacks as the starting point for GPU Operator, scheduler, and AI tooling.
- Validate tenant isolation from inside the tenant cluster: confirm the customer cannot see the Control Plane Cluster, other tenants, or platform internals.
- Wire your product API to Platform's provisioning endpoints and test the end-to-end customer onboarding flow.
Day 2: Operate​
| Operation | Read next |
|---|---|
| Manage bare metal capacity and machine lifecycle | Bare metal overview, Metal3 node provider, vMetal docs |
| Monitor platform and tenant workloads | Monitoring overview, fleet monitoring |
| Upgrade Platform and tenant clusters | Upgrade vCluster, upgrade Platform |
| Back up and restore tenant clusters and Platform | Snapshots, restore, Platform backup |
| Manage vNode compatibility during upgrades | vNode limitations, vNode configuration |
| Scale the Control Plane Cluster | Platform HA, multi-region Platform |