Build for Production
vCluster lets you provision isolated tenant clusters on your existing infrastructure without a separate physical cluster per tenant. This section maps the common production architectures to concrete implementation paths. Choose the one that matches what you are building and follow it from initial design to a running, operated platform.
Start with a quick start to prove the deployment model in your environment, then return here to plan production.
What are you building?​
| You are building | Where to start |
|---|---|
| A managed Kubernetes service for paying customers on GPU infrastructure | AI Cloud: Managed Kubernetes Service |
| An internal AI platform for your organization's R&D and engineering teams | Enterprise AI Factory |
| A unified GPU operations layer across multiple compute sources or suppliers | Distributed Compute Aggregation |
| A shared Kubernetes platform for internal engineering teams and services | Internal Kubernetes Platform |
| Ephemeral isolated clusters for CI/CD pipelines with automatic cleanup | CI/CD Platform |
| A dedicated cluster stack per enterprise customer | Single-Tenant Per Customer |
| Tenant workloads at distributed edge sites from a central control plane | Edge Distribution |
If you are not sure which path fits, start with Architecture and Building a GPU cloud platform.
What production-ready means​
A production vCluster platform delivers:
- Tenant isolation: every customer or team sees only their own cluster, nodes, and workloads
- Repeatable provisioning: new tenant clusters deploy from a defined template, not from manual steps
- A defined worker node model: shared nodes, dedicated node pools, private nodes, or Standalone, matched to your security, performance, and cost requirements
- Governed access: who can create, access, and administer tenant clusters, enforced through Platform policies
- Durable control planes: HA, data store, and backup procedures defined before tenants depend on the system
- Operational readiness: monitoring, upgrade, restore, and incident response procedures documented and tested
Coming from a quick start?​
Each quick start validates a specific deployment model. Use this table to connect what you proved to the production path that extends it.
| If you completed | You have proven | Production paths to consider |
|---|---|---|
| Docker (vind) | Local or CI cluster behavior | Use vind for CI only. Choose a path above based on your production use case. |
| Shared Nodes | Tenant clusters on an existing Kubernetes cluster | Internal Kubernetes Platform, CI/CD Platform, or Enterprise AI Factory (shared tier) |
| Private Nodes | Tenant clusters with dedicated worker nodes | AI Cloud, Enterprise AI Factory (production tier), Single-Tenant Per Customer |
| Standalone | Control Plane Cluster on bare metal or VMs | AI Cloud, Distributed Compute Aggregation, Enterprise AI Factory (on-premises) |
Day 2 operations reference​
Common operations that apply across all paths.
| Operation | Read next |
|---|---|
| Monitor Platform and tenant workloads | Monitoring overview, fleet monitoring |
| Back up and restore tenant clusters | Snapshots, restore, Velero |
| Back up and restore Platform | Backup and restore Platform, Platform database |
| Upgrade Platform and tenant clusters | Upgrade vCluster, upgrade Platform, lifecycle policy |
| Rotate certificates | Certificate rotation |
| Manage private worker nodes | Manage private nodes, Auto Nodes |
| Scale and recover the platform | Platform HA, multi-region Platform |
| Troubleshoot incidents | vCluster troubleshoot, debug commands, Platform troubleshooting |