Virtual clusters are Kubernetes clusters that run on top of other Kubernetes clusters. Compared to fully separate "real" clusters, virtual clusters do not have their own node pools. Instead, they are scheduling workloads inside the underlying cluster while having their control plane.
By default, vclusters run as a single pod (scheduled by a StatefulSet) that consists of 2 containers:
- Control Plane: This container contains API server, controller manager and a connection (or mount) of the data store. By default, vclusters use sqlite as data store and run the API server and controller manager of k3s, which is a certified Kubernetes distribution and CNCF sandbox project. You can also use a different data store, such as etcd, mysql or postgresql.
- Syncer: What makes a vcluster virtual is the fact that it does not have a scheduler. Instead, it uses a so-called syncer which copies the pods that need to be scheduled from the vcluster to the underlying host cluster. Then, the host vcluster will schedule the pod and the vcluster will keep the vcluster pod and host cluster pod in sync.
Each vcluster has its own control plane consisting of:
- Kubernetes API server (point your kubectl requests to this vcluster API server)
- Data store (where the API stores all resources, real clusters run with etcd)
- Controller Manager (creates pods objects in the data store according to replica number in ReplicaSets etc.)
A vcluster is missing one key part of the Kubernetes control plane: the scheduler. Instead of having a scheduler, the vcluster uses a so-called syncer which copies the pods that need to be scheduled from the vcluster to the underlying host cluster. Then, the host vcluster will schedule the pod and the vcluster will keep the vcluster pod and host cluster pod in sync.
Every vcluster runs on top of another Kubernetes cluster, called host cluster. Each vcluster runs as a regular StatefulSet inside a namespace of the host cluster. This namespace is called host namespace. Everything that you create inside the vcluster lives either inside the vcluster itself or inside the host namespace.
It is possible to run multiple vclusters inside the same namespace and you can even run vclusters inside another vcluster (vcluster nesting).
The core idea of virtual clusters is to provision isolated Kubernetes control planes (e.g. API servers) that run on top of "real" Kubernetes clusters. When working with the virtual cluster's API server, resources first only exist in the virtual cluster. However, some low-level Kubernetes resources need to be synchronized to the underlying cluster.
Generally, all Kubernetes resource objects that you create using the vcluster API server are stored in the data store of the vcluster (sqlite by default, see external datastore for more options). That applies in particular to all higher level Kubernetes resources, such as Deployments, StatefulSets, CRDs, etc. These objects only exist inside the virtual cluster and never reach the API server or data store (etcd) of the underlying host cluster.
To be able to actually start containers, the vcluster synchronizes certain low-level resources (e.g. Pods, ConfigMaps mounted in Pods) to the underlying host namespace, so that the scheduler of the underlying host cluster can schedule these pods.
vcluster has been designed following these principles:
vclusters should be as lightweight as possible to minimize resource overhead inside the underlying host cluster.
Implementation: This is mainly achieved by bundling the vcluster inside a single Pod using k3s as a control plane.
Workloads running inside a vcluster (even inside nested vclusters) should run with the same performance as workloads which are running directly on the underlying host cluster. The computing power, the access to underlying persistent storage as well as the network performance should not be degraded at all.
Implementation: This is mainly achieved by synchonizing pods which means that the pods are actually being scheduled and started just like regular pods of the underlying host cluster, i.e. if you run a pod inside the vcluster and you run the same pod directly on the host cluster will be exactly the same in terms of computing power, storage access and networking.
vclusters should greatly reduce the number of requests to the Kubernetes API server of the underlying host cluster by ensuring that all high-level resources remain in the virtual cluster only without ever reaching the underlying host cluster.
Implementation: This is mainly achieved by using a separate API server which handles all requests to the vcluster and a separate data store which stores all objects inside the vcluster. Only the syncer synchronizes very few low-level resources to the underlying cluster which requires very few API server requests. All of this happens in an asynchronous, non-blocking fashion (as pretty much everything in Kubernetes is desgined to be).
vcluster should not make any assumptions about how it is being provisioned. Users should be able to create vclusters on top of any Kubernetes cluster without requiring the installation of any server-side component to provision the vclusters, i.e. provisioning should be possible with any client-only deployment tool (vcluster CLI, helm, kubectl, kustomize, ...). An operator or CRDs may be added to manage vclusters (e.g. using Argo to provision vclusters) but a server-side management plane should never be required for spinning up a vcluster.
Implementation: This is mainly achieved by making vcluster basically run as a simple StatefulSet + Service (see kubectl deployment method for details) which can be deployed using any Kubernetes tool.
To provision a vcluster, a user should never be required to have any cluster-wide permissions. If a user has the RBAC permissions to deploy a simple web application to a namespace, they should also be able to deploy vclusters to this namespace.
Implementation: This is mainly achieved by making vcluster basically run as a simple StatefulSet + Service (see kubectl deployment method for details) which typically every user has the privilege to run if they have any Kubernetes access at all.
Each vcluster and all the workloads and data inside the vcluster should be encapsulated into a single namespace. Even if the vcluster has hundreds of namespaces, in the underlying host cluster, everything will be encapsulated into a single host namespace.
Implementation: This is mainly achieved by using a separate API server and data store and by the design of the syncer which synchronizes everything to a single underlying host namespace while renaming resources during the sync to prevent naming conflicts when mapping from multiple namespaces inside the vcluster to a single namespace in the host cluster.
vclusters should not have any hard wiring with the underlying cluster. Deleting a vcluster or merely deleting the vcluster's host namespace should always be possible without any negative impacts on the underlying cluster (no namespaces stuck in terminating state or anything comparable) and should always guarantee that all vcluster-related resources are being deleted cleanly and immediately without leaving any orphan resources behind.
Implementation: This is mainly achieved by not adding any control plane or server-side elements to the provisioning of vclusters. A vcluster is just a StatefulSet and few other Kubernetes resources. All synchronized resources in the host namespace have an appropriate owner reference, that means if you delete the vcluster itself, everything that belongs to the vcluster will be automatically deleted by Kubernetes as well (this is a similar mechanism as Deployments and StatefulSets use to clean up their Pods).