Tech Blog by vCluster Press and Media Resources

vMetal Deep Dive: How AI Clouds Turn Bare Metal GPUs Into a Programmable Platform

May 28, 2026

min Read

vMetal Deep Dive: How AI Clouds Turn Bare Metal GPUs Into a Programmable Platform

vMetal's official pitch in one line: "Run your GPU data center like a hyperscaler."

I made a long-form video walking through what's actually behind that claim. The architecture, the YAML, the network model, the demo. This is the written companion.

📺 Watch the full video walkthrough: vMetal Deep Dive on YouTube

I went into the actual working repos while writing this. The loft-sh/vcluster-bare-metal-with-kubevirt repo gives you a fully self-contained local demo using KubeVirt VMs as fake bare metal, and loft-sh/vcluster-docs has the source of truth for the docs. Everything in this post is grounded in real YAML you can apply, not slideware.

Plain-English Glossary (skip if you live in this stuff)

Some terms come up a lot in this post. If any of them feel unfamiliar, here they are in everyday language.

Term	What it actually means
BMC (Baseboard Management Controller)	A tiny separate computer inside every server with its own network port. It can power the server on/off and install an OS even when the main machine is off. Think of it as the server's remote control.
Redfish / IPMI	The two common languages BMCs speak. Redfish is the modern one, IPMI is the legacy one.
PXE boot	A way for a server to get its operating system over the network instead of from local disk. The server says "I'm new here, give me an OS," and a server on the network hands it one.
Cloud-init	A tiny script that runs the first time a freshly installed server boots. Sets the hostname, configures the network, joins clusters, etc.
Metal3 / Ironic	Open-source projects vMetal is built on. Metal3 represents physical servers as Kubernetes objects; Ironic is the engine that talks to BMCs and PXE-boots them.
CR / CRD	"Custom Resource." A YAML object Kubernetes manages alongside Pods, Deployments, etc. A BareMetalHost is a CR that represents one physical server.
Control Plane Cluster	The Kubernetes cluster where vMetal itself runs. It's the brain.
Tenant Cluster	What vMetal hands to each customer or team. Their own isolated Kubernetes cluster with their own GPU nodes.
VLAN / VXLAN	Two ways to slice one physical network into many isolated virtual ones. Think floor plans for the same building.
Multus	A Kubernetes plugin that lets a pod be on more than one network at the same time. vMetal uses it so its DHCP pod can sit on the bare metal network.
NVLink / InfiniBand	Super-fast cables between GPUs (NVLink, inside one server) and between servers (InfiniBand). What makes large training runs go fast.

The Platform Problem

Buying GPUs is the easy part. Then you need provisioning, OS lifecycle, tenant isolation, networking, DNS, and scheduling. And you need it to be self-service so customers or internal teams don't file tickets for every notebook.

Building this internally is typically 6 to 12 months and a serious team. Meanwhile the GPUs depreciate. vMetal's pitch: turn the racks into a compute platform without writing the platform.

What vMetal Actually Is

In the simplest possible terms: vMetal lets you treat a rack of physical GPU servers like a cloud. You point it at your hardware once, and from then on you can hand a fresh server to a customer or team in seconds and reclaim it when they're done. All from Kubernetes-native YAML or a UI.

The docs put it more formally:

"vMetal is the bare metal layer of the vCluster Platform. It builds on Metal3 and Ironic to handle BMC communication, PXE boot, OS installation, and server cleaning."

There's no hypervisor in the way. Workloads get direct access to GPUs, NVLink fabric, and InfiniBand. The hardware behaves the way the manufacturer intended. vMetal manages the physical machines themselves: registering them, installing an OS over the network, joining them to a Tenant Cluster, and cleaning them up when the tenant is done.

Built on:

Metal3. Exposes each physical server as a BareMetalHost custom resource.
Ironic. The engine that talks to BMCs (Redfish or IPMI), drives power, PXE, and image writing.

How vMetal "Detects" Servers

A common question: does vMetal scan the network and auto-discover servers? Today, the model is declarative. You register each server once, and from then on the platform handles everything (registering, inspecting, claiming, provisioning, deprovisioning, reuse). More automated discovery is on the roadmap, so you won't have to declare individual hosts in future versions.

The trigger for a server to enter the system is a BareMetalHost CR pointing at the BMC. You create:

A Secret with BMC username/password
A BareMetalHost CR with the BMC URL plus the boot MAC address

Once those exist, Metal3 and Ironic do the rest automatically. Registering (verify BMC creds), then Inspecting (auto-collect hardware inventory: CPU, RAM, NICs, disks, firmware, GPUs, PCIe), then Available.

Real example from the working kubevirt demo repo:

apiVersion: v1 kind: Secret metadata: name: server-01-bmc namespace: metal3-system type: Opaque stringData: username: admin password: <BMC-PASSWORD> --- apiVersion: metal3.io/v1alpha1 kind: BareMetalHost metadata: name: server-01 namespace: metal3-system labels: role: compute spec: bmc: address: redfish://192.168.1.100 credentialsName: server-01-bmc bootMACAddress: "aa:bb:cc:dd:ee:01"

Adding lots of servers at once

If you're racking a fleet, not just one machine, there's a bulk registration path. You concatenate BareMetalHost and Secret resources into a single YAML file (one document per server, separated by ---) and apply it. The docs show this exact pattern:

--- apiVersion: v1 kind: Secret metadata: name: server-01-bmc namespace: metal3-system stringData: username: admin password: <BMC-PASSWORD> --- apiVersion: metal3.io/v1alpha1 kind: BareMetalHost metadata: name: server-01 namespace: metal3-system labels: role: compute rack: rack-a spec: bmc: address: redfish://192.168.1.100 credentialsName: server-01-bmc bootMACAddress: "aa:bb:cc:dd:ee:01" --- # server-02, server-03, ... in the same file

kubectl apply -f servers.yaml

You get parallel registration plus inspection across the whole batch. Combine this with rack/role labels and your hardware inventory becomes a single GitOps-managed manifest.

What if my team doesn't know YAML?

The Platform UI has a form under Bare Metal Servers. You click Add, fill in the BMC address, credentials, and boot MAC. The platform writes the CR for you. In practice this is operator-side work anyway. The data scientists who consume the GPUs never see this layer.

The Network Model: the Other Question Everyone Asks

"Do bare metal servers need to be on the same network as the Control Plane Cluster?"

No. Bare metal servers can sit on completely different networks than the Control Plane Cluster. What you do need is two things wired up. First, the Control Plane Cluster has to be able to reach the BMCs (so it can power servers on and trigger installs). Second, a single network bridge from the cluster into the bare metal provisioning network (so the install actually happens). The next two subsections explain each in plain terms.

Path 1: Ironic to BMC (control plane traffic)

Ironic runs inside the Control Plane Cluster. It must have IP reachability to each BMC (the Redfish/IPMI endpoint). Same L2 is not required, same IP range is not required. The docs are explicit:

"Ironic must have network access to the BMC addresses of the bare metal servers."

If your BMCs are on 10.10.0.0/24 and your Control Plane Cluster pods are on 10.244.0.0/16, that's fine, as long as routing exists.

Path 2: DHCP/PXE to bare metal NICs (provisioning traffic)

This is the part that needs explicit wiring. The DHCP/PXE proxy pod runs in the Control Plane Cluster and is attached via Multus to the provisioning network. Two modes the docs document:

Bridge mode. Control Plane Cluster nodes have a bridge (e.g. br0) attached to the provisioning network. The DHCP pod attaches through that bridge.

deploy: dhcp: enabled: true helmValues: | networkAttachmentDefinition: vip: 192.168.100.2/24 config: | { "cniVersion": "0.3.1", "type": "bridge", "bridge": "br0", "isDefaultGateway": false }

Macvlan mode. Used when "the bare metal servers are on the same network as the Control Plane Cluster nodes." The DHCP pod gets a macvlan interface on eth0.

deploy: dhcp: enabled: true helmValues: | networkAttachmentDefinition: vip: 10.0.0.2/24 config: | { "cniVersion": "0.3.1", "type": "macvlan", "master": "eth0", "mode": "bridge" }

In the kubevirt demo, the entire provisioning network is 192.168.100.0/24 on a bridge br0 set up by a DaemonSet. The KubeVirt "fake" bare metal VMs live in 192.168.100.10–20, the bridge IP is 192.168.100.1, and the DHCP pod gets 192.168.100.4. Bridge mode, end to end.

So: bare metal servers don't need to share IP range with the Control Plane Cluster. What they need is a wire from the cluster nodes into their provisioning network (bridge or shared L2 via macvlan), plus IP routability from Ironic to the BMCs.

What vMetal Deploys for You

A NodeProvider of type Metal3 can deploy three components into the Control Plane Cluster, each individually toggleable. From the actual node-provider.yaml in the kubevirt demo:

apiVersion: storage.loft.sh/v1 kind: NodeProvider metadata: name: metal3 spec: displayName: "Metal3 Bare Metal Hosts" metal3: clusterRef: cluster: loft-cluster namespace: default deploy: multus: enabled: true metal3: enabled: true dhcp: enabled: true helmValues: | networkAttachmentDefinition: vip: 192.168.100.4/24 nodeTypes: - name: vm

The three components:

Component	Role
Metal3 + Ironic	Reconciles BareMetalHost CRs. Talks to BMCs. Drives power, PXE, OS image writes.
DHCP Proxy	Handles PXE boot. Acts as a proxy between bare metal servers and Ironic when they're on different networks.
Multus CNI	Lets the DHCP pod attach to the provisioning network (separate from the cluster pod network).

If you already run any of these (you have a Metal3 install, your own DHCP, your own Multus) you disable the corresponding deploy.*.enabled field and bring your own.

The configuration surface itself comes down to three Kubernetes resources working together:

The NodeProvider points at the Control Plane Cluster and toggles what gets deployed. Each BareMetalHost plus its Secret represents one physical server. NodeType resources define hardware profiles (CPU, memory, GPU count) and a label selector that matches BareMetalHost resources. When a workload needs a GPU server, vMetal finds an available host with matching labels. There's also a built-in cost calculation that picks the cheapest matching node type when multiple could fulfill a request.

The Lifecycle

Every BareMetalHost moves through this state machine:

State	What's happening
Registering	Verify BMC creds. Can the system actually talk to this server?
Inspecting	Auto-collect hardware inventory. CPU, RAM, NICs, disks, firmware, GPUs, PCIe.
Available	In the pool. Waiting to be claimed.
Provisioning	OS image writing via PXE; cloud-init staged.
Provisioned	OS running. If targeting a Tenant Cluster, already joined.
Deprovisioning	Cleanup. Returned to the pool.
Error	Anything can transition here. Debug like any Kubernetes operator.

When a Machine (the platform's claim) is deleted, vMetal restores the BareMetalHost to its original state and it becomes Available again. Same server, next tenant.

End-to-End Path: From Tenant Request to Running Pod

When a Tenant Cluster requests a bare metal node:

Selection. The provider picks an Available BareMetalHost matching the node type's label selector and resources.
Configuration. Cloud-init user data is generated, stored as a Secret on the Control Plane Cluster.
Setup. The BMH is patched with image reference plus userData Secret reference. This is the declarative trigger.
Installation. Ironic powers the server on via BMC, sets boot to PXE, IPA (Ironic Python Agent) writes the OS to disk.
Boot. The server reboots from disk into the new OS, cloud-init runs.
Integration. For vCluster private nodes, cloud-init includes the join command. The node automatically registers with the Tenant Cluster.

No manual kubeadm join. No manual switch port flipping. No manual DNS update.

You Can Run This Locally Today

The thing that surprised me most: there's a fully working local replica using KubeVirt VMs as fake bare metal servers. The repo loft-sh/vcluster-bare-metal-with-kubevirt has a Makefile that walks you through the whole thing without owning any DGX hardware.

Prerequisites: Docker, vcluster CLI, kubectl, helm, a host with KVM (~16GB RAM, 4+ CPU).

# Create a vcluster-in-docker host cluster make vind-up # Install everything (cert-manager, KubeVirt, br0 bridge, # vCluster Platform, Metal3 NodeProvider, DHCP, Multus) make install # Boot KubeVirt VMs that pretend to be bare metal servers # Each VM has a Redfish BMC shim (virtbmc) the platform can talk to make create-vms # Now create a vCluster that auto-claims those "BMHs" as private nodes make create-vcluster

Behind the scenes:

A Linux bridge (br0, 192.168.100.0/24) on the host acts as the shared provisioning network
A NodeProvider deploys Multus, Metal3 plus Ironic, and the DHCP server into the host cluster
A NodeEnvironment provides the IP range (192.168.100.10–20), gateway, and DNS for the network
An Ubuntu 24.04 OSImage and a static SSHKey are referenced
BareMetalHost resources point at each VM's virtbmc Redfish endpoint

make create-vcluster then creates a VirtualClusterInstance that requests a node from the metal3 provider. A NodeClaim is created, a BMH is selected and provisioned (Ironic writes the image), the VM boots, cloud-init joins the Tenant Cluster, and kubectl get nodes against the Tenant Cluster shows the new node.

This is the cheapest way I've seen to learn how this stack actually behaves end-to-end.

Tenant Isolation: Three Layers, Real Boundaries

The isolation model has three distinct layers and they all matter:

Network isolation. Each tenant gets its own VLAN/VXLAN. When a node is claimed, vMetal coordinates with the network controller (Netris in the GTC demo, but the integration is pluggable) and the server manager (BCM, NVIDIA Base Command Manager) to physically move the node into the tenant's network. Switch ports get reconfigured. NVLink fabric and InfiniBand get reconfigured. DNS gets updated.
Cluster isolation. Each tenant runs in their own Tenant Cluster (via vCluster) with its own Virtual Control Plane (API server, scheduler, controller manager, resource view).
Runtime isolation. vNode adds boundary enforcement when workloads share physical nodes (less relevant for dedicated bare metal, more relevant for shared-host scenarios).

The hot standby trick: PXE boots are slow. So vMetal keeps DGX nodes pre-provisioned in a "management" pool. Claim time is then a network move plus cluster join, not a full reinstall. Seconds, not minutes.

The Demo, Compressed

In the video I walk through the GTC demo. Here's the punch line.

A data scientist opens Run:ai, targets Tenant Cluster #1, and clicks Create Jupyter Notebook. That's the user-side action.

Under the hood:

Run:ai schedules the workload. Pending, no GPU node available in the tenant's cluster.
The dynamic node pool kicks in; vMetal sees the demand.
vMetal picks an Available DGX from the management pool.
vMetal coordinates with Netris and moves the node's switch ports into Tenant 1's VLAN.
vMetal coordinates with BCM and updates the node's network assignment.
The node joins Tenant Cluster #1.
Run:ai sees the new node, schedules the Jupyter pod, containers start.

You can watch this happen live in the Netris UI. Three DGX nodes start in management. After the click, DGX-01 visibly moves into Tenant 1's network. BCM confirms the same. When the tenant releases the node, vMetal reverses everything.

Things to Know Going In

A few practical notes so you can plan your rollout:

Servers need a BMC. Redfish or IPMI. Pretty much standard on any modern server-class hardware (Dell iDRAC, HPE iLO, Supermicro IPMI, NVIDIA DGX, etc.).
The first PXE boot of a fresh server takes minutes, not seconds. That's just how PXE works. vMetal's hot standby model handles this elegantly. Keep a warm pool of pre-provisioned nodes and tenant claims become near-instant network moves.
Network plumbing is upfront work. A bridge or macvlan into the provisioning network needs to be set up once. Your network team likely already does this for any bare metal automation; vMetal just needs a leg into it.
OS images must be HTTP-accessible. Local or authenticated image sources aren't supported directly, so plan to host your images on a reachable HTTP endpoint.
You need a vCluster Platform license that includes vMetal. The Control Plane Cluster must be connected to the platform.
Most managed Kubernetes services work fine as Control Plane Clusters. GKE Standard, EKS, and AKS managed node groups are all supported. But GKE Autopilot, EKS Auto Mode, and EKS Fargate are not supported because of restrictions on privileged workloads. If you go managed, also make sure you can route from the cluster's VPC to the BMC network (VPC peering or VPN, typically).

Who This Is Actually For

AI Clouds. The math is brutal. Every month spent building this internally is a month of GPU depreciation with no revenue. vMetal compresses time-to-launch.
Enterprise AI factories. Same self-service experience as cloud, on hardware you fully own.
Sovereign cloud providers. Full data residency, no public cloud dependency.

If you're a single team with one rack and one workload, you don't need this. If you have multiple tenants, multiple workload types, or multiple teams sharing hardware, this is the platform layer that solves it.

Wrap-Up

vMetal turns physical GPU servers into a programmable, tenant-isolated, cloud-like platform without a hypervisor. Built on Metal3 plus Ironic. Tenant Clusters via vCluster. Hot standby keeps claim times in seconds. Everything is Kubernetes-native YAML, GitOps-friendly, and there's a UI for the YAML-averse.

The thing that pushed me from "interesting" to "actually convinced" was the kubevirt repo. You can run the entire stack locally on a beefy laptop, watch a BareMetalHost go from Registering to Provisioned, and see a Tenant Cluster auto-claim it as a private node. If you're evaluating vMetal, start there before scheduling a vendor call.

📺 Watch the full video walkthrough: vMetal Deep Dive on YouTube

Sources & Working Code

vMetal product page: https://vmetal.ai/
vMetal docs: https://vmetal.ai/docs/
Metal3 NodeProvider docs (bridge vs macvlan, Helm values): https://www.vcluster.com/docs/platform/administer/node-providers/metal3
Bare Metal overview docs (UI path, BMH fields, lifecycle): https://www.vcluster.com/docs/platform/administer/bare-metal/overview
vMetal Limitations: https://vmetal.ai/docs/limitations/
Working local demo with KubeVirt fake bare metal: https://github.com/loft-sh/vcluster-bare-metal-with-kubevirt
vCluster docs source repo: https://github.com/loft-sh/vcluster-docs
Companion video walkthrough: https://www.youtube.com/watch?v=qVNNYHfc8jY

Kubernetes Insights

AI & GPUs

vMetal