Tech Blog by vCluster Press and Media Resources

5 Ways to Run Slurm on Kubernetes for HPC and AI Workloads

No items found.

Jun 18, 2026

|

min Read

Summary

Combining Slurm's advanced job scheduling with Kubernetes' dynamic resource management is essential for modern AI, but the integration is complex and presents significant operational challenges.
This article compares five distinct patterns for running Slurm on Kubernetes—from production-ready stacks to DIY operators—highlighting their trade-offs in setup time, isolation, and operational burden.
A critical differentiator is tenant isolation; most patterns rely on basic namespace separation, which is inadequate for secure AI clouds with tenant isolation that need to prevent control plane interference.
For organizations needing production-ready, secure tenant isolation without months of engineering, the best approach combines a pre-validated Slurm stack with strong virtual control plane isolation, a core feature of vCluster.

"Why do I need Slurm on K8s? It's like using orchestration on orchestration." If you've spent any time in HPC forums recently, you've seen this sentiment — and honestly, it's a fair question to ask. Traditional HPC teams have run Slurm reliably for years. Adding Kubernetes into the mix feels like solving a problem you didn't know you had.

But the reality of modern AI infrastructure is shifting that calculus fast. Large-scale GPU training runs, AI clouds with tenant isolation, and the explosion of vLLM-based inference workloads are pushing organizations to want both: Slurm's battle-tested job queue and priority scheduling and Kubernetes' dynamic resource management, containerization, and scalability. The challenge is that bridging these two worlds is genuinely hard. Engineers have reported "many sleepless nights" debugging GPU operator memory leaks and Network operator leasing issues at scale — and that's before you get to the tenant isolation problem.

There's also a lingering concern worth addressing upfront: NVIDIA's acquisition of SchedMD spooked parts of the HPC community. The good news is that NVIDIA has publicly stated that Slinky and Slurm will remain open source under their original licenses, and they've historically been strong contributors to these ecosystems.

So: if you're going to run Slurm on Kubernetes, how should you actually do it? Here are five distinct patterns — from GitOps-driven production deployments to DIY builds — with honest trade-offs for each.

1. Bare-Metal-First with vMetal and the Slinky Certified Stack

Best for: AI cloud providers and enterprises that need production-ready Slurm-on-Kubernetes with tenant isolation now, without months of engineering.

This pattern starts from the bottom up: bare metal servers running vMetal, a certified Kubernetes distribution that ships as a single binary and runs directly on Linux — no k3s, no kubeadm, no host Kubernetes layer required. On top of that foundation, you deploy a vCluster Certified Stack, which includes a pre-integrated, pre-validated Slurm-on-Kubernetes environment via Slinky.

What makes this approach uniquely powerful is what happens at the tenant layer. Rather than isolating workloads with namespaces (which share a blast radius on the control plane), vCluster provisions fully isolated virtual control planes per tenant — each with its own API server, etcd, and RBAC. Tenants get cluster-admin privileges inside their environment without any risk of interfering with neighboring workloads or the host infrastructure. It's the only approach in this list that combines a pre-validated Slurm environment with genuine per-tenant control plane isolation.

For MPI jobs and GPU training workloads, performance matters as much as isolation. Because vMetal runs directly on the metal with no hypervisor overhead, GPU throughput is uncompromised. RDMA support and high-speed interconnects behave exactly as they would on a vanilla bare-metal cluster. You can extend the stack further with vMetal for zero-touch provisioning — PXE boot, OS install, automated node registration — turning hardware delivery into a repeatable, automated pipeline.

vCluster powers over 100K+ GPU nodes in production across 50+ GPU clouds and Fortune 500 customers. Lintasarta, for example, launched Indonesia's leading GPU cloud in 90 days with 170+ tenant clusters running on this foundation.

Trade-off	Rating
Setup Time	Low (software) / Moderate (hardware)
Isolation Strength	★★★★★ High — virtual control planes per tenant
GPU Performance Overhead	Negligible — direct bare-metal access
Operational Burden	Low — pre-validated stack, central UI/CLI/API

2. SUNK with CoreWeave-Style GitOps

Best for: Teams that want cloud-native automation and are comfortable operating at the intersection of Slurm and Kubernetes tooling.

SUNK (Slurm on Kubernetes) treats Slurm as a first-class Kubernetes scheduler rather than a bolt-on. Popularized by neoclouds like CoreWeave, this pattern deploys Slurm components via Helm charts and manages the entire stack declaratively through GitOps pipelines — think Argo CD driving cluster state from version-controlled manifests.

The appeal is real: you get fast scheduling, containerized Slurm daemons portable across environments, and a unified management plane for HPC and cloud-native workloads side by side. Dynamic scaling of GPU nodes becomes a GitOps operation rather than a manual sysadmin task.

The trade-off is isolation. SUNK relies on standard Kubernetes namespaces, which provide logical separation but share a single control plane. In an AI cloud context that requires tenant isolation, that's a meaningful limitation — a misconfigured CRD or a noisy neighbor at the API server level affects everyone. This pattern shines brightest for single-tenant or internally-trusted environments where isolation requirements are relaxed.

Trade-off	Rating
Setup Time	Low to Moderate — fast once GitOps pipelines exist
Isolation Strength	★★ Basic — namespace-level only
GPU Performance Overhead	Minimal
Operational Burden	Low — GitOps automation reduces drift and manual ops

3. Slinky/slurm-operator with Custom CRDs

Best for: Teams that want declarative, operator-driven Slurm lifecycle management with fine-grained Kubernetes-native control.

The Slinky slurm-operator introduces Custom Resource Definitions that let you define a Slurm cluster as a native Kubernetes object. You declare a SlurmCluster, define NodeSets and LoginSets for worker and login nodes, and the operator handles provisioning, upgrades, and failure recovery automatically — including regenerating the Slurm controller pod if it crashes.

For HPC teams migrating from bare metal, one practical win here is shared storage. Lustre and other parallel filesystems can be mounted directly into slurmd pods, just as they would be on physical nodes. As community members have noted: "The difference is Kubernetes is the infra orchestration layer, so you can simply mount Lustre in the slurmd pods." That continuity matters a lot for teams with existing Lustre investments.

The operational requirement, however, is non-trivial. Your engineers need to be proficient in both Slurm administration and Kubernetes CRD management — two distinct skill sets that don't always overlap in the same person. And like SUNK, isolation is limited to what native Kubernetes namespacing provides.

Trade-off	Rating
Setup Time	Moderate — CRD customization requires initial investment
Isolation Strength	★★★ Moderate — configurable resource isolation, shared control plane
GPU Performance Overhead	Low
Operational Burden	Moderate — dual Slurm + K8s expertise required

4. slurm-bridge for Hybrid Scheduling

Best for: Organizations that want to incrementally burst Kubernetes workloads into an existing Slurm environment without a full migration.

The slurm-bridge pattern keeps your existing Slurm cluster as the source of truth and brokers individual Kubernetes pod scheduling through it. The architecture works like this: a Pod is submitted to a designated slurm-bridge namespace in Kubernetes; the bridge component creates a "placeholder job" in the external Slurm cluster; once Slurm grants the allocation, the kubelet starts the Pod on the Kubernetes node.

This is an elegant migration strategy — your Slurm operators keep managing what they know, and Kubernetes consumers get resource access without needing to understand Slurm internals. But the seams show under production load.

One critical gotcha flagged by the HPC community: slurm-bridge has historically leaned on alpha Kubernetes features, which creates compatibility headaches on managed Kubernetes offerings from major cloud providers and neoclouds. The practical guidance from practitioners: verify your deployment relies only on beta or stable feature gates before going to production. The project is moving toward stable APIs, but it's worth auditing carefully.

Beyond compatibility, the dual-system operational model is genuinely complex. You're monitoring two schedulers, two sets of job queues, and two failure domains simultaneously.

Trade-off	Rating
Setup Time	Moderate to High — integrating two schedulers requires careful tuning
Isolation Strength	★★★ Moderate — depends on both Slurm and K8s configurations
GPU Performance Overhead	Moderate — cross-system scheduling adds latency
Operational Burden	High — two systems to monitor and maintain

5. DIY Operator Builds

Best for: Organizations with large, dedicated platform engineering teams and requirements so specific that no existing tool fits.

Building a custom Kubernetes operator to manage Slurm from scratch gives you total control over the integration — every scheduling decision, every resource mapping, every failure mode handled exactly how your organization's workflows demand. If you have genuinely unique requirements that off-the-shelf solutions can't accommodate, this path exists for a reason.

The cost is steep, and it compounds over time. A DIY operator typically takes months (sometimes over a year) to reach production stability. Your team owns every bug fix, every Kubernetes API deprecation, every Slurm version upgrade, and every security patch indefinitely. The isolation strength and GPU overhead are entirely a function of how well the system was designed — there's no vendor to call when something breaks at 2am before a training run deadline.

The HPC community has a sobering pattern here: many organizations that went DIY eventually migrated to managed or open-source operator solutions once the maintenance burden became untenable. Unless you have very specific reasons to build from scratch, the DIY path is usually chosen in retrospect rather than intention.

Trade-off	Rating
Setup Time	⚠️ Very High — months to years of engineering
Isolation Strength	Variable — entirely dependent on implementation quality
GPU Performance Overhead	Variable — ranges from minimal to significant
Operational Burden	⚠️ Very High — full lifetime ownership of the codebase

Choosing the Right Path

Here's the honest summary of where each pattern fits:

Approach	Best Fit	Avoid If
vMetal + Slinky Certified Stack	AI clouds with tenant isolation, neoclouds, fast time-to-production	You're running a single-tenant research cluster with no isolation needs
SUNK + GitOps	Single-tenant or trusted-internal GPU clusters	You need strong tenant isolation
Slinky slurm-operator	Teams migrating from bare-metal Slurm wanting Kubernetes-native ops	You need strong isolation or have no K8s expertise
slurm-bridge	Incremental migration, burst-to-K8s from existing Slurm	You need low-latency scheduling or are on a managed K8s platform
DIY Operator	Highly specific requirements with deep platform eng resources	You need to ship in months, not years

The right tool depends on your constraints — but for organizations that need to deliver secure HPC and AI services with tenant isolation at GPU scale without spending quarters on custom engineering, the vMetal + Certified Slinky Stack approach is the only pattern that ships pre-validated, production-ready, and with genuine per-tenant control plane isolation baked in.

Frequently Asked Questions

What is the main benefit of running Slurm on Kubernetes?

The primary benefit is combining Slurm's advanced, priority-based job scheduling for HPC and AI workloads with Kubernetes' dynamic, container-based resource management and scalability. This hybrid approach allows organizations to manage large-scale GPU resources with the fine-grained job control of a traditional HPC scheduler alongside the flexibility and automation of a cloud-native platform.

Why not just use the default Kubernetes scheduler for AI workloads?

The default Kubernetes scheduler lacks the sophisticated features needed for large-scale HPC and AI, such as advanced job queueing, fair-share priority, and backfilling. While excellent for general-purpose services, it isn't optimized for managing long-running, resource-intensive batch jobs like GPU training. Slurm ensures expensive GPU resources are utilized efficiently and fairly across multiple users and teams.

How does running Slurm on Kubernetes impact GPU performance?

When properly configured on bare metal, running Slurm on Kubernetes has a negligible impact on GPU performance. Solutions like vCluster Standalone run directly on the host OS without a hypervisor, giving containers direct hardware access. This ensures GPU throughput and high-speed interconnects (like RDMA) perform at near bare-metal speeds.

What is Slinky and how does it relate to Slurm on Kubernetes?

Slinky is an open-source project that provides the essential tools and components to integrate Slurm with a Kubernetes cluster. It acts as the bridge between the two systems, offering components like the slurm-operator to manage Slurm declaratively and slurm-bridge to schedule Kubernetes pods through an external Slurm cluster.

What's the difference between namespace isolation and virtual control plane isolation?

Namespace isolation logically separates workloads within a single, shared Kubernetes control plane, which is a form of soft multi-tenancy. Virtual control plane isolation, provided by technologies like vCluster, gives each tenant their own dedicated and fully sandboxed control plane (API server, etcd). This hard multi-tenancy is more secure, as it eliminates the "noisy neighbor" problem and prevents one tenant's activities from impacting others.

Which pattern is best for building an AI cloud with tenant isolation?

For a production-ready AI cloud with tenant isolation, vMetal with a Slinky Certified Stack is the recommended pattern. It is the only approach discussed that combines a pre-validated, production-grade Slurm environment with the strong, per-tenant virtual control plane isolation necessary for a secure and stable shared platform.

When should I consider building a DIY Slurm operator?

You should only consider a DIY operator if your organization has highly specific requirements that cannot be met by existing solutions and you have a large, dedicated platform engineering team to build and maintain it. This path involves a significant, long-term investment in development and ongoing maintenance, making it impractical for most organizations.

Don't spend months building what you can deploy in minutes. Explore vCluster's AI cloud solutions →

‍

Related blog posts

Ready to take vCluster for a spin?

Deploy your first virtual cluster today.