Architecting a Private Cloud for AI Workloads

Artem Oppermann
9 Minute Read

The first part of this series explored why GPU infrastructure has become foundational to enterprise AI strategy. The second part covered GPU multitenancy strategies for safely sharing resources across teams and workloads. This final installment turns to architecting an entire private cloud infrastructure that can support your AI workloads at scale. The emergence of powerful LLMs and multimodal generative AI has drastically increased the use of artificial intelligence.

But ever-increasing AI use is also leading to exponentially higher resource consumption demands, particularly for massive parallel processing.

GPUs have become de facto accelerators for training and running modern AI models due to their ability to deliver orders of magnitude higher throughput than CPUs with their thousands of cores. A single modern GPU can deliver, in some cases, more than ten times the performance of a CPU for AI workloads. Therefore, GPU infrastructure decisions on accelerator choice, scale and topology, interconnects, data paths, workload scheduling, and isolation directly shape a company's ability to innovate and compete.

Due to cost considerations, data sovereignty requirements, and the need for customized infrastructure, many companies are exploring alternatives, such as private-cloud or hybrid approaches, for their AI workloads. This article focuses on architecting a private cloud for AI workloads. It explains why enterprises choose a private cloud and shows how to design and operate one from end to end.

Why a Private Cloud for AI?

A public cloud is a good starting point for AI workloads. However, challenges may arise during production, such as unpredictable costs for continuous GPU usage, egress fees for large data sets, sharing and compliance risks, and limited hardware control. A private cloud addresses these issues with predictable costs, better data residency and management, tailored performance through custom hardware, and hard isolation when needed. The following sections cover the key drivers behind the private-cloud decision, namely cost and ROI, data protection and compliance, performance and customization, security, isolation, and vendor independence.

Cost Control and ROI

A private cloud offers better cost control and, with continuous GPU usage, a higher return on investment than a public cloud: You purchase GPUs once and spread the up-front cost over their three- to five-year lifetime, thus avoiding variable on-demand pricing.

While GPU instances in the public cloud are highly convenient, they quickly become expensive for sustained workloads. GPU pricing in public clouds varies widely by provider, region, and usage model. For example, costs for public cloud GPUs can range from $0.14–$6.25 per hour for previous-generation GPUs to $1.90–$12.29 per hour for current-generation H100 GPUs. For 24/7 usage, this results in costs of tens to even hundreds of thousands of dollars annually. As model size increases and the model moves from proof of concept to production, the GPU demand increases.

Privately managed clouds, on the other hand, offer the advantage of a predictable total cost of ownership and no fees for data transfer between services or from your environment. You also benefit from stable power and colocation costs. Together, these factors reduce the effective cost per GPU hour and improve the ROI for teams that continuously train and deploy models.

Data Protection and Compliance

Training AI models often requires access to sensitive customer or company data. A private cloud gives you more refined control over data residency and security, which can help facilitate compliance with SOC 2, HIPAA, GDPR, financial regulations (PCI DSS), industry-specific requirements (like FDA 21 CFR Part 11 for pharmaceutical companies), and internal governance requirements. That makes a private cloud a good choice for companies subject to strict regulations or working with proprietary models.

Performance and Customization

AI workloads in practice can vary greatly. Large-scale training jobs require clusters of high-end GPUs connected to high-bandwidth networks and storage. Real-time inference services, on the other hand, require many smaller GPU instances. Private clouds can provide customized GPU hardware and dedicated bandwidth to optimize performance for specific workloads. With a private cloud, you can flexibly customize the hardware, network topology, and storage design to your needs without being limited by a single vendor's product catalog.

Security, Isolation, and Vendor Independence

Running workloads in a public, shared cloud can be tricky. In highly competitive industries, protecting intellectual property is especially important. There are often concerns about whether the separation between different customers is really sufficient. On the one hand, public cloud providers implement strong tenant-isolation mechanisms, such as AWS Nitro, and provide dedicated-tenancy choices. On the other hand, some teams and companies still might prefer private environments for tighter IP control and governance. Private cloud environments can provide physically and administratively isolated environments where companies can keep control over their entire security stack, from software to applications.

Another advantage is reduced vendor lock-in. If you own your own hardware, you can move or upgrade GPUs as needed. But you're not completely free of dependencies. You remain still tied to hardware manufacturers, data center operators, and software stacks. A further advantage of a private cloud is its independence from cloud APIs. This makes it easier to move workloads between different environments. Furthermore, storing large data sets locally saves costs. If model checkpoints don't need to be transferred between services, expensive data-egress fees are eliminated. With multiple terabytes of data and frequent model iterations, these fees can easily run into the hundreds of thousands of dollars per year.

Core Pillars of a Private AI Cloud

Building a private AI cloud isn't just about purchasing hardware; it involves coordinating several layers, including compute infrastructure and orchestration, isolation, storage, and networking.

The following sections describe these building blocks and their interrelationships.

GPU-Enabled Compute Infrastructure

Choosing the right hardware is the foundation. The following are common options:

  • NVIDIA A100 (Ampere): Introduced in 2020, the A100 offers either 40 or 80 GB of HBM2e memory and supports the first generation of Multi-Instance GPU (MIG), which allows a single GPU to be partitioned into up to seven isolated instances. This GPU is well suited to your use case if you want solid price performance and availability for moderate-scale training and inference.

  • NVIDIA H100 (Hopper): Launched in 2022, the H100 features 80 GB of HBM3 memory and enhanced MIG capabilities with more flexible partitioning options. The H100 provides roughly two to four times the performance of the A100 for LLM training. This GPU is one of the best suited for large-scale LLM training and long-context inference.

  • NVIDIA H200 (Hopper): The latest iteration with 141 GB of HBM3E memory, offering 1.4x more memory and 1.7x more bandwidth than H100. This GPU is typically chosen if models are memory-bound or if you need longer sequences or larger batch sizes.

  • NVIDIA L40S (Ada Lovelace): This general-purpose GPU with 48 GB GDDR6 memory focuses on generative AI, graphics, and video workloads. This GPU is recommended for high-throughput inference, diffusion/vision, and mixed graphics workloads. However, compared to the H100/H200, this GPU is not ideal for large-scale distributed training.

Consumer GPUs like the RTX 4090, with 24GB memory, can be useful for R & D experiments, small-scale fine-tuning, and CI testing. However, they typically lack ECC memory, data center form factors, and high-bandwidth multi-GPU connections. They're also ill-suited for multitenant clusters or large distributed training environments. Infrastructure architects should adapt their choice of GPU to workload size, precision requirements, availability, and budget.

A private cloud can deploy GPUs directly on bare-metal servers or in virtualized environments. Bare-metal servers offer the highest performance. They minimize overhead for throughput-critical training and latency-sensitive inference. On the other hand, virtualization enables sharing and isolation but costs some overhead.

MIG allows hardware-level partitioning of a single GPU, which can then be exposed on bare metal via GPU pass-through to virtual machines or integrated with NVIDIA's virtual GPU (vGPU) software for more flexible sharing across VMs and containers.

Kubernetes as an Orchestration Layer

Once you establish your hardware foundation and resource-sharing strategies, the next question is how to efficiently orchestrate these resources. This is where Kubernetes comes into play. Kubernetes has established itself as the standard control plane for AI workloads. Kubernetes can abstract the underlying hardware and enable automation, reproducibility, and scalability. Another key advantage is that resources do not need to be manually provisioned. Instead, users declare the desired state, and Kubernetes schedules the pods accordingly. This allows individual AI jobs—such as data preprocessing, training, analysis, and deployment—to be started in their own container and scaled independently.

For GPU-based nodes, Kubernetes uses device plugins. Each node provides its GPU resources via a device plugin, allowing pods to request GPUs and receive consistent performance. By default, Kubernetes schedules entire GPUs. A pod requesting `nvidia.com/gpu: 1` therefore uses the entire card. However, the GPUs are not oversubscribed, and workloads cannot request fractions of a GPU. Advanced features—such as MIG, vGPU, and time slicing—address this limitation by splitting or sharing GPUs.

Multitenant Isolation and Access Control

An orchestrated cluster alone does not guarantee clean separation between teams or projects. When using a private cloud, you need to isolate teams or applications while still sharing the infrastructure. The simplest model uses Kubernetes namespaces combined with role-based access control (RBAC), resource quotas, and network policies. Namespaces isolate objects within the API, RBAC controls who can read or edit them, and quotas set limits on CPU, memory, and GPU usage.

Virtual clusters offer even greater isolation. Virtual clusters create an independent control plane within a host cluster, and each virtual cluster has its own API server and can run on shared or dedicated infrastructure. Virtual clusters also enable self-service. Developers can create their own virtual Kubernetes environments without deploying entire clusters. Combined with single sign-on (SSO) and identity management, virtual clusters enforce strong boundaries while the platform team maintains governance.

Storage for AI Workflows

The storage and transport of large amounts of data also require careful planning. Shared file systems—such as Lustre, BeeGFS, and CephFS—offer high throughput and parallel access for distributed training. These are designed to fully utilize the read/write bandwidth of GPUs.

For data sets exceeding tens of terabytes in size, object storage systems, such as MinIO or other S3-compatible solutions, offer cost-effective scalability. In practice, parallel file systems and object storage are often combined. The high-performance file system is used for latency-sensitive training data, while the object storage is used for archives, data sets, and model checkpoints. This tiering provides low-latency support to GPUs while moving large amounts of data or infrequently accessed data to a more cost-effective, highly durable tier. This lowers the cost per terabyte and improves reliability through erasure coding and versioning.

Network and Data Movement

Distributed training and multi-GPU inference move large amounts of data between nodes for gradient swapping, input pipelines, and checkpoint streaming. For example, if the network is slow or congested, GPUs wait for communication instead of computing. Therefore, high-throughput and low-latency networking is critical.

High-bandwidth connections, such as InfiniBand and RDMA over Converged Ethernet (RoCE)), can ensure low latency between GPUs. The choice of Container Network Interface (CNI) can also influence performance. AI clusters often use plugins that support jumbo frames and multiqueue networking. You also need to consider inbound and outbound data traffic. Moving large data sets into the cluster and exporting model artifacts can place a strain on network connections. Colocating memory and compute or using GPUDirect Storage can reduce the overhead of data movement.

Software-Stack Considerations

When building a private AI cloud, your software-stack choices directly impact GPU efficiency, tenant security, and operational complexity. A typical private AI cloud software stack includes the following:

  • NVIDIA GPU drivers and operators: The container runtime, monitoring agents, management components, and the required drivers are installed by the GPU Operator. It supports the configuration of MIG and (where applicable) time slicing. It also abstracts the distinction between bare-metal and cloud nodes. Use the GPU Operator for consistent installations, faster rollouts, and easier upgrades across clusters; manage drivers manually only in tightly locked-down or highly customized environments.

  • ML frameworks and model serving: PyTorch, TensorFlow, and Keras are frameworks that serve as the foundation for training workloads and the actual AI applications. NVIDIA Triton Inference Server, KServe, and Ray Serve provide high-performance model endpoints with automatic scaling. Use Triton if you need multiframework backends and high-throughput dynamic batching. Choose KServe if you want native Kubernetes routing canaries and autoscaling. Choose Ray Serve if you prefer a Python-centric serving layer and DAG-style composition. For large language models, consider vLLM for efficient serving with paginated attention and TorchServe for simple pure PyTorch deployments.

  • Pipeline tools: Pipelines must be coordinated from training to deployment. Workflows from MLflow, Argo, and Kubeflow ensure this. These tools, through integration with Kubernetes, enable metadata tracking and reproducible experiments. Choose MLflow if experiment tracking and a lightweight registry are priorities, Argo if you want common workflows that fit GitOps, and Kubeflow if you want a stubborn ML platform with notebook pipelines and centralized UX.

  • Monitoring and observability: Transparency regarding job status, memory usage, GPU utilization, and performance metrics is critical for productive AI. This is where Grafana and Prometheus come into play. While the Data Center GPU Manager (DCGM) Exporter provides GPU-specific telemetry, these tools also collect and display metrics. Standardize on Prometheus plus DCGM for metrics, Grafana for dashboards, and OpenTelemetry for traces.

Operational Challenges

Operating a private AI cloud is challenging, even with the right hardware and software. GPUs are often expensive and used for frequent, stateful AI jobs. They peak during experiments and settle between training cycles. Balancing load is difficult, and simple capacity plans suffer. Therefore, scaling and lifecycle work requires tight choreography across drivers, CUDA, firmware, kernels, and node images. It's recommended to build new drivers and CUDA builds on a small pool of nodes. Additionally, pods should be drained before reimaging, and GPU resets should be sequenced to allow long-running jobs to perform checkpoints and resume. Missing this choreography risks losing your jobs or incurring downtime.

Utilization and capacity planning depend on avoiding GPU fragmentation, choosing the right sizing, and planning for long lead times. Fragmentation is problematic when the scheduler can only distribute entire devices. A small 8 GB inference service can consume an 80 GB H100. Use better bin packing, fixed instance sizes, and rightsize requests to allow small services to converge. Leave room for larger training tasks, and plan for multimonth procurement and model expansion for capacity. Keep an eye on GPU hours, memory reserves, and connection saturation, and plan buffers for maintenance, requeueing, and supply chain delays.

Inefficiencies in Bin Packing and GPU Fragmentation

AI clusters are often underutilized because small services consume entire GPUs, and Kubernetes doesn't overcommit by default. For example, an online inference service that uses about 8 GB of memory and modest processing power still requires an H100, leaving most of it unused. The result is unused capacity that blocks urgent tasks.

Capacity utilization can be improved with a few standard GPU shapes and simple request templates. Enforce rightsizing with CI checks and admission policies. Teams should be encouraged to specify memory and bandwidth requirements in addition to the number of GPUs to ensure smarter placement. Also, place small inference runs on dedicated nodes and large training runs in separate pools. The waste can be visualized with dashboards to track allocated and used memory, active SMs, and throughput over time. Set clear thresholds and trigger reviews when waste persists. Reduce the churn that increases fragmentation by prioritizing stable pods for inference and consolidating bursty work into defined time windows. Over time, these habits increase utilization without new hardware and without impacting latency.

Capacity Planning and Utilization Monitoring

Companies need to predict how many GPUs they will need, especially during development cycles, when training peaks happen. Underprovisioning slows down projects, and overprovisioning ties up money. In practice, teams can estimate the training baseline by measuring how many requests per second a single GPU can handle at the required latency for size inference. Then, they add headroom for packing inefficiencies, maintenance, and growth. For example, if monitoring shows around 2,400 GPU hours per day and a processing target of 1,000 requests per second, with each GPU handling about 180 requests, the cluster will need about 230 GPUs after factoring in safety margins. Platform teams can also set quotas and priority classes to make sure that important workloads always have enough resources.

GPU Node Autoscaling

Automatic scaling of CPU nodes is easy to set up, but GPUs are more expensive and take longer to set up. Private clusters require cross-cluster autoscalers and hardware provisioning. MIG profiles may also need to be dynamically reconfigured because profiles sometimes need to be adjusted on the fly to accommodate workload size. For example, a GPU may be split into smaller instances for inference during the day and reconfigured at night for training. This reconfiguration requires orchestration to avoid interrupting running jobs, but it can usually be performed without rebooting the entire GPU. You should plan for a baseline capacity with additional headroom for peaks and consider burst-capacity agreements with colocation or cloud providers for hybrid scenarios.

Patching and Driver Version Consistency

AI software updates frequently change driver requirements and library compatibility, thus shifting the boundaries between supported versions. This pace requires maintaining a tested, consistent set of CUDA drivers and frameworks. Otherwise, there's a risk that kernel modules won't load missing symbols in libraries, NCCL communication will be interrupted, or silent performance degradation will occur. Use the GPU operator to lock in known, good combinations, and roll out updates to Kubernetes nodes in a controlled manner. Coordinate patches at regular intervals to reduce drift and downtime, and use canary nodes with automated rollback to ensure running jobs aren't interrupted.

Lifecycle of AI Workloads

AI workloads can consist of both short-lived jobs, like training and batch inference, which run for hours or days, and persistent services, like online inference, which run continuously. It's important to manage these different lifecycles appropriately. Short-lived jobs require robust checkpoints, retry logic, and cleanup to avoid wasted GPU time. Persistent services, on the other hand, require strong SLOs, autoscaling policies, and secure rollout strategies. The handoff between the two—which is the transfer of a trained model to production—should follow a standardized path via a model registry and CI/CD.

Cost and Governance

Operational practices—such as scheduling, usage monitoring, and workload isolation—impact both the efficiency of GPU usage for cost management and the fairness and transparency of their allocation to teams for governance. Efficient private-cloud operations require transparency and cost control.

You can use the following strategies to help achieve efficient private-cloud operations with clear accountability and predictable costs:

  • GPU usage accounting: GPU usage time should be tracked either per team or per project. Tools such as Kubernetes Resource Usage Metrics, DCGM telemetry, or specialized platforms like Run:ai or Determined AI are helpful here. Metrics to track include GPU hours, memory usage percentage, and actual compute power. This enables accurate cost allocation and the identification of heavy users.

  • Quotas and budgets: Quotas for GPUs, CPUs, memory, and storage space should be set per tenant. In combination with vClusters or namespaces, limits can be enforced and resource exhaustion prevented. Consider implementing resource quotas, LimitRanges, and PriorityClasses in Kubernetes. Additionally, hard limits should be set to prevent overuse and soft limits to allow for notification spikes.

  • Rightsizing workloads: Developers should be encouraged to request only the resources they truly need. By using MIG profiles or time slicing, unused capacity can be reclaimed and utilization improved.

  • ROI analysis: The costs of using private infrastructure should be regularly compared with those of the public cloud. This comparison should consider investments in hardware, depreciation (typically three to five years for GPUs), power, cooling, network equipment, storage systems, and personnel costs for operations. Since AI models are constantly evolving, hardware selection should also be regularly reviewed. This ensures that GPUs continue to meet workload requirements. Track metrics such as performance per watt and performance per dollar for different workload types.

Security and Compliance

Container isolation is a good foundation, but in shared AI platforms, teams often share the same physical hardware, such as GPUs. This can cause problems to spill over into container boundaries. For example, if an AI job is completed, residual data (*eg* tensors or model weights) may remain in GPU memory if the runtime or hardware fails to reliably delete it; a subsequent job could see traces of it. For this reason, the hardware itself should be treated as part of the security boundary, not just the containers.

Resources must be more strictly segregated. This includes dedicated GPUs or hard partitioning using technologies such as NVIDIA MIG or AMD MI300X partitioning, where available. Furthermore, all data between tenants must be securely deleted (*eg* through a zero wipe). Access to shared devices should also be strictly restricted. Continuous monitoring of unusual behavior is also required.

Ensure the platform's root of trust is simple and reliable. Machines should only boot with known good software (secure boot). Device firmware should also be kept up-to-date, and configurations should be locked to prevent unauthorized changes. Clear, enforceable rules should be established regarding who can deploy what, where, and with which permissions. These activities should also be secured with comprehensive audit logs. Good record-keeping ("Who did what and when?") can accelerate compliance checks for standards like SOC 2, HIPAA, or GDPR and help teams resolve issues before they become incidents.

For highly confidential models or data, hard tenancy should also be preferred. This means giving each team its own isolated environment, such as virtual Kubernetes clusters (vClusters), which provide complete control plane isolation with dedicated nodes or even dedicated GPUs. Use network segmentation and encrypt data in transit and at rest. Where available, hardware partitioning (*eg* GPU slicing) can further reduce the risk of cross-tenant overflows.

Decision Framework

Before deciding on a private cloud for AI workloads, you need to carefully consider whether you're ready for this approach and which implementation strategy makes the most sense. This decision involves understanding the requirements, assessing your team's capabilities, and weighing the pros and cons of different deployment models.

When implementing a private cloud, you have several options:

  • Building your own private AI cloud: This means assuming full responsibility for hardware procurement, data center operations, power and cooling, and general maintenance. While this approach offers maximum control and customization, it also requires significant up-front investment and ongoing operational expertise.

  • Purchasing managed private cloud services or colocation offerings: Doing this allows you to maintain data sovereignty and compliance benefits and delegate infrastructure management to specialized providers. These managed solutions typically require ongoing fees rather than large capital expenditures. The providers handle hardware maintenance, driver updates, and infrastructure operations.

  • Adopting a hybrid private cloud approach: This could involve building a centralized training infrastructure in-house and using managed services for development environments or overflow capacity.

The choice between these approaches largely depends on the trade-off between control and complexity. If you need maximum control over hardware configurations, network topologies, and software stacks, you're also going to face correspondingly greater operational complexity. A fully self-managed private cloud enables the deployment of customized operating systems, specialized security policies, and custom schedulers. However, this requires a dedicated MLOps team with deep expertise in Kubernetes, CUDA programming, GPU management, and distributed systems. Managed private-cloud services can significantly reduce this operational overhead through out-of-the-box scalability, professional support, and automated updates. However, this limits hardware selection and customization options.

To find the right approach, answer these questions:

Do we need strict tenant separation?

This is required if you process highly sensitive data or operate in a highly competitive environment, which may require dedicated control planes, isolated hardware, or even physically separated infrastructure.

How sensitive are our models and data?

Legal requirements—such as HIPAA, GDPR, or industry-specific compliance regulations—may mandate local processing and storage, making a private-cloud infrastructure essential rather than optional.

Are our teams ready to operate the GPU infrastructure?

Successfully operating a private AI cloud requires specialized expertise in GPU cluster management, CUDA optimization, Kubernetes operations, and distributed training workflows.

What is our long-term AI strategy?

The sustainability of GPU investments depends on workload evolution, model architecture trends, and performance requirements over the typical hardware lifecycle of three to five years.

You should consider a self-built private cloud if you require dedicated hardware isolation for highly sensitive data and have strong internal capabilities. Managed private-cloud services offer an attractive middle ground when you're handling sensitive data but lack the operational expertise to manage infrastructure independently. Public-cloud solutions remain your better choice when you don't need stringent security requirements or when your long-term AI strategy is still evolving; this allows you to defer major infrastructure commitments until your needs become more concrete.

Conclusion

This article explored how to design a private cloud for AI workloads, why teams choose it, and where it provides the greatest benefit. We examined the core components of a private cloud, such as GPU-accelerated computing, Kubernetes as a control plane, multitenant isolation, and high-throughput storage and networking. We also outlined when a private cloud is likely to pay off, for example, with continuous 24/7 training/inference, strict data residency and compliance, strict tenant/IP protection requirements, and minimizing egress traffic.

Building on Lukas Gentele's insights about the foundational importance of GPU infrastructure strategy from part one and the multitenancy approaches covered in part two, this comprehensive approach to private cloud architecture enables you to maximize your AI investments while maintaining the control, security, and cost efficiency that make in-house capabilities attractive.

vCluster simplifies Kubernetes multi-tenancy. Instead of running separate physical clusters for every team, vCluster provisions a fully isolated control plane, complete with its own API server, inside a shared host cluster. This approach delivers strong tenant boundaries, fast and reproducible environments, and true self-service without the overhead of managing dozens of clusters. When paired with RBAC, resource quotas, LimitRanges, network policies, and GPU features like MIG and vGPU, vCluster helps organizations achieve stronger governance, clearer cost and usage tracking, and safer, low-risk upgrades.

Want to Go Deeper on GPU Infrastructure?

If you're building robust GPU-enabled platforms, we've created a comprehensive resource to help. Download our free ebook, "GPU-Enabled Platforms on Kubernetes," which explains how Kubernetes abstracts GPU resources, why traditional isolation fails, and what architectural patterns enable multi-tenant GPU platforms. This guide covers everything from how GPUs meet Kubernetes and why GPU multi-tenancy is hard, to orchestrating GPU sharing, hardware isolation and enforcement, and architecting GPU infrastructure with vCluster for optimal isolation and efficiency.

Download the eBook here.

Sign up for our newsletter

Be the first to know about new features, announcements and industry insights.