Tech Blog by vCluster Press and Media Resources

Run Ray Serve as an NVCF Function on vind (vCluster in Docker)

Jun 11, 2026

|

11

min Read

Run Ray Serve as an NVCF Function on vind (vCluster in Docker)

NVIDIA Cloud Functions (NVCF) is the function platform behind NVIDIA's GPU cloud, and it is now open source. For AI Clouds and platform teams building GPU infrastructure, it is one of the more interesting pieces of the stack to study: it turns a GPU cluster into a serverless inference target.

NVCF lets you create functions in two ways: as a custom container, or as a Helm chart. The Helm chart path is interesting because it gives you control over the Kubernetes resources that run inside the function boundary.

I built and tested a Ray Serve chart, sent it upstream as PR #22, and it has now landed in NVIDIA/nvcf main through NVIDIA's upstream merge flow as commit 31497f3, with me credited as co-author.

For this walkthrough, the demo environment is vind (vCluster in Docker). vind is vCluster's standalone mode: it runs the entire cluster directly in Docker containers. That means the whole demo runs on a laptop with nothing but Docker installed: you get a real Kubernetes API server in seconds, and the chart deploys into it exactly as it would into any GPU cluster.

A 60-Second Ray Serve Primer

If you have never used Ray, here is the mental model in plain English.

Ray is a distributed Python runtime. Ray Serve is the serving layer built on top of it. You define a Python class, decorate it with @serve.deployment, and Ray Serve handles HTTP routing, replica management, and scaling.

For this sample, we run one Kubernetes pod with one Ray head node. Ray still starts multiple internal processes and Serve actors, but you do not need the KubeRay operator or any Ray CRDs.

That last part matters for NVCF. NVCF Helm functions deploy a Helm chart into a namespace and then route invocations to the Kubernetes Service you name in the function definition. This sample names that service entrypoint. If Ray Serve binds to 0.0.0.0:8000 and you expose it through a ClusterIP service called entrypoint, NVCF can treat it like any other Helm function endpoint.

What We Are Building

A single Kubernetes pod running a Ray head node with Ray Serve deployed on top. NVCF routes inference requests to the Service configured with --helm-chart-service; in this sample, that Service is named entrypoint and exposes port 8000. The serve app handles POST /infer and GET /health. No KubeRay operator. No CRDs. Just a Deployment, a ConfigMap, and a Service.

Ray Serve as an NVCF Helm function: client request flows through the NVCF control plane to the entrypoint Service and into the Ray head pod

When NVCF deploys a Helm function, it uses the helmChartServiceName value from the function definition as the target for invocation routing. In the CLI, that is the --helm-chart-service flag. For this chart, we pass --helm-chart-service entrypoint, so every inference request that arrives at the NVCF API is routed to entrypoint:8000 inside the cluster. Ray Serve listens there and dispatches to your deployment class.

Prerequisites

Docker running locally
vcluster CLI >= 0.34
helm >= 3.12
For GPU inference: real nvidia.com/gpu extended resources. Fake GPU resources are useful for chart scheduling tests only; they do not make model inference run on a GPU.
For NVCF deployment: a self-managed NVCF control plane

On Apple Silicon (ARM64), use --set image.tag=2.40.0-py310-aarch64. The default 2.40.0-py310-gpu tag is AMD64-only.

Create the vind Cluster

The vCluster Docker driver creates a standalone cluster directly in Docker, no existing Kubernetes cluster required:

‍

$ vcluster create ray-nvcf --driver docker info Ensuring environment for vCluster ray-nvcf... done Created network vcluster.ray-nvcf info Starting vCluster standalone ray-nvcf info Waiting for vCluster standalone node to be joined... done vCluster standalone node joined successfully done Successfully created virtual cluster ray-nvcf done vCluster is ready done Switched active kube context to vcluster-docker_ray-nvcf

About 35 seconds from nothing to a ready Kubernetes v1.35 API:

$ kubectl get nodes NAME STATUS ROLES AGE VERSION ray-nvcf Ready control-plane,master 15s v1.35.0

The CLI switches your kube context to the vind cluster automatically, so every helm and kubectl command below talks to it like any other cluster.

The Chart

The chart lives at examples/function-samples/helmchart-samples/ray-serve-sample/ in the NVCF repo. Five files:

ray-serve/ Chart.yaml # version 0.1.0 values.yaml # image, GPU count, resource requests templates/ deployment.yaml # Ray head pod + startup sequence configmap.yaml # serve_app.py mounted at /app service.yaml # entrypoint Service on port 8000

The startup sequence

The key design decision is in deployment.yaml:

ray start --head --port=6379 --dashboard-host=0.0.0.0 --block & until ray health-check 2>/dev/null; do sleep 2; done python /app/serve_app.py

ray start --block & starts the head node in the background and keeps it alive. ray health-check polls until Ray's GCS control service is ready before running the serve app. This matters: if you launch serve_app.py before Ray is ready, the ray.init(address="auto") call can fail.

The serve app

configmap.yaml mounts a Python file at /app/serve_app.py:

import time import ray from ray import serve from ray.serve.config import HTTPOptions from fastapi import FastAPI, Request from fastapi.responses import JSONResponse ray.init(address="auto", ignore_reinit_error=True) # serve.start() must be called explicitly to bind to 0.0.0.0; # serve.run() alone defaults the HTTP proxy to 127.0.0.1, # which is unreachable from outside the pod. serve.start(http_options=HTTPOptions(host="0.0.0.0", port=8000)) app = FastAPI() @serve.deployment(num_replicas=1, ray_actor_options={"num_gpus": 0}) @serve.ingress(app) class InferenceDeployment: def __init__(self): pass @app.post("/infer") async def infer(self, request: Request) -> JSONResponse: body = await request.json() return JSONResponse({"echo": body}) @app.get("/health") async def health(self) -> JSONResponse: return JSONResponse({"status": "ok"}) serve.run(InferenceDeployment.bind()) while True: time.sleep(3600)

Two things worth noting:

serve.start(http_options=HTTPOptions(...)) must come before serve.run(). In current Ray versions, serve.run() does not accept host and port arguments. If you pass them to serve.run() directly you get TypeError: run() got an unexpected keyword argument 'host'. The explicit serve.start() sets the bind address before the proxy starts.
The while True: time.sleep(3600) keeps the Python process alive after serve.run() returns. On Python 3.10 aarch64, time.sleep(float("inf")) can raise OverflowError, so the loop is safer.

Deploy the Chart

Clone the repo and install the chart in CPU mode into the vind cluster. I tested the commands below from current NVIDIA/nvcf main, inside the ray-nvcf vind cluster on an ARM64 Mac.

git clone https://github.com/NVIDIA/nvcf.git cd nvcf # AMD64 helm upgrade --install ray-serve-vcluster \ examples/function-samples/helmchart-samples/ray-serve-sample/ray-serve/ \ --set gpu.count=0 \ --set image.tag=2.40.0-py310 \ --namespace ray-test-vcluster \ --create-namespace \ --wait --timeout 12m # Apple Silicon ARM64 helm upgrade --install ray-serve-vcluster \ examples/function-samples/helmchart-samples/ray-serve-sample/ray-serve/ \ --set gpu.count=0 \ --set image.tag=2.40.0-py310-aarch64 \ --set resources.requests.memory=2Gi \ --set resources.limits.memory=4Gi \ --namespace ray-test-vcluster \ --create-namespace \ --wait --timeout 12m

The --wait flag blocks until the readiness probe on /health passes. First run can take a few minutes because the Ray image is large and Ray Serve needs time to start. The final release status is what matters:

STATUS: deployed REVISION: 1 DESCRIPTION: Install complete

Verify

Check the pod:

$ kubectl get pods -n ray-test-vcluster NAME READY STATUS RESTARTS AGE ray-serve-vcluster-746d79d55c-dfcj5 1/1 Running 0 2m25s

Check the service:

$ kubectl get svc -n ray-test-vcluster NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE entrypoint ClusterIP 10.106.209.40 <none> 8000/TCP 2m25s

Check that the service has a backend endpoint:

$ kubectl get endpoints -n ray-test-vcluster entrypoint NAME ENDPOINTS AGE entrypoint 10.244.0.4:8000 2m25s

(Kubernetes v1.33+ prints a deprecation warning pointing you to EndpointSlices; the v1 Endpoints view still works and is the shortest way to confirm the Service has a backend.)

Watch the logs to confirm Ray Serve deployed:

$ kubectl logs -n ray-test-vcluster -l app.kubernetes.io/name=ray-serve-vcluster | grep -E "Application 'default'|Deployed app" INFO 2026-06-10 00:22:54,759 serve 1 -- Application 'default' is ready at http://0.0.0.0:8000/. INFO 2026-06-10 00:22:54,760 serve 1 -- Deployed app 'default' successfully.

Test the endpoints:

$ kubectl port-forward -n ray-test-vcluster svc/entrypoint 8000:8000 & $ curl -sS -w '\nHTTP %{http_code} in %{time_total}s\n' http://localhost:8000/health {"status":"ok"} HTTP 200 in 0.018517s $ curl -sS -w '\nHTTP %{http_code} in %{time_total}s\n' \ -X POST http://localhost:8000/infer \ -H 'Content-Type: application/json' \ -d '{"prompt": "Hello Ray Serve on vCluster"}' {"echo":{"prompt":"Hello Ray Serve on vCluster"}} HTTP 200 in 0.023307s $ kill %1

Ray Serve logs confirm both requests:

(ServeReplica:default:InferenceDeployment pid=404) GET /health 200 3.6ms (ServeReplica:default:InferenceDeployment pid=404) POST /infer 200 9.9ms

The pod reached 1/1 Running with zero restarts in this run. If you do see restarts on a slow machine, check kubectl describe pod; a busy laptop can trip the Kubernetes probe timeout while Ray Serve is still warming up, and the pod recovers on its own.

When you are done, the whole environment disappears with one command:

vcluster delete ray-nvcf --driver docker

Extend for a Real Model

Replace the InferenceDeployment body in configmap.yaml with your model logic. For a Hugging Face text generation model:

@serve.deployment(num_replicas=1, ray_actor_options={"num_gpus": 1}) @serve.ingress(app) class InferenceDeployment: def __init__(self): from transformers import pipeline self.model = pipeline( "text-generation", model="meta-llama/Llama-3.2-1B", device=0, ) @app.post("/infer") async def infer(self, request: Request) -> JSONResponse: body = await request.json() result = self.model(body.get("prompt", ""), max_new_tokens=256) return JSONResponse({"generated_text": result[0]["generated_text"]}) @app.get("/health") async def health(self) -> JSONResponse: return JSONResponse({"status": "ok"})

Set num_gpus: 1 in the Ray actor options and deploy with --set gpu.count=1. The pod will request nvidia.com/gpu: 1 from Kubernetes.

For a real model, you will usually also build a custom image with transformers, accelerate, model-specific libraries, and any required Hugging Face authentication or model cache setup. The default rayproject/ray image is enough for the echo sample, not a full production model stack.

Deploy on Self-Managed NVCF

Package and push the chart to an OCI registry:

helm package examples/function-samples/helmchart-samples/ray-serve-sample/ray-serve/ helm push ray-serve-0.1.0.tgz oci://<your-registry>/<namespace>

Register credentials and create the function:

nvcf-cli registry-credential add \ --hostname <your-registry> \ --username <user> \ --password <pass> \ --artifact-type HELM \ --artifact-type CONTAINER nvcf-cli function create \ --name ray-serve-sample \ --helm-chart oci://<your-registry>/<namespace>/ray-serve:0.1.0 \ --helm-chart-service entrypoint \ --inference-url /infer \ --inference-port 8000 \ --health-uri /health \ --health-port 8000 nvcf-cli function deploy create \ --function-id <function-id> \ --version-id <version-id> \ --gpu H100 \ --instance-type NCP.GPU.H100_1x \ --min-instances 1 \ --max-instances 1

The --helm-chart-service entrypoint tells NVCF which Kubernetes Service to route invocations to. The --health-uri and --health-port tell NVCF where to check readiness before sending traffic. If your OCI registry only stores the Helm chart and your container image is public, --artifact-type HELM may be enough. Add --artifact-type CONTAINER when the same private registry also hosts images that NVCF must pull.

Notes from Testing

During the original development of the chart, getting to a clean 1/1 Running required fixing three issues: the startup loop used the wrong health-check condition, current Ray versions do not accept host/port on serve.run(), and time.sleep(float("inf")) can raise OverflowError on Python 3.10 aarch64. All three are fixed in the chart that landed in main via commit 31497f3.

The run you see in this post was done on 2026-06-10 against current NVIDIA/nvcf main, inside a standalone vind cluster created with vCluster 0.34 (Docker driver) running Kubernetes v1.35.0, on an ARM64 Mac.

One honest caveat on the NVCF side: my self-managed NVCF control plane was not healthy enough to run a real function create / function deploy test (several control-plane pods were in CrashLoopBackOff or ImagePullBackOff). So the Helm chart path is tested end to end; the NVCF registration commands are validated against the current CLI, but not executed against a healthy NVCF control plane in this run.

Links

Chart source: NVIDIA/nvcf - ray-serve-sample
Upstream merge commit: 31497f3
Original PR: NVIDIA/nvcf#22
Ray Serve docs: docs.ray.io/en/latest/serve/index.html
NVCF repo: github.com/NVIDIA/nvcf
vCluster docs: vcluster.com/docs

Tutorials

AI & GPUs

vCluster

Open Source