(kuberay-raycluster-quickstart)=
This guide shows you how to manage and interact with Ray clusters on Kubernetes.
This step creates a local Kubernetes cluster using Kind. If you already have a Kubernetes cluster, you can skip this step.
kind create cluster --image=kindest/node:v1.26.0
Creating cluster "kind" ... ✓ Ensuring node image (kindest/node:v1.26.0) 🖼 ✓ Preparing nodes 📦 7l ✓ Writing configuration 📜7l ✓ Starting control-plane 🕹️7l ✓ Installing CNI 🔌7l ✓ Installing StorageClass 💾7l Set kubectl context to "kind-kind" You can now use your cluster with: kubectl cluster-info --context kind-kind Have a nice day! 👋
Follow this document to install the latest stable KubeRay operator from the Helm repository.
../scripts/doctest-utils.sh install_kuberay_operator
NAME: kuberay-operator LAST DEPLOYED: Tue Mar 18 16:26:18 2025 NAMESPACE: default STATUS: deployed REVISION: 1 TEST SUITE: None deployment.apps/kuberay-operator condition met
(raycluster-deploy)=
Once the KubeRay operator is running, you're ready to deploy a RayCluster. Create a RayCluster Custom Resource (CR) in the default
namespace.
helm install raycluster kuberay/ray-cluster --version 1.3.0
NAME: raycluster LAST DEPLOYED: Tue Mar 18 16:26:52 2025 NAMESPACE: default STATUS: deployed REVISION: 1 TEST SUITE: None
Use helm install raycluster kuberay/ray-cluster --version 1.3.0 --set 'image.tag=2.41.0-aarch64'
instead if are using ARM64 (Apple Silicon) machines.
kubectl wait --for=condition=RayClusterProvisioned raycluster/raycluster-kuberay --timeout=500s
raycluster.ray.io/raycluster-kuberay condition met
Once the RayCluster CR has been created, you can view it by running:
kubectl get rayclusters
NAME DESIRED WORKERS AVAILABLE WORKERS CPUS MEMORY GPUS STATUS AGE raycluster-kuberay 1 1 2 3G 0 ready 55s
The KubeRay operator detects the RayCluster object and starts your Ray cluster by creating head and worker pods. To view Ray cluster’s pods, run the following command:
# View the pods in the RayCluster named "raycluster-kuberay"
kubectl get pods --selector=ray.io/cluster=raycluster-kuberay
NAME READY STATUS RESTARTS AGE raycluster-kuberay-head-k7rlq 1/1 Running 0 56s raycluster-kuberay-workergroup-worker-65zl8 1/1 Running 0 56s
Wait for the pods to reach Running
state. This may take a few minutes, downloading the Ray images takes most of this time. If your pods stick in the Pending
state, you can check for errors using kubectl describe pod raycluster-kuberay-xxxx-xxxxx
and ensure your Docker resource limits meet the requirements.
Now, interact with the RayCluster deployed.
The most straightforward way to experiment with your RayCluster is to exec directly into the head pod. First, identify your RayCluster's head pod:
export HEAD_POD=$(kubectl get pods --selector=ray.io/node-type=head -o custom-columns=POD:metadata.name --no-headers)
echo $HEAD_POD
raycluster-kuberay-head-k7rlq
# Print the cluster resources.
kubectl exec -it $HEAD_POD -- python -c "import pprint; import ray; ray.init(); pprint.pprint(ray.cluster_resources(), sort_dicts=True)"
2025-03-18 01:27:48,692 INFO worker.py:1514 -- Using address 127.0.0.1:6379 set in the environment variable RAY_ADDRESS
2025-03-18 01:27:48,692 INFO worker.py:1654 -- Connecting to existing Ray cluster at address: 10.244.0.6:6379...
2025-03-18 01:27:48,699 INFO worker.py:1832 -- Connected to Ray cluster. View the dashboard at 10.244.0.6:8265
{'CPU': 2.0,
'memory': 3000000000.0,
'node:10.244.0.6': 1.0,
'node:10.244.0.7': 1.0,
'node:__internal_head__': 1.0,
'object_store_memory': 749467238.0}
Unlike Method 1, this method doesn't require you to execute commands in the Ray head pod. Instead, you can use the Ray job submission SDK to submit Ray jobs to the RayCluster through the Ray Dashboard port where Ray listens for Job requests. The KubeRay operator configures a Kubernetes service targeting the Ray head Pod.
kubectl get service raycluster-kuberay-head-svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE raycluster-kuberay-head-svc ClusterIP None <none> 10001/TCP,8265/TCP,6379/TCP,8080/TCP,8000/TCP 57s
Now that the service name is available, use port-forwarding to access the Ray Dashboard port which is 8265 by default.
# Execute this in a separate shell.
kubectl port-forward service/raycluster-kuberay-head-svc 8265:8265 > /dev/null &
[1] 290315
Now that the Dashboard port is accessible, submit jobs to the RayCluster:
# The following job's logs will show the Ray cluster's total resource capacity, including 2 CPUs.
ray job submit --address http://localhost:8265 -- python -c "import pprint; import ray; ray.init(); pprint.pprint(ray.cluster_resources(), sort_dicts=True)"
Job submission server address: http://localhost:8265 ------------------------------------------------------- Job 'raysubmit_8vJ7dKqYrWKbd17i' submitted successfully ------------------------------------------------------- Next steps Query the logs of the job: ray job logs raysubmit_8vJ7dKqYrWKbd17i Query the status of the job: ray job status raysubmit_8vJ7dKqYrWKbd17i Request the job to be stopped: ray job stop raysubmit_8vJ7dKqYrWKbd17i Tailing logs until the job exits (disable with --no-wait): 2025-03-18 01:27:51,014 INFO job_manager.py:530 -- Runtime env is setting up. 2025-03-18 01:27:51,744 INFO worker.py:1514 -- Using address 10.244.0.6:6379 set in the environment variable RAY_ADDRESS 2025-03-18 01:27:51,744 INFO worker.py:1654 -- Connecting to existing Ray cluster at address: 10.244.0.6:6379... 2025-03-18 01:27:51,750 INFO worker.py:1832 -- Connected to Ray cluster. View the dashboard at 10.244.0.6:8265 {'CPU': 2.0, 'memory': 3000000000.0, 'node:10.244.0.6': 1.0, 'node:10.244.0.7': 1.0, 'node:__internal_head__': 1.0, 'object_store_memory': 749467238.0} ------------------------------------------ Job 'raysubmit_8vJ7dKqYrWKbd17i' succeeded ------------------------------------------
# Kill the `kubectl port-forward` background job in the earlier step
killall kubectl
kind delete cluster
Deleting cluster "kind" ... Deleted nodes: ["kind-control-plane"] [1]+ Terminated kubectl port-forward service/raycluster-kuberay-head-svc 8265:8265 > /dev/null