(kuberay-rayjob-quickstart)=
A RayJob manages two aspects:
ray job submit
to submit a Ray job to the RayCluster.With RayJob, KubeRay automatically creates a RayCluster and submits a job when the cluster is ready. You can also configure RayJob to automatically delete the RayCluster once the Ray job finishes.
To understand the following content better, you should understand the difference between:
ray job submit
to submit a Ray job to the RayCluster.RayCluster configuration
rayClusterSpec
- Defines the RayCluster custom resource to run the Ray job on.clusterSelector
- Use existing RayCluster custom resources to run the Ray job instead of creating a new one. See ray-job.use-existing-raycluster.yaml for example configurations.Ray job configuration
entrypoint
- The submitter runs ray job submit --address ... --submission-id ... -- $entrypoint
to submit a Ray job to the RayCluster.runtimeEnvYAML
(Optional): A runtime environment that describes the dependencies the Ray job needs to run, including files, packages, environment variables, and more. Provide the configuration as a multi-line YAML string.Example:
spec:
runtimeEnvYAML: |
pip:
- requests==2.26.0
- pendulum==2.1.2
env_vars:
KEY: "VALUE"
See {ref}Runtime Environments <runtime-environments>
for more details. (New in KubeRay version 1.0.0)
jobId
(Optional): Defines the submission ID for the Ray job. If not provided, KubeRay generates one automatically. See {ref}Ray Jobs CLI API Reference <ray-job-submission-cli-ref>
for more details about the submission ID.metadata
(Optional): See {ref}Ray Jobs CLI API Reference <ray-job-submission-cli-ref>
for more details about the --metadata-json
option.entrypointNumCpus
/ entrypointNumGpus
/ entrypointResources
(Optional): See {ref}Ray Jobs CLI API Reference <ray-job-submission-cli-ref>
for more details.backoffLimit
(Optional, added in version 1.2.0): Specifies the number of retries before marking this RayJob failed. Each retry creates a new RayCluster. The default value is 0.Submission configuration
submissionMode
(Optional): submissionMode
specifies how RayJob submits the Ray job to the RayCluster. In "K8sJobMode", the KubeRay operator creates a submitter Kubernetes Job to submit the Ray job. In "HTTPMode", the KubeRay operator sends a request to the RayCluster to create a Ray job. The default value is "K8sJobMode".submitterPodTemplate
(Optional): Defines the Pod template for the submitter Kubernetes Job. This field is only effective when submissionMode
is "K8sJobMode".RAY_DASHBOARD_ADDRESS
- The KubeRay operator injects this environment variable to the submitter Pod. The value is $HEAD_SERVICE:$DASHBOARD_PORT
.RAY_JOB_SUBMISSION_ID
- The KubeRay operator injects this environment variable to the submitter Pod. The value is the RayJob.Status.JobId
of the RayJob.ray job submit --address=http://$RAY_DASHBOARD_ADDRESS --submission-id=$RAY_JOB_SUBMISSION_ID ...
submitterConfig
(Optional): Additional configurations for the submitter Kubernetes Job.backoffLimit
(Optional, added in version 1.2.0): The number of retries before marking the submitter Job as failed. The default value is 2.Automatic resource cleanup
shutdownAfterJobFinishes
(Optional): Determines whether to recycle the RayCluster after the Ray job finishes. The default value is false.ttlSecondsAfterFinished
(Optional): Only works if shutdownAfterJobFinishes
is true. The KubeRay operator deletes the RayCluster and the submitter ttlSecondsAfterFinished
seconds after the Ray job finishes. The default value is 0.activeDeadlineSeconds
(Optional): If the RayJob doesn't transition the JobDeploymentStatus
to Complete
or Failed
within activeDeadlineSeconds
, the KubeRay operator transitions the JobDeploymentStatus
to Failed
, citing DeadlineExceeded
as the reason.DELETE_RAYJOB_CR_AFTER_JOB_FINISHES
(Optional, added in version 1.2.0): Set this environment variable for the KubeRay operator, not the RayJob resource. If you set this environment variable to true, the RayJob custom resource itself is deleted if you also set shutdownAfterJobFinishes
to true. Note that KubeRay deletes all resources created by the RayJob, including the Kubernetes Job.Others
suspend
(Optional): If suspend
is true, KubeRay deletes both the RayCluster and the submitter. Note that Kueue also implements scheduling strategies by mutating this field. Avoid manually updating this field if you use Kueue to schedule RayJob.deletionPolicy
(Optional, alpha in v1.3.0): Indicates what resources of the RayJob are deleted upon job completion. Valid values are DeleteCluster
, DeleteWorkers
, DeleteSelf
or DeleteNone
. If unset, deletion policy is based on spec.shutdownAfterJobFinishes
. This field requires the RayJobDeletionPolicy
feature gate to be enabled.DeleteCluster
- Deletion policy to delete the RayCluster custom resource, and its Pods, on job completion.DeleteWorkers
- Deletion policy to delete only the worker Pods on job completion.DeleteSelf
- Deletion policy to delete the RayJob custom resource (and all associated resources) on job completion.DeleteNone
- Deletion policy to delete no resources on job completion.kind create cluster --image=kindest/node:v1.26.0
Creating cluster "kind" ... â Ensuring node image (kindest/node:v1.26.0) đŧ â Preparing nodes đĻ 7l â Writing configuration đ â Starting control-plane đšī¸7l â Installing CNI đ7l â Installing StorageClass đž7l Set kubectl context to "kind-kind" You can now use your cluster with: kubectl cluster-info --context kind-kind Not sure what to do next? đ Check out https://kind.sigs.k8s.io/docs/user/quick-start/
Follow the RayCluster Quickstart to install the latest stable KubeRay operator by Helm repository.
../scripts/doctest-utils.sh install_kuberay_operator
NAME: kuberay-operator LAST DEPLOYED: Wed Apr 9 18:28:39 2025 NAMESPACE: default STATUS: deployed REVISION: 1 TEST SUITE: None deployment.apps/kuberay-operator condition met
kubectl apply -f https://raw.githubusercontent.com/ray-project/kuberay/v1.3.0/ray-operator/config/samples/ray-job.sample.yaml
rayjob.ray.io/rayjob-sample created configmap/ray-job-code-sample created
kubectl wait --for=condition=RayClusterProvisioned raycluster/$(kubectl get rayjob rayjob-sample -o jsonpath='{.status.rayClusterName}') --timeout=500s
raycluster.ray.io/rayjob-sample-raycluster-7965c condition met
kubectl wait --for=condition=ready pod -l job-name=rayjob-sample --timeout=500s
pod/rayjob-sample-74pmj condition met
# Step 4.1: List all RayJob custom resources in the `default` namespace.
kubectl get rayjob
NAME JOB STATUS DEPLOYMENT STATUS RAY CLUSTER NAME START TIME END TIME AGE rayjob-sample Running rayjob-sample-raycluster-7965c 2025-04-09T10:29:17Z 117s
# Step 4.2: List all RayCluster custom resources in the `default` namespace.
kubectl get raycluster
NAME DESIRED WORKERS AVAILABLE WORKERS CPUS MEMORY GPUS STATUS AGE rayjob-sample-raycluster-7965c 1 1 400m 0 0 ready 117s
# Step 4.3: List all Pods in the `default` namespace.
# The Pod created by the Kubernetes Job will be terminated after the Kubernetes Job finishes.
kubectl get pods --sort-by='.metadata.creationTimestamp'
NAME READY STATUS RESTARTS AGE kuberay-operator-6bc45dd644-tlsfn 1/1 Running 0 2m26s rayjob-sample-raycluster-7965c-head-n6nj8 1/1 Running 0 117s rayjob-sample-raycluster-7965c-small-group-worker-nlzwx 1/1 Running 0 117s rayjob-sample-74pmj 1/1 Running 0 2s
kubectl wait --for=condition=complete job/rayjob-sample --timeout=500s
job.batch/rayjob-sample condition met
# Step 4.4: Check the status of the RayJob.
# The field `jobStatus` in the RayJob custom resource will be updated to `SUCCEEDED` and `jobDeploymentStatus`
# should be `Complete` once the job finishes.
kubectl get rayjobs.ray.io rayjob-sample -o jsonpath='{.status.jobStatus}'
SUCCEEDED
kubectl get rayjobs.ray.io rayjob-sample -o jsonpath='{.status.jobDeploymentStatus}'
Complete
The KubeRay operator creates a RayCluster custom resource based on the rayClusterSpec
and a submitter Kubernetes Job to submit a Ray job to the RayCluster.
In this example, the entrypoint
is python /home/ray/samples/sample_code.py
, and sample_code.py
is a Python script stored in a Kubernetes ConfigMap mounted to the head Pod of the RayCluster.
Because the default value of shutdownAfterJobFinishes
is false, the KubeRay operator doesn't delete the RayCluster or the submitter when the Ray job finishes.
kubectl logs -l=job-name=rayjob-sample
2025-04-09 03:31:23,810 INFO worker.py:1654 -- Connecting to existing Ray cluster at address: 10.244.0.6:6379... 2025-04-09 03:31:23,818 INFO worker.py:1832 -- Connected to Ray cluster. View the dashboard at 10.244.0.6:8265 test_counter got 1 test_counter got 2 test_counter got 3 test_counter got 4 test_counter got 5 2025-04-09 03:31:32,204 SUCC cli.py:63 -- ----------------------------------- 2025-04-09 03:31:32,204 SUCC cli.py:64 -- Job 'rayjob-sample-jrrd2' succeeded 2025-04-09 03:31:32,204 SUCC cli.py:65 -- -----------------------------------
The Python script sample_code.py
used by entrypoint
is a simple Ray script that executes a counter's increment function 5 times.
kubectl delete -f https://raw.githubusercontent.com/ray-project/kuberay/v1.3.0/ray-operator/config/samples/ray-job.sample.yaml
rayjob.ray.io "rayjob-sample" deleted configmap "ray-job-code-sample" deleted
shutdownAfterJobFinishes
set to true¶kubectl apply -f https://raw.githubusercontent.com/ray-project/kuberay/v1.3.0/ray-operator/config/samples/ray-job.shutdown.yaml
rayjob.ray.io/rayjob-sample-shutdown created configmap/ray-job-code-sample created
The ray-job.shutdown.yaml
defines a RayJob custom resource with shutdownAfterJobFinishes: true
and ttlSecondsAfterFinished: 10
.
Hence, the KubeRay operator deletes the RayCluster 10 seconds after the Ray job finishes. Note that the submitter job isn't deleted
because it contains the ray job logs and doesn't use any cluster resources once completed. In addition, the RayJob cleans up the submitter job
when the RayJob is eventually deleted due to its owner reference back to the RayJob.
kubectl wait --for=condition=RayClusterProvisioned raycluster/$(kubectl get rayjob rayjob-sample-shutdown -o jsonpath='{.status.rayClusterName}') --timeout=500s
raycluster.ray.io/rayjob-sample-shutdown-raycluster-pfqsf condition met
kubectl wait --for=condition=complete job/rayjob-sample-shutdown --timeout=500s
job.batch/rayjob-sample-shutdown condition met
# Wait until `jobStatus` is `SUCCEEDED` and `jobDeploymentStatus` is `Complete`.
kubectl get rayjobs.ray.io rayjob-sample-shutdown -o jsonpath='{.status.jobDeploymentStatus}'
Complete
kubectl get rayjobs.ray.io rayjob-sample-shutdown -o jsonpath='{.status.jobStatus}'
SUCCEEDED
# List the RayCluster custom resources in the `default` namespace. The RayCluster
# associated with the RayJob `rayjob-sample-shutdown` should be deleted.
kubectl get raycluster
NAME DESIRED WORKERS AVAILABLE WORKERS CPUS MEMORY GPUS STATUS AGE rayjob-sample-shutdown-raycluster-pfqsf 1 1 400m 0 0 ready 45s
kind delete cluster
Deleting cluster "kind" ... Deleted nodes: ["kind-control-plane"]