A primary challenge in managing cluster resources is balancing a high volume of jobs with limited resources. To address this, organizations must prioritize resource allocation for critical teams or individuals while maintaining the flexibility to make adjustments on the fly. This guide demonstrates how to improve cluster resource utilization by using a unified job management platform to automate the processing of numerous RayJobs from different departments. This approach supports job preemption and dynamic priority adjustments, ensuring that high-priority jobs receive resources first.
Prerequisites (select one)
ACK Pro cluster
Cluster version: v1.24 or later.
You have installed kubectl and used it to connect to the Kubernetes cluster. For more information, see Obtain a kubeconfig file and connect to a cluster by using kubectl.
Kuberay component: See Install Kuberay-Operator.
Kube Queue: v1.21.4 or later. See Use ack-kube-queue to manage AI and machine learning workloads.
Kube Queue is configured to support RayJob resources.
Default Node Pool: At least three ECS instances with a specification of 8 vCPU and 32 GiB or higher.
ACK Lingjun cluster
Cluster version: v1.24 or later.
You have installed kubectl and used it to connect to the Kubernetes cluster. For more information, see Obtain a kubeconfig file and connect to a cluster by using kubectl.
Kuberay component: See Install Kuberay-Operator.
Kube Queue: v1.21.4 or later. See Use ack-kube-queue to manage AI and machine learning workloads.
Kube Queue is configured to support RayJob resources.
Default Node Pool: At least three ECS instances with a specification of 8 vCPU and 32 GiB or higher.
Solution for Docker Hub pull failures
Due to network instability, such as issues with carrier networks, image pulls from Docker Hub may fail. We recommend using images that rely on Docker Hub with caution in production environments. This example uses the official Ray image rayproject/ray:2.36.1. If you cannot pull this image, use one of the following solutions:
Subscribe to images from registries outside the Chinese mainland through Container Registry. For more information, see Subscribe to images outside China.
To directly pull images from overseas sources, create a Global Accelerator instance and use its global network acceleration service. For more information, see Use GA to accelerate cross-domain container image pulling in ACK.
Resource quotas
Use the ElasticQuotaTree feature in an ACK cluster with RayJob to automate job scheduling and manage computing resources more flexibly. This allows RayJobs to efficiently schedule workloads within defined resource limits, ensuring that each team can fully use its allocated resources while avoiding waste or excessive competition.
You can configure the ElasticQuotaTree resource quota based on departments or individuals, defining the maximum amount of resources each team can use. Each node in the tree represents the minimum and maximum resource quota available to the corresponding team or department. When a RayJob is submitted, the system automatically checks if the resource quota for that job is sufficient to meet its requirements. The RayJob starts executing on the appropriate compute resources only when its quota is confirmed. This ensures both effective resource utilization and proper job prioritization.
ElasticQuotaTree defines quota information within the cluster, including the quota hierarchy, the amount of resources associated with each quota, and the namespaces bound to them. When a job is submitted in one of these namespaces, it is automatically counted against the resource quota of the corresponding namespace. Refer to the following example to set a resource quota with ElasticQuotaTree.
To build a resource quota system that meets your organization's needs, submit the following ElasticQuotaTree configuration to the cluster.
ElasticQuotaTreeis a tree structure where each node defines the maximum resource usage through themaxfield and the minimum guaranteed resource amount through theminfield. If a quota's minimum guarantee cannot be met, the scheduler attempts to reclaim resources from other quotas that are using more than their minimum guaranteed resources to run the job. For jobs marked asintern-textandintern-video, their guaranteed resource amount is set to 0. This means that if an algorithm team member submits a job that requires immediate processing while an intern's job is running, the system can preempt the resources used by the intern's job to prioritize the algorithm team's job, ensuring that high-priority jobs can proceed smoothly.View the
ElasticQuotaTreesettings that have taken effect in thekube-systemnamespace.kubectl -n kube-system get elasticquotatree elasticquotatree -o yaml
Job queues
The Queue feature assigns RayJobs from different departments and teams to their respective queues. After the ElasticQuotaTree is submitted, ack-kube-queue automatically creates corresponding queues in the cluster for job queuing. The resource quota of each leaf node corresponds to a separate Queue in the cluster. When a RayJob is submitted, ack-kube-queue automatically associates it with the corresponding Queue based on the RayJob's namespace. The job is then automatically placed in the correct queue, and its dequeue is determined by the queuing policy or quota. For more information, see ack-kube-queue manages job queues.
As shown in the following example, the video team is associated with the video namespace. Resources are allocated through the min and max configurations, and Kube Queue automatically creates an associated queue for this quota: root-algorithm-video. Subsequently, when a RayJob with the .spec.suspend field set to True is submitted in the video namespace, a corresponding QueueUnit resource object is automatically created and enters the root-algorithm-video queue. For a RayJob, KubeQueue calculates the total required resources by summing the Head Pod's requests with the total resources for each WorkerGroup (replicas × a single Pod's resource request). If the total resource request of the RayJob fits within the currently available quota, the RayJob will dequeue from root-algorithm-video and enter the scheduler logic.
After the ElasticQuotaTree is created, Kube Queue automatically creates a corresponding Queue for each leaf node based on the ElasticQuotaTree configuration.
For example, for the algorithm department/video team, the queue
root-algorithm-videois automatically created.kubectl get queue -n kube-queue root-algorithm-video-k42kq -o yaml apiVersion: scheduling.x-k8s.io/v1alpha1 kind: Queue metadata: annotations: kube-queue/parent-quota-fullname: algorithm kube-queue/quota-fullname: root/algorithm/video creationTimestamp: "2025-01-09T03:32:27Z" generateName: root-algorithm-video- generation: 1 labels: create-by-kubequeue: "true" name: root-algorithm-video-k42kq namespace: kube-queue resourceVersion: "18282630" uid: 5606059e-acf5-4f92-b11a-48a02ef53cdf spec: queuePolicy: Round status: queueItemDetails: active: [] backoff: []Noteactive: Jobs awaiting scheduling, showing their priority and position in the queue.backoff: Jobs that failed to schedule, typically due to insufficient resources, and are waiting before retrying.View the queues.
kubectl get queue -n kube-queueThe output appears as follows.
NAME AGE root-algorithm-n54fm 51s root-algorithm-text-hgbvz 51s root-algorithm-video-k42kq 51s root-devops-2zccw 51s root-infrastructure-devops-d6zqq 51s root-infrastructure-vbpkt 51s root-k8htb 51s
Create a RayJob
Define a
RayJobresource in thevideonamespace. This automatically associates the job with theroot-algorithm-videoqueue and the correspondingvideoresource quota.- name: video min: cpu: 12 memory: 12Gi nvidia.com/gpu: 2 max: cpu: 14 memory: 14Gi nvidia.com/gpu: 4 namespaces: # Configure the namespace. - videoNoteMinimum guaranteed resources:
cpu: 12,memory: 12Gi,nvidia.com/gpu: 2.Maximum available resources:
cpu: 14,memory: 14Gi,nvidia.com/gpu: 4.
Create a ConfigMap to define the Python code that the RayJob will execute in the RayCluster. The sample code creates an actor using the
ray.remotedecorator and calls the actor'sinc()andget_counter()methods.--- apiVersion: v1 kind: ConfigMap metadata: name: rayjob-video namespace: video data: sample_code.py: | import ray import os import requests ray.init() @ray.remote class Counter: def __init__(self): # Used to verify runtimeEnv self.name = os.getenv("counter_name") assert self.name == "test_counter" self.counter = 0 def inc(self): self.counter += 1 def get_counter(self): return "{} got {}".format(self.name, self.counter) counter = Counter.remote() for _ in range(2): ray.get(counter.inc.remote()) print(ray.get(counter.get_counter.remote())) # Verify that the correct runtime env was used for the job. assert requests.__version__ == "2.26.0"Configure the sample RayJob YAML.
apiVersion: ray.io/v1 kind: RayJob metadata: labels: job-type: video generateName: rayjob-video- namespace: video spec: entrypoint: python /home/ray/samples/sample_code.py runtimeEnvYAML: | pip: - requests==2.26.0 - pendulum==2.1.2 env_vars: counter_name: "test_counter" ttlSecondsAfterFinished: 10 # if suspend: true , should set shutdownAfterJobFinishes to true shutdownAfterJobFinishes: true # Suspend specifies whether the RayJob controller should create a RayCluster instance. # If a job is applied with the suspend field set to true, the RayCluster will not be created and we will wait for the transition to false. # If the RayCluster is already created, it will be deleted. In the case of transition to false, a new RayCluste rwill be created. suspend: true rayClusterSpec: rayVersion: '2.36.1' # should match the Ray version in the image of the containers headGroupSpec: rayStartParams: dashboard-host: '0.0.0.0' num-cpus: "0" template: spec: containers: - name: ray-head image: rayproject/ray:2.36.1 ports: - containerPort: 6379 name: gcs-server - containerPort: 8265 # Ray dashboard name: dashboard - containerPort: 10001 name: client resources: limits: cpu: "4" memory: 4G requests: cpu: "4" memory: 4G volumeMounts: - mountPath: /home/ray/samples name: code-sample volumes: # You set volumes at the Pod level, then mount them into containers inside that Pod - name: code-sample configMap: # Provide the name of the ConfigMap you want to mount. name: rayjob-video # An array of keys from the ConfigMap to create as files items: - key: sample_code.py path: sample_code.py workerGroupSpecs: - replicas: 2 groupName: small-group rayStartParams: {} template: spec: containers: - name: ray-worker image: rayproject/ray:2.36.1 lifecycle: preStop: exec: command: [ "/bin/sh","-c","ray stop" ] resources: limits: cpu: "4" memory: 4G requests: cpu: "4" memory: 4GUse
kubectl create -fto create two RayJobs.kubectl get rayjob -n video NAME JOB STATUS DEPLOYMENT STATUS START TIME END TIME AGE rayjob-video-g2lvn Initializing 2025-01-10T01:36:24Z 6s rayjob-video-h4x2q Suspended 2025-01-10T01:36:25Z 5s 5s 3srayjob-video-g2lvnis dequeued and is in theInitializingstate, whilerayjob-video-h4x2qremains queued in theSuspendedstate.Check the enqueue and dequeue times for
rayjob-video-g2lvnin itsannotationsusingkube-queue/job-enqueue-timestampandkube-queue/job-dequeue-timestamp.kubectl -n video get rayjob rayjob-video-g2lvn -o yaml apiVersion: ray.io/v1 kind: RayJob metadata: annotations: kube-queue/job-dequeue-timestamp: 2025-01-10 01:36:24.641181026 +0000 UTC m=+132100.596012828 kube-queue/job-enqueue-timestamp: 2025-01-10 01:36:24.298639916 +0000 UTC m=+132100.253471714 creationTimestamp: "2025-01-10T01:36:24Z"For
rayjob-video-h4x2q, theannotationsonly show the enqueue time (kube-queue/job-enqueue-timestamp), with no dequeue time. This indicates that theRayJobhas not yet been dequeued for scheduling.kubectl -n video get rayjob rayjob-video-h4x2q -o yaml apiVersion: ray.io/v1 kind: RayJob metadata: annotations: kube-queue/job-enqueue-timestamp: 2025-01-10 01:36:25.505182364 +0000 UTC m=+132101.460014182 creationTimestamp: "2025-01-10T01:36:25Z"View the Pods. Currently, only the Pods for
rayjob-video-g2lvnhave started scheduling.kubectl -n video get pod NAME READY STATUS RESTARTS AGE rayjob-video-g2lvn-9gz66 1/1 Running 0 28s rayjob-video-g2lvn-raycluster-v8tfh-head-6trq5 1/1 Running 0 49s rayjob-video-g2lvn-raycluster-v8tfh-small-group-worker-hkt7m 1/1 Running 0 49s rayjob-video-g2lvn-raycluster-v8tfh-small-group-worker-rbzjn 1/1 Running 0 49sCheck the properties of the queue. The second job,
rayjob-video-h4x2q, is now in thebackofflist, waiting for resources.k -n kube-queue get queue root-algorithm-video-k42kq -o yaml apiVersion: scheduling.x-k8s.io/v1alpha1 kind: Queue metadata: annotations: kube-queue/parent-quota-fullname: algorithm kube-queue/quota-fullname: root/algorithm/video creationTimestamp: "2025-01-09T08:34:57Z" generateName: root-algorithm-video- generation: 1 labels: create-by-kubequeue: "true" name: root-algorithm-video-k42kq namespace: kube-queue resourceVersion: "19070012" uid: 5606059e-acf5-4f92-b11a-48a02ef53cdf spec: queuePolicy: Round status: queueItemDetails: active: [] backoff: - name: rayjob-video-h4x2q-ray-qu namespace: video position: 1
Set up gang scheduling
Use gang scheduling (or co-scheduling) with Ray for distributed tasks that require multiple nodes to start simultaneously. This is useful in the following scenarios:
Large-scale machine learning training: When dealing with very large datasets or complex models, a single machine may not provide sufficient computing resources. In this case, a group of containers needs to work together. Gang scheduling ensures that these containers are scheduled simultaneously, avoiding resource contention and deadlocks, thereby improving training efficiency.
MPI computing framework: Parallel computing with multiple threads under the MPI framework requires master and slave processes to work together. Gang scheduling ensures that these processes are scheduled simultaneously, reducing communication latency and improving computational efficiency.
Data processing and analysis: For applications that need to process massive amounts of data, such as log analysis and real-time stream processing, multiple jobs may need to run simultaneously to complete complex analysis tasks. Gang scheduling ensures that these jobs are scheduled at the same time, improving overall processing speed.
Custom distributed application development: Implement player matchmaking services in a game server architecture, or coordinate data collection and processing from thousands of devices in an Internet of Things (IoT) project.
To enable gang scheduling for a RayJob in an ACK Pro or ACK Lingjun cluster, add the ray.io/scheduler-name: kube-scheduler label to its metadata. After submitting the job, the Ray Operator will automatically inject the necessary labels for gang scheduling when creating the Pods.
When creating Pods, the Ray Operator injects labels to facilitate identification, grouping, and gang scheduling operations.
When resources are insufficient, get detailed information about scheduling failures by inspecting Kubernetes events. Use --field-selector='type=Warning,reason=GangFailedScheduling' to filter for events related to gang scheduling failures. The event message, which may include cycle xx, provides details about a specific scheduling attempt and explains why the Pod could not be successfully scheduled in that round. The following is an example.
➜ kubequeue-doc kubectl get events -n algorithm-text --field-selector='type=Warning,reason=GangFailedScheduling' | grep "cycle 1"
5m48s Warning GangFailedScheduling pod/rayjob-sample-dtmtl-raycluster-r9jc7-small-group-worker-89mlq rayjob-sample-dtmtl-raycluster-r9jc7-small-group-worker-89mlq in gang failed to be scheduled in cycle 1: 0/0 nodes are available: 3 Insufficient cpu.
5m48s Warning GangFailedScheduling pod/rayjob-sample-dtmtl-raycluster-r9jc7-small-group-worker-8fwmr rayjob-sample-dtmtl-raycluster-r9jc7-small-group-worker-8fwmr in gang failed to be scheduled in cycle 1: 0/0 nodes are available: 3 Insufficient cpu.
5m48s Warning GangFailedScheduling pod/rayjob-sample-dtmtl-raycluster-r9jc7-small-group-worker-8g5wv rayjob-sample-dtmtl-raycluster-r9jc7-small-group-worker-8g5wv in gang failed to be scheduled in cycle 1: 0/0 nodes are available: 3 Insufficient cpu.
5m46s Warning GangFailedScheduling pod/rayjob-sample-dtmtl-raycluster-r9jc7-small-group-worker-8tn4w rayjob-sample-dtmtl-raycluster-r9jc7-small-group-worker-8tn4w in gang failed to be scheduled in cycle 1: 0/0 nodes are available: 3 Insufficient cpu.
5m48s Warning GangFailedScheduling pod/rayjob-sample-dtmtl-raycluster-r9jc7-small-group-worker-97gpk rayjob-sample-dtmtl-raycluster-r9jc7-small-group-worker-97gpk in gang failed to be scheduled in cycle 1: 0/0 nodes are available: 3 Insufficient cpu.
5m46s Warning GangFailedScheduling pod/rayjob-sample-dtmtl-raycluster-r9jc7-small-group-worker-9xsgw rayjob-sample-dtmtl-raycluster-r9jc7-small-group-worker-9xsgw in gang failed to be scheduled in cycle 1: 0/0 nodes are available: 3 Insufficient cpu.
5m48s Warning GangFailedScheduling pod/rayjob-sample-dtmtl-raycluster-r9jc7-small-group-worker-gwxhg rayjob-sample-dtmtl-raycluster-r9jc7-small-group-worker-gwxhg in gang failed to be scheduled in cycle 1: 0/0 nodes are available: 3 Insufficient cpu.
5m48s Warning GangFailedScheduling pod/rayjob-sample-dtmtl-raycluster-r9jc7-small-group-worker-jzw6k rayjob-sample-dtmtl-raycluster-r9jc7-small-group-worker-jzw6k in gang failed to be scheduled in cycle 1: 0/0 nodes are available: 3 Insufficient cpu.
5m48s Warning GangFailedScheduling pod/rayjob-sample-dtmtl-raycluster-r9jc7-small-group-worker-kb55s rayjob-sample-dtmtl-raycluster-r9jc7-small-group-worker-kb55s in gang failed to be scheduled in cycle 1: 0/0 nodes are available: 3 Insufficient cpu.
5m48s Warning GangFailedScheduling pod/rayjob-sample-dtmtl-raycluster-r9jc7-small-group-worker-lbvk7 rayjob-sample-dtmtl-raycluster-r9jc7-small-group-worker-lbvk7 in gang failed to be scheduled in cycle 1: 0/0 nodes are available: 3 Insufficient cpu.
5m46s Warning GangFailedScheduling pod/rayjob-sample-dtmtl-raycluster-r9jc7-small-group-worker-ms96b rayjob-sample-dtmtl-raycluster-r9jc7-small-group-worker-ms96b in gang failed to be scheduled in cycle 1: 0/0 nodes are available: 3 Insufficient cpu.
5m48s Warning GangFailedScheduling pod/rayjob-sample-dtmtl-raycluster-r9jc7-small-group-worker-sgr9g rayjob-sample-dtmtl-raycluster-r9jc7-small-group-worker-sgr9g in gang failed to be scheduled in cycle 1: 0/0 nodes are available: 3 Insufficient cpu.
5m46s Warning GangFailedScheduling pod/rayjob-sample-dtmtl-raycluster-r9jc7-small-group-worker-svt6g rayjob-sample-dtmtl-raycluster-r9jc7-small-group-worker-svt6g in gang failed to be scheduled in cycle 1: 0/0 nodes are available: 3 Insufficient cpu.
5m48s Warning GangFailedScheduling pod/rayjob-sample-dtmtl-raycluster-r9jc7-small-group-worker-wm5c6 rayjob-sample-dtmtl-raycluster-r9jc7-small-group-worker-wm5c6 in gang failed to be scheduled in cycle 1: 0/0 nodes are available: 3 Insufficient cpu.