Elastic scaling based on the Ray autoscaler and ACK autoscaler

Ray provides the Ray autoscaler, which allows you to dynamically adjust the computing resources of the Ray cluster based on workloads. Container Service for Kubernetes (ACK) also provides the ACK autoscaler to implement auto scaling. This component can automatically adjust the number of nodes based on the workloads in the cluster. The auto scaling feature based on the Ray autoscaler and ACK autoscaler completely leverages the elasticity capability of cloud computing and improves the efficiency and cost-effectiveness of computing resources.

Prerequisites

A Ray cluster is created based on ACK.
(Optional) You have learned how to submit a job in a Ray cluster. For more information, see Submit a Ray job.
The node auto scaling feature is enabled for the default node pool of the ACK cluster.

Run the following command to deploy a Ray cluster by using Helm in the ACK cluster:

helm uninstall ${RAY_CLUSTER_NAME} -n ${RAY_CLUSTER_NS}
helm install ${RAY_CLUSTER_NAME} aliyunhub/ack-ray-cluster -n ${RAY_CLUSTER_NS}

Run the following command to view the status of resources in the Ray cluster:

kubectl get pod -n ${RAY_CLUSTER_NS}
NAME                                           READY   STATUS     RESTARTS   AGE
myfirst-ray-cluster-head-kvvdf                 2/2     Running    0          22m

Run the following command to log on to the head node and view the cluster status:

Replace the value with the actual pod name of the Ray cluster.

kubectl -n ${RAY_CLUSTER_NS} exec -it myfirst-ray-cluster-head-kvvdf -- bash
(base) ray@myfirst-ray-cluster-head-kvvdf:~$ ray status

Expected output:

======== Autoscaler status: 2024-01-25 00:00:19.879963 ========
Node status
---------------------------------------------------------------
Healthy:
 1 head-group
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 0B/1.86GiB memory
 0B/452.00MiB object_store_memory

Demands:
 (no resource demands)

Submit and run the following jobs in the Ray cluster:
The following code starts 15 tasks, each of which requires one vCPU. By default, the value of --num-cpus for the head pod is 0, which means that task scheduling is not allowed. The CPU and memory of the worker pod are set to 1 vCPU and 1 GB by default. Therefore, the Ray cluster automatically creates 15 worker pods. Due to insufficient node resources in the ACK cluster, pods in the pending state automatically trigger the node auto scaling feature.
```
import time
import ray
import socket

ray.init()

@ray.remote(num_cpus=1)
def get_task_hostname():
    time.sleep(120)
    host = socket.gethostbyname(socket.gethostname())
    return host

object_refs = []
for _ in range(15):
    object_refs.append(get_task_hostname.remote())

ray.wait(object_refs)

for t in object_refs:
    print(ray.get(t))
```

Run the following command to query the status of pods in the Ray cluster:

kubectl get pod -n ${RAY_CLUSTER_NS} -w
# Expected output:
NAME                                           READY   STATUS    RESTARTS   AGE
myfirst-ray-cluster-head-kvvdf                 2/2     Running   0          47m
myfirst-ray-cluster-worker-workergroup-btgmm   1/1     Running   0          30s
myfirst-ray-cluster-worker-workergroup-c2lmq   0/1     Pending   0          30s
myfirst-ray-cluster-worker-workergroup-gstcc   0/1     Pending   0          30s
myfirst-ray-cluster-worker-workergroup-hfshs   0/1     Pending   0          30s
myfirst-ray-cluster-worker-workergroup-nrfh8   1/1     Running   0          30s
myfirst-ray-cluster-worker-workergroup-pjbdw   0/1     Pending   0          29s
myfirst-ray-cluster-worker-workergroup-qxq7v   0/1     Pending   0          30s
myfirst-ray-cluster-worker-workergroup-sm8mt   1/1     Running   0          30s
myfirst-ray-cluster-worker-workergroup-wr87d   0/1     Pending   0          30s
myfirst-ray-cluster-worker-workergroup-xc4kn   1/1     Running   0          30s
...

Run the following command to query the node status:

kubectl get node -w
# Expected output:
cn-hangzhou.172.16.0.204   Ready    <none>   44h   v1.24.6-aliyun.1
cn-hangzhou.172.16.0.17    NotReady   <none>   0s    v1.24.6-aliyun.1
cn-hangzhou.172.16.0.17    NotReady   <none>   0s    v1.24.6-aliyun.1
cn-hangzhou.172.16.0.17    NotReady   <none>   0s    v1.24.6-aliyun.1
cn-hangzhou.172.16.0.17    NotReady   <none>   1s    v1.24.6-aliyun.1
cn-hangzhou.172.16.0.17    NotReady   <none>   11s   v1.24.6-aliyun.1
cn-hangzhou.172.16.0.16    NotReady   <none>   10s   v1.24.6-aliyun.1
cn-hangzhou.172.16.0.16    NotReady   <none>   14s   v1.24.6-aliyun.1
cn-hangzhou.172.16.0.17    NotReady   <none>   31s   v1.24.6-aliyun.1
cn-hangzhou.172.16.0.17    NotReady   <none>   60s   v1.24.6-aliyun.1
cn-hangzhou.172.16.0.17    Ready      <none>   61s   v1.24.6-aliyun.1
cn-hangzhou.172.16.0.16    Ready      <none>   64s   v1.24.6-aliyun.1
...

References

You can also refer to Elastic scaling of Elastic Container Instance nodes based on the Ray autoscaler
You can access Ray Dashboard from the local network. For more information, see Access Ray Dashboard from the local network.