Elastic scaling of Elastic Container Instance nodes based on the Ray autoscaler - Container Service for Kubernetes

By integrating with the Ray autoscaler, you can create a Ray cluster in a Container Service for Kubernetes (ACK) cluster and configure it to automatically scale Elastic Container Instance nodes in the Ray cluster. This allows you to benefit from a container runtime environment that is maintenance-free, isolated, and quick to start up. Elastic Container Instance allows you to run containers without the need to purchase or manage the underlying Elastic Compute Service( ECS) instances. You can focus on containerized applications rather than the maintenance of the underlying infrastructure. You are charged only for the resources used by your containers.

Prerequisites

A Ray cluster is created based on ACK.
(Optional) You have learned how to submit a job in a Ray cluster. For more information, see Submit a Ray job.
The ack-virtual-node component is deployed and pods are scheduled to Elastic Container Instance nodes. For more information, see Use virtual nodes to schedule pods to run on Elastic Container Instance nodes.

Example of the cluster architecture

Procedure

Run the following command to query the node and check whether the virtual-kubelet virtual node is added:

kubectl get node

Expected output:

NAME                            STATUS   ROLES    AGE   VERSION
cn-hangzhou.172.XX.XX.20         Ready    <none>   19h   v1.26.3-aliyun.1
cn-hangzhou.172.XX.XX.236        Ready    <none>   82m   v1.26.3-aliyun.1
cn-hangzhou.172.XX.XX.41         Ready    <none>   19h   v1.26.3-aliyun.1
virtual-kubelet-cn-hangzhou-k   Ready    agent    16m   v1.26.3-aliyun.1

Run the following command to create a file named values.yaml:

cat > values.yaml <<EOF
worker:
  groupName: workergroup
  labels:
    alibabacloud.com/eci: "true"
EOF

Run the following command to deploy a Ray cluster that supports Elastic Container Instance:

helm uninstall ${RAY_CLUSTER_NAME} -n ${RAY_CLUSTER_NS}
helm install ${RAY_CLUSTER_NAME} aliyunhub/ack-ray-cluster -n ${RAY_CLUSTER_NS}  -f values.yaml

Run the following command to check whether the pod of the Ray cluster runs as normal:

kubectl get pod

Expected output:

NAME                                           READY   STATUS    RESTARTS   AGE     IP             NODE                            NOMINATED NODE   READINESS GATES
myfirst-ray-cluster-head-7fgp4                 2/2     Running   0          7m2s    172.16.0.241   cn-hangzhou.172.16.0.240        <none>           <none>

Run the following command to log on to the head node:
Replace the value with the actual pod name of the Ray cluster.
```
kubectl -n ${RAY_CLUSTER_NS} exec -it myfirst-ray-cluster-head-7fgp4 -- bash
```
Submit and run a Python job in the Ray cluster.
The following code starts two tasks, each of which requires one vCPU. By default, the value of --num-cpus for the head pod is 0, which means that task scheduling is not allowed. The CPU and memory of the worker pod are set to 1 vCPU and 1 GB by default. Therefore, the Ray cluster automatically creates two Elastic Container Instance worker pods.
```
import time
import ray
import socket

ray.init()

@ray.remote(num_cpus=1)
def get_task_hostname():
    time.sleep(120)
    host = socket.gethostbyname(socket.gethostname())
    return host

object_refs = []
for _ in range(2):
    object_refs.append(get_task_hostname.remote())

ray.wait(object_refs)

for t in object_refs:
    print(ray.get(t))
```

Run the following command to check whether the pods run as expected:

kubectl get pod -o wide
# Expected output:
NAME                                           READY   STATUS     RESTARTS   AGE     IP             NODE                            NOMINATED NODE   READINESS GATES
myfirst-ray-cluster-head-7fgp4                 2/2     Running    0          4m56s   172.16.0.241   cn-hangzhou.172.16.0.240        <none>           <none>
myfirst-ray-cluster-worker-workergroup-6s2cl   0/1     Init:0/1   0          4m5s    172.16.0.17    virtual-kubelet-cn-hangzhou-k   <none>           <none>
myfirst-ray-cluster-worker-workergroup-l9qgb   1/1     Running    0          4m5s    172.16.0.16    virtual-kubelet-cn-hangzhou-k   <none>

References

For more information about virtual nodes, see Virtual nodes.
You can also refer to Elastic scaling based on the Ray autoscaler and ACK autoscaler and configure auto scaling for ECS nodes.