All Products
Search
Document Center

Container Service for Kubernetes:Elastic scaling based on the Ray autoscaler and ACK autoscaler

Last Updated:Feb 29, 2024

Ray provides the Ray autoscaler, which allows you to dynamically adjust the computing resources of the Ray cluster based on workloads. Container Service for Kubernetes (ACK) also provides the ACK autoscaler to implement auto scaling. This component can automatically adjust the number of nodes based on the workloads in the cluster. The auto scaling feature based on the Ray autoscaler and ACK autoscaler completely leverages the elasticity capability of cloud computing and improves the efficiency and cost-effectiveness of computing resources.

Prerequisites

Elastic scaling based on the Ray autoscaler and ACK autoscaler

  1. Run the following command to deploy a Ray cluster by using Helm in the ACK cluster:

    helm uninstall ${RAY_CLUSTER_NAME} -n ${RAY_CLUSTER_NS}
    helm install ${RAY_CLUSTER_NAME} aliyunhub/ack-ray-cluster -n ${RAY_CLUSTER_NS} 
  2. Run the following command to view the status of resources in the Ray cluster:

    kubectl get pod -n ${RAY_CLUSTER_NS}
    NAME                                           READY   STATUS     RESTARTS   AGE
    myfirst-ray-cluster-head-kvvdf                 2/2     Running    0          22m
  3. Run the following command to log on to the head node and view the cluster status:

    Replace the value with the actual pod name of the Ray cluster.

    kubectl -n ${RAY_CLUSTER_NS} exec -it myfirst-ray-cluster-head-kvvdf -- bash
    (base) ray@myfirst-ray-cluster-head-kvvdf:~$ ray status

    Expected output:

    ======== Autoscaler status: 2024-01-25 00:00:19.879963 ========
    Node status
    ---------------------------------------------------------------
    Healthy:
     1 head-group
    Pending:
     (no pending nodes)
    Recent failures:
     (no failures)
    
    Resources
    ---------------------------------------------------------------
    Usage:
     0B/1.86GiB memory
     0B/452.00MiB object_store_memory
    
    Demands:
     (no resource demands)
  4. Submit and run the following jobs in the Ray cluster:

    The following code starts 15 tasks, each of which requires one vCPU. By default, the value of --num-cpus for the head pod is 0, which means that task scheduling is not allowed. The CPU and memory of the worker pod are set to 1 vCPU and 1 GB by default. Therefore, the Ray cluster automatically creates 15 worker pods. Due to insufficient node resources in the ACK cluster, pods in the pending state automatically trigger the node auto scaling feature.

    import time
    import ray
    import socket
    
    ray.init()
    
    @ray.remote(num_cpus=1)
    def get_task_hostname():
        time.sleep(120)
        host = socket.gethostbyname(socket.gethostname())
        return host
    
    object_refs = []
    for _ in range(15):
        object_refs.append(get_task_hostname.remote())
    
    ray.wait(object_refs)
    
    for t in object_refs:
        print(ray.get(t))
  5. Run the following command to query the status of pods in the Ray cluster:

    kubectl get pod -n ${RAY_CLUSTER_NS} -w
    # Expected output:
    NAME                                           READY   STATUS    RESTARTS   AGE
    myfirst-ray-cluster-head-kvvdf                 2/2     Running   0          47m
    myfirst-ray-cluster-worker-workergroup-btgmm   1/1     Running   0          30s
    myfirst-ray-cluster-worker-workergroup-c2lmq   0/1     Pending   0          30s
    myfirst-ray-cluster-worker-workergroup-gstcc   0/1     Pending   0          30s
    myfirst-ray-cluster-worker-workergroup-hfshs   0/1     Pending   0          30s
    myfirst-ray-cluster-worker-workergroup-nrfh8   1/1     Running   0          30s
    myfirst-ray-cluster-worker-workergroup-pjbdw   0/1     Pending   0          29s
    myfirst-ray-cluster-worker-workergroup-qxq7v   0/1     Pending   0          30s
    myfirst-ray-cluster-worker-workergroup-sm8mt   1/1     Running   0          30s
    myfirst-ray-cluster-worker-workergroup-wr87d   0/1     Pending   0          30s
    myfirst-ray-cluster-worker-workergroup-xc4kn   1/1     Running   0          30s
    ...
  6. Run the following command to query the node status:

    kubectl get node -w
    # Expected output:
    cn-hangzhou.172.16.0.204   Ready    <none>   44h   v1.24.6-aliyun.1
    cn-hangzhou.172.16.0.17    NotReady   <none>   0s    v1.24.6-aliyun.1
    cn-hangzhou.172.16.0.17    NotReady   <none>   0s    v1.24.6-aliyun.1
    cn-hangzhou.172.16.0.17    NotReady   <none>   0s    v1.24.6-aliyun.1
    cn-hangzhou.172.16.0.17    NotReady   <none>   1s    v1.24.6-aliyun.1
    cn-hangzhou.172.16.0.17    NotReady   <none>   11s   v1.24.6-aliyun.1
    cn-hangzhou.172.16.0.16    NotReady   <none>   10s   v1.24.6-aliyun.1
    cn-hangzhou.172.16.0.16    NotReady   <none>   14s   v1.24.6-aliyun.1
    cn-hangzhou.172.16.0.17    NotReady   <none>   31s   v1.24.6-aliyun.1
    cn-hangzhou.172.16.0.17    NotReady   <none>   60s   v1.24.6-aliyun.1
    cn-hangzhou.172.16.0.17    Ready      <none>   61s   v1.24.6-aliyun.1
    cn-hangzhou.172.16.0.16    Ready      <none>   64s   v1.24.6-aliyun.1
    ...

References