全部產品
Search
文件中心

Container Service for Kubernetes:基於Ray autoscaler與ACK autoscaler實現Auto Scaling

更新時間:Jun 19, 2024

Ray分散式運算架構提供Ray autoscaler組件,支援根據工作負載動態調整Ray Cluster的計算資源。ACK叢集也提供ACK autoscaler組件實現自動調整功能,根據叢集中工作負載的實際需要自動調整節點數量。Ray autoscaler與ACK autoscaler彈性功能的結合能更充分地發揮雲的彈效能力,提高計算資源供給效率和性價比。

前提條件

Ray Auto-Scaler結合ACK Cluster-Autoscaler實現Auto Scaling

  1. 執行以下命令,在ACK叢集中通過Helm安裝Ray Cluster應用。

    helm uninstall ${RAY_CLUSTER_NAME} -n ${RAY_CLUSTER_NS}
    helm install ${RAY_CLUSTER_NAME} aliyunhub/ack-ray-cluster -n ${RAY_CLUSTER_NS} 
  2. 執行以下命令,查看Ray Cluster中資源的運行情況。

    kubectl get pod -n ${RAY_CLUSTER_NS}
    NAME                                           READY   STATUS     RESTARTS   AGE
    myfirst-ray-cluster-head-kvvdf                 2/2     Running    0          22m
  3. 執行以下命令,登入Head節點,查看叢集Status資訊。

    請將Pod名稱替換為實際的Ray Cluster的Pod名稱。

    kubectl -n ${RAY_CLUSTER_NS} exec -it myfirst-ray-cluster-head-kvvdf -- bash
    (base) ray@myfirst-ray-cluster-head-kvvdf:~$ ray status

    預期輸出:

    ======== Autoscaler status: 2024-01-25 00:00:19.879963 ========
    Node status
    ---------------------------------------------------------------
    Healthy:
     1 head-group
    Pending:
     (no pending nodes)
    Recent failures:
     (no failures)
    
    Resources
    ---------------------------------------------------------------
    Usage:
     0B/1.86GiB memory
     0B/452.00MiB object_store_memory
    
    Demands:
     (no resource demands)
  4. 在Ray Cluster中運行提交如下Job。

    下方代碼啟動了15個Task,每個Task需要1核CPU的調度資源。預設建立的Ray Cluster的Head pod的--num-cpus為0,即不允許調度Task;Work Pod的CPU記憶體預設為1核,1GB。因此,共需要自動擴容15個Work Pod。由於ACK叢集中的節點資源不足,Pending的Pod會自動觸發ACK的節點自動調整

    import time
    import ray
    import socket
    
    ray.init()
    
    @ray.remote(num_cpus=1)
    def get_task_hostname():
        time.sleep(120)
        host = socket.gethostbyname(socket.gethostname())
        return host
    
    object_refs = []
    for _ in range(15):
        object_refs.append(get_task_hostname.remote())
    
    ray.wait(object_refs)
    
    for t in object_refs:
        print(ray.get(t))
  5. 執行以下命令,查看Ray Cluster下的Pod狀態。

    kubectl get pod -n ${RAY_CLUSTER_NS} -w
    # 預期輸出:
    NAME                                           READY   STATUS    RESTARTS   AGE
    myfirst-ray-cluster-head-kvvdf                 2/2     Running   0          47m
    myfirst-ray-cluster-worker-workergroup-btgmm   1/1     Running   0          30s
    myfirst-ray-cluster-worker-workergroup-c2lmq   0/1     Pending   0          30s
    myfirst-ray-cluster-worker-workergroup-gstcc   0/1     Pending   0          30s
    myfirst-ray-cluster-worker-workergroup-hfshs   0/1     Pending   0          30s
    myfirst-ray-cluster-worker-workergroup-nrfh8   1/1     Running   0          30s
    myfirst-ray-cluster-worker-workergroup-pjbdw   0/1     Pending   0          29s
    myfirst-ray-cluster-worker-workergroup-qxq7v   0/1     Pending   0          30s
    myfirst-ray-cluster-worker-workergroup-sm8mt   1/1     Running   0          30s
    myfirst-ray-cluster-worker-workergroup-wr87d   0/1     Pending   0          30s
    myfirst-ray-cluster-worker-workergroup-xc4kn   1/1     Running   0          30s
    ...
  6. 執行以下命令,查看Node狀態。

    kubectl get node -w
    # 預期輸出:
    cn-hangzhou.172.16.0.204   Ready    <none>   44h   v1.24.6-aliyun.1
    cn-hangzhou.172.16.0.17    NotReady   <none>   0s    v1.24.6-aliyun.1
    cn-hangzhou.172.16.0.17    NotReady   <none>   0s    v1.24.6-aliyun.1
    cn-hangzhou.172.16.0.17    NotReady   <none>   0s    v1.24.6-aliyun.1
    cn-hangzhou.172.16.0.17    NotReady   <none>   1s    v1.24.6-aliyun.1
    cn-hangzhou.172.16.0.17    NotReady   <none>   11s   v1.24.6-aliyun.1
    cn-hangzhou.172.16.0.16    NotReady   <none>   10s   v1.24.6-aliyun.1
    cn-hangzhou.172.16.0.16    NotReady   <none>   14s   v1.24.6-aliyun.1
    cn-hangzhou.172.16.0.17    NotReady   <none>   31s   v1.24.6-aliyun.1
    cn-hangzhou.172.16.0.17    NotReady   <none>   60s   v1.24.6-aliyun.1
    cn-hangzhou.172.16.0.17    Ready      <none>   61s   v1.24.6-aliyun.1
    cn-hangzhou.172.16.0.16    Ready      <none>   64s   v1.24.6-aliyun.1
    ...

相關文檔