Ray分散式運算架構提供Ray autoscaler組件,支援根據工作負載動態調整Ray Cluster的計算資源。ACK叢集也提供ACK autoscaler組件實現自動調整功能,根據叢集中工作負載的實際需要自動調整節點數量。Ray autoscaler與ACK autoscaler彈性功能的結合能更充分地發揮雲的彈效能力,提高計算資源供給效率和性價比。
前提條件
Ray Auto-Scaler結合ACK Cluster-Autoscaler實現Auto Scaling
執行以下命令,在ACK叢集中通過Helm安裝Ray Cluster應用。
helm uninstall ${RAY_CLUSTER_NAME} -n ${RAY_CLUSTER_NS} helm install ${RAY_CLUSTER_NAME} aliyunhub/ack-ray-cluster -n ${RAY_CLUSTER_NS}執行以下命令,查看Ray Cluster中資源的運行情況。
kubectl get pod -n ${RAY_CLUSTER_NS} NAME READY STATUS RESTARTS AGE myfirst-ray-cluster-head-kvvdf 2/2 Running 0 22m執行以下命令,登入Head節點,查看叢集Status資訊。
請將Pod名稱替換為實際的Ray Cluster的Pod名稱。
kubectl -n ${RAY_CLUSTER_NS} exec -it myfirst-ray-cluster-head-kvvdf -- bash (base) ray@myfirst-ray-cluster-head-kvvdf:~$ ray status預期輸出:
======== Autoscaler status: 2024-01-25 00:00:19.879963 ======== Node status --------------------------------------------------------------- Healthy: 1 head-group Pending: (no pending nodes) Recent failures: (no failures) Resources --------------------------------------------------------------- Usage: 0B/1.86GiB memory 0B/452.00MiB object_store_memory Demands: (no resource demands)在Ray Cluster中運行提交如下Job。
下方代碼啟動了15個Task,每個Task需要1核CPU的調度資源。預設建立的Ray Cluster的Head pod的
--num-cpus為0,即不允許調度Task;Work Pod的CPU記憶體預設為1核,1GB。因此,共需要自動擴容15個Work Pod。由於ACK叢集中的節點資源不足,Pending的Pod會自動觸發ACK的節點自動調整。import time import ray import socket ray.init() @ray.remote(num_cpus=1) def get_task_hostname(): time.sleep(120) host = socket.gethostbyname(socket.gethostname()) return host object_refs = [] for _ in range(15): object_refs.append(get_task_hostname.remote()) ray.wait(object_refs) for t in object_refs: print(ray.get(t))執行以下命令,查看Ray Cluster下的Pod狀態。
kubectl get pod -n ${RAY_CLUSTER_NS} -w # 預期輸出: NAME READY STATUS RESTARTS AGE myfirst-ray-cluster-head-kvvdf 2/2 Running 0 47m myfirst-ray-cluster-worker-workergroup-btgmm 1/1 Running 0 30s myfirst-ray-cluster-worker-workergroup-c2lmq 0/1 Pending 0 30s myfirst-ray-cluster-worker-workergroup-gstcc 0/1 Pending 0 30s myfirst-ray-cluster-worker-workergroup-hfshs 0/1 Pending 0 30s myfirst-ray-cluster-worker-workergroup-nrfh8 1/1 Running 0 30s myfirst-ray-cluster-worker-workergroup-pjbdw 0/1 Pending 0 29s myfirst-ray-cluster-worker-workergroup-qxq7v 0/1 Pending 0 30s myfirst-ray-cluster-worker-workergroup-sm8mt 1/1 Running 0 30s myfirst-ray-cluster-worker-workergroup-wr87d 0/1 Pending 0 30s myfirst-ray-cluster-worker-workergroup-xc4kn 1/1 Running 0 30s ...執行以下命令,查看Node狀態。
kubectl get node -w # 預期輸出: cn-hangzhou.172.16.0.204 Ready <none> 44h v1.24.6-aliyun.1 cn-hangzhou.172.16.0.17 NotReady <none> 0s v1.24.6-aliyun.1 cn-hangzhou.172.16.0.17 NotReady <none> 0s v1.24.6-aliyun.1 cn-hangzhou.172.16.0.17 NotReady <none> 0s v1.24.6-aliyun.1 cn-hangzhou.172.16.0.17 NotReady <none> 1s v1.24.6-aliyun.1 cn-hangzhou.172.16.0.17 NotReady <none> 11s v1.24.6-aliyun.1 cn-hangzhou.172.16.0.16 NotReady <none> 10s v1.24.6-aliyun.1 cn-hangzhou.172.16.0.16 NotReady <none> 14s v1.24.6-aliyun.1 cn-hangzhou.172.16.0.17 NotReady <none> 31s v1.24.6-aliyun.1 cn-hangzhou.172.16.0.17 NotReady <none> 60s v1.24.6-aliyun.1 cn-hangzhou.172.16.0.17 Ready <none> 61s v1.24.6-aliyun.1 cn-hangzhou.172.16.0.16 Ready <none> 64s v1.24.6-aliyun.1 ...
相關文檔
您可以在本地訪問Ray的可視化Web介面DashBoard,請參見本地訪問Ray DashBoard。