Ray is an open-source unified framework for scaling AI and Python applications. Ray is widely adopted in the machine learning sector. You can quickly create a Ray cluster in a Container Service for Kubernetes (ACK) cluster and integrate the Ray cluster with Simple Log Service, Managed Service for Prometheus, and ApsaraDB for Redis to optimize log management, observability, and availability. The Ray autoscaler can work with the ACK autoscaler to improve the efficiency of computing resource scaling and increase the resource utilization.
What is Ray?
Ray is an open-source unified framework for scaling AI and Python applications. It provides an API to simplify distributed computing to help you efficiently develop parallel processing and distributed Python applications. Ray is widely adopted in the machine learning sector. The unified computing framework of Ray consists of the Ray AI Libraries, Ray Core, and Ray Clusters layers.
Ray on Kubernetes
The KubeRay operator provides a Kubernetes-native way to manage Ray clusters. You can use the KubeRay operator to deploy Ray clusters in Kubernetes environments, including ACK clusters. When you install the KubeRay operator, you need to deploy the operator Deployment and the RayCluster, RayJob, and RayService CustomResourceDefinitions (CRDs).
Ray on Kubernetes can greatly simplify the deployment and management of distributed applications. Ray on Kubernetes provides the following benefits. For more information, see Ray on Kubernetes.
Auto scaling: Kubernetes can automatically scale the number of nodes based on workloads. After you deploy the Ray autoscaler in Kubernetes, Kubernetes can dynamically scale a Ray cluster based on workloads, optimize the resource utilization, and simplify the management of distributed applications.
Fault tolerance: Ray is designed with fault tolerance. This capability is enhanced when Ray runs on Kubernetes. When a Ray node fails, Kubernetes automatically replaces the faulty node to ensure the stability and availability of the Ray cluster.
Resource management: In Kubernetes, you can create resource requests and limits to control and manage resources, such as CPU and memory resources, used by Ray nodes in a fine-grained manner. This helps improve resource utilization and avoid resource waste.
Simple deployment: Kubernetes provides a unified system for deploying, managing, and monitoring containerized applications. Ray on Kubernetes provides a consistent experience for configuring and managing Ray clusters in development, staging, and production environments.
Service discovery and load balancing: Kubernetes supports service discovery and load balancing. You can use Kubernetes to automatically manage connections between Ray nodes and connections between clients and a Ray cluster. This helps simplify network configuration and improve network performance.
Multi-tenancy: You can use namespaces in Kubernetes to isolate Ray clusters that belong to different users or teams and share resources in a Kubernetes cluster.
Monitoring and logging: Kubernetes provides observability capabilities for monitoring and logging, which allow you to trace the status and performance of your Ray cluster. For example, you can use Prometheus and Grafana to collect the performance metrics of Ray clusters.
Compatibility: Kubernetes is the core of cloud-native ecosystems. It is compatible with multiple cloud service providers and technology stacks. After you deploy a Ray cluster in Kubernetes, you can migrate or scale the cluster across different cloud computing platforms or hybrid cloud environments.
Ray on ACK
Container Service for Kubernetes (ACK) is one of the first services to participate in the Certified Kubernetes Conformance Program in the world. ACK provides high-performance containerized application management services and supports lifecycle management for enterprise-class containerized applications. You can use KubeRay to create Ray clusters in ACK clusters in the same way you create ACK clusters in the cloud.
Your Ray cluster can work with Simple Log Service, Managed Service for Prometheus, and ApsaraDB for Redis to improve log management, observability, and availability.
You can use a combination of the ray autoscaler and ACK autoscaler to scale computing resources on demand.
Billing
After you create a Ray cluster in an ACK cluster, you can use Simple Log Service, Managed Service for Prometheus, and ApsaraDB for Redis to improve log management, observability, and availability. In addition to fees incurred by ACK, you must also pay for other resources. For more information about billing, see the following topics:
Managed Service for Prometheus: Billing
Simple Log Service: Billing overview
ApsaraDB for Redis: Billable items
Prerequisites
An ACK Pro cluster is created. The cluster meets the following requirements.
For more information about how to create an ACK cluster, see Create an ACK managed cluster. For more information about how to update an ACK cluster, see Update an ACK cluster.
The Kubernetes version of the cluster is v1.24 or later.
Node specifications: A node that provides at least 8 vCPUs and 32 GB of memory is created.
You can use the recommended minimum specifications in a staging environment. You can configure GPU-accelerated nodes on demand. For more information about Elastic Compute Service (ECS) instance types, see Overview of instance families.
Simple Log Service is enabled for the cluster.
Manage Service for Prometheus is enabled for the cluster.
A kubectl client is connected to the cluster. For more information, see Obtain the kubeconfig file of a cluster and use kubectl to connect to the cluster.
(Optional) An ApsaraDB for Redis instance is created. The instance meets the following requirements.
To deploy a Ray cluster that supports high availability and fault tolerance, an ApsaraDB for Redis instance is used in this example. You can choose to create an ApsaraDB for Redis instance on demand.
The ApsaraDB for Redis instance must be deployed in the same region and virtual private cloud (VPC) of the ACK Pro cluster. For more information, see Step 1: Create an ApsaraDB for Redis instance.
Add a whitelist to allow access from the VPC CIDR block. For more information, see Step 2: Configure whitelists.
Obtain the endpoint of the ApsaraDB for Redis instance. We recommend that you use the VPC endpoint. For more information, see View endpoints.
Obtain the password of the ApsaraDB for Redis instance. For more information, see Change or reset the password.
Step 1: Deploy the ack-kuberay-operator component
The ack-kuberay-operator component is available in the marketplace of the ACK console. This component integrates the open source KubeRay component, enhances the KubeRay component, and minimizes the permissions of the KubeRay component.
Deploy ack-kuberay-operator
Log on to the ACK console. In the left-side navigation pane, choose .
On the Marketplace page, click the Big Data/AI tab and click ack-kuberay-operator. On the ack-kuberay-operator page, click Deploy.
On the Basic Information wizard page, select a cluster and click Next.
On the Parameters tab, confirm Chart Version and Parameters, and click OK.
After ack-kuberay-operator is deployed, you are redirected to the Helm page. You can view the Helm information of ack-kuberay-operator on this page.
Check whether ack-kuberay-operator runs as normal
Run the following command to verify that the status of the operator pod in the kuberay-system
namespace is running
:
kubectl get pod -n kuberay-system
The output indicates that ack-kuberay-operator is installed.
NAME READY STATUS RESTARTS AGE
ack-kuberay-operator-88c879859-f467l 1/1 Running 0 3m48s
Step 2: Use ack-ray-cluster to deploy a Ray cluster
The ack-ray-cluster component provides additional Value configurations to allow you to integrate with Alibaba Cloud services, such as Simple Log Service, Managed Service for Prometheus, and ApsaraDB for Redis. You can use ack-ray-cluster to deploy a Ray cluster to run Ray tasks.
In this step, a Ray cluster named myfirst-ray-cluster is created in the raycluster namespace.
Run the following command to add an aliyunhub helm repo source:
helm repo add aliyunhub https://aliacs-app-catalog.oss-cn-hangzhou.aliyuncs.com/charts-incubator/ helm repo update aliyunhub
Run the following command to install the
ack-ray-cluster
chart:helm search repo ack-ray-cluster
The output indicates that the
ack-ray-cluster
chart is installed.NAME CHART VERSION APP VERSION DESCRIPTION aliyunhub/ack-ray-cluster 1.0.0 1.0.0 A ray cluster for Alibaba Cloud
NoteBy default, the KubeRay autoscaler is enabled for ack-ray-cluster. By default, Ray clusters use the official image
rayproject/ray:2.7.0
provided by the Ray community. For other settings, refer to the Value configurations in theack-ray-cluster
chart.Run the following command to configure environment variables:
export RAY_CLUSTER_NAME='myfirst-ray-cluster' export RAY_CLUSTER_NS='raycluster'
Run the following command to create a namespace:
kubectl create ns ${RAY_CLUSTER_NS}
The output indicates that the namespace is created.
namespace/raycluster created
Run the following command to create a Ray cluster in the
${RAY_CLUSTER_NS}
namespace:helm install ${RAY_CLUSTER_NAME} aliyunhub/ack-ray-cluster -n ${RAY_CLUSTER_NS}
The output indicates that the Ray cluster is created.
NAME: myfirst-ray-cluster LAST DEPLOYED: Tue Feb 6 09:48:29 2024 NAMESPACE: raycluster STATUS: deployed REVISION: 1 TEST SUITE: None
Run the following command to check whether the Ray cluster, Service, and pods are created.
Query the Ray cluster.
kubectl get rayclusters.ray.io -n ${RAY_CLUSTER_NS}
The output indicates that the Ray cluster is created.
NAME DESIRED WORKERS AVAILABLE WORKERS STATUS AGE myfirst-ray-cluster 49s
Query Services.
kubectl get svc -n ${RAY_CLUSTER_NS}
The output indicates that a ClusterIP Service is created.
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE myfirst-ray-cluster-head-svc ClusterIP 192.168.36.189 <none> 10001/TCP,8265/TCP,8080/TCP,6379/TCP,8000/TCP 82s
Query pods.
kubectl get pod -n ${RAY_CLUSTER_NS}
The output indicates that pods are created for the Ray cluster.
NAME READY STATUS RESTARTS AGE ray-cluster-01-head-zx88p 2/2 Running 0 27s ray-cluster-01-worker-workergroup-nt9wv 1/1 Running 0 27s
(Optional) Run the following command to create another Ray cluster named
mysecond-ray-cluster
in thedefault
namespace.NoteYou can use the
ack-ray-cluster
component to create multiple Ray clusters in an ACK cluster.helm install mysecond-ray-cluster aliyunhub/ack-ray-cluster
(Optional) Step 3: Integrate Ray clusters with Simple Log Service
You can integrate Simple Log Service with a Ray cluster to persist logs.
Run the following command to create a global AliyunLogConfig object to enable the Logtail component in the ACK cluster to collect logs generated by the pods of Ray clusters and deliver the logs to a Simple Log Service project.
cat <<EOF | kubectl apply -f - apiVersion: log.alibabacloud.com/v1alpha1 kind: AliyunLogConfig metadata: name: rayclusters namespace: kube-system spec: # The name of the Logstore. If the specified Logstore does not exist, Simple Log Service automatically creates a Logstore. logstore: rayclusters # Configure Logtail. logtailConfig: # The type of data source. If you want to collect text logs, you must set the value to file. inputType: file # The name of the Logtail configuration. The name must be the same as the resource name that is specified in metadata.name. configName: rayclusters inputDetail: # Configure Logtail to collect text logs in simple mode. logType: common_reg_log # The path of the log file. logPath: /tmp/ray/session_*-*-*_*/logs # The name of the log file. You can use wildcard characters such as asterisks (*) and question marks (?) when you specify the log file name. Example: log_*.log. filePattern: "*.*" # If you want to collect container text logs, you must set dockerFile to true. dockerFile: true # The conditions that are used to filter containers. advanced: k8s: IncludeK8sLabel: ray.io/is-ray-node: "yes" ExternalK8sLabelTag: ray.io/cluster: "_raycluster_name_" ray.io/node-type : "_node_type_" EOF
Parameters
Description
logPath
Collect all logs in the
/tmp/ray/session_*-*-*_*/logs
directory of the pods. You can specify a custom path.advanced.k8s.ExternalK8sLabelTag
Add tags to the collected logs for log retrieval. By default, the
_raycluster_name_
and_node_type_
tags are added.For more information about the AliyunLogConfig parameters, see Use CRDs to collect container logs in DaemonSet mode. Simple Log Service is a paid service. For more information, see Billing overview.
Log on to the ACK console and click Clusters in the left-side navigation pane.
On the Clusters page, click the name of the cluster that you want to manage and click Cluster Information in the left-side navigation pane.
On the Cluster Information page, click the Cluster Resources tab. Click the hyperlink next to Log Service Project to access the Simple Log Service project.
Select the Logstore that corresponds to
rayclusters
and view the log content.You can view the logs of different Ray clusters based on tags, such as
_raycluster_name_
.
(Optional) Step 4: Integrate Ray clusters with Managed Service for Prometheus
The ack-ray-cluster component is integrated with Managed Service for Prometheus. To use the Ray cluster monitoring feature, perform the following steps to install ack-ray-cluster. For more information, see Managed Service for Prometheus.
Run the following command to install ack-ray-cluster and set
armsPrometheus.enable
in the values configuration totrue
.helm uninstall ${RAY_CLUSTER_NAME} -n ${RAY_CLUSTER_NS} helm install ${RAY_CLUSTER_NAME} aliyunhub/ack-ray-cluster -n ${RAY_CLUSTER_NS} --set armsPrometheus.enable=true
Log on to the ARMS console.
In the left-side navigation pane, click Integration Center. On the Infrastructure page, find and select Ray. In the Ray panel, select the ACK cluster that you created and click OK.
After the ACK cluster is integrated with Managed Service for Prometheus, click Integration Management to go to the Integration Management page.
On the Component Management tab, click Dashboards in the Component Type section and click Ray Cluster.
Specify Namespace, RayClusterName, and SessionName to filter the monitoring data of tasks that run in the Ray clusters.
(Optional) Step 5: Enable fault tolerance
The Global Control Service (GCS) component is used to manage the metadata of Ray clusters. GCS stores all data in memory, which is not fault-tolerant. When GCS is down, the entire Ray cluster fails. To enable fault tolerance for GCS, you need to create highly-available ApsaraDB for Redis instance for the Ray cluster. When GCS restarts, GCS can retrieve all data from the ApsaraDB for Redis instance so that GCS can function as normal. The ack-ray-cluster component supports integration with ApsaraDB for Redis to enable fault tolerance for GCS. For more information, see KubeRay GCS Fault Toleration Config and GCS Fault Toleration in KubeRay.
Run the following command to create a Secret to store the endpoint and password of the ApsaraDB for Redis instance.
Replace
REDIS_PASSWORD
andRAY_REDIS_ADDRESS
with the endpoint and password of the ApsaraDB for Redis instance. SpecifyRAY_REDIS_ADDRESS
in theredis://{Endpoint of the ApsaraDB for Redis instance}:6379
format. For more information about how to obtain the password of the ApsaraDB for Redis instance, see Change or reset the password.export REDIS_PASSWORD='your redis password' export RAY_REDIS_ADDRESS='redis://<Endpoint of the ApsaraDB for Redis instance>:6379' kubectl create secret generic ${RAY_CLUSTER_NAME}-raycluster-redis -n ${RAY_CLUSTER_NS} --from-literal=address=${RAY_REDIS_ADDRESS} --from-literal=password=${REDIS_PASSWORD}
NoteThe Secret is named in the
${RAY_CLUSTER_NAME}-raycluster-redis
format. Do not change the Secret name.Run the following command to create a Ray cluster:
helm uninstall ${RAY_CLUSTER_NAME} -n ${RAY_CLUSTER_NS} helm install ${RAY_CLUSTER_NAME} aliyunhub/ack-ray-cluster -n ${RAY_CLUSTER_NS} --set armsPrometheus.enable=true --set gcsFaultTolerance.enable=true
When
gcsFaultTolerance.enable
is set totrue
, theray.io/ft-enabled: "true"
annotation is added to the Ray cluster to enable GCS fault tolerance and mount the Secret to the Ray cluster through environment parameters.Run the following command to query the Ray cluster:
kubectl get rayclusters.ray.io ${RAY_CLUSTER_NAME} -n ${RAY_CLUSTER_NS} # Expected output: NAME DESIRED WORKERS AVAILABLE WORKERS STATUS AGE myfirst-ray-cluster 0 0 1 11m
Run the following command to query the pods of the Ray cluster:
kubectl get pod -n ${RAY_CLUSTER_NS} # Expected output: NAME READY STATUS RESTARTS AGE myfirst-ray-cluster-head-vrltd 2/2 Running 0 12m
View data on the ApsaraDB for Redis instance. For more information, see Manage ApsaraDB for Redis instances by using DMS.
The figure indicates that the GCS information of the Ray cluster is stored on the ApsaraDB for Redis instance. When you delete the Ray cluster, the relevant GCS information is deleted from the ApsaraDB for Redis instance.
References
You can access Ray Dashboard from the local network. For more information, see Access Ray Dashboard from the local network.
For more information about how to submit jobs in a Ray cluster, see Submit a Ray job.
For more information about how to use the Ray autoscaler to automatically scale ECS nodes, see Elastic scaling based on the Ray autoscaler and ACK autoscaler.
For more information about how to use the Ray autoscaler to automatically scale virtual Elastic Container Instance nodes, see Elastic scaling of Elastic Container Instance nodes based on the Ray autoscaler.