Deploy Ray Cluster in ACK - Container Service for Kubernetes

Ray is an open-source framework for building scalable AI and Python applications, widely used in the machine learning field. This guide shows how to deploy a Ray Cluster on Alibaba Cloud Container Service for Kubernetes (ACK).

Create a cluster

To create a cluster, see Create an ACK managed cluster. To upgrade your cluster, see Manually upgrade a cluster. Create an ACK Managed Cluster Pro that meets the following requirements.

Cluster version: v1.24 or later.
Instance Type: Requires at least one node with a minimum of 8 vCPUs and 32 GB of memory.
The recommended minimum specifications are for a test environment. For production environments, use specifications that match your actual workload. If you require GPU acceleration, configure GPU-accelerated nodes.
For more information about supported ECS instance types, see Instance family.
You have kubectl installed on your local machine and are connected to your Kubernetes cluster. For more information, see Obtain the KubeConfig file of a cluster and connect to the cluster by using kubectl.

(Optional) Create an ApsaraDB for Tair instance

To provide fault tolerance and high availability for the Ray Cluster, create an Alibaba Cloud ApsaraDB for Tair (Redis-compatible) instance that meets the following requirements.

The ApsaraDB for Tair instance must be in the same region and Virtual Private Cloud (VPC) as the ACK Managed Cluster Pro. For more information, see Step 1: Create an instance.
Add a whitelist group to allow access from the VPC CIDR block. For more information, see Step 2: Configure a whitelist.
Obtain the connection address of the Redis instance (a VPC endpoint is recommended). For more information, see View connection addresses.
Obtain the password for the Redis instance. For more information, see Change or reset the password.

Install the Kuberay-Operator component

Log on to the Container Service for Kubernetes (ACK) console. In the left-side navigation pane, click Clusters. Click the name of your target cluster. Navigate to Operations > Add-ons > Manage Applications, then click Install under Kuberay-Operator.

Deploy the Ray Cluster

Important

Solution for Docker Hub pull failures

Due to network instability, such as issues with carrier networks, image pulls from Docker Hub may fail. We recommend using images that rely on Docker Hub with caution in production environments. This example uses the official Ray image rayproject/ray:2.36.1. If you cannot pull this image, use one of the following solutions:

Subscribe to images from registries outside the Chinese mainland through Container Registry. For more information, see Subscribe to images outside China.
To directly pull images from overseas sources, create a Global Accelerator instance and use its global network acceleration service. For more information, see Use GA to accelerate cross-domain container image pulling in ACK.

Run the following commands to create a Ray Cluster named myfirst-ray-cluster and check its deployment status.

Run the following command to create the Ray Cluster resource.

Expand to view the complete command code

cat <<EOF | kubectl apply -f -
apiVersion: ray.io/v1
kind: RayCluster
metadata:
  name: myfirst-ray-cluster
  namespace: default
spec:
  suspend: false
  autoscalerOptions:
    env: []
    envFrom: []
    idleTimeoutSeconds: 60
    imagePullPolicy: Always
    resources:
      limits:
        cpu: 2000m
        memory: 2024Mi
      requests:
        cpu: 2000m
        memory: 2024Mi
    securityContext: {}
    upscalingMode: Default
  enableInTreeAutoscaling: false
  headGroupSpec:
    rayStartParams:
      dashboard-host: 0.0.0.0
      num-cpus: "0"
    serviceType: ClusterIP
    template:
      spec:
        containers:
        - image: rayproject/ray:2.36.1
          imagePullPolicy: Always
          name: ray-head
          resources:
            limits:
              cpu: "4"
              memory: 4G
            requests:
              cpu: "1"
              memory: 1G
  workerGroupSpecs:
  - groupName: work1
    maxReplicas: 1000
    minReplicas: 0
    numOfHosts: 1
    rayStartParams: {}
    replicas: 1
    template:
      spec:
        containers:
        - image: rayproject/ray:2.36.1
          imagePullPolicy: Always
          name: ray-worker
          resources:
            limits:
              cpu: "4"
              memory: 4G
            requests:
              cpu: "4"
              memory: 4G
EOF

Run the following commands to check the deployment status.

Check the status of the Ray Cluster.

kubectl get raycluster

Expected output:

NAME                  DESIRED WORKERS   AVAILABLE WORKERS   CPUS   MEMORY   GPUS   STATUS   AGE
myfirst-ray-cluster   1                 1                   5      5G       0      ready    4m19s

Check the pods for the Ray Cluster.

kubectl get pod

Expected output:

NAME                                     READY   STATUS    RESTARTS   AGE
myfirst-ray-cluster-head-5q2hk           1/1     Running   0          4m37s
myfirst-ray-cluster-work1-worker-zkjgq   1/1     Running   0          4m31s

Check the services for the Ray Cluster.

kubectl get svc

Expected output:

NAME                           TYPE        CLUSTER-IP    EXTERNAL-IP   PORT(S)                                         AGE
kubernetes                     ClusterIP   192.168.0.1   <none>        443/TCP                                         21d
myfirst-ray-cluster-head-svc   ClusterIP   None          <none>        10001/TCP,8265/TCP,8080/TCP,6379/TCP,8000/TCP   6m57s