All Products
Search
Document Center

Container Service for Kubernetes:Install KubeRay in ACK

Last Updated:Mar 26, 2026

Install the KubeRay Operator on an ACK Managed Cluster Pro and connect it to Simple Log Service and Managed Service for Prometheus. This gives you centralized log storage and metrics dashboards for Ray clusters running on your cluster.

Prerequisites

Before you begin, ensure that you have:

If you need to create or upgrade a cluster, see Create an ACK managed cluster or Manually upgrade a cluster.

Install KubeRay Operator

Important

KubeRay Operator is in invitational preview. To get access, submit a ticket.

  1. Log on to the ACK console. In the left-side navigation pane, click Clusters.

  2. Click the name of your cluster.

  3. On the cluster details page, choose Operations > Add-ons > Manage Applications.

  4. Find and install Kuberay-Operator.

    image

Verify: After installation, run the following command to confirm the operator pod is running:

kubectl get pods -A | grep kuberay

The operator pod should appear with a Running status. If the pod stays in Pending, run kubectl describe pod <pod-name> to check for scheduling errors, such as insufficient node resources.

Enable log collection for KubeRay Operator

Collect control-plane logs from the KubeRay Operator itself.

  1. On the cluster details page, choose Operations > Log Center > Control Plane Component Logs.

  2. Click Enable Component Log Collection.

  3. Select kuberay-operator from the drop-down list.

Enable log collection for Ray clusters

Collect logs from Ray cluster pods and forward them to Simple Log Service. The following steps configure Logtail (via the AliyunLogConfig custom resource) to watch the Ray log directory across all Ray pods and tag each log entry with the cluster name and node type.

Note

Simple Log Service is a paid service. See Billing overview for pricing details.

  1. Apply the following manifest to create an AliyunLogConfig object in the kube-system namespace:

    Sample code

    cat <<EOF | kubectl apply -f -
    apiVersion: log.alibabacloud.com/v1alpha1
    kind: AliyunLogConfig
    metadata:
      name: rayclusters
      namespace: kube-system
    spec:
       # The name of the Logstore. If the specified Logstore does not exist, Simple Log Service automatically creates one.
      logstore: rayclusters
      # Configure Logtail.
      logtailConfig:
        # The type of data source. If you want to collect text logs, you must set the value to file.
        inputType: file
        # The name of the Logtail configuration. The name must be the same as the resource name that is specified in metadata.name.
        configName: rayclusters
        inputDetail:
          # Configure Logtail to collect text logs in simple mode.
          logType: common_reg_log
          # The path of the log file.
          logPath: /tmp/ray/session_*-*-*_*/logs
          # The name of the log file. You can use wildcard characters such as asterisks (*) and question marks (?) when you specify the log file name. Example: log_*.log.
          filePattern: "*.*"
          # If you want to collect container text logs, you must set dockerFile to true.
          dockerFile: true
          # The filter conditions of containers.
          advanced:
            k8s:
              IncludeK8sLabel:
                ray.io/is-ray-node: "yes"
              ExternalK8sLabelTag:
                ray.io/cluster: "_raycluster_name_"
                ray.io/node-type : "_node_type_"
    EOF

    Key parameters:

    ParameterDescription
    logPathThe log directory on each Ray pod. Matches all session directories under /tmp/ray/. Specify a custom path if your Ray cluster uses a different log location.
    advanced.k8s.ExternalK8sLabelTagAdds _raycluster_name_ and _node_type_ tags to each log entry for filtering in Simple Log Service.

    For the full list of AliyunLogConfig parameters, see Use CRDs to collect container logs in DaemonSet mode.

  2. View logs in the ACK console:

    1. Log on to the ACK console. In the left-side navigation pane, click Clusters.

    2. Click the name of your cluster.

    3. Choose Cluster Information > Basic Information > Cluster Resources.

    4. Click the link next to Log Service Project to open the Simple Log Service project. image

  3. Select the Logstore named rayclusters to view log entries. Filter by the _raycluster_name_ tag to isolate logs from a specific Ray cluster.

    image

Enable monitoring for Ray clusters

Collect Prometheus metrics from Ray cluster pods using a PodMonitor for worker nodes and a ServiceMonitor for the head node. Worker nodes are monitored with a PodMonitor because they are independent pods—not replicas managed by a ReplicaSet—so grouping them through a Kubernetes Service is not reliable. The head node exposes metrics through a stable Service endpoint, making a ServiceMonitor the appropriate choice.

Note

Managed Service for Prometheus is a paid service. See Managed Service for Prometheus instance billing for pricing details. For setup instructions, see Connect to and configure Managed Service for Prometheus.

  1. Create a PodMonitor to collect metrics from Ray worker pods:

    apiVersion: monitoring.coreos.com/v1
    kind: PodMonitor
    metadata:
      annotations:
        arms.prometheus.io/discovery: 'true'
        arms.prometheus.io/resource: arms
      name: ray-workers-monitor
      namespace: arms-prom
      labels:
        # `release: $HELM_RELEASE`: Prometheus can only detect PodMonitor with this label.
        release: prometheus
        #ray.io/cluster: raycluster-kuberay # $RAY_CLUSTER_NAME: "kubectl get rayclusters.ray.io"
    spec:
      namespaceSelector:
        any: true
      jobLabel: ray-workers
      # Only select Kubernetes Pods with "matchLabels".
      selector:
        matchLabels:
          ray.io/node-type: worker
      # A list of endpoints allowed as part of this PodMonitor.
      podMetricsEndpoints:
      - port: metrics
        relabelings:
        - action: replace
          regex: (.+)
          replacement: $1
          separator: ;
          sourceLabels:
            - __meta_kubernetes_pod_label_ray_io_cluster
          targetLabel: ray_io_cluster
  2. Create a ServiceMonitor to collect metrics from the Ray head node:

    apiVersion: monitoring.coreos.com/v1
    kind: ServiceMonitor
    metadata:
      annotations:
        arms.prometheus.io/discovery: 'true'
        arms.prometheus.io/resource: arms
      name: ray-head-monitor
      namespace: arms-prom
      labels:
        # `release: $HELM_RELEASE`: Prometheus can only detect ServiceMonitor with this label.
        release: prometheus
    spec:
      namespaceSelector:
        any: true
      jobLabel: ray-head
      # Only select Kubernetes Services with "matchLabels".
      selector:
        matchLabels:
          ray.io/node-type: head
      # A list of endpoints allowed as part of this ServiceMonitor.
      endpoints:
        - port: metrics
          path: /metrics
      targetLabels:
      - ray.io/cluster
  3. Integrate with Application Real-Time Monitoring Service (ARMS) to view dashboards:

    1. Log on to the ARMS console. In the left-side navigation pane, click Integration Center.

    2. Search for Ray, then select it from the results.

    3. In the Ray panel, select your cluster and click OK. image

    4. Click Integration Management in the left-side navigation pane, then click the target environment name.

    5. On the Component Management tab, find Dashboards in the Addon Type section, then click Ray Cluster. image

    6. Set Namespace, RayClusterName, and SessionName to filter the monitoring data for a specific task. image.png

Verify: After applying the PodMonitor and ServiceMonitor, confirm that Prometheus is scraping your targets. In the ARMS console, both ray-workers-monitor and ray-head-monitor should appear as active scrape targets in the integrated Prometheus instance. If a target does not appear, check that the release: prometheus label is present on the PodMonitor and ServiceMonitor, and that the namespace selector matches the namespace where your Ray cluster is running.

What's next

  • Deploy a RayCluster custom resource to start running distributed Ray workloads on your ACK cluster.

  • Use the _raycluster_name_ tag in Simple Log Service to correlate logs across multiple Ray clusters.

  • Set up alerting rules in ARMS based on the Ray cluster metrics collected by Managed Service for Prometheus.