How to create and use Ray service - AnalyticDB - Alibaba Cloud Documentation Center

AnalyticDB Ray is a fully managed Ray service built on AnalyticDB for MySQL. Running distributed AI workloads in production requires handling cluster operations, resource scheduling, and system stability — complexity that open source Ray leaves to you. AnalyticDB Ray manages these concerns, so you can focus on building AI applications at scale.

Use cases

Use case	Description
Multimodal processing	Process large volumes of mixed media — images, video, audio, and text — in parallel across distributed nodes.
Search and recommendation	Run real-time inference and ranking pipelines at scale for search relevance and personalized recommendations.
Financial risk control	Execute high-throughput, latency-sensitive risk scoring and fraud detection jobs across massive datasets.
Embodied intelligence	One of the validated scenarios for AnalyticDB Ray deployments.

How it works

Each Ray cluster resource group contains two types of nodes:

Head node — manages Ray metadata, runs the Global Control Store (GCS) service, and schedules tasks. The head node does not execute tasks.
Worker nodes — execute tasks and actors. Worker nodes scale automatically based on job demand.

When you submit a job, AnalyticDB Ray schedules tasks to available worker nodes and stores intermediate data in Ray's distributed object store. If the head node restarts, a Redis-based disaster recovery mechanism recovers the cluster state, actors, and tasks automatically.

Prerequisites

Before you begin, ensure that you have:

An AnalyticDB for MySQL cluster (Enterprise Edition, Basic Edition, or Data Lakehouse Edition)

Create a Ray cluster resource group

Log on to the AnalyticDB for MySQL console. In the upper-left corner, select a region. In the left-side navigation pane, click Clusters. Find the cluster and click the cluster ID.
In the left-side navigation pane, choose Cluster Management > Resource Management. Click the Resource Groups tab. In the upper-right corner, click Create Resource Group.

In the Create Resource Group panel, enter a resource group name, set Job Type to AI, and configure the following parameters.

Parameter	Description
Deployment Mode	Select RayCluster.
Head Resource Specifications	The head node manages Ray metadata, runs GCS, and schedules tasks. Options: small, m.xlarge, m.2xlarge. CPU core counts match Spark resource specifications. For details, see Spark resource specifications. Select specifications based on the overall scale of your Ray cluster — the head node handles all scheduling.
Worker Group Name	A name for this worker group. One resource group can contain multiple worker groups with different names.
Worker Resource Type	CPU: for daily computing, multitasking, or complex logic. GPU: for large-scale data parallel processing, machine learning, or deep learning training.
Worker Resource Specifications	CPU: small, m.xlarge, m.2xlarge (same core counts as Spark; see Spark resource specifications). GPU: submit a ticket — GPU specifications depend on available models and inventory.
Worker Disk Storage	Disk space for Ray logs, temporary data, and distributed object store overflow. Unit: GB. Range: 30–2000. Default: 100. Disks are for temporary storage only — do not rely on them for persistent data.
Minimum Workers / Maximum Workers	The minimum number of worker nodes in this group (minimum: 1) and the maximum (maximum: 8). If minimum and maximum differ, AnalyticDB Ray scales worker count automatically based on current job load. When multiple worker groups exist, AnalyticDB Ray distributes jobs across groups to avoid overloading or underutilizing any single group.
Distribution Unit	(GPU workers only) The number of GPUs allocated to each worker node. Example: 1/3.

Click OK.

Connect to and use the Ray service

Step 1: Get the endpoint URLs

In the left-side navigation pane, choose Cluster Management > Resource Management. Click the Resource Groups tab.

Find the resource group and choose More > Details in the Actions column.

URL	Description
Ray Grafana	Grafana visualization page for monitoring the cluster.
Ray Cluster Endpoint	Internal endpoint URL for connecting to the cluster from within the VPC.
Ray Dashboard	Public dashboard URL (port 8265). View cluster status and job progress.

Step 2: Submit jobs

Prerequisites

Python 3.7 or later is installed.

Choose a submission method

Two methods are available. Use the Cloud Task Launcher (CTL) method unless you have a specific reason to run the driver locally.

Method	How it works	Considerations
Cloud Task Launcher (CTL) (recommended)	Packages and uploads your script to the Ray cluster. The driver runs in the cluster and consumes cluster resources.	Simpler setup; no version-matching requirements.
ray.init	Connects a locally running driver to the Ray cluster. The driver runs on your local machine and does not consume cluster resources.	Local Ray and Python versions must match the Ray cluster version. Update your local environment when the cluster version changes.

Submit jobs using CTL

Install Ray:
```
pip3 install ray[default]
```
(Optional) Set the RAY_ADDRESS environment variable to avoid specifying the URL in every command:
```
export RAY_ADDRESS="RAY_URL"
```
Replace RAY_URL with the URL obtained in Step 1.

Submit a job:

If RAY_ADDRESS is set:

ray job submit --working-dir <working-directory> -- python <script-file>

If RAY_ADDRESS is not set:

ray job submit --address <ray-url> --working-dir <working-directory> -- python <script-file>

Placeholder	Description
`<ray-url>`	The Ray URL from Step 1. Example: `http://amv-uf64gwe14****-rayo.ads.aliyuncs.com:8265`
`<working-directory>`	Directory containing your script and all its dependencies. Example: `/root/Ray`
`<script-file>`	The Python script to run. Example: `scripts.py`

Important

The system uploads everything in <working-directory> to the head node. Keep the directory minimal — large directories can cause upload failures. All dependency scripts must be in this directory.

Example:

ray job submit --address http://amv-uf64gwe14****-rayo.ads.aliyuncs.com:8265 --working-dir /root/Ray -- python scripts.py

Check job status:
- Run ray job list to list all jobs and their statuses.
- Or view the Ray Dashboard: in the Resource Groups tab, find the resource group, choose More > Details, and click the Ray Dashboard URL.

Submit jobs using ray.init

Install Ray:
```
pip3 install ray
```
Convert the Ray Dashboard URL to a Ray protocol URL: The dashboard URL uses port 8265. The ray.init() connection requires port 10001 and the ray:// protocol.
Dashboard URL (from Step 1) ray.init URL
http://amv-uf64gwe14****-rayo.ads.aliyuncs.com:8265 ray://amv-uf64gwe14****-rayo.ads.aliyuncs.com:10001
Connect and run your script:
- Option A — set the RAY_ADDRESS environment variable, then run the script directly:
```
export RAY_ADDRESS="ray://<host>:10001"
python scripts.py
```
- Option B — specify the address inside the script, then run it:
```
ray.init(address="RAY_URL")
```
```
python scripts.py
```
Important
If the Ray URL is incorrect, ray.init() silently starts a local Ray cluster instead of connecting to the remote cluster. Check the output logs to confirm you are connected to the correct cluster.

Dashboard URL (from Step 1)	ray.init URL
`http://amv-uf64gwe14****-rayo.ads.aliyuncs.com:8265`	`ray://amv-uf64gwe14****-rayo.ads.aliyuncs.com:10001`

Billing

Charges begin when you create a Ray cluster resource group. You are billed for:

Billing item	Charged by
Worker Disk Storage	Storage size specified in the Worker Disk Storage parameter
CPU workers	Used AnalyticDB Compute Unit (ACU) elastic resources
GPU workers	GPU specifications and quantity

Usage notes

Worker node restart and deletion

Modifying worker configurations restarts or deletes worker nodes. Schedule these changes during off-peak hours and avoid running jobs on nodes that are about to restart.

When a worker node restarts or is deleted:

Drivers, actors, and tasks running on the affected node fail. Ray automatically redeploys actors and tasks.
Data in Ray's distributed object store is lost. Jobs that depend on data from the restarted node also fail.

Resource group changes

Change	Effect
Delete a resource group	Running tasks are interrupted immediately.
Delete a worker group	All worker nodes in the group are deleted. See Worker node restart and deletion.
If the maximum number of worker nodes after the change is less than the minimum number of worker nodes before the change	Worker nodes are deleted. See Worker node restart and deletion.
Change head resource specifications or worker resource type	Head node or worker nodes restart. See Worker node restart and deletion.

Automatic scaling

Ray clusters scale based on logical resource requirements, not physical utilization. Scaling may be triggered even when physical resource usage is low.

Some third-party applications create as many tasks as possible to saturate available resources. When automatic scaling is enabled, this can quickly scale the cluster to its maximum size. Understand the task-creation behavior of any third-party program before enabling automatic scaling.