Run Distributed AI Jobs on Ray with AnalyticDB for MySQL - AnalyticDB for MySQL

AnalyticDB Ray is a fully managed Ray service built into AnalyticDB for MySQL. It enhances open source Ray with production-grade kernel optimizations and simplified cluster operations, so you can run large-scale distributed AI workloads — multimodal processing, search recommendations, financial risk control, and embodied intelligence — directly on your data lakehouse without managing infrastructure.

Prerequisites

Before you begin, ensure that you have:

An AnalyticDB for MySQL Enterprise Edition, Basic Edition, or Data Lakehouse Edition cluster

What AnalyticDB Ray adds over open source Ray

Open source Ray is a distributed computing framework for AI and high-performance computing. It lets you scale single-node Python code to thousand-node clusters with minimal changes, and its built-in modules — Ray Tune, Ray Train, and Ray Serve — integrate seamlessly with TensorFlow and PyTorch.

Running open source Ray in production, however, introduces challenges: distributed job optimization, fine-grained resource scheduling, complex cluster operations, and ensuring system stability at scale.

AnalyticDB Ray addresses these challenges with the following enhancements:

Enhancement	Description
Kernel performance optimization	Improved scheduling and execution efficiency for large-scale distributed workloads.
Simplified cluster operations	Managed head node and worker node lifecycle, so you focus on jobs rather than infrastructure.
Disaster recovery	A Redis-based mechanism that automatically recovers Ray clusters, actors, and tasks when the head node restarts.
Automatic scaling	Dynamic worker node scaling based on logical resource requirements, with multi-worker-group support to prevent overloading or underutilizing individual groups.
Data lakehouse integration	Native integration with AnalyticDB for MySQL, enabling integrated Data + AI architectures.

Billing

When you create a Ray cluster resource group, you are charged for the following:

Storage consumed by the Worker Disk Storage setting.
If Worker Resource Type is set to CPU: AnalyticDB Compute Unit (ACU) elastic resources actually used.
If Worker Resource Type is set to GPU: the GPU specifications and quantity provisioned.

Usage notes

Worker node changes

Modify worker configurations during off-peak hours to minimize impact. Deleting or restarting worker nodes has these effects:

Drivers, actors, and tasks running on the affected nodes fail. Ray automatically redeploys actors and tasks.
Data in Ray's distributed object storage is lost. Tasks that depend on data from the restarted node also fail.

Resource group changes

Operation	Impact
Delete a resource group	Running tasks in the resource group are interrupted.
Delete a worker group	The worker group's nodes are deleted. See the impact of worker node deletion above.
Set Maximum Workers after the change to a value less than Minimum Workers before the change	Worker nodes are deleted. See the impact of worker node deletion above.
Change any parameter other than Minimum Workers or Maximum Workers (for example, head resource specifications or worker resource types)	Head node or worker nodes restart. See the impact of worker node restart above.

Automatic scaling

Ray clusters scale based on logical resource requirements, not physical resource utilization. Scaling may trigger even when physical utilization is low.

Some third-party applications create as many tasks as possible to maximize resource usage. When automatic scaling is enabled, this can cause Ray clusters to scale up rapidly to the maximum size. Understand the task-creation logic of any third-party program before enabling automatic scaling to avoid unexpected resource consumption.

Create a Ray cluster resource group

Log on to the AnalyticDB for MySQL console. In the upper-left corner of the console, select a region. In the left-side navigation pane, click Clusters. Find and click your cluster ID.
In the left-side navigation pane, choose Cluster Management > Resource Management. Click the Resource Groups tab. In the upper-right corner, click Create Resource Group.

In the Create Resource Group panel, enter a resource group name, set Job Type to AI, and configure the following parameters.

Parameter	Description
Deployment Mode	Select RayCluster.
Head Resource Specifications	The head node manages Ray metadata, runs the Global Control Store (GCS) service, and schedules tasks — it does not execute tasks. Specifications define the number of CPU cores: small, m.xlarge, or m.2xlarge. Choose based on the overall scale of your Ray cluster. The number of CPU cores matches Spark resource specifications.
Worker Group Name	A name for this worker group. One AI resource group can have multiple worker groups with different names.
Worker Resource Type	CPU: for daily computing tasks, multitasking, and complex logical operations. GPU: for large-scale data parallel processing, machine learning, and deep learning training.
Worker Resource Specifications	For CPU: small, m.xlarge, or m.2xlarge. CPU core counts match Spark resource specifications. For GPU: submit a ticket for available specifications, as these depend on GPU models and inventory.
Worker Disk Storage	Disk used for Ray logs, temporary data, and overflow from distributed object storage. Unit: GB. Range: 30–2000. Default: 100. Disks are for temporary storage only — do not use them for long-term data.
Minimum Workers	Minimum worker nodes in this group (minimum value: 1).
Maximum Workers	Maximum worker nodes in this group (maximum value: 8). When minimum and maximum differ, AnalyticDB for MySQL scales the group dynamically based on task demand. With multiple worker groups, the system distributes load to prevent over- or under-utilization.
Distribution Unit	GPUs allocated to each worker node (for example, 1/3). Required only when Worker Resource Type is GPU.

Click OK.

Connect to your Ray cluster and submit jobs

Step 1: Get the cluster URLs

In the left-side navigation pane, choose Cluster Management > Resource Management. Click the Resource Groups tab.

Find your AI resource group and choose More > Details in the Actions column. The following URLs are available:

URL	Description
Ray Cluster Endpoint	Internal URL for connecting to the cluster.
Ray Dashboard	Public URL for the Ray visualization page. Click it to view the statuses of your cluster, worker groups, and jobs.
Ray Grafana	URL for the Grafana metrics dashboard.

Step 2: Choose a job submission method

Two methods are available. Use CTL for most scenarios.

Method	How it works	When to use
CTL (cloud task launcher) (Recommended)	Packages your script files and uploads them to the Ray cluster for execution. The entry program runs in the cluster and consumes cluster resources.	Default choice for most workloads. No local Ray version management required.
ray.init	Connects a locally running entry program to the remote Ray cluster. The local program does not consume cluster resources.	When you need the entry program to run locally. Local Ray and Python versions must match the cluster version; update the local environment if the cluster version changes.

Submit jobs with CTL

Prerequisites: Python 3.7 or later.

Install Ray with the default extras:
```
pip3 install ray[default]
```
(Optional) Set the cluster URL as an environment variable so you don't need to specify it on each submission:
```
export RAY_ADDRESS="<ray-dashboard-url>"
```
Replace <ray-dashboard-url> with the Ray Dashboard URL from Step 1 (port 8265).

Submit a job. Replace the following placeholders: Example with address specified:

If you set RAY_ADDRESS in the previous step: ``bash ray job submit --working-dir <your-working-directory> -- python <your-script.py> ``
If you did not set RAY_ADDRESS: ``bash ray job submit --address <ray-dashboard-url> --working-dir <your-working-directory> -- python <your-script.py> ``

Important

CTL packages and uploads all files in the directory specified by --working-dir to the Ray head node. Keep this directory minimal — large files can cause upload failures. Store all dependency script files in this directory; missing dependencies cause execution failures.

Placeholder	Description	Example
`<ray-dashboard-url>`	The Ray Dashboard URL from Step 1	`http://amv-uf64gwe14****-rayo.ads.aliyuncs.com:8265`
`<your-working-directory>`	Path to the directory containing your script and dependencies	`/root/Ray`
`<your-script.py>`	Your Python entry script	`scripts.py`

ray job submit --address http://amv-uf64gwe14****-rayo.ads.aliyuncs.com:8265 --working-dir /root/Ray -- python scripts.py

Check the job status using either of these methods:
- Run the following command: ``bash ray job list ``
- On the Resource Groups tab, find your AI resource group, choose More > Details in the Actions column, and click the Ray Dashboard URL.

Submit jobs with ray.init