Deep Learning Containers (DLC) of Platform for AI (PAI) supports jobs based on the Ray framework. You can directly submit Ray training scripts to DLC without the need to set up a Ray cluster or configure Kubernetes. Additionally, DLC offers comprehensive log and metric monitoring services to better manage jobs. This topic describes how to submit a Ray training job.
Prerequisites
To submit training jobs using the SDK, you must configure environment variables. For more information, see Install the Credentials tool and Configure environment variables in Linux, macOS, and Windows.
Preparations
Prepare node images
A Ray cluster contains two types of nodes: Head and Worker. DLC jobs use a specified node image to build Head and Worker containers. After you create a job, DLC automatically sets up the Ray cluster and, after the cluster is ready, submits the job to the cluster by initiating a submitter node, which also uses the same image.
The Ray image must be of 2.6 or later versions and include at least the components in ray[default]. The following images are supported:
Alibaba Cloud image: PAI provides official images with Ray basic components pre-installed for your convenience.

Ray community image:
The Docker image rayproject/ray is recommended.
The rayproject/ray-ml image, which includes machine learning frameworks like PyTorch and TensorFlow, is also supported.
For GPU support, use an image compatible with CUDA. For more supported image versions, see Docker image documentation.
Prepare startup commands and script files
The startup command for a DLC job serves as the entrypoint command for ray job submit. The startup command can be a single or multiple lines, such as python /root/code/sample.py, where:
sample.py is the Python script file to be executed. You can mount the script file to the DLC container through dataset or code. Sample code:
import ray import os ray.init() @ray.remote class Counter: def __init__(self): # Used to verify runtimeEnv self.name = os.getenv("counter_name") # assert self.name == "ray" self.counter = 0 def inc(self): self.counter += 1 def get_counter(self): return "{} got {}".format(self.name, self.counter) counter = Counter.remote() for _ in range(50000): ray.get(counter.inc.remote()) print(ray.get(counter.get_counter.remote()))/root/code/is the mount path.
Submit training jobs
Use the console
Go to the Create Job page.
Log on to the PAI console. Select a region and a workspace. Then, click Enter Deep Learning Containers (DLC).
On the Deep Learning Containers (DLC) page, click Create Job.
On the Create Job page, configure the following key parameters. For information about the other parameters, see Create a training job.
Parameter
Description
Example Value
Environment Information
Image config
Select a Ray image on the Alibaba Cloud Image tab.
ray:2.39.0-cpu-py312-ubuntu22.04Startup Command
The command to be executed by the job.
python /root/code/sample.pyThird-party Libraries
Configure the Environment Dependencies (runtime_env) using a list of third-party libraries.
NoteFor production environments, we recommend that you use pre-packaged images to prevent job failures due to temporary dependency installations.
None
Code Builds
Upload your script file to the DLC container using either Online configuration or Local Upload.
Using Local Upload:
Sample code file: sample.py
Mount path:
/root/code/.
Resource Information
Source
Choose between Public Resources and Resource Quota for submitting training jobs.
NoteRay does not support idle or preemptible resources, or preemptible job types. Ray jobs cannot be preempted.
Public Resources
Framework
The framework used.
Ray
Job Resource
Quantity:
For a Ray cluster, the node types are Head and Worker. The number of Head node must be one, only to run the entrypoint script, not as a Worker node. At least one Worker node is typically needed, but not mandatory. Each Ray job automatically creates a Submitter node to execute the startup command. And you can view the job log through the event log of the Submitter. In subscription jobs, the Submitter node uses a minimal share of user resources. For pay-as-you-go jobs, the smallest available resource type node is used.
Resource Type:
The Logical Resources on the Worker nodes correspond to the physical resources specified during job submission. For example, if you configure a GPU node with 8 GPUs, the Worker nodes also have 8 GPUs by default.
We recommend that you align resource specifications with job demands and use fewer large nodes instead of a number of small ones. To prevent OOM errors, each node should ideally has a minimum of 2 GiB of memory, which should be scaled up as the number of jobs/Actors increases.
Number of Nodes: 1 for both types.
Resource Type: ecs.g6.xlarge.
After configuring the parameters, click Confirm.
Use the SDK
Install the DLC SDK for Python.
pip install alibabacloud_pai_dlc20201203==1.4.0Submit a DLC Ray job. Sample code:
#!/usr/bin/env python3 from alibabacloud_tea_openapi.models import Config from alibabacloud_credentials.client import Client as CredClient from alibabacloud_pai_dlc20201203.client import Client as DLCClient from alibabacloud_pai_dlc20201203.models import CreateJobRequest region_id = '<region-id>' cred = CredClient() workspace_id = '12****' dlc_client = DLCClient( Config(credential=cred, region_id=region_id, endpoint='pai-dlc.{}.aliyuncs.com'.format(region_id), protocol='http')) create_job_resp = dlc_client.create_job(CreateJobRequest().from_map({ 'WorkspaceId': workspace_id, 'DisplayName': 'dlc-ray-job', 'JobType': 'RayJob', 'JobSpecs': [ { "Type": "Head", "Image": "dsw-registry-vpc.<region-id>.cr.aliyuncs.com/pai/ray:2.39.0-gpu-py312-cu118-ubuntu22.04", "PodCount": 1, "EcsSpec": 'ecs.c6.large', }, { "Type": "Worker", "Image": "dsw-registry-vpc.<region-id>.cr.aliyuncs.com/pai/ray:2.39.0-gpu-py312-cu118-ubuntu22.04", "PodCount": 1, "EcsSpec": 'ecs.c6.large', }, ], "UserCommand": "echo 'Prepare your ray job entrypoint here' && sleep 1800 && echo 'DONE'", })) job_id = create_job_resp.body.job_id print(f'jobId is {job_id}')Where:
region_id: The region ID. For example, China (Hangzhou) is cn-hangzhou.
workspace_id: The workspace ID, which can be found on the workspace details page. For more information, see Manage workspaces.
Image: Replace <region-id> with the region ID. For example, China (Hangzhou) is cn-hangzhou.
For more information about how to use the SDK, see Use SDK for Python.
Use the command line
Download the DLC client tool and complete user authentication. For more information, see Before you begin.
Submit a DLC Ray job. Sample code:
./dlc submit rayjob --name=my_ray_job \ --workers=1 \ --worker_spec=ecs.g6.xlarge \ --worker_image=dsw-registry-vpc.<region-id>.cr.aliyuncs.com/pai/ray:2.39.0-cpu-py312-ubuntu22.04 \ --heads=1 \ --head_image=dsw-registry-vpc.<region-id>.cr.aliyuncs.com/pai/ray:2.39.0-cpu-py312-ubuntu22.04 \ --head_spec=ecs.g6.xlarge \ --command="echo 'Prepare your ray job entrypoint here' && sleep 1800 && echo 'DONE'" \ --workspace_id=4****For more information about how to use the command line, see Commands used to submit jobs.
FAQ
Why does a Ray job time out due to prolonged environment preparation?
Examine the log of the Head node to determine if the Ray environment starts properly. If it does not, Ray is not functioning on the relevant instance. Please follow the instructions in the Preparations section to prepare a compatible image.

We recommend that you check the event log of the Head node. If the error
Readiness probe failed...appears, it may indicate missing dependencies related to the Readiness check or unavailable indirect dependencies. You can reinstall the ray[default] component in the original image using pip or conda, or rebuild the image based on the Ray official image.