All Products
Search
Document Center

Platform For AI:Quickly submit a Ray job

Last Updated:Dec 05, 2025

Deep Learning Containers (DLC) of Platform for AI (PAI) supports jobs based on the Ray framework. You can directly submit Ray training scripts to DLC without the need to set up a Ray cluster or configure Kubernetes. Additionally, DLC offers comprehensive log and metric monitoring services to better manage jobs. This topic describes how to submit a Ray training job.

Prerequisites

To submit training jobs using the SDK, you must configure environment variables. For more information, see Install the Credentials tool and Configure environment variables in Linux, macOS, and Windows.

Preparations

Prepare node images

A Ray cluster contains two types of nodes: Head and Worker. DLC jobs use a specified node image to build Head and Worker containers. After you create a job, DLC automatically sets up the Ray cluster and, after the cluster is ready, submits the job to the cluster by initiating a submitter node, which also uses the same image.

The Ray image must be of 2.6 or later versions and include at least the components in ray[default]. The following images are supported:

  • Alibaba Cloud image: PAI provides official images with Ray basic components pre-installed for your convenience. image

  • Ray community image:

    • The Docker image rayproject/ray is recommended.

    • The rayproject/ray-ml image, which includes machine learning frameworks like PyTorch and TensorFlow, is also supported.

    For GPU support, use an image compatible with CUDA. For more supported image versions, see Docker image documentation.

Prepare startup commands and script files

The startup command for a DLC job serves as the entrypoint command for ray job submit. The startup command can be a single or multiple lines, such as python /root/code/sample.py, where:

  • sample.py is the Python script file to be executed. You can mount the script file to the DLC container through dataset or code. Sample code:

    import ray
    import os
    
    ray.init()
    
    @ray.remote
    class Counter:
        def __init__(self):
            # Used to verify runtimeEnv
            self.name = os.getenv("counter_name")
            # assert self.name == "ray"
            self.counter = 0
    
        def inc(self):
            self.counter += 1
    
        def get_counter(self):
            return "{} got {}".format(self.name, self.counter)
    
    counter = Counter.remote()
    
    for _ in range(50000):
        ray.get(counter.inc.remote())
        print(ray.get(counter.get_counter.remote()))
    
  • /root/code/ is the mount path.

Submit training jobs

Use the console

  1. Go to the Create Job page.

    1. Log on to the PAI console. Select a region and a workspace. Then, click Enter Deep Learning Containers (DLC).

    2. On the Deep Learning Containers (DLC) page, click Create Job.

  2. On the Create Job page, configure the following key parameters. For information about the other parameters, see Create a training job.

    Parameter

    Description

    Example Value

    Environment Information

    Image config

    Select a Ray image on the Alibaba Cloud Image tab.

    ray:2.39.0-cpu-py312-ubuntu22.04

    Startup Command

    The command to be executed by the job.

    python /root/code/sample.py

    Third-party Libraries

    Configure the Environment Dependencies (runtime_env) using a list of third-party libraries.

    Note

    For production environments, we recommend that you use pre-packaged images to prevent job failures due to temporary dependency installations.

    None

    Code Builds

    Upload your script file to the DLC container using either Online configuration or Local Upload.

    Using Local Upload:

    • Sample code file: sample.py

    • Mount path: /root/code/.

    Resource Information

    Source

    Choose between Public Resources and Resource Quota for submitting training jobs.

    Note

    Ray does not support idle or preemptible resources, or preemptible job types. Ray jobs cannot be preempted.

    Public Resources

    Framework

    The framework used.

    Ray

    Job Resource

    • Quantity:

      For a Ray cluster, the node types are Head and Worker. The number of Head node must be one, only to run the entrypoint script, not as a Worker node. At least one Worker node is typically needed, but not mandatory. Each Ray job automatically creates a Submitter node to execute the startup command. And you can view the job log through the event log of the Submitter. In subscription jobs, the Submitter node uses a minimal share of user resources. For pay-as-you-go jobs, the smallest available resource type node is used.

    • Resource Type:

      The Logical Resources on the Worker nodes correspond to the physical resources specified during job submission. For example, if you configure a GPU node with 8 GPUs, the Worker nodes also have 8 GPUs by default.

      We recommend that you align resource specifications with job demands and use fewer large nodes instead of a number of small ones. To prevent OOM errors, each node should ideally has a minimum of 2 GiB of memory, which should be scaled up as the number of jobs/Actors increases.

    • Number of Nodes: 1 for both types.

    • Resource Type: ecs.g6.xlarge.

  3. After configuring the parameters, click Confirm.

Use the SDK

  1. Install the DLC SDK for Python.

    pip install alibabacloud_pai_dlc20201203==1.4.0
  2. Submit a DLC Ray job. Sample code:

    #!/usr/bin/env python3
    
    from alibabacloud_tea_openapi.models import Config
    from alibabacloud_credentials.client import Client as CredClient
    
    from alibabacloud_pai_dlc20201203.client import Client as DLCClient
    from alibabacloud_pai_dlc20201203.models import CreateJobRequest
    
    region_id = '<region-id>'
    cred = CredClient()
    workspace_id = '12****'
    
    dlc_client = DLCClient(
        Config(credential=cred,
               region_id=region_id,
               endpoint='pai-dlc.{}.aliyuncs.com'.format(region_id),
               protocol='http'))
    
    create_job_resp = dlc_client.create_job(CreateJobRequest().from_map({
        'WorkspaceId': workspace_id,
        'DisplayName': 'dlc-ray-job',
        'JobType': 'RayJob',
        'JobSpecs': [
            {
                "Type": "Head",
                "Image": "dsw-registry-vpc.<region-id>.cr.aliyuncs.com/pai/ray:2.39.0-gpu-py312-cu118-ubuntu22.04",
                "PodCount": 1,
                "EcsSpec": 'ecs.c6.large',
            },
            {
                "Type": "Worker",
                "Image": "dsw-registry-vpc.<region-id>.cr.aliyuncs.com/pai/ray:2.39.0-gpu-py312-cu118-ubuntu22.04",
                "PodCount": 1,
                "EcsSpec": 'ecs.c6.large',
            },
        ],
        "UserCommand": "echo 'Prepare your ray job entrypoint here' && sleep 1800 && echo 'DONE'",
    }))
    job_id = create_job_resp.body.job_id
    print(f'jobId is {job_id}')
    

    Where:

    • region_id: The region ID. For example, China (Hangzhou) is cn-hangzhou.

    • workspace_id: The workspace ID, which can be found on the workspace details page. For more information, see Manage workspaces.

    • Image: Replace <region-id> with the region ID. For example, China (Hangzhou) is cn-hangzhou.

For more information about how to use the SDK, see Use SDK for Python.

Use the command line

  1. Download the DLC client tool and complete user authentication. For more information, see Before you begin.

  2. Submit a DLC Ray job. Sample code:

    ./dlc submit rayjob --name=my_ray_job \
      --workers=1 \
      --worker_spec=ecs.g6.xlarge \
      --worker_image=dsw-registry-vpc.<region-id>.cr.aliyuncs.com/pai/ray:2.39.0-cpu-py312-ubuntu22.04 \
      --heads=1 \
      --head_image=dsw-registry-vpc.<region-id>.cr.aliyuncs.com/pai/ray:2.39.0-cpu-py312-ubuntu22.04 \
      --head_spec=ecs.g6.xlarge \
      --command="echo 'Prepare your ray job entrypoint here' && sleep 1800 && echo 'DONE'" \
      --workspace_id=4****

    For more information about how to use the command line, see Commands used to submit jobs.

FAQ

Why does a Ray job time out due to prolonged environment preparation?

  • Examine the log of the Head node to determine if the Ray environment starts properly. If it does not, Ray is not functioning on the relevant instance. Please follow the instructions in the Preparations section to prepare a compatible image.

image.png

  • We recommend that you check the event log of the Head node. If the error Readiness probe failed... appears, it may indicate missing dependencies related to the Readiness check or unavailable indirect dependencies. You can reinstall the ray[default] component in the original image using pip or conda, or rebuild the image based on the Ray official image.