Create single-node or distributed training jobs in DLC by using the console, Python SDK, or CLI.
Quick start
For an MNIST-based walkthrough of single-GPU or multi-node multi-GPU training, see Distributed training DLC quick start.
Console parameters
Basic information
Configure the Job Name and Tag.
Environment information
|
Parameter |
Description |
|
Image Configuration |
In addition to Alibaba Cloud Image, the following image types are available:
|
|
Mount dataset |
Mount data files for model training. Supported dataset types:
Mount Path: The mount path in the DLC container, such as Important
If you use a CPFS dataset, configure a VPC for your DLC job. The VPC must match the VPC of the CPFS file system. Otherwise, the job may remain in the Preparing state. |
|
Mount storage |
Mount a data source path to read data or store output files.
For more information, see Use cloud storage in DLC training jobs. |
|
Startup Command |
Startup command for the job (shell commands supported). DLC automatically injects common environment variables for PyTorch and TensorFlow, such as
|
Resource information
|
Parameter |
Description |
|
Resource Type |
The default value is General Computing. You can select Lingjun Intelligence Resources only in the China (Ulanqab), Singapore, China (Shenzhen), China (Beijing), China (Shanghai), and China (Hangzhou) regions. |
|
Source |
|
|
Framework |
Supported deep learning training frameworks: TensorFlow, PyTorch, ElasticBatch, XGBoost, OneFlow, MPIJob, and Ray. Note
If you select Resource Quota and use Lingjun AI computing resources, you can submit only TensorFlow, PyTorch, ElasticBatch, MPIJob, and Ray jobs. |
|
Job Resource |
Based on the selected Framework, configure resources for Worker, PS, Chief, Evaluator, and GraphLearn nodes. For the Ray framework, click Add Role to customize Worker roles and run jobs on heterogeneous resources.
|
VPC configuration
-
Without a VPC, the job uses a public gateway with limited bandwidth, which may slow down or fail job execution.
-
Configure a VPC with the corresponding vSwitch and security group to improve network bandwidth, stability, and security. The job cluster can directly access services within the VPC.
Important-
When using a VPC, the resource group instances, dataset storage (OSS), and code repository must all be in the same VPC.
-
For CPFS datasets, configure the job to use the same VPC as the CPFS file system. Otherwise, the DLC training job may remain in the Preparing state.
-
DLC jobs using preemptible Lingjun AI computing resources require a VPC.
Configure the Internet Access Gateway. Two methods are supported:
-
Public Gateway: Limited bandwidth that may be insufficient during high-concurrency access or large file downloads.
-
Private Gateway: To overcome the bandwidth limitations of the public gateway, you can create an Internet NAT Gateway in the VPC of the DLC job, bind an Elastic IP Address (EIP), and configure SNAT entries. For more information, see Improve public network access speed by using a private gateway.
-
Fault tolerance and diagnosis
|
Parameter |
Description |
|
Automatic Fault Tolerance |
Enable Automatic Fault Tolerance to detect and handle algorithm-level errors and improve GPU utilization. For more information, see AIMaster: An elastic automatic fault tolerance engine. Note
Enabling automatic fault tolerance starts an AIMaster instance that runs alongside the job and consumes the following resources:
|
|
Sanity Check |
Enable Sanity Check to check training resource health. Automatically isolates faulty nodes and triggers backend O&M to prevent early-stage failures. For more information, see SanityCheck: Compute resource health check. Note
Health check is supported only for PyTorch jobs submitted with Lingjun AI computing resource quotas that have a GPU count greater than 0. |
Roles and permissions
Configure the instance RAM role. For more information, see Configure a DLC RAM role.
|
Instance RAM role |
Description |
|
Default Role of PAI |
Uses the AliyunPAIDLCDefaultRole service-linked role with fine-grained MaxCompute and OSS permissions. Temporary credentials grant:
|
|
Custom Role |
Select or enter a custom RAM role. The instance assumes this role's permissions when accessing Alibaba Cloud services through STS temporary credentials. |
|
Does Not Associate Role |
No RAM role is associated with the DLC job. Default option. |
References
-
View job details, resource usage, and operation logs: View training details.
-
View job billing details: Bill details.
-
Troubleshoot DLC job issues: DLC FAQ.
-
DLC use cases: DLC use cases.
Appendix
Create a job using an SDK or CLI
Python SDK
Step 1: Install the Alibaba Cloud Credentials tool
Install the Credentials tool to configure your credentials for Alibaba Cloud SDK API calls. Prerequisites:
-
Python 3.7 or later.
-
Alibaba Cloud SDK 2.0 or later.
pip install alibabacloud_credentials
Step 2: Obtain an AccessKey
This example uses an AccessKey pair to configure access credentials. To mitigate security risks, we recommend that you set your AccessKey ID and AccessKey secret in the ALIBABA_CLOUD_ACCESS_KEY_ID and ALIBABA_CLOUD_ACCESS_KEY_SECRET environment variables.
-
To obtain an AccessKey, see Create an AccessKey.
-
To learn how to set environment variables, see Configure environment variables.
-
For other methods of configuring credentials, see Install the Credentials tool.
Step 3: Install the Python SDK
-
Install the workspace SDK.
pip install alibabacloud_aiworkspace20210204==3.0.1 -
Install the DLC SDK.
pip install alibabacloud_pai_dlc20201203==1.4.17
Step 4: Submit the job
Using public resources
The following sample code creates and submits a job.
Using a subscription quota
-
Log on to the PAI console.
-
Find your workspace ID on the Workspaces page, as shown below.

-
Find the resource quota ID of your dedicated resource group, as shown below.

-
Use the following sample code to create and submit a job. For a list of available public images, see Step 2: Prepare an image.
from alibabacloud_pai_dlc20201203.client import Client from alibabacloud_credentials.client import Client as CredClient from alibabacloud_tea_openapi.models import Config from alibabacloud_pai_dlc20201203.models import ( CreateJobRequest, JobSpec, ResourceConfig, GetJobRequest ) # Initialize a client to access the DLC API. region = 'cn-hangzhou' # An AccessKey provides full API access. For better security, we recommend using a RAM user for API access and daily O&M. # Do not hard-code your AccessKey ID and AccessKey secret in your code. This can leak your credentials and compromise the security of your resources. # This example authenticates by reading the AccessKey from environment variables using the Credentials SDK. cred = CredClient() client = Client( config=Config( credential=cred, region_id=region, endpoint=f'pai-dlc.{region}.aliyuncs.com', ) ) # Declare the resource configuration for the job. For image selection, you can refer to the public image list in the documentation or provide your own image URL. spec = JobSpec( type='Worker', image=f'registry-vpc.cn-hangzhou.aliyuncs.com/pai-dlc/tensorflow-training:1.15-cpu-py36-ubuntu18.04', pod_count=1, resource_config=ResourceConfig(cpu='1', memory='2Gi') ) # Define the job. req = CreateJobRequest( resource_id='<Replace with your resource quota ID>', workspace_id='<Replace with your WorkspaceID>', display_name='sample-dlc-job', job_type='TFJob', job_specs=[spec], user_command='echo "Hello World"', ) # Submit the job. response = client.create_job(req) # Get the job ID. job_id = response.body.job_id # Query the job status. job = client.get_job(job_id, GetJobRequest()).body print('job status:', job.status) # View the job's command. job.user_command
Using preemptible instances
-
SpotDiscountLimit (Spot discount)
#!/usr/bin/env python3 from alibabacloud_tea_openapi.models import Config from alibabacloud_credentials.client import Client as CredClient from alibabacloud_pai_dlc20201203.client import Client as DLCClient from alibabacloud_pai_dlc20201203.models import CreateJobRequest region_id = '<region-id>' # The ID of the region in which the DLC job resides, such as cn-hangzhou. cred = CredClient() workspace_id = '12****' # The ID of the workspace to which the DLC job belongs. dlc_client = DLCClient( Config(credential=cred, region_id=region_id, endpoint='pai-dlc.{}.aliyuncs.com'.format(region_id), protocol='http')) create_job_resp = dlc_client.create_job(CreateJobRequest().from_map({ 'WorkspaceId': workspace_id, 'DisplayName': 'sample-spot-job', 'JobType': 'PyTorchJob', 'JobSpecs': [ { "Type": "Worker", "Image": "dsw-registry-vpc.<region-id>.cr.aliyuncs.com/pai/pytorch-training:1.12-cpu-py39-ubuntu20.04", "PodCount": 1, "EcsSpec": 'ecs.g7.xlarge', "SpotSpec": { "SpotStrategy": "SpotWithPriceLimit", "SpotDiscountLimit": 0.4, } }, ], 'UserVpc': { "VpcId": "vpc-0jlq8l7qech3m2ta2****", "SwitchId": "vsw-0jlc46eg4k3pivwpz8****", "SecurityGroupId": "sg-0jl4bd9wwh5auei9****", }, "UserCommand": "echo 'Hello World' && ls -R /mnt/data/ && sleep 30 && echo 'DONE'", })) job_id = create_job_resp.body.job_id print(f'jobId is {job_id}') -
SpotPriceLimit (Spot price)
#!/usr/bin/env python3 from alibabacloud_tea_openapi.models import Config from alibabacloud_credentials.client import Client as CredClient from alibabacloud_pai_dlc20201203.client import Client as DLCClient from alibabacloud_pai_dlc20201203.models import CreateJobRequest region_id = '<region-id>' cred = CredClient() workspace_id = '12****' dlc_client = DLCClient( Config(credential=cred, region_id=region_id, endpoint='pai-dlc.{}.aliyuncs.com'.format(region_id), protocol='http')) create_job_resp = dlc_client.create_job(CreateJobRequest().from_map({ 'WorkspaceId': workspace_id, 'DisplayName': 'sample-spot-job', 'JobType': 'PyTorchJob', 'JobSpecs': [ { "Type": "Worker", "Image": "dsw-registry-vpc.<region-id>.cr.aliyuncs.com/pai/pytorch-training:1.12-cpu-py39-ubuntu20.04", "PodCount": 1, "EcsSpec": 'ecs.g7.xlarge', "SpotSpec": { "SpotStrategy": "SpotWithPriceLimit", "SpotPriceLimit": 0.011, } }, ], 'UserVpc': { "VpcId": "vpc-0jlq8l7qech3m2ta2****", "SwitchId": "vsw-0jlc46eg4k3pivwpz8****", "SecurityGroupId": "sg-0jl4bd9wwh5auei9****", }, "UserCommand": "echo 'Hello World' && ls -R /mnt/data/ && sleep 30 && echo 'DONE'", })) job_id = create_job_resp.body.job_id print(f'jobId is {job_id}')
Key parameters for preemptible instances.
Parameter | Description |
SpotStrategy | The bidding policy. The bidding type parameters take effect only if you set this parameter to SpotWithPriceLimit. |
SpotDiscountLimit | The spot discount bidding type. Note
|
SpotPriceLimit | The spot price bidding type. |
UserVpc | This parameter is required when you use Lingjun resources to submit jobs. Configure the VPC, vSwitch, and security group ID for the region in which the job resides. |
CLI
Step 1: Download the client and authenticate
Download the client for Linux (64-bit) or macOS and complete user authentication. For details, see Preparations.
Step 2: Submit the job
-
Log on to the PAI console.
-
Find your workspace ID on the Workspaces page, as shown below.

-
Find your resource quota ID, as shown below.

-
Prepare a parameter file named
tfjob.paramsas shown in the following example. For more information about how to configure the parameter file, see Submit command.name=test_cli_tfjob_001 workers=1 worker_cpu=4 worker_gpu=0 worker_memory=4Gi worker_shared_memory=4Gi worker_image=registry-vpc.cn-beijing.aliyuncs.com/pai-dlc/tensorflow-training:1.12.2PAI-cpu-py27-ubuntu16.04 command=echo good && sleep 120 resource_id=<Replace with your resource quota ID> workspace_id=<Replace with your WorkspaceID> -
Run the following command to submit the DLC job to the specified workspace and resource quota by using the parameter file.
./dlc submit tfjob --job_file ./tfjob.params -
Run the following command to view the submitted DLC job.
./dlc get job <jobID>
Advanced parameters
|
Parameter (key) |
Supported frameworks |
Description |
Value |
|
|
ALL |
Custom resource release policy. Default: all pod resources are released after job completion. Set to pod-exit to release a pod's resources immediately when that pod exits. |
pod-exit |
|
|
ALL |
Enable or disable IBGDA when the GPU driver is loaded. |
|
|
|
ALL |
If set to true, installs the GDRCopy kernel module (version 2.4.4). |
|
|
|
ALL |
Enable or disable NUMA binding. |
|
|
|
ALL |
Checks whether the quota has sufficient total resources (node specifications) for all roles in the job at submission time. |
|
|
|
PyTorch |
Enable or disable network communication between workers.
When enabled, the domain name of each worker is its worker name, such as |
|
|
|
PyTorch |
Defines the network ports to open on each worker. This parameter can be used with If this parameter is not configured, only port 23456 is opened on the master by default. Therefore, make sure that port 23456 is not included in this custom port list. Important
This parameter is mutually exclusive with |
A set of strings separated by semicolons (;). Each string can be a port number or a port range specified by a hyphen (-). Example: |
|
|
PyTorch |
Number of network ports to open for each worker. This parameter can be used with If this parameter is not configured, only port 23456 is opened on the master by default. DLC randomly assigns the specified number of ports to each worker. The assigned ports, in a semicolon-separated format, are passed to the worker in the Important
|
Integer (up to 65536) |
|
|
Ray |
For Ray, define the runtime environment by configuring RayRuntimeEnv. Important
The environment variable and third-party library configurations are overwritten by this configuration. |
Configure environment variables and third-party libraries ( |
|
|
Ray |
The address of the external GCS Redis. |
String |
|
|
Ray |
The username of the external GCS Redis. |
String |
|
|
Ray |
The password of the external GCS Redis. |
String |
|
|
Ray |
The number of retries for the submitter. |
Positive integer (int) |
|
|
Ray |
Sets the shared memory for a node. For example, to configure 1 GiB of shared memory per node:
|
Positive integer (int) |
button to select a bidding method: