PAI-DLC creates single-node or distributed training jobs on Kubernetes, removing the need to provision instances or configure environments. It supports multiple deep learning frameworks and flexible resource configurations.
Quick start
For an MNIST-based walkthrough of single-GPU or multi-node multi-GPU distributed training, see the Distributed Training DLC Quick Start.
Console parameters
Basic information
Configure the Job Name and Tag.
Environment information
|
Parameter |
Description |
|
Image Configuration |
In addition to selecting an Alibaba Cloud Image, you can use the following image types:
|
|
Mount dataset |
Datasets provide the data files required for model training. PAI supports two types of datasets:
Mount Path: The path in the DLC container where the dataset is mounted, for example, Important
If you configure a CPFS dataset, you must configure a VPC for DLC and ensure that the VPC is the same as the VPC of the CPFS file system. Otherwise, the submitted job may remain in the "Preparing" state for a long time. |
|
Mount storage |
You can also mount a data source path to read data or store intermediate files and results.
|
|
Startup Command |
Set the startup command for the job. Shell commands are supported. DLC automatically injects common environment variables for PyTorch and TensorFlow, such as
|
Resource information
|
Parameter |
Description |
|
Resource Type |
The default value is General Computing. Lingjun Intelligence Resources are available in the following regions: China (Ulanqab), Singapore, China (Shenzhen), China (Beijing), China (Shanghai), China (Hangzhou), China (Guangzhou), China (Hong Kong), Malaysia (Kuala Lumpur), Germany (Frankfurt), and Atlanta. |
|
Source |
|
|
Framework |
Supported deep learning training frameworks and tools: TensorFlow, PyTorch, ElasticBatch, XGBoost, OneFlow, MPIJob, Ray, Custom, DataJuicer, and MPI. Note
When you select Resource Quota and use Lingjun AI computing resources, you can submit only TensorFlow, PyTorch, ElasticBatch, MPIJob, and Ray jobs. |
|
Job Resource |
Based on the selected Framework, you can configure resources for Worker, PS, Chief, Evaluator, and GraphLearn node types. If you select the Ray framework, you can click Add Role to customize Worker roles and run jobs on heterogeneous resources.
|
VPC configuration
-
Without a VPC, the job uses a Public Gateway with limited bandwidth, which may slow down or fail the job.
-
Configure a VPC with a vSwitch and security group to improve bandwidth, stability, and security. The job cluster can directly access services within the VPC.
Important-
If you use a VPC, ensure that the job's resource group instances and dataset storage (OSS) are in a VPC in the same region, and that the VPC is connected to the network of the code repository.
-
If you use a CPFS dataset, you must configure a VPC and ensure that the selected VPC is the same as the VPC of the CPFS file system. Otherwise, the submitted DLC training job may remain in the "Preparing" state for a long time.
-
You must configure a VPC when you submit a DLC job that uses preemptible Lingjun AI computing resources.
You can also configure an Internet Access Gateway by using one of the following methods:
-
Public Gateway: Has limited bandwidth that may be insufficient during high-concurrency access or large file downloads.
-
Private Gateway: To overcome Public Gateway bandwidth limits, create an Internet NAT Gateway in the DLC VPC, bind an EIP, and configure SNAT entries. Improve public network access speed by using a private gateway.
-
Fault tolerance and diagnosis
|
Parameter |
Description |
|
Automatic Fault Tolerance |
Enable Automatic Fault Tolerance and configure the required parameters to detect and mitigate algorithm-level errors, improving GPU utilization. AIMaster: An elastic automatic fault tolerance engine. Note
When you enable automatic fault tolerance, an AIMaster instance starts and runs with the job instance. This consumes a specific amount of computing resources. The AIMaster instance uses the following resources:
|
|
Sanity Check |
Enable Sanity Check to comprehensively check training resources, isolate faulty nodes, and trigger backend automated O&M processes. Reduces early-stage failures and improves success rate. SanityCheck: Compute resource health check. Note
The health check feature is supported only for PyTorch training jobs that are submitted using a Lingjun AI computing resource quota and have a GPU count greater than 0. |
Roles and permissions
Instance RAM role configurations. Configure a DLC RAM role.
|
Instance RAM role |
Description |
|
Default Role of PAI |
The PAI default role grants the following permissions via STS temporary credentials:
|
|
Custom Role |
Select or enter a custom RAM role. The instance assumes this role's permissions when accessing cloud services through STS temporary credentials. |
|
Does Not Associate Role |
No RAM role is associated with the DLC job. This is the default option. |
Related topics
-
Job details, resource usage, and operation logs: View training details.
-
Billing details: Bill details.
-
Common issues and solutions: DLC FAQ.
-
Use cases: DLC use cases.
Appendix
Create a job via SDK or CLI
Python SDK
Step 1: Install the Credentials tool
Install the Credentials tool for SDK authentication. Requirements:
-
Python 3.7 or later.
-
Alibaba Cloud SDK 2.0 series.
pip install alibabacloud_credentials
Step 2: Obtain an AccessKey
This example uses an AccessKey pair. Store AccessKey values as environment variables to prevent security risks. The environment variable for the AccessKey ID is ALIBABA_CLOUD_ACCESS_KEY_ID, and the environment variable for the AccessKey secret is ALIBABA_CLOUD_ACCESS_KEY_SECRET.
-
Obtain an AccessKey pair: Create an AccessKey.
-
Set environment variables: Configure environment variables.
-
Other credential methods: Install the Credentials tool.
Step 3: Install the Python SDKs
-
Install the workspace SDK.
pip install alibabacloud_aiworkspace20210204==3.0.1 -
Install the DLC SDK.
pip install alibabacloud_pai_dlc20201203==1.4.17
Step 4: Submit the job
Public resources
The following sample code creates and submits a job.
Subscription resource quota
-
Log on to the PAI console.
-
To view your workspace ID: In the left-side navigation pane, click Workspaces. Find the target workspace, click the ⓘ icon next to its name, then view and copy the Workspace ID from the information card that appears.
-
To view the ID of your resource quota for the dedicated resource group: In the left-side navigation pane, choose AI Computing Resources > Resource Quotas. Click the General-purpose Computing Resources tab and obtain the Quota ID from the Name/ID column in the resource quota list.
-
Use the following code to create and submit a job. For a list of available public images, see Step 2: Prepare an image.
from alibabacloud_pai_dlc20201203.client import Client from alibabacloud_credentials.client import Client as CredClient from alibabacloud_tea_openapi.models import Config from alibabacloud_pai_dlc20201203.models import ( CreateJobRequest, JobSpec, ResourceConfig, GetJobRequest ) # Initialize a client to access the DLC API. region = 'cn-hangzhou' # An AccessKey pair provides full API access. For security purposes, we recommend that you use a RAM user for API access and daily O&M. # Do not hard-code your AccessKey ID and AccessKey secret in your code. This may lead to AccessKey leakage and compromise the security of all resources in your account. # This example shows how to use the Credentials SDK to read the AccessKey from environment variables for authentication. cred = CredClient() client = Client( config=Config( credential=cred, region_id=region, endpoint=f'pai-dlc.{region}.aliyuncs.com', ) ) # Declare the resource configuration for the job. For image selection, you can refer to the public image list in the documentation or provide your own image URL. spec = JobSpec( type='Worker', image=f'registry-vpc.cn-hangzhou.aliyuncs.com/pai-dlc/tensorflow-training:1.15-cpu-py36-ubuntu18.04', pod_count=1, resource_config=ResourceConfig(cpu='1', memory='2Gi') ) # Declare the job's execution details. req = CreateJobRequest( resource_id='<Replace with the ID of your resource quota>', workspace_id='<Replace with your WorkspaceID>', display_name='sample-dlc-job', job_type='TFJob', job_specs=[spec], user_command='echo "Hello World"', ) # Submit the job. response = client.create_job(req) # Get the job ID. job_id = response.body.job_id # Query the job status. job = client.get_job(job_id, GetJobRequest()).body print('job status:', job.status) # View the command executed by the job. job.user_command
Spot instances
-
SpotDiscountLimit (spot discount)
#!/usr/bin/env python3 from alibabacloud_tea_openapi.models import Config from alibabacloud_credentials.client import Client as CredClient from alibabacloud_pai_dlc20201203.client import Client as DLCClient from alibabacloud_pai_dlc20201203.models import CreateJobRequest region_id = '<region-id>' # The ID of the region in which the DLC job resides, such as cn-hangzhou. cred = CredClient() workspace_id = '12****' # The ID of the workspace to which the DLC job belongs. dlc_client = DLCClient( Config(credential=cred, region_id=region_id, endpoint='pai-dlc.{}.aliyuncs.com'.format(region_id), protocol='http')) create_job_resp = dlc_client.create_job(CreateJobRequest().from_map({ 'WorkspaceId': workspace_id, 'DisplayName': 'sample-spot-job', 'JobType': 'PyTorchJob', 'JobSpecs': [ { "Type": "Worker", "Image": "dsw-registry-vpc.<region-id>.cr.aliyuncs.com/pai/pytorch-training:1.12-cpu-py39-ubuntu20.04", "PodCount": 1, "EcsSpec": 'ecs.g7.xlarge', "SpotSpec": { "SpotStrategy": "SpotWithPriceLimit", "SpotDiscountLimit": 0.4, } }, ], 'UserVpc': { "VpcId": "vpc-0jlq8l7qech3m2ta2****", "SwitchId": "vsw-0jlc46eg4k3pivwpz8****", "SecurityGroupId": "sg-0jl4bd9wwh5auei9****", }, "UserCommand": "echo 'Hello World' && ls -R /mnt/data/ && sleep 30 && echo 'DONE'", })) job_id = create_job_resp.body.job_id print(f'jobId is {job_id}') -
SpotPriceLimit (spot price)
#!/usr/bin/env python3 from alibabacloud_tea_openapi.models import Config from alibabacloud_credentials.client import Client as CredClient from alibabacloud_pai_dlc20201203.client import Client as DLCClient from alibabacloud_pai_dlc20201203.models import CreateJobRequest region_id = '<region-id>' cred = CredClient() workspace_id = '12****' dlc_client = DLCClient( Config(credential=cred, region_id=region_id, endpoint='pai-dlc.{}.aliyuncs.com'.format(region_id), protocol='http')) create_job_resp = dlc_client.create_job(CreateJobRequest().from_map({ 'WorkspaceId': workspace_id, 'DisplayName': 'sample-spot-job', 'JobType': 'PyTorchJob', 'JobSpecs': [ { "Type": "Worker", "Image": "dsw-registry-vpc.<region-id>.cr.aliyuncs.com/pai/pytorch-training:1.12-cpu-py39-ubuntu20.04", "PodCount": 1, "EcsSpec": 'ecs.g7.xlarge', "SpotSpec": { "SpotStrategy": "SpotWithPriceLimit", "SpotPriceLimit": 0.011, } }, ], 'UserVpc': { "VpcId": "vpc-0jlq8l7qech3m2ta2****", "SwitchId": "vsw-0jlc46eg4k3pivwpz8****", "SecurityGroupId": "sg-0jl4bd9wwh5auei9****", }, "UserCommand": "echo 'Hello World' && ls -R /mnt/data/ && sleep 30 && echo 'DONE'", })) job_id = create_job_resp.body.job_id print(f'jobId is {job_id}')
The following table describes the key parameters.
|
Parameter |
Description |
|
SpotStrategy |
The bidding policy. The bidding type parameters take effect only if you set this parameter to SpotWithPriceLimit. |
|
SpotDiscountLimit |
The spot discount bidding type. Note
|
|
SpotPriceLimit |
The spot price bidding type. |
|
UserVpc |
This parameter is required when you use Lingjun resources to submit jobs. Configure the VPC, vSwitch, and security group ID for the region in which the job resides. |
CLI
Step 1: Download the client and authenticate
Download the client tool for Linux (64-bit) or macOS and complete authentication. Preparations.
Step 2: Submit the job
-
Log on to the PAI console.
-
To view your workspace ID:
In the left-side navigation pane, click Workspaces. Find the target workspace, click the ⓘ icon next to its name and view the Workspace ID in the information card that appears.
-
To view your resource quota ID:
In the left-side navigation pane, choose AI Computing Resources > Resource Quotas. Select the tab of the target resource type, such as General-purpose Computing Resources, and obtain the resource quota ID from the Name/ID column.
-
Create a parameter file named
tfjob.paramswith the following content. Parameter file details: Submit command.name=test_cli_tfjob_001 workers=1 worker_cpu=4 worker_gpu=0 worker_memory=4Gi worker_shared_memory=4Gi worker_image=registry-vpc.cn-beijing.aliyuncs.com/pai-dlc/tensorflow-training:1.12.2PAI-cpu-py27-ubuntu16.04 command=echo good && sleep 120 resource_id=<Replace with your resource quota ID> workspace_id=<Replace with your WorkspaceID> -
Run the following command to submit the DLC job to a specified workspace and resource quota by using the '--job_file' parameter to specify the path to your parameter file.
./dlc submit tfjob --job_file ./tfjob.params -
Run the following command to view the DLC job that you submitted.
./dlc get job <jobID>
Advanced parameters
|
Parameter |
Supported frameworks |
Description |
Value |
|
|
ALL |
By default, all pod resources are released after the job completes. The only other supported value is 'pod-exit', which releases a pod's resources as soon as the pod exits. |
pod-exit |
|
|
ALL |
Specifies whether to enable the IBGDA feature when the GPU driver is loaded. |
|
|
|
ALL |
Specifies whether to install the GDRCopy kernel module. (Version: 2.4.4) |
|
|
|
ALL |
Specifies whether to enable NUMA core binding. |
|
|
|
ALL |
Specifies whether to check if the total resources (node specifications) in the quota can meet the specifications of all roles in the job upon submission. |
|
|
|
PyTorch |
Specifies whether to allow network communication between workers.
After this feature is enabled, the domain name of each worker is the same as its worker name, such as |
|
|
|
PyTorch |
Allows you to specify the network ports to open on each worker, which can be used with If this parameter is not configured, only port 23456 on the master worker is opened by default. Therefore, make sure that port 23456 is not included in this custom port list. Important
This parameter and |
A set of strings separated by semicolons, where each string is a port number or a port range connected by a hyphen, such as |
|
|
PyTorch |
This allows you to request several network ports for each worker and can be used with If this setting is not configured, only port 23456 is opened on the master node by default. DLC randomly assigns ports to worker nodes based on the number of ports that you specify. The assigned port numbers are passed to the worker nodes through the Important
|
An integer up to 65536. |
|
|
Ray |
When the framework is Ray, you can manually configure RayRuntimeEnv to define the runtime environment. Important
This configuration overrides other environment variable and third-party library settings. |
Configure environment variables and third-party libraries ( |
|
|
Ray |
The address of the external GCS Redis server. |
String |
|
|
Ray |
The username for the external GCS Redis server. |
String |
|
|
Ray |
The password for the external GCS Redis server. |
String |
|
|
Ray |
The number of submitter retries. |
Positive integer (int) |
|
|
Ray |
Configures shared memory for a node. For example, to configure 1 GiB of shared memory for each node, use the following configuration:
|
Positive integer (int) |
button to select a bidding method: