Deep Learning Containers (DLC) in the Platform for AI (PAI) console allows you to quickly create standalone or distributed training jobs. Compute nodes are automatically started by Kubernetes at the underlying layer of DLC. This way, you do not need to manually purchase instances and configure environments and your usage habits remain unchanged. DLC is suitable for users who require quick startup of training jobs, supports for a variety of deep learning frameworks, and provides flexible resource configurations.
Prerequisites
PAI is activated, and a workspace is created by using your Alibaba Could account. To activate PAI, log on to the PAI console, select a region in the top navigation bar, and click Activate after authorization.
The account that you use to perform operations is granted the required permissions. If you use an Alibaba Could account, this prerequisite can be ignored. If you use a RAM user, you must assign one of the following roles to the RAM user: algorithm developer, algorithm O&M engineer, and workspace administrator.
Submit a job in the PAI console
If this is the first time you use DLC, we recommend that you submit a job in the PAI console. You can also submit a job by using SDK for Python or the command line.
Go to the Create Job page.
Log on to the PAI console, select the target region and workspace, and then click Enter Deep Learning Containers (DLC).
On the Deep Learning Containers (DLC) page, click Create Job.
Configure the parameters in the following sections.
Basic Information
In this section, configure the Job Name and Tag parameters.
Environment Information
Parameter
Description
Node Image
The node image. You can select Alibaba Cloud Image. You can also select one of the following values:
Custom Image: a custom image that you uploaded to PAI. You must make sure that you can pull images from the image repository or the images are stored on a Container Registry Enterprise Edition instance.
NoteIf you want to use Lingjun resources and custom images, install Remote Direct Memory Access (RDMA) to use the high-performance RDMA network of Lingjun.
Image Address: the address of a custom or Alibaba Cloud image that can be accessed over the Internet.
If you enter a private image address, click enter the username and password and specify the Username and Password parameters to grant permissions on the private image repository.
You can also use an accelerated image in PAI.
Data Set
The dataset that provides data files required in model training. You can use one of the following dataset types.
Custom Dataset: Create a custom dataset to store data files required in model training. You can configure the Read/Write Permission parameter and select the required version in the Versions panel.
Public Dataset: Select an existing public dataset provided by PAI. Public datasets only support read-only mounting.
Mount Path: the path in the DLC container, such as
/mnt/data
. You can run commands to query datasets based on the mount path you specified. For more information about mounting configuration, see Use cloud storage for a DLC training job.ImportantIf you select a Cloud Parallel File Storage (CPFS) dataset, you must configure a virtual private cloud (VPC) for the DLC job. The VPC must be the same as the VPC configured for the CPFS dataset. Otherwise, the job may stay in the preparing environment state for a long time.
Directly Mount
You can directly mount data sources to read data or store procedure and result files.
Supported data sources: OSS, General-purpose NAS, Extreme NAS, and BMCPFS. BMCPFS is available only for jobs that use Lingjun resources.
Advanced Settings: You can configure this parameter for different data sources to implement specific features. Examples:
OSS: You can add the
{"mountType":"ossfs"}
configurations in Advanced Settings to mount an OSS bucket by using ossfs.General-purpose NAS and CPFS: You can specify the nconnect parameter in Advanced Settings to improve the throughput of NAS access in DLC containers. Sample configuration:
{"nconnect":"<Sample value>"}
. Replace <Sample value> with a positive integer.
For more information, see Use cloud storage for a DLC training job.
Startup Command
The commands that the job runs. Shell commands are supported. DLC automatically injects PyTorch and TensorFlow general environment variables, such as
MASTER_ADDR
andWORLD_SIZE
. You can obtain the variables by using$Environment variable name
. Sample commands:Run Python:
python -c "print('Hello World')"
Start PyTorch multi-machine and multi-GPU distributed training:
python -m torch.distributed.launch \ --nproc_per_node=2 \ --master_addr=${MASTER_ADDR} \ --master_port=${MASTER_PORT} \ --nnodes=${WORLD_SIZE} \ --node_rank=${RANK} \ train.py --epochs=100
Set a shell script path as the startup command:
/ml/input/config/launch.sh
Resource Information
Parameter
Description
Resource Type
The resource type. Default value: General Computing Resources. You can select Lingjun Resources only for the China (Ulanqab), Singapore, China (Shenzhen), China (Beijing), China (Shanghai), and China (Hangzhou) regions.
Source
Public Resources:
Billing method: Pay-as-you-go.
Scenarios: Training jobs that run on public resources may encounter queuing delays. We recommend that you use public resources in time-insensitive scenarios that involve a small number of jobs.
Limits: Public resources can provide up to two GPUs and eight vCPUs. To increase the resource quota, contact your sales manager.
Resource Quota: includes general computing resources and Lingjun resources.
Billing method: Subscription.
Scenarios: Resource quotas are suitable for scenarios that require high assurance and involve a large number of jobs.
Special parameters:
Resource Quota: Specifies the numbers of GPUs, vCPU, and other resources. For more information, see Create a resource quota.
Priority: Specifies the priority for running the job. Valid values: 1 to 9. A greater value indicates a higher priority.
Preemptible Resources:
Billing method: Pay-as-you-go.
Scenarios: Preemptible resources are suitable for scenarios that require cost reduction. In most cases, preemptible resources offer additional discounts.
Limits: The high availability and stability of preemptible resources are not guaranteed. Resources may not be immediately available or may be released by the system. For more information, see Use a preemptible job.
Framework
The deep learning training framework and tool. Valid values: Tensorflow, PyTorch, ElasticBatch, XGBoost, OneFlow, MPIJob, and Ray.
NoteIf you set the Resource Quota parameter to Lingjun resources, you can submit only the following types of jobs: TensorFlow, PyTorch, ElasticBatch, MPIJob, and Ray.
Job Resource
Configure the resources of the following nodes based on the framework you selected: worker nodes, parameter server (PS) nodes, chief nodes, evaluator nodes, and GraphLearn nodes. If you select the Ray framework, you can click Add Role to create a custom Worker role. This enables different types of computing resources to work together seamlessly.
Use public resources: You can configure the following parameters:
Number of Nodes: the number of nodes on which the DLC job runs.
Resource Type: Select an instance. The prices of the instances of different types are displayed in the Instance Type panel. For information about the billing, see Billing of DLC.
Use resource quotas: In addition to the Number of Nodes, vCPUs, GPUs, Memory (GiB), and Shared Memory (GiB) parameters, you can also configure the following special parameters:
Node-Specific Scheduling: You can specify a computing node to run the job.
Idle Resources: If you enable this feature, jobs can run on idle resources of quotas that are allocated for other business jobs. This effectively improves resource utilization. However, when the Use idle resources are required to return, jobs that run on the idle resources are terminated and the idle resources are automatically returned.
CPU Affinity: If you enable this feature, processes in a container or pod can be bound to a specific CPU core for execution. This prevents issues such as CPU cache misses and context switches, and improves CPU utilization and application performance. This feature is suitable for scenarios that have high requirements on performance and timeliness.
Use preemptible resources: In addition to the Number of Nodes and Resource Type parameters, you can also configure the Bid Price parameter to specify the maximum bid price to apply for the preemptible resources. You can click the
icon to switch the bidding method.
Bid Price (Discount): The maximum bid price ranges from 10% to 90% of the market price with a 10% interval. You can get the preemptible resources if your bid meets or exceeds the market price and inventory is available.
Bid Price ($/Minutes): The maximum bid price range is based on the market price range.
VPC
If you do not configure a VPC, Internet connection and public gateways are used. Due to the limited bandwidth of public gateways, the job may be stuck or may not run as expected.
If you configure a VPC and select a vSwitch and a security group, the network bandwidth is increased, and both the performance stability and security are enhanced. After the configuration takes effect, the cluster on which the job runs directly accesses the services in the VPC.
ImportantIf you use VPCs, you must make sure that instances in the resource group and the OSS bucket of the dataset reside in the VPCs of the same region, and that the VPCs are connected to the networks of the code repository.
If you use a CPFS dataset, you must configure a VPC. The VPC must be the same as the VPC configured for the CPFS dataset. Otherwise, the job may stay in the preparing environment state for a long time.
If you use Lingjun preemptible resources to submit a DLC job, you must configure a VPC.
You can also configure the Internet Gateway parameter. Valid values:
Public Gateway: The public bandwidth is limited. The download rate may not meet your business requirements in high concurrency or large file downloading scenarios.
Private Gateway: To increase the limited public bandwidth, you can use a private gateway. You need to create an Internet NAT gateway, associate an elastic IP address (EIP) with a DSW instance and configure SNAT in the VPC that is associated with the DSW instance. For more information, see Improve Internet access rate by using a private gateway.
Fault Tolerance and Diagnosis
Parameter
Description
Automatic Fault Tolerance
After you turn on Automatic Fault Tolerance and configure the related parameters, the system checks the jobs to identify and handle algorithmic errors of the jobs. This helps improve GPU utilization. For more information, see AIMaster: Elastic fault tolerance engine.
NoteAfter you enable Automatic Fault Tolerance, the system starts an AIMaster instance that runs together with the job instance and occupies the following resources:
Resource quotas: 1 CPU core and 1 GB of memory.
Public resources: ecs.c6.large.
Sanity Check
After you turn on Sanity Check, the system detects the resources that are used to run the jobs, isolates faulty nodes, and triggers automated O&M processes in the background. Sanity check effectively reduces job failures in the early stage of training and improves the training success rate.
NoteYou can enable sanity check only for PyTorch jobs that run on Lingjun resources and use GPU.
Roles and Permissions
The following table describes how to configure the Instance RAM Role parameter. For more information, see Associate a RAM role with a DLC job.
Instance RAM Role
Description
Default Roles of PAI
The default roles of PAI are developed based on the AliyunPAIDLCDefaultRole role and have only the permissions to access MaxCompute and OSS. You can use the default roles to implement fine-grained permission management. If you have the temporary credentials issued by the default roles of PAI:
You are granted the same permissions as the owner of a DLC job when you access MaxCompute tables.
When you access OSS, you can access only the bucket that is configured as the default storage path for the current workspace.
Custom Roles
Select or create a custom Resource Access Management (RAM) role. You are granted the same permissions as the custom role you select when you call API operations of other Alibaba Cloud services by using Security Token Service (STS) temporary credentials.
Does Not Associate Role
Do not associate a RAM role with the DLC job. By default, this option is selected.
After you configure the parameters, click Confirm.
What to do next
After you submit the training job, you can perform the following operations:
View the basic information, resource views, and logs of the job. For more information, see View training jobs.
Manage the training job. Clone, stop, or delete the job.
View the analysis report of model training results by using TensorBoard.
Monitor the training job and configure alert rules. For more information, see Training monitoring and alerting.
View detailed information about your job execution bills. For more information, see Bill details.
Forward logs of the DLC job from the workspace to a specific Logstore for custom analysis. For more information, see Subscribe to job logs.
Create a notification rule for the workspace on the Configure Event Notification tab in the PAI console to track and monitor the status of the DLC job.
If you have other questions about DLC jobs, see FAQ about DLC.
View DLC use cases.
Appendix
Submit a job by using SDK for Python or the command line
Use SDK for Python
Step 1: Install the Alibaba Cloud SDK Credentials tool
You must configure valid credential information before you call API operations to manage cloud resources by using Alibaba Cloud SDKs. Prerequisites:
Python 3.7 or later is installed.
Alibaba Cloud SDK V2.0 is installed.
pip install alibabacloud_credentials
Step 2: Obtain an AccessKey pair
In this example, an AccessKey pair is used as the access credential. To prevent account information leaks, we recommend that you configure the AccessKey pair as environment variables. The variable names for the AccessKey ID and AccessKey secret are ALIBABA_CLOUD_ACCESS_KEY_ID and ALIBABA_CLOUD_ACCESS_KEY_SECRET.
References:
Step 3: Install SDK for Python
Install the workspace SDK.
pip install alibabacloud_aiworkspace20210204==3.0.1
Install the DLC SDK.
pip install alibabacloud_pai_dlc20201203==1.4.17
Step 4: Submit the job
Use public resources to submit the job
The following sample code provides an example on how to create and submit a job:
Use subscription resource quotas to submit the job
Log on to the PAI console.
Follow the instructions shown in the following figure to obtain your workspace ID on the Workspaces page.
Follow the instructions shown in the following figure to obtain the resource quota ID of your dedicated resource group.
Run the following code to create and submit the job. For information about the available public images, see Step 2: Prepare an image.
from alibabacloud_pai_dlc20201203.client import Client from alibabacloud_credentials.client import Client as CredClient from alibabacloud_tea_openapi.models import Config from alibabacloud_pai_dlc20201203.models import ( CreateJobRequest, JobSpec, ResourceConfig, GetJobRequest ) # Initialize a client to access the DLC API operations. region = 'cn-hangzhou' # The AccessKey pair of an Alibaba Cloud account has permissions on all API operations. Using these credentials to perform operations is a high-risk operation. We recommend that you use a RAM user to call API operations or perform routine O&M. To create a RAM user, log on to the RAM console. # We recommend that you do not save the AccessKey ID and the AccessKey secret in your project code. Otherwise, the AccessKey pair may be leaked, and this may compromise the security of all resources within your account. # In this example, the Credentials SDK reads the AccessKey pair from the environment variables to perform identity verification. cred = CredClient() client = Client( config=Config( credential=cred, region_id=region, endpoint=f'pai-dlc.{region}.aliyuncs.com', ) ) # Specify the resource configurations of the job. You can select a public image or specify an image address. For information about the available public images, see the reference documentation. spec = JobSpec( type='Worker', image=f'registry-vpc.cn-hangzhou.aliyuncs.com/pai-dlc/tensorflow-training:1.15-cpu-py36-ubuntu18.04', pod_count=1, resource_config=ResourceConfig(cpu='1', memory='2Gi') ) # Specify the execution information for the job. req = CreateJobRequest( resource_id='<Your resource quota ID>', workspace_id='<Your workspace ID>', display_name='sample-dlc-job', job_type='TFJob', job_specs=[spec], user_command='echo "Hello World"', ) # Submit the job. response = client.create_job(req) # Obtain the job ID. job_id = response.body.job_id # Query the job status. job = client.get_job(job_id, GetJobRequest()).body print('job status:', job.status) # View the commands that the job runs. job.user_command
Use preemptible resources to submit jobs
SpotDiscountLimit (Spot discount)
#!/usr/bin/env python3 from alibabacloud_tea_openapi.models import Config from alibabacloud_credentials.client import Client as CredClient from alibabacloud_pai_dlc20201203.client import Client as DLCClient from alibabacloud_pai_dlc20201203.models import CreateJobRequest region_id = '<region-id>' # The ID of the region in which the DLC job resides, such as cn-hangzhou. cred = CredClient() workspace_id = '12****' # The ID of the workspace to which the DLC job belongs. dlc_client = DLCClient( Config(credential=cred, region_id=region_id, endpoint='pai-dlc.{}.aliyuncs.com'.format(region_id), protocol='http')) create_job_resp = dlc_client.create_job(CreateJobRequest().from_map({ 'WorkspaceId': workspace_id, 'DisplayName': 'sample-spot-job', 'JobType': 'PyTorchJob', 'JobSpecs': [ { "Type": "Worker", "Image": "dsw-registry-vpc.<region-id>.cr.aliyuncs.com/pai/pytorch-training:1.12-cpu-py39-ubuntu20.04", "PodCount": 1, "EcsSpec": 'ecs.g7.xlarge', "SpotSpec": { "SpotStrategy": "SpotWithPriceLimit", "SpotDiscountLimit": 0.4, } }, ], 'UserVpc': { "VpcId": "vpc-0jlq8l7qech3m2ta2****", "SwitchId": "vsw-0jlc46eg4k3pivwpz8****", "SecurityGroupId": "sg-0jl4bd9wwh5auei9****", }, "UserCommand": "echo 'Hello World' && ls -R /mnt/data/ && sleep 30 && echo 'DONE'", })) job_id = create_job_resp.body.job_id print(f'jobId is {job_id}')
SpotPriceLimit (Spot price)
#!/usr/bin/env python3
from alibabacloud_tea_openapi.models import Config
from alibabacloud_credentials.client import Client as CredClient
from alibabacloud_pai_dlc20201203.client import Client as DLCClient
from alibabacloud_pai_dlc20201203.models import CreateJobRequest
region_id = '<region-id>'
cred = CredClient()
workspace_id = '12****'
dlc_client = DLCClient(
Config(credential=cred,
region_id=region_id,
endpoint='pai-dlc.{}.aliyuncs.com'.format(region_id),
protocol='http'))
create_job_resp = dlc_client.create_job(CreateJobRequest().from_map({
'WorkspaceId': workspace_id,
'DisplayName': 'sample-spot-job',
'JobType': 'PyTorchJob',
'JobSpecs': [
{
"Type": "Worker",
"Image": "dsw-registry-vpc.<region-id>.cr.aliyuncs.com/pai/pytorch-training:1.12-cpu-py39-ubuntu20.04",
"PodCount": 1,
"EcsSpec": 'ecs.g7.xlarge',
"SpotSpec": {
"SpotStrategy": "SpotWithPriceLimit",
"SpotPriceLimit": 0.011,
}
},
],
'UserVpc': {
"VpcId": "vpc-0jlq8l7qech3m2ta2****",
"SwitchId": "vsw-0jlc46eg4k3pivwpz8****",
"SecurityGroupId": "sg-0jl4bd9wwh5auei9****",
},
"UserCommand": "echo 'Hello World' && ls -R /mnt/data/ && sleep 30 && echo 'DONE'",
}))
job_id = create_job_resp.body.job_id
print(f'jobId is {job_id}')
The following table describes the key configurations.
Parameter | Description |
SpotStrategy | The bidding policy. The bidding type parameters take effect only if you set this parameter to SpotWithPriceLimit. |
SpotDiscountLimit | The spot discount bidding type. Note
|
SpotPriceLimit | The spot price bidding type. |
UserVpc | This parameter is required when you use Lingjun resources to submit jobs. Configure the VPC, vSwitch, and security group ID for the region in which the job resides. |
Use the command line
Step 1: Download the DLC client and perform user authentication
Download the DLC client for your operating system and verify your credentials. For more information, see Before you begin.
Step 2: Submit the job
Log on to the PAI console.
Follow the instructions shown in the following figure to obtain your workspace ID on the Workspace page.
Follow the instructions shown in the following figure to obtain the resource quota ID.
Create a parameter file named
tfjob.params
and copy the following content into the file. For information about commands that are used to submit jobs, see Commands used to submit jobs.name=test_cli_tfjob_001 workers=1 worker_cpu=4 worker_gpu=0 worker_memory=4Gi worker_shared_memory=4Gi worker_image=registry-vpc.cn-beijing.aliyuncs.com/pai-dlc/tensorflow-training:1.12.2PAI-cpu-py27-ubuntu16.04 command=echo good && sleep 120 resource_id=<Your resource quota ID> workspace_id=<Your workspace ID>
Run the following code to specify the params_file parameter and submit the DLC job to the specified workspace and resource quota.
./dlc submit tfjob --job_file ./tfjob.params
Run the following code to query the DLC job that you submitted.
./dlc get job <jobID>