【DSW Gallery】How to use the command line tool to submit DLC tasks

overview
PAI-DLC (Deep Learning Containers) is a deep learning training platform based on Alibaba Cloud Container Service for Kubernetes (ACK), providing you with a flexible, stable, easy-to-use and extreme performance deep learning training environment.
In addition to submitting tasks on the PAI console interface, the PAI platform also provides a complete SDK and OpenAPI to implement code-based task submission, making PAI-DLC more flexible for your daily production.
This article will introduce how to use the CLI provided by PAI-DLC to implement the submission of training tasks.
If you need to submit the PAI-DLC public resource group or private resource group task through the interface, please refer to the DLC command-line tool user manual.
prerequisite
• Activate PAI-DLC and complete authorization. For details, see Cloud Product Dependency and Authorization: DLC.
• If you want to submit a prepaid training task, prepare the resource group cluster (this article uses the DLC exclusive resource group for demonstration)
• You have obtained the AccessKey ID and AccessKey Secret of your Alibaba Cloud account. For details, see Obtaining an AccessKey.
Step 1: Install DLC CLI
DLC command line Mac amd64 version download link
DLC command line Mac arm version download link
DLC command line Linux version download link
In actual use, you can put the DLC command line tool into /usr/local/bin, chmod grants the DLC command line execution permission, and use the following command to associate
export dlc=/usr/local/bin/dlc
Prepare AI workspace (required)
Workspace is the top-level concept of PAI, which provides unified computing resource management and personnel authority management capabilities for the team. The purpose is to provide AI developers with full-process development tools and AI asset management capabilities that support team collaboration.
When the PAI platform is activated, it will automatically create a default workspace for the user
Prepare to mirror (required)
To perform training tasks, you need to explicitly specify the image used by the computing nodes. PAI-DLC supports the option to use different types of images:
• Community mirrors: Standard mirrors provided by the community. For details about different mirrors, see Community Mirror Version Details.
• PAI platform image: Various official images provided by Alibaba Cloud PAI products, supporting different resource types, Python versions, and deep learning frameworks TensorFlow and PyTorch. For the list of images, please refer to the list of public images.
• User-defined image: You can choose to use the custom image you added to the PAI. Before selecting, you need to add the custom image to the PAI. For details, see Viewing and Adding an Image.
• Mirror address: You can choose to use your custom mirror. After selecting the mirror address, you need to configure the Docker Registry Image URL accessible under the public network environment in the configuration box.
Prepare dataset (optional)
High-quality datasets are the foundation of high-precision models and the core goal of data preparation. The PAI platform provides a dataset management module that supports the registration of various types of data (local data, data stored in Alibaba Cloud, etc.) as datasets, and also supports scanning OSS folders to generate index datasets to prepare for model training.
Prepare code sources (optional)
When performing operations such as submitting model training tasks, you usually need to provide your own training code. PAI provides you with a code configuration function, so that you can create the code warehouse you need to use as an AI asset, which is convenient for reference in multiple tasks.
Step 2: Submit the training task
The resources and configuration required for training are prepared above, and the task will be submitted next. For more description of the interface, please refer to the API reference.
Configure named line tools
Before using the command line tool, you need to configure the command line tool, enter the user's ak, region and the endpoint of the corresponding region's DLC service
dlc config --access_id {.ak-id }
--access_key{.ak-secret}
--endpoint 'pai-dlc.{region-id}.aliyuncs.com' # region-id such as cn-beijing/cn-hangzhou
--region cn-hangzhou
Create a prepaid job
Exclusive resource group (also known as prepaid resource group) is a new function launched by PAI, which is in the testing stage. If necessary, you can contact the PAI team. The resources in the prepaid resource group are exclusive to the user, and the user can submit a DLC task or a DSW instance, and specify the resource amount of the task or instance. When the submitted tasks exceed the total amount in the resource group, they will be automatically queued. At the same time, users can modify the priority of tasks in the queue to affect the order in which tasks are executed.
It is worth noting that the jobs submitted to the resource group will follow the All-Or-Nothing (also called Gang Scheduling) principle, that is, only when all workers required by the job can obtain resource guarantee requests, will they be scheduled for execution ; Avoid submitting some workers to the cluster, resulting in waste of resources.
The following sample command is to submit a DLC prepaid training task to the workspace 46099, and schedule the task to the prepaid resource group rgo80s1uv05bplin


View task status and get logs:
To view the log of a task, you need to specify the job_id and pod_id of the task. Let's take the task submitted above as an example
dlc logs dlcylztabt7alg7p dlcylztabt7alg7p-worker-0 --max_events_num=20
Create post-paid multi-machine distributed training job
Post-paid tasks are the first business model supported by DLC. Users can pull up the required resources on demand, and after the task is over, all resources will be released.
This article takes Tensorflow multi-machine distributed tasks as an example to show how to use the DLC command line tool to create multi-machine distributed tasks

For TFJob, it is now supported to allow users to customize the input success policy:
--success_policy=ChiefWorker In this mode, if the pod of the Chief in the TF distributed multi-machine task succeeds, then the entire task is considered successful and the task will end
--success_policy=AllWOrkers (Default) In this mode, the TF distributed multi-machine task must be successful for all Workers before the entire task is considered successful
View task yaml
You can use the get job command
./dlc get job dlc1fifr7w6ycnp
Stop task
You can use the stop job command
./dlc stop job dlc1fif7r7w6ycnq
Step 3: View training tasks
After submitting the above PAI-DLC task, you can check the status of the task in real time through the following code. The command-line tool supports multiple dimensions to query tasks. Specifically, you can use dlc get job -h to view the detailed parameter list. Here, the tasks in the 'Running' state in the prepaid resource group used to query the previously submitted tasks are displayed.
dlc get job --resource_id='rgo80s1uv05bplin' --status='Running'

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us