AI acceleration: accelerate DLC training by using EPL - Platform For AI

Easy Parallel Library (EPL) is an efficient and easy-to-use framework for distributed model training. EPL adopts multiple training optimization technologies and provides easy-to-use API operations that allow you to use parallelism strategies. You can use EPL to reduce costs and improve the efficiency of distributed model training. This topic describes how to use EPL to accelerate TensorFlow distributed model trainings in Deep Learning Containers (DLC).

Prerequisites

Before you perform the operations described in this topic, make sure that the following requirements are met:

The required service-linked role is created for DLC. For more information, see Grant the permissions that are required to use DLC.
The official image or one of the following community images is prepared: NVIDIA TensorFlow 1.15 or TensorFlow-GPU 1.15.
- If you use the official image, you can use EPL without the need to install it. For more information about official images, see Alibaba Cloud image.
- If you use an open source image, you must first install EPL. For more information about community images, see Community image. For more information about how to install EPL, see Install EPL.
Note
If you use DLC, we recommend that you select the community image tensorflow-training:1.15-gpu-py36-cu100-ubuntu18.04. You can run commands to install EPL in DLC.

Step 1: Configure a code build

You can use EPL to write code for TensorFlow-based distributed model training. For more information, see Quick Start.

You can also use the sample code provided by EPL to start the TensorFlow distributed model training. In this example, the training dataset ResNet50 is used to create a code build. You can use the code build to submit a TensorFlow training job. Each time model training is performed, the latest version is automatically cloned. To configure a code build, perform the following steps.

Go to the code builds page.
1. Log on to the PAI console.
2. In the left-side navigation pane, click Workspaces. On the Workspaces page, click the name of the workspace that you want to manage.
3. In the left-side navigation pane, choose AI Computing Asset Management > Source Code Repositories to go to the code builds page.
On the Code Builds page, click Create Code Build.
In the Create Code Build panel, configure the parameters and click Submit.
Set the Repository parameter to https://github.com/alibaba/EasyParallelLibrary.git and the Code branch parameter to main. For more information about other parameters, see Code builds.

Step 2: Start a training job

Go to the Create Job page.
1. Log on to the PAI console.
2. In the left-side navigation pane, click Workspaces. On the Workspaces page, click the name of the workspace that you want to manage.
3. In the left-side navigation pane of the Workspace page, choose Model Development and Training > Deep Learning Containers (DLC). Click Create Job on the Distributed Training Jobs page. The Create Job page appears.

On the Create Job page, configure the parameters in the Basic Information and Resource Configuration sections. For more information about other parameters, see Submit training jobs. Click Submit.

The following table describes the parameters in the Basic Information section.

Parameter	Example
Resource Quota	Public resource group.
Job Name	Specify a name for the training job.
Node Image	Click Community Image and select tensorflow-training:1.15-gpu-py36-cu100-ubuntu18.04 from the image list.
Framework	TensorFlow.
Code Builds	Click Online Configuration and select the dataset that you configured in step 1 from the drop-down list.
Code Branch	main.
Job Command	`apt update apt install libnccl2 libnccl-dev cd /root/code/EasyParallelLibrary/ pip install . cd examples/resnet bash scripts/train_dp.sh`

The following table describes the parameters in the Resource Configuration section.

Parameter	Example
Number of Nodes	Set the value to 2. You can change the value based on the requirements of the training job.
Node Configuration	On the GPU Instance tab, select ecs.gn6v-c8g1.2xlarge.
Maximum Duration	2. Unit: hours.

On the Distributed Training Jobs page, click the name of the job that you want to manage and go to the job details page. View the running status of the job. For more information, see View training jobs.

References

For more information about EPL, see the EPL documentation.
For more information about DLC, see Before you begin.