Easy Parallel Library (EPL) is an efficient and easy-to-use framework for distributed model training. EPL incorporates multiple training optimization technologies and provides user-friendly API operations that enable you to implement parallelism strategies. You can use EPL to reduce costs and improve the efficiency of distributed model training. This topic describes how to use EPL to accelerate TensorFlow distributed model training in Deep Learning Containers (DLC).
Preparations
Before you perform the operations described in this topic, ensure that the following requirements are met:
You have authorized the service-linked role for DLC. For more information, see Cloud service dependencies and authorization: DLC.
The official image or one of the following community images is deployed: NVIDIA TensorFlow 1.15 or TensorFlow-GPU 1.15.
If you use an official image (Official images (optimized images provided by the PAI team)), you can use EPL directly without installation.
If you use a community image (Community images (standard images provided by the community)), you must install EPL before you can use it. For more information about how to install EPL, see Install EPL.
NoteIf you use the DLC platform, we recommend that you select the community image:
tensorflow-training:1.15-gpu-py36-cu100-ubuntu18.04. You can run commands to install EPL in DLC.
Step 1: Configure a code build
You can use EPL to write TensorFlow distributed training code. For more information, see QuickStart.
You can also use the sample code provided by EPL to start the TensorFlow distributed model training. In this example, the ResNet50 training dataset is used to create a code build. You can use the code build to submit a TensorFlow training job. Each time model training is performed, the latest version is automatically cloned. To configure a code build, perform the following steps.
Go to the code builds page.
Log on to the PAI console.
In the left-side navigation pane, click Workspaces. On the Workspaces page, click the name of the workspace that you want to manage.
In the left-side navigation pane, choose to go to the code builds page.
On the Code Configuration page, click Create Code Build.
On the Create Code Build page, configure the parameters and click Submit.
Set Git Repository Address to
https://github.com/alibaba/EasyParallelLibrary.gitand Code Branch to main. For more information about other parameters, see Code configuration.
Step 2: Start a training job
Go to the Create Job page.
Log on to the PAI console, select a region in the top navigation bar, select a workspace in the right section, and then click Enter Deep Learning Containers (DLC).
On the Distributed Training (DLC) page, click Create Job.
On the Create Job page, configure the following key parameters. For more information about other parameters, see Create a training job. Then, click OK.
In the Basic Information section, customize the job name.
In the Environment Information section, configure the following parameters.
Parameter
Example value
Node Image
Select Official Image > tensorflow-training:1.15-gpu-py36-cu100-ubuntu18.04.
Startup Command
apt update apt install libnccl2 libnccl-dev cd /root/code/EasyParallelLibrary/ pip install . cd examples/resnet bash scripts/train_dp.shCode Configuration
From the Online Configuration drop-down list, select the code build that you configured in Step 1 and set Branch to main.
In the Resource Information section, configure the following parameters.
Parameter
Example value
Resource Source
Select Public Resources.
Framework
Select TensorFlow.
Job Resources
Configure the following parameters for worker nodes:
Number Of Nodes: Set this parameter to 2. You can change the value based on the requirements of the training job.
Resource Specification: Select the GPU specification ecs.gn6v-c8g1.2xlarge.
Maximum Running Time
Set this parameter to 2. Unit: hours.
Configure the Job Resource Configuration parameters as follows:
Parameter
Example value
Number Of Nodes
Set this parameter to 2. You can change the value based on the requirements of the training job.
Node Configuration
On the GPU Instance tab, select ecs.gn6v-c8g1.2xlarge.
Maximum Running Time
2. Unit: hours.
On the Distributed Training Jobs page, click the name of the job that you want to manage and go to the job details page. View the running status of the job. For more information, see View training details.
References
For more information about EPL, see EPL.
For more information about DLC, see Distributed Training (DLC).