Use EPL in DLC to accelerate training - Platform For AI - Alibaba Cloud Documentation Center

EPL (Easy Parallel Library) is a distributed model training framework that combines multiple training optimization techniques and provides simple APIs for various parallelization strategies. Use EPL in DLC for cost-effective, high-performance distributed TensorFlow training.

Prerequisites

Ensure the following:

Authorize a service-linked role for the DLC service. Cloud product dependencies and authorization: DLC.
Set up your image environment with an official image or a community image (NVIDIA TensorFlow 1.15 or TensorFlow-GPU 1.15):
- If you use an official image, EPL is pre-installed.
- If you use a community image, install EPL first. Install EPL.
Note
For DLC, use the community image tensorflow-training:1.15-gpu-py36-cu100-ubuntu18.04. EPL can be installed through a DLC execution command.

Step 1: Configure a code build

Write distributed TensorFlow training code with EPL. The Quick Start guide provides examples.

Alternatively, use the ResNet-50 example through a code build. When you submit a training job, the latest code is automatically cloned. Configure the code build as follows.

Go to the Code Configuration page.
1. Log on to the PAI console.
2. In the left navigation pane, click Workspaces. On the Workspace List page, click the name of the workspace that you want to manage.
3. In the left navigation pane of the workspace page, choose AI Computing Asset Management > Source Code Repositories.
On the Source Code Repositories page, click Create Code Build.
On the Create Code Build page, configure the parameters and click Submit.

Set Git Repository Address to https://github.com/alibaba/EasyParallelLibrary.git and Code Branch to main. Other parameters are described in Configure a code build.

Step 2: Create a training job

Go to the Create Job page.
1. Log on to the PAI console. In the top navigation bar, select the target region and workspace, and then click Go to DLC.
2. On the Distributed Training (DLC) page, click Create Task.

On the Create Task page, configure the following key parameters and click OK. Other parameters are described in Create a training job.

In the Basic Information section, enter a job name.

In the Environment Information section, configure the following parameters.

Parameter	Example value
Node Image	Select Alibaba Cloud Image > tensorflow-training:1.15-gpu-py36-cu100-ubuntu18.04.
Start Command	`apt update apt install libnccl2 libnccl-dev cd /root/code/EasyParallelLibrary/ pip install . cd examples/resnet bash scripts/train_dp.sh`
Source Code Repositories	From the Online Configuration drop-down list, select the `code build` that you created in Step 1 and set Branch to main.

In the Resource Information section, configure the following parameters.

Parameter	Example value
Resource Source	Select Public Resources.
Framework	Select TensorFlow.
Task Resources	For worker nodes, configure the following parameters: Nodes: Set to 2. Adjust based on your training requirements. Instance Type: Select the GPU specification ecs.gn6v-c8g1.2xlarge.
Maximum Running Time	Set this to 2 hours.

Configure the Task resource configuration parameters as follows.

Parameter	Example value
Nodes	Set to 2. Adjust based on your training needs.
Node configuration	On the GPU instance tab, select ecs.gn6v-c8g1.2xlarge.
Maximum running time	2 hours.

In the job list, click your job name to view its details and monitor status. View training job details.

Platform For AI:Accelerate distributed deep learning training with EPL

Prerequisites

Step 1: Configure a code build

Step 2: Create a training job

Related documents