EPL (Easy Parallel Library) is a distributed model training framework that combines multiple training optimization techniques and provides simple APIs for various parallelization strategies. Use EPL in DLC for cost-effective, high-performance distributed TensorFlow training.
Prerequisites
Ensure the following:
-
Authorize a service-linked role for the DLC service. Cloud product dependencies and authorization: DLC.
-
Set up your image environment with an official image or a community image (NVIDIA TensorFlow 1.15 or TensorFlow-GPU 1.15):
-
If you use an official image, EPL is pre-installed.
-
If you use a community image, install EPL first. Install EPL.
NoteFor DLC, use the community image
tensorflow-training:1.15-gpu-py36-cu100-ubuntu18.04. EPL can be installed through a DLC execution command. -
Step 1: Configure a code build
Write distributed TensorFlow training code with EPL. The Quick Start guide provides examples.
Alternatively, use the ResNet-50 example through a code build. When you submit a training job, the latest code is automatically cloned. Configure the code build as follows.
-
Go to the Code Configuration page.
-
Log on to the PAI console.
-
In the left navigation pane, click Workspaces. On the Workspace List page, click the name of the workspace that you want to manage.
-
In the left navigation pane of the workspace page, choose .
-
-
On the Source Code Repositories page, click Create Code Build.
-
On the Create Code Build page, configure the parameters and click Submit.
Set Git Repository Address to
https://github.com/alibaba/EasyParallelLibrary.gitand Code Branch to main. Other parameters are described in Configure a code build.
Step 2: Create a training job
-
Go to the Create Job page.
-
Log on to the PAI console. In the top navigation bar, select the target region and workspace, and then click Go to DLC.
-
On the Distributed Training (DLC) page, click Create Task.
-
-
On the Create Task page, configure the following key parameters and click OK. Other parameters are described in Create a training job.
-
In the Basic Information section, enter a job name.
-
In the Environment Information section, configure the following parameters.
Parameter
Example value
Node Image
Select Alibaba Cloud Image > tensorflow-training:1.15-gpu-py36-cu100-ubuntu18.04.
Start Command
apt update apt install libnccl2 libnccl-dev cd /root/code/EasyParallelLibrary/ pip install . cd examples/resnet bash scripts/train_dp.shSource Code Repositories
From the Online Configuration drop-down list, select the
code buildthat you created in Step 1 and set Branch to main. -
In the Resource Information section, configure the following parameters.
Parameter
Example value
Resource Source
Select Public Resources.
Framework
Select TensorFlow.
Task Resources
For worker nodes, configure the following parameters:
-
Nodes: Set to 2. Adjust based on your training requirements.
-
Instance Type: Select the GPU specification ecs.gn6v-c8g1.2xlarge.
Maximum Running Time
Set this to 2 hours.
-
-
Configure the Task resource configuration parameters as follows.
Parameter
Example value
Nodes
Set to 2. Adjust based on your training needs.
Node configuration
On the GPU instance tab, select ecs.gn6v-c8g1.2xlarge.
Maximum running time
2 hours.
-
-
In the job list, click your job name to view its details and monitor status. View training job details.