Using EPL for training acceleration in DLC - Platform For AI

Distributed deep learning training across multiple nodes requires complex parallelism orchestration, inter-node communication, and synchronization logic. Easy Parallel Library (EPL) simplifies this by wrapping your existing TensorFlow code with annotations that handle data parallelism, tensor model parallelism, and pipeline parallelism automatically. Use EPL in Deep Learning Containers (DLC) on Platform for AI (PAI) to scale model training with minimal code changes.

How EPL works

EPL provides a unified interface for multiple parallelism strategies. Instead of rewriting your training scripts for distributed execution, add EPL annotations to your existing TensorFlow code. EPL then manages communication and synchronization across nodes.

For API details and parallelism strategy options, see EPL documentation.

Prerequisites

Before you begin, make sure that you have:

Authorized the service-linked role for DLC. For details, see Cloud service dependencies and authorization: DLC
An image running NVIDIA TensorFlow 1.15 or TensorFlow-GPU 1.15

Image selection

EPL availability depends on your image type:

Image type	EPL installation	Details
Official image (PAI-optimized)	Pre-installed, ready to use	See Official images
Community image (standard)	Manual installation required	See Community images

Note

For DLC, use the community image tensorflow-training:1.15-gpu-py36-cu100-ubuntu18.04. Install EPL by running the commands in the startup script shown in Create a training job. For other environments, see Install EPL.

Set up a code build

A code build links a Git repository to DLC so that each training job automatically clones the latest code. This example uses the EPL repository, which includes a ResNet50 sample.

Log on to the PAI console.
In the left-side navigation pane, click Workspaces. Click the name of your workspace.
In the left-side navigation pane, choose AI Asset Management > Source Code Repositories.
On the Code Configuration page, click Create Code Build.
Configure the following parameters and click Submit. For other parameters, see Code configuration.
Parameter Value
Git Repository Address https://github.com/alibaba/EasyParallelLibrary.git
Code Branch main

Parameter	Value
Git Repository Address	`https://github.com/alibaba/EasyParallelLibrary.git`
Code Branch	`main`

Create a training job

Log on to the PAI console, select a region, select a workspace, and click Enter Deep Learning Containers (DLC).
On the Distributed Training (DLC) page, click Create Job.
In the Basic Information section, enter a job name.

In the Environment Information section, configure the following parameters. Startup script: This script installs NCCL dependencies, builds EPL from source, and launches a data-parallel ResNet50 training run.

Parameter	Value
Node Image	Select Official Image > tensorflow-training:1.15-gpu-py36-cu100-ubuntu18.04
Code Configuration	From the Online Configuration drop-down list, select the code build created in Set up a code build. Set Branch to main.
Startup Command	See the startup script below.

   apt update
   apt install libnccl2 libnccl-dev
   cd /root/code/EasyParallelLibrary/
   pip install .
   cd examples/resnet
   bash scripts/train_dp.sh

In the Resource Information section, configure the following parameters.
Parameter Value
Resource Source Select Public Resources
Framework Select TensorFlow
In the Job Resource Configuration section, configure the following parameters.
Parameter Value
Number Of Nodes 2 (adjust based on your training requirements)
Node Configuration On the GPU Instance tab, select ecs.gn6v-c8g1.2xlarge
Maximum Running Time 2 (hours)
Click OK to submit the job.

Parameter	Value
Resource Source	Select Public Resources
Framework	Select TensorFlow

Parameter	Value
Number Of Nodes	`2` (adjust based on your training requirements)
Node Configuration	On the GPU Instance tab, select ecs.gn6v-c8g1.2xlarge
Maximum Running Time	`2` (hours)

Verify the training job

After you submit the job, verify that it runs successfully:

On the Distributed Training Jobs page, click the job name to open the job details page.
Check the job status and wait for it to show Succeeded.
Review the training logs to confirm that the model is training across both nodes.

For more information about job monitoring, see View training details.

Platform For AI:Accelerate distributed training with EPL