Train ResNet50 on DLC Using EPL Distributed Annotations - Platform For AI

How it works

EPL provides a unified interface for multiple parallelism strategies. Add EPL annotations to existing TensorFlow code instead of rewriting training scripts. EPL manages communication and synchronization.

For API details and parallelism strategy options, see EPL documentation.

Prerequisites

Requirements:

Service-linked role authorization for DLC. See Cloud service dependencies and authorization: DLC.
Image running NVIDIA TensorFlow 1.15 or TensorFlow-GPU 1.15.

EPL installation by image type

EPL availability varies by image type:

Image type	Installation status	Details
Official image (PAI-optimized)	Pre-installed	See Official images
Community image (standard)	Manual installation	See Community images

Note

For DLC, use community image tensorflow-training:1.15-gpu-py36-cu100-ubuntu18.04. Install EPL by running commands in the startup script shown in Create a training job. For other environments, see EPL installation.

Configure a code repository

Configure a code build to link your Git repository to DLC. Each training job clones the latest code. This example uses the EPL repository with a ResNet50 sample.

Log on to the PAI console.
In the left-side navigation pane, click Workspaces and then click your workspace name.
In the left-side navigation pane, choose AI Asset Management > Source Code Repositories.
On the Code Configuration page, click Create Code Build.

Configure parameters and click Submit. For parameter details, see Code configuration.

Parameter	Value
Git Repository Address	`https://github.com/alibaba/EasyParallelLibrary.git`
Code Branch	`main`

Create a training job

Log on to the PAI console, select a region and workspace, and click Enter Deep Learning Containers (DLC).
On the Distributed Training (DLC) page, click Create Job.
In the Basic Information section, enter a job name.

In the Environment Information section, configure parameters. The startup script installs NCCL, builds EPL from source, and launches data-parallel ResNet50 training.

Parameter	Value
Node Image	Select Official Image > tensorflow-training:1.15-gpu-py36-cu100-ubuntu18.04.
Code Configuration	From the Online Configuration drop-down list, select the code build created in Configure code repository. Set Branch to main.
Startup Command	See startup script.

   apt update
   apt install libnccl2 libnccl-dev
   cd /root/code/EasyParallelLibrary/
   pip install .
   cd examples/resnet
   bash scripts/train_dp.sh

In the Resource Information section, configure parameters.

Parameter	Value
Resource Source	Select Public Resources.
Framework	Select TensorFlow.

In the Job Resource Configuration section, configure parameters.

Parameter	Value
Number Of Nodes	`2` (adjust as needed)
Node Configuration	On the GPU Instance tab, select ecs.gn6v-c8g1.2xlarge.
Maximum Running Time	`2` hours

Click OK to submit.

Verify job status

After submitting the job, verify it runs:

On the Distributed Training Jobs page, click the job name.
Check job status and wait for Succeeded.
Review training logs to confirm the model trains on both nodes.

For job monitoring details, see View training details.

Platform For AI:Accelerate distributed training with EPL