Enable multi-node distributed training by adding EPL annotations to TensorFlow code.
How it works
EPL provides a unified interface for multiple parallelism strategies. Add EPL annotations to existing TensorFlow code instead of rewriting training scripts. EPL manages communication and synchronization.
For API details and parallelism strategy options, see EPL documentation.
Prerequisites
Requirements:
-
Service-linked role authorization for DLC. See Cloud service dependencies and authorization: DLC.
-
Image running NVIDIA TensorFlow 1.15 or TensorFlow-GPU 1.15.
EPL installation by image type
EPL availability varies by image type:
| Image type | Installation status | Details |
|---|---|---|
| Official image (PAI-optimized) | Pre-installed | See Official images |
| Community image (standard) | Manual installation | See Community images |
For DLC, use community image tensorflow-training:1.15-gpu-py36-cu100-ubuntu18.04. Install EPL by running commands in the startup script shown in Create a training job. For other environments, see EPL installation.
Configure a code repository
Configure a code build to link your Git repository to DLC. Each training job clones the latest code. This example uses the EPL repository with a ResNet50 sample.
-
Log on to the PAI console.
-
In the left-side navigation pane, click Workspaces and then click your workspace name.
-
In the left-side navigation pane, choose AI Asset Management > Source Code Repositories.
-
On the Code Configuration page, click Create Code Build.
-
Configure parameters and click Submit. For parameter details, see Code configuration.
Parameter Value Git Repository Address https://github.com/alibaba/EasyParallelLibrary.gitCode Branch main
Create a training job
-
Log on to the PAI console, select a region and workspace, and click Enter Deep Learning Containers (DLC).
-
On the Distributed Training (DLC) page, click Create Job.
-
In the Basic Information section, enter a job name.
-
In the Environment Information section, configure parameters. The startup script installs NCCL, builds EPL from source, and launches data-parallel ResNet50 training.
Parameter Value Node Image Select Official Image > tensorflow-training:1.15-gpu-py36-cu100-ubuntu18.04. Code Configuration From the Online Configuration drop-down list, select the code build created in Configure code repository. Set Branch to main. Startup Command See startup script. apt update apt install libnccl2 libnccl-dev cd /root/code/EasyParallelLibrary/ pip install . cd examples/resnet bash scripts/train_dp.sh -
In the Resource Information section, configure parameters.
Parameter Value Resource Source Select Public Resources. Framework Select TensorFlow. -
In the Job Resource Configuration section, configure parameters.
Parameter Value Number Of Nodes 2(adjust as needed)Node Configuration On the GPU Instance tab, select ecs.gn6v-c8g1.2xlarge. Maximum Running Time 2hours -
Click OK to submit.
Verify job status
After submitting the job, verify it runs:
-
On the Distributed Training Jobs page, click the job name.
-
Check job status and wait for Succeeded.
-
Review training logs to confirm the model trains on both nodes.
For job monitoring details, see View training details.