All Products
Search
Document Center

Platform For AI:Accelerate distributed deep learning training with EPL

Last Updated:Jun 10, 2026

EPL (Easy Parallel Library) is a distributed model training framework that combines multiple training optimization techniques and provides simple APIs for various parallelization strategies. Use EPL in DLC for cost-effective, high-performance distributed TensorFlow training.

Prerequisites

Ensure the following:

  • Authorize a service-linked role for the DLC service. Cloud product dependencies and authorization: DLC.

  • Set up your image environment with an official image or a community image (NVIDIA TensorFlow 1.15 or TensorFlow-GPU 1.15):

    Note

    For DLC, use the community image tensorflow-training:1.15-gpu-py36-cu100-ubuntu18.04. EPL can be installed through a DLC execution command.

Step 1: Configure a code build

Write distributed TensorFlow training code with EPL. The Quick Start guide provides examples.

Alternatively, use the ResNet-50 example through a code build. When you submit a training job, the latest code is automatically cloned. Configure the code build as follows.

  1. Go to the Code Configuration page.

    1. Log on to the PAI console.

    2. In the left navigation pane, click Workspaces. On the Workspace List page, click the name of the workspace that you want to manage.

    3. In the left navigation pane of the workspace page, choose AI Computing Asset Management > Source Code Repositories.

  2. On the Source Code Repositories page, click Create Code Build.

  3. On the Create Code Build page, configure the parameters and click Submit.

    Set Git Repository Address to https://github.com/alibaba/EasyParallelLibrary.git and Code Branch to main. Other parameters are described in Configure a code build.

Step 2: Create a training job

  1. Go to the Create Job page.

    1. Log on to the PAI console. In the top navigation bar, select the target region and workspace, and then click Go to DLC.

    2. On the Distributed Training (DLC) page, click Create Task.

  2. On the Create Task page, configure the following key parameters and click OK. Other parameters are described in Create a training job.

    • In the Basic Information section, enter a job name.

    • In the Environment Information section, configure the following parameters.

      Parameter

      Example value

      Node Image

      Select Alibaba Cloud Image > tensorflow-training:1.15-gpu-py36-cu100-ubuntu18.04.

      Start Command

      apt update
      apt install libnccl2 libnccl-dev
      cd /root/code/EasyParallelLibrary/
      pip install .
      cd examples/resnet
      bash scripts/train_dp.sh

      Source Code Repositories

      From the Online Configuration drop-down list, select the code build that you created in Step 1 and set Branch to main.

    • In the Resource Information section, configure the following parameters.

      Parameter

      Example value

      Resource Source

      Select Public Resources.

      Framework

      Select TensorFlow.

      Task Resources

      For worker nodes, configure the following parameters:

      • Nodes: Set to 2. Adjust based on your training requirements.

      • Instance Type: Select the GPU specification ecs.gn6v-c8g1.2xlarge.

      Maximum Running Time

      Set this to 2 hours.

    • Configure the Task resource configuration parameters as follows.

      Parameter

      Example value

      Nodes

      Set to 2. Adjust based on your training needs.

      Node configuration

      On the GPU instance tab, select ecs.gn6v-c8g1.2xlarge.

      Maximum running time

      2 hours.

  3. In the job list, click your job name to view its details and monitor status. View training job details.

Related documents