All Products
Search
Document Center

Platform For AI:EPL for accelerating distributed deep learning training

Last Updated:Dec 10, 2025

Easy Parallel Library (EPL) is an efficient and easy-to-use framework for distributed model training. EPL incorporates multiple training optimization technologies and provides user-friendly API operations that enable you to implement parallelism strategies. You can use EPL to reduce costs and improve the efficiency of distributed model training. This topic describes how to use EPL to accelerate TensorFlow distributed model training in Deep Learning Containers (DLC).

Preparations

Before you perform the operations described in this topic, ensure that the following requirements are met:

Step 1: Configure a code build

You can use EPL to write TensorFlow distributed training code. For more information, see QuickStart.

You can also use the sample code provided by EPL to start the TensorFlow distributed model training. In this example, the ResNet50 training dataset is used to create a code build. You can use the code build to submit a TensorFlow training job. Each time model training is performed, the latest version is automatically cloned. To configure a code build, perform the following steps.

  1. Go to the code builds page.

    1. Log on to the PAI console.

    2. In the left-side navigation pane, click Workspaces. On the Workspaces page, click the name of the workspace that you want to manage.

    3. In the left-side navigation pane, choose AI Asset Management > Source Code Repositories to go to the code builds page.

  2. On the Code Configuration page, click Create Code Build.

  3. On the Create Code Build page, configure the parameters and click Submit.

    Set Git Repository Address to https://github.com/alibaba/EasyParallelLibrary.git and Code Branch to main. For more information about other parameters, see Code configuration.

Step 2: Start a training job

  1. Go to the Create Job page.

    1. Log on to the PAI console, select a region in the top navigation bar, select a workspace in the right section, and then click Enter Deep Learning Containers (DLC).

    2. On the Distributed Training (DLC) page, click Create Job.

  2. On the Create Job page, configure the following key parameters. For more information about other parameters, see Create a training job. Then, click OK.

    • In the Basic Information section, customize the job name.

    • In the Environment Information section, configure the following parameters.

      Parameter

      Example value

      Node Image

      Select Official Image > tensorflow-training:1.15-gpu-py36-cu100-ubuntu18.04.

      Startup Command

      apt update
      apt install libnccl2 libnccl-dev
      cd /root/code/EasyParallelLibrary/
      pip install .
      cd examples/resnet
      bash scripts/train_dp.sh

      Code Configuration

      From the Online Configuration drop-down list, select the code build that you configured in Step 1 and set Branch to main.

    • In the Resource Information section, configure the following parameters.

      Parameter

      Example value

      Resource Source

      Select Public Resources.

      Framework

      Select TensorFlow.

      Job Resources

      Configure the following parameters for worker nodes:

      • Number Of Nodes: Set this parameter to 2. You can change the value based on the requirements of the training job.

      • Resource Specification: Select the GPU specification ecs.gn6v-c8g1.2xlarge.

      Maximum Running Time

      Set this parameter to 2. Unit: hours.

    • Configure the Job Resource Configuration parameters as follows:

      Parameter

      Example value

      Number Of Nodes

      Set this parameter to 2. You can change the value based on the requirements of the training job.

      Node Configuration

      On the GPU Instance tab, select ecs.gn6v-c8g1.2xlarge.

      Maximum Running Time

      2. Unit: hours.

  3. On the Distributed Training Jobs page, click the name of the job that you want to manage and go to the job details page. View the running status of the job. For more information, see View training details.

References