Whale facilitates distributed parallel training. It supports training based on combined parallelism strategies and provides a variety of communication optimization features. This topic describes the detailed procedure for implementing distributed parallelism strategies in Whale. The procedure includes initialization, resource grouping, model division, and mapping from hardware resources to logical resources. This topic also provides a complete example. In addition, this topic describes how to enable the communication optimization features of Whale.

Background information

Whale is a flexible, easy-to-use, efficient, and centralized distributed training framework. It provides simple and easy-to-use API operations for data parallelism, model parallelism, pipeline parallelism, operator splitting, and hybrid parallelism that combines multiple parallelism strategies. Whale is developed based on TensorFlow and fully compatible with TensorFlow API. You only need to add a few lines of code that describe distributed parallelism strategies to an existing TensorFlow model to perform distributed, hybrid parallel training.

Version mapping

Whale supports the following PAI-TensorFlow and Python versions:
  • PAI-TensorFlow versions: PAI-TensorFlow V1.12 and PAI-TensorFlow V1.15
  • Python versions: Python 2.7 and Python 3.6

Use Whale to add distributed strategies

  1. Initialize Whale.
    Call the init() API operation to initialize Whale. For more information, see whale.init. Sample code:
    import whale as wh
    wh.init()
  2. Use a cluster to group resources.
    Whale uses a cluster to abstract and define resources and map hardware resources to logical resources. Whale groups resources by using different strategies. For more information, see whale.cluster.
    cluster = wh.cluster()
  3. Use scopes to divide a model.
    Whale uses scopes to specify the distributed training modes. Whale supports a variety of scope combinations. For more information, see Scope-related operations. Whale supports the following scope types:
    • replica: data parallelism
    • stage: model parallelism
    • pipeline: pipeline parallelism
    • split: operator splitting
    • A combination of the preceding scopes
    The following examples show how to implement distributed training in Whale for reference.
    • Data parallelism
      with wh.replica():
          model_fn()
    • Model parallelism
      Assume that a model has 20 layers. The first 10 layers are divided into one stage, and the last 10 layers are divided into the other stage. Sample code:
      with wh.stage():
          # layers 1~10 of model
          model_fn1()
      with wh.stage():
          # layers 11~20 of model
          model_fn2()
    • Data parallelism, model parallelism, and pipeline parallelism
      To implement data parallelism and pipeline parallelism based on model parallelism, you only need to combine the replica and pipeline scopes. Sample code:
      with wh.replica():
          with wh.pipeline(num_micro_batch=5):
              with wh.stage():
                  # layers 1~10 of model
                  model_fn1()
              with wh.stage():
                  # layers 11~20 of model
                  model_fn2()
    • Data parallelism and operator splitting

      Assume that the first 10 layers of a model use data parallelism and the subsequent layers use operator splitting. Sample code:

      with wh.replica():
          # layers 1~10 of model
          model_fn1()
      with wh.split():
          # layers 11~12 of model
          model_fn2()

Complete example

  1. Rewrite a model for distributed training.
    Use Whale to transform a standalone model into a model that uses data parallelism. The following example shows only how to rewrite the main code. For more information about the complete code, visit simple_data_parallel.py. For more information about how to use distributed parallelism strategies, see Whale distributed paradigm.
    import whale as wh
    def main(_):
      wh.init()  # init whale
      with wh.cluster(): # cluster
        with wh.replica(): # data parallelism
           your_model_definition # construct your model here.
    The rewritten main code consists of three calls:
    1. Call wh.init() to initialize Whale.
    2. Call with wh.cluster() to define the cluster and map hardware resources to logical resources.
    3. Call with wh.replica() to specify data parallelism as the parallelism strategy.
    After the standalone model is transformed into a model that uses data parallelism for training, Whale automatically detects task resources and groups the resources based on the data parallelism strategy.
  2. Split training data.
    In the previous step, the Mock data is read. If real data is used for distributed training, you must split the training data to ensure that each worker obtains different data.

    Whale automatically splits training data. For more information about training data splitting strategies and configuration methods, see Shard training data.

  3. Submit a task.
    You can submit a task in one of the following ways:
    • PAI-Studio

      Use the algorithm component TensorFlow 1120 to submit the task. For more information, see .

    • PAI-DLC
      • Log on to the server and run the following command to manually run the training script:
        python simple_data_parallel.py
        In the preceding command, simple_data_parallel.py indicates the code file of the distributed training task. Replace the file name based on your requirements.
      • Use Arena to submit the task. For more information, see Manage jobs in Arena.
      • Use Deep Learning Containers (DLC) Dashboard to submit the task. For more information, see Manage jobs in DLC Dashboard.

Other features of Whale

Whale supports a variety of communication optimization features that you can enable as needed. For more information about the configuration methods, see Communication parameters.