You can call the scope-related operations provided by Whale to divide a model. This topic describes the syntax and parameters of each scope-related operation provided by Whale. This topic also provides the sample code to show you how to call the operations to divide a model for parallel training.

A distributed training framework is used to distribute different parts of a model to distributed hardware resources by using different parallelism strategies. This way, data can be computed in parallel. In Whale, this process consists of two parts: group and map resources, and divide the model. You can call the scope-related operations provided by Whale to divide a model.

In Whale, a scope-related operation is used to divide the computational graph of a model into multiple subgraphs. Whale provides a variety of scope-related operations to divide a computational graph into multiple subgraphs in different ways for parallel training. That is, Whale implements parallel training at the granularity of subgraphs. Whale provides the following scope-related operations:
Note All the scope-related operations must be called by using the with keyword. For more information about the principles and mechanisms of each parallelism strategy, see Whale distributed paradigm.
  • whale.replica
    • Syntax
      replica(devices=None, name=None)
    • Description

      Creates multiple copies of subgraphs and distributes different copies of subgraphs to multiple devices to achieve data parallelism.

    • Parameters
      • devices: the GPUs that are used to execute subgraphs. The value is of a list of strings. The default value is None.
      • name: the name of the current scope. The value is of the STRING type. The default value is None.
    • Return value

      Returns a Replica object.

    • Examples
      import whale as wh
      with wh.replica():
          Your_Model_Definition()
  • whale.split
    • Syntax
      Split(devices=None, name=None)
    • Description

      Divides subgraphs into multiple groups and distributes different groups of subgraphs to multiple devices to achieve operator splitting.

      The following operators can be split in Whale:
      • tf.layers.dense
      • tf.losses.sparse_softmax_cross_entropy
      • tf.arg_max
      • tf.equal
      • tf.math.equal
      • tf.metrics.accuracy
    • Parameters
      • devices: the devices on which the subgraphs are executed. The value is a list of strings. The default value is None, which indicates to obtain device information from the cluster.slices object.
      • name: the name of the current scope. The value is of the STRING type. The default value is None.
    • Return value

      Returns a Split object.

    • Examples
      import whale as wh
      with wh.split():
          Your_Model_Definition()
  • whale.stage
    • Syntax
      stage(devices=None, name=None)
    • Description

      Divides subgraphs into multiple stages and distributes different stages of subgraphs to multiple devices. This achieves model parallelism based on layer splitting.

    • Parameters
      • devices: the GPUs that are used to execute subgraphs. The value is of a list of strings. The default value is None.
      • name: the name of the current scope. The value is of the STRING type. The default value is None.
    • Return value

      Returns a Stage object.

    • Examples
      import whale as wh
      # Scenario: Divide the model into two stages to achieve model parallelism.
      with wh.stage():
          Model_Part_1()
      with wh.stage():
          Model_Part_2()
  • whale.pipeline
    • Syntax
      pipeline(num_micro_batch, devices=None, strategy=None, name=None)
    • Description

      Builds pipelines to execute subgraphs to achieve pipeline parallelism. This operation must be used together with the whale.stage operation.

    • Parameters
      • num_micro_batch: the number of microbatches that are executed in parallel. The value is of the INTEGER type.
      • devices: the GPUs that are used to execute subgraphs. The value is of a list of strings. The default value is None.
      • strategy: the pipeline parallelism strategy. The PreferForward, PreferBackward, and PreferBackwardOptimizer strategies are supported. The value is of the STRING type. If the value of this parameter is set to None, the PreferBackwardOptimizer strategy is used.
      • name: the name of the current scope. The value is of the STRING type. The default value is None.
    • Return value

      Returns a Pipeline object.

    • Examples
      import whale as wh
      # This operation is usually used together with the whale.stage operation.
      # Scenario: Divide the model into two stages and apply a pipeline parallelism strategy.
      with wh.pipeline(num_micro_batch=5):
          with wh.stage():
              Model_Part_1()
          with wh.stage():
              Model_Part_2()