Whale automatically shards training data based on the distributed training mode and resources specified by users. If the training data cannot be evenly split, you can configure the data sharding strategy by using environment variables. This topic describes the data sharding strategy of Whale, parameters used to configure the data sharding strategy, and how to configure the data sharding strategy.

Data sharding strategy

During distributed training, each worker needs to obtain a different shard of the training data. If each worker processes the same data, the training speed of distributed training is equivalent to that of a single server. The advantage of acceleration is not utilized. Whale automatically shards training data based on the distributed training mode and resources specified by users. Workers are assigned the master or slave role based on the distributed training mode. The master workers construct the computational graph and drive slave workers to run tasks.

Whale shards training data based on the following strategy:
  1. Determine the role of each worker based on the resources and the distributed training mode, such as data parallelism, model parallelism, pipeline parallelism, and operator splitting.
  2. Split the training data based on the number of master workers.

Data sharding examples in different scenarios

  • Data parallelism
    In the scenario where the data parallelism mode is applied, each worker assumes the master role. Whale evenly splits the training data based on the number of master workers. Three master workers are used, as shown in the following figure. Whale evenly splits the training data into three shards and each worker obtains a shard. Multiple graphics processing units (GPUs) in each worker obtains training batches in sequence.Split the training data in the data parallelism mode
  • Data parallelism and model parallelism

    In the scenario where the data parallelism and model parallelism modes are applied, the model is split into multiple stages. Different drivers are responsible for different stages. A model is split into two stages, as shown in the following figure. The master workers execute the logic of Stage 0 and the slave workers execute the logic of stage 1. In this example, two master workers and two slave workers are used.

    Whale evenly splits the training data based on the number of master workers. Two master workers are used, as shown in the following figure. Therefore, the training data is split into two shards. Each master worker obtains a shard and multiple GPUs in each worker obtains training batches in sequence.

    Data sharding in the scenario where the data parallelism and model parallelism modes are applied
  • Scenario where the training data cannot be evenly split
    If the number of files in the training data cannot be evenly dividable by the number of master workers, Whale provides three processing methods:
    • Whale does not split the training data and randomly shuffles the files.
      Each master worker obtains all the files. To ensure convergence, Whale randomly shuffles the files.
      Note
      • Whale uses the processing method by default when the training data cannot be evenly split.
      • When the number of files is smaller than the number of master workers, Whale supports only this processing method.
    • Whale unevenly splits the training data. Specific master workers obtain one file more than the others.

      To split the training data as evenly as possible, Whale allows specific master workers to obtain one file more than the other master workers.

    • Whale evenly splits the training data and discards excessive files.

      We recommend that you use this processing method only when the number of files is large and discarded files do not affect convergence.

Parameters used for data sharding

The following table describes the parameters used to configure the data sharding strategy supported by Whale. If the number of files in the training data cannot be divisible by the number of master workers, you can configure the data sharding strategy by using the following parameters.
Parameter Description Type Default value Remarks
WHALE_UNBALANCED_IO_SLICING Specifies whether to split the training data and allow specific master workers to obtain one file more than the others. Valid values:
  • False: does not split the training data.
  • True: unevenly splits the training data. Specific master workers can obtain one file more than the others.
BOOL False Specific workers can obtain more training data files. In this case, you must pay attention to the condition for ending the distributed training. If the number of epochs is specified as the condition, the training may be suspended. We recommend that you use StopAtStepHook to control whether the model training ends.
WHALE_DROP_LAST_FILES Specifies whether to evenly split the training data and discard excessive files when the number of files in the training data cannot be divisible by the number of master workers. This parameter has a lower priority than the WHALE_UNBALANCED_IO_SLICING parameter. If both parameters are set to True, Whale unevenly splits the training data. Valid values:
  • False: does not split the training data.
  • True: evenly splits the training data and discards excessive files.
BOOL False We recommend that you set this parameter to True only when the number of training data files is large and discarding a small part of the data does not affect convergence.

Configure the data sharding strategy

When the number of files in the training data cannot be divisible by the number of master workers, you can configure the data sharding strategy by using environment variables. You can set parameters in the training code or scripts as needed.
  • Set parameters in the training code
    You can configure environment variables in os.environ. For example, you can run the following command to set the WHALE_UNBALANCED_IO_SLICING parameter. For more information, see the "Parameters used for data sharding" section.
    import os
    os.environ["WHALE_UNBALANCED_IO_SLICING"]="True"
  • Set parameters in scripts
    A launch script is provided in a Deep Learning Containers (DLC) environment or when you start training. You can configure environment variables in the script.
    • Configure the environment variables at the beginning of the script.
      export WHALE_UNBALANCED_IO_SLICING=True
      ...
      python train.py
      train.py indicates a training file.
    • Configure the environment variables before statements are loaded for execution.
      WHALE_UNBALANCED_IO_SLICING=True python train.py
      train.py indicates a training file.