Whale allows you to set communication parameters to enable various optimization features. This topic describes how to set communication parameters in Whale and the communication parameters that are supported by Whale.

Configuration method

Whale allows you to optimize communication by using multiple methods and set parameters by using environment variables. You can set parameters in the training code or scripts to enable optimization features.
  • Set parameters in the training code
    You can configure environment variables in os.environ. For example, you can run the following command to enable Hierarchical AllReduce communication by setting the communication parameter WHALE_COMMUNICATION_HIERARCHICAL_ALL_REDUCE. For more information, see the "Communication parameters" section in this topic.
    import os
    os.environ["WHALE_COMMUNICATION_HIERARCHICAL_ALL_REDUCE"]="True"
    Assume that you use Machine Learning Studio to submit a task and have not set communication parameters in the training code. You can use PAI-TensorFlow to pass hyperparameters during training. Then, run the preceding code to assign values to environment variables. For more information about hyperparameters, see Hyperparameters supported by PAI-TensorFlow.
  • Set parameters in scripts
    A launch script is provided in a Deep Learning Containers (DLC) environment or when you start training. You can configure environment variables in the script.
    • Configure the environment variables at the beginning of the script.
      export WHALE_COMMUNICATION_HIERARCHICAL_ALL_REDUCE=True
      ...
      python train.py
      train.py indicates a training file.
    • Configure the environment variables before statements are loaded for execution.
      WHALE_COMMUNICATION_HIERARCHICAL_ALL_REDUCE=True python train.py
      train.py indicates a training file.

Communication parameters

The following table describes the communication parameters that are supported by Whale.
Parameter Description Type Default value Remarks
WHALE_COMMUNICATION_SPARSE_AS_DENSE Specifies whether to convert sparse gradients to dense gradients for communication during gradient synchronization. Valid values:
  • False: The conversion feature is disabled. Sparse gradients are communicated by using AllGather.
  • True: Sparse gradients are converted to dense gradients for communication.
BOOL False After the sparse gradients are converted to dense gradients, the dense shape does not swell much. In this case, enable the conversion feature. For example, the dense shape of the BertLarge model does not swell much after the sparse gradients are converted to dense gradients. In this case, enable the conversion feature.
WHALE_COMMUNICATION_NUM_SPLITS The number of gradient groups for gradient blending when gradients are communicated. INT 5 Whale automatically adjusts the number of gradient groups. Manual configuration is not required.
WHALE_COMMUNICATION_NUM_COMMUNICATORS The number of communicators for parallel communication when gradients are communicated. A value of None indicates that the number of communicators is the same as that of the gradient groups. The communicators are created for parallel communication. INT None A greater number of communicators indicates a larger amount of concurrent communication and higher graphics processing unit (GPU) memory usage. When the GPU memory is insufficient or an alert is sent to you to indicate that the memory is occupied, you can set this parameter to 2 or 1.
WHALE_COMMUNICATION_HIERARCHICAL_ALL_REDUCE Specifies whether to enable hierarchical AllReduce communication for dense gradients. Valid values:
  • False: The feature is disabled.
  • True: The feature is enabled to perform hierarchical AllReduce communication.
BOOL False Assume that bandwidths vary greatly in a server or between servers. For example, NVLink is used in the server or the transmission rate between servers is 25 Gbit/s. If the shape of dense gradients is large, you can enable this feature to accelerate communication.
WHALE_COMMUNICATION_HIERARCHICAL_ALL_GATHER Specifies whether to enable hierarchical AllGather communication for sparse gradients. Valid values:
  • False: The feature is disabled.
  • True: The feature is enabled to perform hierarchical AllGather communication.
BOOL False Assume that bandwidths vary greatly in a server or between servers and the shape of sparse gradients is large, you can enable this feature to accelerate communication.
WHALE_COMMUNICATION_DENSE_FP16 Specifies whether to enable half-precision communication for dense gradients. Valid values:
  • False: The feature is disabled.
  • True: The feature is enabled to perform half-precision communication.
BOOL False If you enable this feature, the communication traffic of dense gradients is reduced by half. However, this may deteriorate convergence performance and you need to tune parameters.
WHALE_COMMUNICATION_SPARSE_FP16 Specifies whether to enable half-precision communication for sparse gradients. Valid values:
  • False: The feature is disabled.
  • True: The feature is enabled to perform half-precision communication.
BOOL False If you enable this feature, the communication traffic of sparse gradients is reduced by half. However, this may deteriorate convergence performance and you need to tune parameters.
WHALE_COMMUNICATION_FP16_SCALE The gradient scale factor that is used to prevent gradients from vanishing when half-precision communication is performed. A value of None indicates that no scale factor is specified. FLOAT None When half-precision communication is enabled, you can use this parameter to tune parameters.