AIACC-Training provides a set of commands that you can run to start distributed training. You can run the commands together with environment variables to improve the performance of AIACC-Training. This topic describes the startup commands and the environment variables in AIACC-Training.

Startup commands

Alibaba Cloud provides a set of perseusrun commands to start distributed training based on AIACC-Training. This improves the performance of AIACC-Training. The perseusrun commands can be used in underlying communication infrastructures, various training modes, distributed training, and elastic training methods. The following figure compares the typical commands and the perseusrun commands. 2022-03-23_16-23-40.png
The following information describes the syntax of perseusrun commands.
  • Single machine
    By default, the Gloo backend is used. Command syntax:
    perseusrun -np NP [-H localhost:N] -- COMMAND [ARG [ARG...]]
  • Multiple machines
    In this example, the Message Passing Interface (MPI) backend and two machines are used. Command syntax:
    perseusrun –-mpi –np NP –H host1:N,host2:N -- COMMAND [ARG [ARG...]]
    In this example, the Gloo backend and two machines are used. Command syntax:
    perseusrun --gloo –np NP –H host1:N,host2:N -- COMMAND [ARG [ARG...]]
    The following information describes parameters in the preceding command syntax.
    • N: the number of startup processes on each machine. In most cases, the value of N is the same as the number of GPUs on the machine.
    • NP: the total number of startup processes. The value of NP is calculated based on the following formula: NP = N × {Total number of machines}.
    • host1 or host2: the IP address of the internal network that is used by each machine.
    • COMMAND: the Python command for the training program. ARG: the Python parameter for the training program.
    To obtain more information about perseusrun commands, run the perseusrun -h command.
The following sample code provides examples on how to run the perseusrun commands to start training:
# In this example, a single machine and the default Gloo communication backend are used. Eight processes run in the machine.
perseusrun -np 8 -H localhost:8 -- python train.py --model resnet50
perseusrun -np 8 -- python train.py --model resnet50

# In this example, two machines and an MPI communication backend are used. Eight processes run in each machine.
perseusrun –-mpi –np 16 –H host1:8,host2:8 -- python train.py --model resnet50

# In this example, four machines and the Gloo communication backend are used. Eight processes run in each machine.
perseusrun –-gloo –np 32 –H host1:8,host2:8,host3:8,host4:8 -- python train.py --model resnet50

Environment variables

If you want to use the default settings of AIACC-Training, you do not need to configure environment variables. If you do not want to use the default settings, you can configure environment variables to change the default settings. This section describes how to configure environment variables for AIACC-Training.

Perseus environment variableFeatureSuggestion
PERSEUS_ALLREDUCE_NANCHECKSpecifies whether to check the gradient is set to a Not a Number (NaN) value.
  • 0: disables the feature.
  • 1: enables the feature during startup.
Default value: 0.
None.
PERSEUS_ALLREDUCE_DTYPESelect the gradient compression type for communication among GPUs.
  • 0: enables FP16 gradient compression.
  • 1: disables gradient compression.
  • 2: enables mixed precision compression. Gradient compression is disabled within a machine. FP16 gradient compression is performed among machines.
Default value: 0.
By default, FP16 precision compression is used. In most cases, you can leave the environment variable empty. If the precision decreases when you perform FP32 training, we recommend that you set the environment variable to 2. In scenarios in which automatic mixed precision (AMP) is enabled, we recommend that you set the environment variable to 1.
PERSEUS_ALLREDUCE_MODESelect the communication mode among machines. The following communication modes are supported.
  • 0: All machines use AllReduce for one-level communication.
  • 1: If multiple machines are used and each machine is configured with multiple GPUs, two-level communication is implemented. At the first level, the data on each machine is reduced and distributed to a GPU on the machine. At the second level, the GPU on each machine is used for communication.
By default, Perseus selects the AllReduce communication mode.
We recommend that you leave the environment variable empty. Perseus automatically selects the optimal value.
PERSEUS_ALLREDUCE_STREAMSSpecify the upper limit of streams for multi-stream communication. Default value: 4. Valid values: 1 to 12. In most cases, you can leave the environment variable empty. If the following conditions are met, you can set a higher value:
  1. Your instance requires a higher bandwidth. For example, you want to use a gn6v instance that requires a bandwidth of 32 Gbit/s, or you want to use TCP communication for a Super Computing Cluster (SCC) instance.
  2. If the available GPU memory on your instance is higher than 200 MB but the scalability of the performance cannot meet your business requirements, you can set the environment variable to 8.
PERSEUS_ALLREDUCE_FUSIONSpecify the granularity for gradient blending. If you set the environment variable to 16, the total number of gradients is 196. Valid values: 0 to 128. The environment variable does not have a default value. If you leave the environment variable empty, AIACC-Training automatically selects an optimal value. We recommend that you leave the environment variable empty.
PERSEUS_ACCUMULATE_N_STEPS (Perseus 1.3.0 or later)Specify the number of steps for local gradient accumulation based on multistep. Default value: 1. Sample values: 2, 4, and 8. If a large batch size is required when the GPU memory is insufficient, you can set the environment variable to N. This way, the system increases the batch size N times and retains the original epochs. If you want to narrow down the scale of communication, you can set the environment variable to N to reduce the scale of communication to 1/N.
Note Local gradient accumulation can be used to increase the batch size of your training. You can increase the batch size for hyperparameters, such as the learning rate.
PERSEUS_DOWNSAMPLE_N_ELEMENTS (Perseus 1.3.0 or later)Specify the granularity at which the gradient compression is downsampled based on the gossip method. Sample values: 2, 4, and 8. Default value: 1. When the step size is large, you can use the gossip method to downsample the communication scale of gradients. If you want to use the ResNet50 (noptop) model of the ImageNet dataset, you can set the environment variable to 2, 4, or 8 to ensure the granularity and downsample the scale of communication by 50%, 75%, or 87.5%.
PERSEUS_GRADIENT_MOMENTUM (Perseus 1.3.0 or later)Specify the momentum of the gradient. You can use the environment variable together with the PERSEUS_DOWNSAMPLE_N_ELEMENTS environment variable. Default value: 1. If you use MomentumSGD for ImageNet training, you can set the environment variable to 0.9.
PERSEUS_NCCL_ENABLE (Special version)
  • 0: disables the NVIDIA Collective Communications Library (NCCL) that supports hybrid links.
  • 1: enables the NCCL.
Default value: 0.
When you use an SCC instance, you can set the environment variable to 1. This way, you can enable the RDMA and virtual private cloud (VPC) links to use hybrid bandwidth. To obtain details of the feature, submit a ticket.
PERSEUS_ALLREDUCE_GRADIENT_SCALE (Perseus 1.3.0 or later)Specify the coefficient of the gradient scale. Default value: 10.

The environment variable takes effect only when the PERSEUS_ALLREDUCE_DTYPE environment variable is set to 0 or 2.

The environment variable takes effect only when the PERSEUS_ALLREDUCE_DTYPE environment variable is set to 0 or 2. In this case, you can perform precision compression from FP32 to FP16. If you change precision compression from FP32 to FP16, the gradient scale is multiplied by the coefficient. If you change precision compression from FP16 to FP32, the gradient scale is divided by the coefficient. If an NaN error is caused due to a large loss value, you can set the environment variable to a lower value.
PERSEUS_OFFLINE_NEG (Perseus 1.3.2 or later)Specify the offline negotiation mode for the gradient. Default value: 0. If you set the environment variable to 1, the system enables the offline negotiation mode for the gradient.
  • When you are required to configure a large number of parameters for TensroFlow models of poor scalability, we recommend that you set the environment variable to 1.
  • If you use models built based on other frameworks, we recommend that you set the environment variable to 1 only when multiple Synchronized Batch Normalization (SyncBN) layers exist.
  • In other cases, we recommend that you set the environment variable to 0.
PERSEUS_PERF_CHECK_N_STEPS(1.3.2 or later)Specify the frequency at which the system performs anomaly detection for GPU performance. Default value: 0. The value indicates that anomaly detection is disabled. If you set the environment variable to 100, the system performs anomaly detection every 100 steps. If an anomaly is detected on a GPU, the details of the anomaly are displayed on the machine on which the GPU is installed.
Important The environment variable is incompatible with the TensorFlow accelerated linear algebra (XLA) environment. We recommend that you do not set the environment variable to 0 in such an environment.
PERSEUS_MASTER_PORT(1.5.0 or later)Specify the port number that is used to start the master. Default value: 6666.

The environment variable takes effect only when the PyTorch launcher starts training by using DistributedDataParallel (DDP).

By default, a rendezvous functionality is started during training in PyTorch. AIACC-Training starts a functionality that is similar to rendezvous because the two functionalities share the same value of the master_addr parameter. You need to only ensure that the port numbers used by the functionalities are different.
PERSEUS_NCCL_NETWORK_INTERFACE(1.5.0 or later)Specify the network interface controller (NIC) for NCCL communication. Default value: eth0. You can modify the NIC settings based on your business requirements.
PERSEUS_GLOO_NETWORK_INTERFACE(1.5.0 or later)Configure the NIC for Gloo communication. Default value: eth0. You can modify the NIC settings based on your business requirements.
GLOO_TIMEOUT_SECONDS(1.4.0 or later)Specify the timeout period for Gloo communication. Default value: 60. Unit: seconds. If the communication is hanging due to complex logic and network environment issues, we recommend that you set the environment variable to a higher value.
PERSEUS_CHANGE_HVD_ALLGATHER(1.5.0 or later)Specify the calculation method based on Allgather. Allgather is compatible with DDP and mpi4py. Default value: 0. If you set the environment variable to 1, Horovod is used.

Example:

In this example, the following tensors are used:

tensor1=[0,0] and tensor2=[1,1]

If you set the environment variable to 0, the return value is tensor([[0,0], [1,1]]).

If you set the environment variable to 1, the return value is tensor([0,0,1,1]).

If you use PyTorch DDP to perform training, you can use the default value of the environment variable. If you use Horovod to perform training, you can set the environment variable to 1. The performance of PyTorch SyncBatchNorm varies based on the value of the environment variable.
PERSEUS_USE_DDP_LAUNCHER(1.5.0 or later)Specify the method to start training by using PyTorch DDP. Default value: 1. The value specifies that the built-in launcher of DDP is used. If you set the environment variable to 0, the mpirun launcher of Horovod is used. If you use PyTorch DDP to perform training, we recommend that you set the environment variable to 1. If you use Horovod to perform training, we recommend that you set the environment variable to 0.
You can add environment variables before the perseusrun command. In this example, mixed precision is enabled and the coefficient of the gradient scale is set to 5. Sample code:
PERSEUS_ALLREDUCE_DTYPE=2 PERSEUS_ALLREDUCE_GRADIENT_SCALE=5 perseusrun xxx