What is AIACC-ACSpeed? - Elastic GPU Service - Alibaba Cloud Documentation Center

AIACC 2.0-AIACC Communication Speeding (AIACC-ACSpeed) is a communication optimization library for the distributed training of AI models released by Alibaba Cloud. It is also known as AIACC-Training V2.0. Compared with AIACC-Training V1.5, AIACC-ACSpeed's modular, decoupled design is optimized to deliver better compatibility, applicability, and performance.

Introduction

AIACC-ACSpeed (ACSpeed) is an in-house AI training accelerator developed by Alibaba Cloud, and delivers significant performance improvement for model training. You can use ACSpeed to optimize the performance of jobs that involve distributed communication, thereby improving computing efficiency and reducing costs.

ACSpeed is fully compatible with mainstream open-source distributed frameworks at the AI framework layer, collective algorithm layer, and network layer. This allows ACSpeed to deliver optimized performance from both hardware and software aspects.

Layer	Description
AI framework layer	At the AI framework layer, ACSpeed is fully compatible with the PyTorch frameworks and optimizes the performance of distributed trainings in an imperceptible manner by using aiacc-c10d-plugin.
Collective algorithm layer	At the collective algorithm layer, ACSpeed uses collective communication compilation technologies to build adaptive topology algorithms for different models. ACSpeed also optimizes the collective communication topology to be NCCL Runtime-compatible in an imperceptible manner.
Network layer	At the network layer, ACSpeed implements optimization of the network infrastructure to achieve communication optimization without affecting network stability. These optimizations may be performed on the virtual private cloud (VPC), remote direct memory access (RDMA), or eRDMA of Alibaba Cloud, based on the optimizations available.

Usage notes

You can verify whether the aiacc-c10d-plugin component is initialized by checking the logs of training tasks that have ACSpeed enabled. If the logs contain AIACC-2.0 ACSpeed c10d-plugin init, the component is initialized. The following figure shows an example of the log output.

Note

When you set NCCL_DEBUG to INFO, you can see the field started with AIACC-2.0 in the AIACC-2.0-Socket field of the log of NCCL component.

The following table provides some information about ACSpeed's training mode, its supported startup methods, and its benefits.

Item	Note
Training mode	The c10d component of ACSpeed is optimized specifically for PyTorch. You must select native DDP (`torch.nn.parallel.DistributedDataParallel`) as the training mode to use this component.
Startup method	ACSpeed does not impose any limits on startup methods. You can choose any custom startup methods, such as launch, run, or spawn script of Torch or other scripts. However, if CPU-Affinity is enabled, you can use only the official torch.distributed.run or torch.distributed.launch of torch 1.8 to 1.12 to start ACSpeed.
Benefits	ACSpeed can improve communication performance for distributed tasks. The greater the current communication bottleneck, the greater the performance improvement ACSpeed can deliver. If your setup makes use of multi-GPU instances, or the linearity of your cluster is close to 1, there is little to no bottlenecks in communication. In cases such as these, ACSpeed is unable to provide significant performance advantages.

The following section describes the concepts in ACSpeed.

autotuner: This feature enables adaptive tuning of communication algorithms. The following message is printed when about 200 to 400 iterations are performed, indicating that AIACC tuning has ended with the bucket-tuning disabled. This feature is enabled by default.
The bucket-tuning feature is enabled by default for ACSpeed v1.1.0. The following message is printed when autotuning ends with bucket-tuning enabled.
Note
Model training consumes additional context resources. Therefore, to ensure the accuracy of the performance tests, we recommend that you set the warmup_iteration for model training to 400. The preceding feature shows a training task whose warmup_iteration is set to 220, indicated by the Cost 220 steps printout.
perf_mode: ACSpeed analyzes the performance of each iteration during the autotune process. It supports a variety of methods to deliver performance statistics from various aspects. The default method is time. You can use environment variables to switch between methods. Valid values:
- AIACC_AUTOTUNE_PERF_MODE=time: This mode is suitable for scenarios with additional operations such as end-to-end data preprocessing, data loading, and gradient accumulation. This is the default value.
- AIACC_AUTOTUNE_PERF_MODE=filter: This mode can output perf_time stably. It is suitable for scenarios with additional processing operations such as checkpoint and logging.
- AIACC_AUTOTUNE_PERF_MODE=profile: Compared with the previous two methods, you can use this mode to additionally output the proportion of communication in one-time iterations of the current model training.
Single-instance optimization: Only applicable for Torch 1.6 to Torch 1.9.
CPU-Affinity: This feature is disabled by default. If you want to enable CPU-Affinity, you can set the following environment variables to enhance the effect of ACSpeed on some instance types.
```
AIACC_CPU_BINDING_ENABLE=1
```
If the program has innate defects, such as performance fluctuations caused by load imbalance, performance loss may occur after the CPU-Affinity feature is enabled. Therefore, this feature is provided as an optional optimization.
Bucket-Tuning: This feature enables adaptive tuning for gradient blending and can improve the performance optimization on overlapping processes in computing and communication. However, it takes a larger number of warmup_steps (approximately 950 steps). This feature is enabled for ACSpeed v1.1.0. This feature is applicable to Torch 1.8 to Torch 1.13.
Note
You can set the following environment variables to disable bucket-tuning.
```
AIACC_BUCKET_TUNE_DISABLE=1
```
Cache mechanism: This feature can be used to reduce the warmup time of tuning and avoid repeated autotunes of the same model in the same environment. The cache mechanisms are categorized based on key values such as models, model names, model sizes, and cluster sizes.
To enable cache mechanism, you can set the AIACC_CONFIG_CACHE_DISABLE variable to 0. The configurations that deliver the optimal performance in the training will be saved on the machine whose rank is 0.
Note
When cache mechanism is enabled in the training scenario with the same key values, the system automatically reads and loads the saved configurations and skips the autotune step.
- The following message is printed when you enable cache mechanism for the first time:
- The following message is printed after the cache is saved:
- The following message is printed when the training with the same key values is performed again:
In training scenarios where the optimal configurations are saved and the same key values are specified, you can set the AIACC_CONFIG_CACHE_UPDATE_DISABLE variable to 0 to enable cache update on saved configurations. The following message is printed when the cache starts to update.

How ACSpeed works

Scenarios

When you use a single-GPU instance or multiple multi-GPU instances for distributed training of AI models, the linearity of distributed communication can be used as a performance indicator when applying single-GPU training to multi-GPU. The linearity is calculated in the following formulas:

The scalability of one instance: Linearity = multi-GPU performance / single-GPU performance / the number of GPUs per instance
The scalability of multiple instances: Linearity = multi-instance performance / single-instance performance / the number of clusters

The value of the linearity ranges from 0 to 1. As the linearity approaches 1, the training job nears peak performance. Assume that the linearity is lower than 0.8, and the influence of data I/O and CPU is excluded, it can be determined that the bottleneck in distributed communication exists. In this scenario, you can use ACSpeed to accelerate distributed training and improve the overall performance. The poorer the linearity of the original baseline, the greater improvement ACSpeed can deliver.

Single-instance optimization

The following section describes how ACSpeed works and shows the performance improvement effect by taking PCIe-topo and NVLink-topo servers as examples.

PCIe-topo
- Issue analysis
  Take the GPU topology of an eight-GPU instance without P2P interconnection as an example. The following figure shows the connections of GPU0 to GPU7. There is no P2P interconnection between GPUs and the GPUs share the same PCIe bandwidth. Therefore, in distributed training scenarios that require multi-GPU communication, especially in scenarios where a large amount of communication between GPUs is required, the performance is bottlenecked by the physical bandwidth limitations.
  As shown in the preceding figure, GPU0 to GPU3 and GPU4 to GPU7 are connected to each other through PCIe Bridge (PIX). Between GPU0 and GPU4 to GPU7, GPU1 and GPU4 to GPU7, GPU2 and GPU4 to GPU7, and GPU3 and GPU4 to GPU7, it requires connection through the QPI/UPI interface (SYS) between sockets.
- Optimization
  The ncclAllReduce function is used for collective communication in the native NCCL communication library by default. Given the bandwidth limits of PCIe-topo, training performance can be further improved. ACSpeed improves performance by reducing the steps for communication during collective communication processes, and thus achieves asynchronous pipelined communication between the CPUs and GPUs. One of the core improvements is that all AllReduce operations on data are performed on the CPUs. This optimization is also called CPU-Reduce.
- Improvement effect
  For trainings that use single instances of PCIe-topo type, you can choose to enable the CPU-Reduce optimization to solve the poor linearity caused by a relatively high proportion of communication. This method improves the performance by about 20% compared with the native NCCL mode in scenarios with over 4 MB traffic. It can also reduce the gradient synchronization time in the training process so as to implement end-to-end performance improvement. For example, in model trainings with Resnet50 and Transformer-based, you can use this method to achieve over 10% performance improvement.
NVLink-topo
- Issue analysis
  Take the GPU topology of the V100 eight-GPU instance as an example. The number of nvlink channels connected between different GPUs is different, such as NV1 and NV2. One of the algorithms commonly used by NCCL is binary-tree (also known as 2-tree), which is unable to achieve the optimal state when used with different instance topologies.
  Note
  NCCL is a collective communication library of NVIDIA GPUs. It can implement collective communication and point-to-point communication. Almost all open-source AI frameworks use NCCL for communication.
- Optimization
  To solve the issues mentioned above, ACSpeed makes full use of high-bandwidth nvilnk interconnections to implement AllReduce algorithms, such as between GPU0 and GPU3 in the example figure. It can improve the performance when single-instance communication bottlenecks occur. ACSpeed implements a set of n-trees algorithms between multiple instances based on the binary-tree topology within the single instance and realizes topology optimization.
- Improvement effect
  By using directed combinations of n-trees, ACSpeed can make full use of the transceiver capacity of multiple nvlink channels. In this way, ACSpeed improves the performance by 20% in scenarios with data traffic of more than 128 MB.

Multi-instance optimization

ACSpeed can improve the performance of communication between multiple instances through AllReduce algorithms and multi-stream communication optimization.

AllReduce algorithms
- Issue analysis
  Take a V100 instance as an example. For single instances, nvlink is used to support P2P communication and can provide a bandwidth of up to 300 GB/s. While for multiple instances, the performance of ring-allreduce algorithms is limited between multiple instances and leads to overall performance degradation. The network performance is lower than 100 Gbit/s, and the throughput performance is poor.
- Optimization
  Compared with traditional ring-allreduce algorithms, the hybrid-allreduce algorithm provided by ACSpeed implements hierarchical training of a single instance and multiple instances, making full use of the high-speed bandwidth of the single instance and reduces the traffic of low-speed networks between multiple instances. It leverages a variety of algorithm combinations, such as ps, tree, and butterfly algorithms based on the topological characteristics of network interface controllers and GPU distances of different instances of Alibaba Cloud to ensure the network performance of different instances.
- Improvement effect
  The performance on V100 16GB or V100 32GB instances is significantly improved. For example, the performance of using these two types of instances to run VGG16 model can be improved by over 20%.
Multi-stream communication optimization
- Issue analysis
  In most cases, single-stream communication cannot make full use of TCP network bandwidth, which can be verified by using iperf. This results in the limited performance of collective communication algorithms AllReduce across instances.
- Optimization
  ACSpeed implements the multi-stream feature based on TCP/IP to improve the concurrent communication capabilities in distributed training and makes full use of network bandwidth.
- Improvement effect
  By using multi-stream communication optimization, the overall multi-instance performance can be improved by 5% to 20%.
Multi-instance CPU-Reduce
- Issue analysis
  Given the bandwidth limits of PCIe-topo, the single-instance CPU-Reduce method delivers better improvements compared with the native NCCL mode. Therefore, for training jobs that use multiple instances of the PCIe-topo type, you can apply CPU-Reduce to multiple instances to make full use of each instance. This method can also improve cross-instance communication performance in scenarios where the bandwidth is limited by the number of socket connections.
- Optimization
  Multi-instance CPU-Reduce uses asynchronous pipelined communication among instances to improve the communication efficiency. This method also prevents the intermediate data stored on the CPU from being repeatedly duplicated between CPUs and GPUs. To further improve the communication performance across instances, you can make use of idle resources to increase the number of cross-instance pipelines.
- Improvement effect
  For communication-intensive training jobs such as training VGG16 or Transformer-based models on multiple PCIe-topo instances, the end-to-end performance can be improved by over 20%.

Architecture optimization

ACSpeed can improve the communication performance at the architecture layer in terms of the communication algorithm autotuner, adaptive cpu-affinity feature, and end-to-end optimization.

Autotuner algorithms
- Issue analysis
  Different instances and network environments may require different communication algorithms to achieve optimal performance. Therefore, it is hard to ensure the optimal performance of real-time network environment with a fixed algorithm.
- Optimization
  ACSpeed implements the autotuner feature to choose the optimal communication algorithm based on the specific network environment during real-time training to ensure optimal end-to-end performance. The autotuner feature includes mechanisms such as warmup, outputting multi-dimensional perf_time data, and top_k_algo retuning mechanism.
- Improvement effect
  The feature ensures that the optimal algorithm is used when training different models on multiple instances, and there is no performance loss in 99% cases.
Cpu-Affinity
- Issue analysis
  Multiple processes in a single instance may compete for resources due to the influence of the Non-Uniform Memory Access (NUMA) architecture, or different Linux scheduling policies. This leads to the consumption of additional scheduling context, and performance inconsistency among multiple processes in a single instance. Since distributed training is mostly synchronous training, this will also lower the overall performance.
- Optimization
  ACSpeed introduces the CPU-Affinity mechanism to associate the training processes with the CPU cores to control the affinity between the processes and the CPU cores. This eliminates the influence of NUMA and scheduling consumption.
- Improvement effect
  This method improves the performance of an eight-GPU single instance that runs VGG16 model by 3%.
End-to-end Optimization
- Issue analysis
  Model training includes the processes of computing, communication, and parameter update. Different models have different gradients indicated by communication traffic, which will lead to performance differences between different communication algorithms. Process overlaps in computing, communication, and parameter update will lead to differences in overall performance. The coupling between computing and communication will also lower the overall performance, which requires systematic and comprehensive optimization.
- Optimization
  ACSpeed provides comprehensive optimization of the PyTorch framework, covering the overall tuning of forward, backward, and overlap operations. For example, you can use bucket-tuning to improve the end-to-end performance by 2% to 10%.
- Improvement effect
  ACSpeed can optimize the overall performance by using imperceptible plug-ins to ensure smooth and streamlined user experience.

Contact us

If you need assistance for distributed training, use DingTalk to search group number 33617640 to join the Alibaba Cloud AIACC support group for external users. (Download DingTalk)