Apsara AI Accelerator (AIACC) is an AI acceleration engine developed based on Alibaba Cloud IaaS resources. AIACC is used to optimize models based on mainstream AI computing frameworks and improve training and inference performance. AIACC can cooperate with the resource management tool FastGPU to build AI computing tasks and make research and development more efficient.

Use AIACC to accelerate an application in deep learning scenarios

The following infrastructure shows how to use AIACC in deep learning scenarios. aiacc-en
  • Resource layer (Alibaba Cloud IaaS resources): uses Alibaba Cloud IaaS resources at the resource layer. The resources can be enabled on demand to meet requirements of large-scale GPU clusters for elastic computing, storage, and network resources.
  • Scheduling layer (AI acceleration resource management): uses FastGPU to build AI computing tasks and manage the resources of large-scale GPU clusters at the scheduling layer. For more information, see What is FastGPU?.
  • Framework layer (AI acceleration engine): uses AIACC to achieve multi-framework unified acceleration at the framework layer. AIACC uses the performance optimization technology based on data communication. When AIACC implements distributed training, AIACC must exchange data between machines and between GPUs to ensure the acceleration effect. For more information, see AIACC-Training and AIACC-Inference.
  • Application layer (AI acceleration reference solution): implements deep learning in various application scenarios such as image recognition, object detection, video recognition, click-through rate (CTR) prediction, natural language understanding, and speech recognition. AIACC is used to implement unified acceleration for multiple frameworks at the framework layer. Therefore, you need to only make minimal modifications to the code to improve application performance.

Benefits

AIACC provides the following benefits:
  • AIACC is based on Alibaba Cloud IaaS resources that are stable and easy to use.
  • AIACC cooperates with FastGPU to build training tasks. This reduces the time to create and configure resources. This also improves GPU resource utilization to reduce costs.
  • AIACC supports unified acceleration for multiple frameworks. This provides a small adaptation workload and improves training and inference performance. AI algorithms are developed in shorter verification cycles, which ensures faster model iteration. This makes research and development more efficient.

AIACC-Training

AIACC-Training (formerly known as Ali-Perseus or Perseus-Training) is developed and maintained by the Alibaba Cloud AIACC team based on Alibaba Cloud IaaS resources to achieve efficient acceleration for AI distributed training. AIACC-Training is designed to be compatible with open source systems and to accelerate your distributed training tasks without your manual intervention.

  • AIACC-Training allows you to accelerate distributed training tasks by using models that are built based on mainstream AI computing frameworks such as TensorFlow, PyTorch, MXNet, and Caffe.
  • AIACC-Training is compatible with the APIs of PyTorch Distributed Data Parallel (DDP) and Horovod. AIACC-Training accelerates the performance of native distributed training without your manual intervention.
  • In terms of underlying acceleration, AIACC-Training optimizes the features of the Alibaba Cloud network infrastructure and the policy of AI data-parallel distributed training to achieve significant performance improvements.
The following content lists some of the acceleration features of AIACC-Training:
  • Gradient fusion communication: allows you to use adaptive multi-stream fusion and adaptive gradient fusion to improve the training performance of bandwidth-intensive network models by 50% to 300%.
  • Highly optimized online and offline gradient-based negotiation: reduces the overhead of gradient-based negotiation on large-scale nodes by up to two orders of magnitude.
  • Hierarchical Allreduce algorithm: supports FP16 gradient compression and mixed precision compression.
  • Gradient compression based on gossip.
  • Gradient communication optimization based on multistep.
  • Deep optimization for remote direct memory access (RDMA) and elastic RDMA (eRDMA) networks.
  • API extensions for MXNet: support data parallelism and model parallelism of the InsightFace type and enhance the performance of Synchronized Batch Normalization (SyncBN) operators.
  • Group communication operators provided by GroupComm: allow you to build complex training tasks that implement the communication in both data parallelism and model parallelism.

AIACC-Training provides benefits in training speed and cost. For more information about test data, visit Stanford DAWNBench.

The following table lists some typical optimization cases of distributed training.

Customer Model Framework Number of GPUs Training speed increase
An AI chip manufacturer Image classification MXNet 256 100%
An AI chip manufacturer Facial recognition MXNet 256 200%
A car manufacturer FaceNet PyTorch 32 100%
A mobile phone manufacturer BERT TensorFlow 32 30%
A mobile phone manufacturer GPT2 PyTorch 32 30%
An AI company Faster-RCNN MXNet, Horovod, and BytePS 128 30%
An AI company InsightFace MXNet, Horovod, and BytePS 128 200%
An online education platform ESPnet PyTorch-DP 16 30%
An online education platform ESPnet2 PyTorch-DDP 16 30%
An online education platform CTR PyTorch 32 80%
An online education platform OCR PyTorch 32 30%
A mobile phone manufacturer Image classification PyTorch 128 25%
A mobile phone manufacturer MAE PyTorch 32 30%
A research institute GPT2 PyTorch+Megtragon 32 30%
A social media platform MMDetection2 PyTorch 32 30%
A financial intelligence company InsightFace PyTorch 32 50%
A mobile phone manufacturer Detection2 PyTorch 64 25%
A visual team insightface MXNet 64 50%
A game vendor ResNet PyTorch 32 30%
A city brain project InsightFace MXNet 16 42%
A pharmaceutical technology company Autoencoder PyTorch 32 30%
An autonomous driving company swin-transformer PyTorch 32 70%

AIACC-Inference

AIACC-Inference can optimize models based on TensorFlow and exportable frameworks in the Open Neural Network Exchange (ONNX) format to improve inference performance.

AIACC-Inference provides benefits in inference speed and cost. For more information about test data, visit Stanford DAWNBench.

AIACC-Inference provides a model conversion tool to convert existing models to TF or ONNX models. AIACC-Inference also provides TensorFlow and ONNX acceleration engines for acceleration.

The following content lists some of the acceleration features of AIACC-Inference:
  • TensorFlow and ONNX acceleration engines split and fuse model subgraphs. The split subgraph is passed to the high-performance operator acceleration library for acceleration.
  • The high-performance operator acceleration library finds the optimal operator among the self-developed high-performance operators and NVIDIA operators. Then, the library generates a list of high-performance operators for the acceleration engine to split and pass subgraphs.