Apsara AI Accelerator (AIACC) is an AI acceleration engine developed based on Alibaba Cloud IaaS resources. AIACC is used to optimize models based on mainstream AI computing frameworks and improve training and inference performance. AIACC can cooperate with the resource management tool FastGPU to build AI computing tasks and make research and development more efficient.
Use AIACC to accelerate an application in deep learning scenarios
The following infrastructure shows how to use AIACC in deep learning scenarios.
Resource layer (Alibaba Cloud IaaS resources): uses Alibaba Cloud IaaS resources at the resource layer. The resources can be enabled on demand to meet requirements of large-scale GPU clusters for elastic computing, storage, and network resources.
Scheduling layer (AI acceleration resource management): uses FastGPU to build AI computing tasks and manage the resources of large-scale GPU clusters at the scheduling layer. For more information, see What is FastGPU?.
Framework layer (AI acceleration engine): uses AIACC to accelerate multiple frameworks at the same time at the framework layer. AIACC uses the performance optimization technology based on data communication. When AIACC implements distributed training, AIACC must exchange data between machines and between GPUs to ensure the acceleration effect. For more information, see AIACC-Training and AIACC-Inference.
Application layer (AI acceleration reference solution): implements deep learning in various application scenarios such as image recognition, object detection, video recognition, click-through rate (CTR) prediction, natural language understanding, and speech recognition. AIACC is used to implement unified acceleration for multiple frameworks at the framework layer. Therefore, you need to only make minimal modifications to the code to improve application performance.
Benefits
AIACC provides the following benefits:
AIACC is based on Alibaba Cloud IaaS resources that are stable and easy to use.
AIACC cooperates with FastGPU to build training tasks. This reduces the time to create and configure resources. This also improves GPU resource utilization to reduce costs.
AIACC supports unified acceleration for multiple frameworks. This allows smooth collaboration between the frameworks and improves training and inference performance. AI algorithms are developed in shorter verification cycles, which ensures faster model iteration. This makes research and development more efficient.
AIACC-Training
AIACC-Training (formerly known as Ali-Perseus or Perseus-Training) is developed and maintained by the Alibaba Cloud AIACC team based on Alibaba Cloud IaaS resources to achieve efficient acceleration for AI distributed training. AIACC-Training is designed to be compatible with open source systems and to accelerate your distributed training tasks without your manual intervention.
AIACC-Training allows you to accelerate distributed training tasks by using models that are built based on mainstream AI computing frameworks such as TensorFlow, PyTorch, MXNet, and Caffe.
AIACC-Training is compatible with the APIs of PyTorch Distributed Data Parallel (DDP) and Horovod. AIACC-Training accelerates the performance of native distributed training without your manual intervention.
In terms of underlying acceleration, AIACC-Training optimizes the features of the Alibaba Cloud network infrastructure and the policy of AI data-parallel distributed training to achieve significant performance improvements.
The following content lists some of the acceleration features of AIACC-Training:
Gradient fusion communication: allows you to use adaptive multi-stream fusion and adaptive gradient fusion to improve the training performance of bandwidth-intensive network models by 50% to 300%.
Highly optimized online and offline gradient-based negotiation: reduces the overhead of gradient-based negotiation on large-scale nodes by up to two orders of magnitude.
Hierarchical Allreduce algorithm: supports FP16 gradient compression and mixed precision compression.
Gradient compression based on gossip.
Gradient communication optimization based on multistep.
Deep optimization for remote direct memory access (RDMA) and elastic RDMA (eRDMA) networks.
API extensions for MXNet: support data parallelism and model parallelism of the InsightFace type and enhance the performance of Synchronized Batch Normalization (SyncBN) operators.
Group communication operators provided by GroupComm: allow you to build complex training tasks that implement the communication in both data parallelism and model parallelism.
AIACC-Training provides benefits in training speed and cost. For more information about test data, visit Stanford DAWNBench.
The following table lists some typical optimization cases of distributed training.
Customer | Model | Framework | Number of GPUs | Training speed increase |
An AI chip manufacturer | Image classification | MXNet | 256 | 100% |
An AI chip manufacturer | Facial recognition | MXNet | 256 | 200% |
A car manufacturer | FaceNet | PyTorch | 32 | 100% |
A mobile phone manufacturer | BERT | TensorFlow | 32 | 30% |
A mobile phone manufacturer | GPT2 | PyTorch | 32 | 30% |
An AI company | Faster-RCNN | MXNet, Horovod, and BytePS | 128 | 30% |
An AI company | InsightFace | MXNet, Horovod, and BytePS | 128 | 200% |
An online education platform | ESPnet | PyTorch-DP | 16 | 30% |
An online education platform | ESPnet2 | PyTorch-DDP | 16 | 30% |
An online education platform | CTR | PyTorch | 32 | 80% |
An online education platform | OCR | PyTorch | 32 | 30% |
A mobile phone manufacturer | Image classification | PyTorch | 128 | 25% |
A mobile phone manufacturer | MAE | PyTorch | 32 | 30% |
A research institute | GPT2 | PyTorch+Megatron | 32 | 30% |
A social media platform | MMDetection2 | PyTorch | 32 | 30% |
A financial intelligence company | InsightFace | PyTorch | 32 | 50% |
A mobile phone manufacturer | Detection2 | PyTorch | 64 | 25% |
A visual team | insightface | MXNet | 64 | 50% |
A game vendor | ResNet | PyTorch | 32 | 30% |
A city brain project | InsightFace | MXNet | 16 | 42% |
A pharmaceutical technology company | Autoencoder | PyTorch | 32 | 30% |
An autonomous driving company | swin-transformer | PyTorch | 32 | 70% |
AIACC-Inference
AIACC-Inference can optimize models based on TensorFlow and exportable frameworks in the Open Neural Network Exchange (ONNX) format to improve inference performance.
AIACC-Inference provides benefits in inference speed and cost. For more information about test data, visit Stanford DAWNBench.
AIACC-Inference provides a model conversion tool to convert existing models to TF or ONNX models. AIACC-Inference also provides TensorFlow and ONNX acceleration engines for acceleration.
The following content lists some of the acceleration features of AIACC-Inference:
TensorFlow and ONNX acceleration engines split and fuse model subgraphs. The split subgraph is passed to the high-performance operator acceleration library for acceleration.
The high-performance operator acceleration library finds the optimal operator among the self-developed high-performance operators and NVIDIA operators. Then, the library generates a list of high-performance operators for the acceleration engine to split and pass subgraphs.
For more information about how to install and use AIACC-Inference, see the following topics: