Apsara AI Accelerator (AIACC) is an AI acceleration engine developed based on Alibaba Cloud IaaS resources. AIACC is used to optimize models based on mainstream AI computing frameworks, accelerate deep learning applications, and improve model training and inference performance. You can use AIACC together with the resource management tool FastGPU to create AI computing tasks to improve research and development efficiency.
Use AIACC to accelerate an application in deep learning scenarios
AIACC consists of a training accelerator AIACC-Training and an inference accelerator AIACC-Inference. The following infrastructure shows how to use AIACC in deep learning scenarios.
Layer | Description |
Resource layer (Alibaba Cloud IaaS resources) | Uses Alibaba Cloud IaaS resources at the resource layer. The resources can be enabled on demand to meet the requirements of large-scale GPU clusters for elastic computing, storage, and network resources. |
Scheduling layer (AI acceleration resource management) | Uses FastGPU to create AI computing tasks and manage the resources of large-scale GPU clusters at the scheduling layer. For more information, see What is FastGPU? |
Framework layer (AI acceleration engine) | Uses AIACC to accelerate multiple frameworks in a centralized manner at the framework layer. AIACC uses the performance optimization technology based on data communication. When AIACC implements distributed training, AIACC must exchange data between machines and between GPUs to ensure the acceleration effect. For more information, see AIACC-Training and AIACC-Inference. |
Application layer (AI acceleration solution) | Implements deep learning in various application scenarios such as image recognition, object detection, video recognition, click-through rate (CTR) prediction, natural language understanding, and speech recognition. AIACC is used to implement centralized acceleration for multiple frameworks at the framework layer. You need to only make minimal modifications to the code to improve application performance. |
Benefits
AIACC provides the following benefits:
AIACC is based on Alibaba Cloud IaaS resources that are stable and easy to use.
You can use AIACC together with FastGPU to create training tasks. This reduces the time to create and configure resources, and improves GPU resource utilization to reduce costs.
AIACC supports centralized acceleration for multiple frameworks. This allows smooth collaboration between the frameworks and improves training and inference performance.
AI algorithms are developed in shorter verification cycles, which ensures faster model iteration. This makes research and development more efficient.
AIACC-Training
AIACC-Training (formerly known as Ali-Perseus or Perseus-Training) is developed and maintained by the Alibaba Cloud AIACC team based on Alibaba Cloud IaaS resources to achieve efficient acceleration for AI distributed training. AIACC-Training is designed to be compatible with open source systems and to accelerate your distributed training tasks without manual intervention.
The following figure shows the architecture of AIACC-Training.
Layer | Description |
Mainstream AI computing framework | AIACC-Training allows you to accelerate distributed training tasks by using models that are built based on mainstream AI computing frameworks such as TensorFlow, PyTorch, MXNet, and Caffe |
Interface layer | The interface layer provides unified interfaces and components that are used to interact and communicate with AIACC-Training system. The interface layer includes unified communication interface classes, unified basic component classes, unified basic communication classes, and unified gradient entry layer. AIACC-Training is compatible with the APIs of PyTorch Distributed Data Parallel (DDP) and Horovod and can accelerate the performance of native distributed training without manual intervention. |
Underlying acceleration layer | The underlying acceleration layer uses high-performance distributed communication libraries to implement model performance optimization in a unified manner. It works with gradient negotiation optimization, gradient fusion optimization, gradient compression optimization, and communication operation optimization. AIACC-Training optimizes the features of the Alibaba Cloud network infrastructure and the policy of AI data-parallel distributed training to achieve significant performance improvements. |
The following content lists the acceleration features of AIACC-Training:
Gradient fusion communication allows you to use adaptive multi-stream fusion and adaptive gradient fusion to improve the training performance of bandwidth-intensive network models by 50% to 300%.
Highly optimized online and offline gradient-based negotiation reduces the overhead of gradient-based negotiation on large-scale nodes by up to two orders of magnitude.
Hierarchical Allreduce algorithm supports FP16 gradient compression and mixed precision compression.
Gradient compression based on gossip is supported.
Gradient communication optimization based on multistep is supported.
Deep optimization for remote direct memory access (RDMA) and elastic RDMA (eRDMA) networks.
API extensions for MXNet support data parallelism and model parallelism of the InsightFace type and enhance the performance of Synchronized Batch Normalization (SyncBN) operators.
Group communication operators provided by GroupComm allow you to build complex training tasks that implement the communication in both data parallelism and model parallelism.
AIACC-Training provides significant benefits in training speed and cost. For more information about test data, visit Stanford DAWNBench.
The following table provides typical optimization cases of distributed training.
Customer | Model | Framework | Number of GPUs | Training speed increase |
An AI chip manufacturer | Image classification | MXNet | 256 | 100% |
An AI chip manufacturer | Face Service | MXNet | 256 | 200% |
A car manufacturer | FaceNet | PyTorch | 32 | 100% |
A mobile phone manufacturer | BERT | TensorFlow | 32 | 30% |
A mobile phone manufacturer | GPT2 | PyTorch | 32 | 30% |
An AI company | Faster-RCNN | MXNet, Horovod, and BytePS | 128 | 30% |
An AI company | InsightFace | MXNet, Horovod, and BytePS | 128 | 200% |
An online education platform | ESPnet | PyTorch-DP | 16 | 30% |
An online education platform | ESPnet2 | PyTorch-DDP | 16 | 30% |
An online education platform | CTR | PyTorch | 32 | 80% |
An online education platform | OCR | PyTorch | 32 | 30% |
A mobile phone manufacturer | Image classification | PyTorch | 128 | 25% |
A mobile phone manufacturer | MAE | PyTorch | 32 | 30% |
A research institute | GPT2 | PyTorch+Megatron | 32 | 30% |
A social media platform | MMDetection2 | PyTorch | 32 | 30% |
A financial intelligence company | InsightFace | PyTorch | 32 | 50% |
A mobile phone manufacturer | Detection2 | PyTorch | 64 | 25% |
A visual team | insightface | MXNet | 64 | 50% |
A game vendor | ResNet | PyTorch | 32 | 30% |
A city brain project | InsightFace | MXNet | 16 | 42% |
A pharmaceutical technology company | Autoencoder | PyTorch | 32 | 30% |
An autonomous driving company | swin-transformer | PyTorch | 32 | 70% |
For more information about how to install and use AIACC-Training, see the following topics:
AIACC-Inference
AIACC-Inference provides significant benefits in inference speed and cost. For more information about test data, visit Stanford DAWNBench.
The high-performance operator acceleration library finds the optimal operator among the self-developed high-performance operators and NVIDIA operators. Then, the library generates a list of high-performance operators for the acceleration engine to split and pass subgraphs.
For more information about how to install and use AIACC-Inference, see the following topics: