All Products
Search
Document Center

Elastic GPU Service:What is AIACC?

Last Updated:Dec 05, 2023

Apsara AI Accelerator (AIACC) is an AI acceleration engine developed based on Alibaba Cloud IaaS resources. AIACC is used to optimize models based on mainstream AI computing frameworks and improve training and inference performance. AIACC can cooperate with the resource management tool FastGPU to build AI computing tasks and make research and development more efficient.

Use AIACC to accelerate an application in deep learning scenarios

The following infrastructure shows how to use AIACC in deep learning scenarios. aiacc-en

  • Resource layer (Alibaba Cloud IaaS resources): uses Alibaba Cloud IaaS resources at the resource layer. The resources can be enabled on demand to meet requirements of large-scale GPU clusters for elastic computing, storage, and network resources.

  • Scheduling layer (AI acceleration resource management): uses FastGPU to build AI computing tasks and manage the resources of large-scale GPU clusters at the scheduling layer. For more information, see What is FastGPU?.

  • Framework layer (AI acceleration engine): uses AIACC to accelerate multiple frameworks at the same time at the framework layer. AIACC uses the performance optimization technology based on data communication. When AIACC implements distributed training, AIACC must exchange data between machines and between GPUs to ensure the acceleration effect. For more information, see AIACC-Training and AIACC-Inference.

  • Application layer (AI acceleration reference solution): implements deep learning in various application scenarios such as image recognition, object detection, video recognition, click-through rate (CTR) prediction, natural language understanding, and speech recognition. AIACC is used to implement unified acceleration for multiple frameworks at the framework layer. Therefore, you need to only make minimal modifications to the code to improve application performance.

Benefits

AIACC provides the following benefits:

  • AIACC is based on Alibaba Cloud IaaS resources that are stable and easy to use.

  • AIACC cooperates with FastGPU to build training tasks. This reduces the time to create and configure resources. This also improves GPU resource utilization to reduce costs.

  • AIACC supports unified acceleration for multiple frameworks. This allows smooth collaboration between the frameworks and improves training and inference performance. AI algorithms are developed in shorter verification cycles, which ensures faster model iteration. This makes research and development more efficient.

AIACC-Training

AIACC-Training (formerly known as Ali-Perseus or Perseus-Training) is developed and maintained by the Alibaba Cloud AIACC team based on Alibaba Cloud IaaS resources to achieve efficient acceleration for AI distributed training. AIACC-Training is designed to be compatible with open source systems and to accelerate your distributed training tasks without your manual intervention.

  • AIACC-Training allows you to accelerate distributed training tasks by using models that are built based on mainstream AI computing frameworks such as TensorFlow, PyTorch, MXNet, and Caffe.

  • AIACC-Training is compatible with the APIs of PyTorch Distributed Data Parallel (DDP) and Horovod. AIACC-Training accelerates the performance of native distributed training without your manual intervention.

  • In terms of underlying acceleration, AIACC-Training optimizes the features of the Alibaba Cloud network infrastructure and the policy of AI data-parallel distributed training to achieve significant performance improvements.

The following content lists some of the acceleration features of AIACC-Training:

  • Gradient fusion communication: allows you to use adaptive multi-stream fusion and adaptive gradient fusion to improve the training performance of bandwidth-intensive network models by 50% to 300%.

  • Highly optimized online and offline gradient-based negotiation: reduces the overhead of gradient-based negotiation on large-scale nodes by up to two orders of magnitude.

  • Hierarchical Allreduce algorithm: supports FP16 gradient compression and mixed precision compression.

  • Gradient compression based on gossip.

  • Gradient communication optimization based on multistep.

  • Deep optimization for remote direct memory access (RDMA) and elastic RDMA (eRDMA) networks.

  • API extensions for MXNet: support data parallelism and model parallelism of the InsightFace type and enhance the performance of Synchronized Batch Normalization (SyncBN) operators.

  • Group communication operators provided by GroupComm: allow you to build complex training tasks that implement the communication in both data parallelism and model parallelism.

AIACC-Training provides benefits in training speed and cost. For more information about test data, visit Stanford DAWNBench.

The following table lists some typical optimization cases of distributed training.

Customer

Model

Framework

Number of GPUs

Training speed increase

An AI chip manufacturer

Image classification

MXNet

256

100%

An AI chip manufacturer

Facial recognition

MXNet

256

200%

A car manufacturer

FaceNet

PyTorch

32

100%

A mobile phone manufacturer

BERT

TensorFlow

32

30%

A mobile phone manufacturer

GPT2

PyTorch

32

30%

An AI company

Faster-RCNN

MXNet, Horovod, and BytePS

128

30%

An AI company

InsightFace

MXNet, Horovod, and BytePS

128

200%

An online education platform

ESPnet

PyTorch-DP

16

30%

An online education platform

ESPnet2

PyTorch-DDP

16

30%

An online education platform

CTR

PyTorch

32

80%

An online education platform

OCR

PyTorch

32

30%

A mobile phone manufacturer

Image classification

PyTorch

128

25%

A mobile phone manufacturer

MAE

PyTorch

32

30%

A research institute

GPT2

PyTorch+Megatron

32

30%

A social media platform

MMDetection2

PyTorch

32

30%

A financial intelligence company

InsightFace

PyTorch

32

50%

A mobile phone manufacturer

Detection2

PyTorch

64

25%

A visual team

insightface

MXNet

64

50%

A game vendor

ResNet

PyTorch

32

30%

A city brain project

InsightFace

MXNet

16

42%

A pharmaceutical technology company

Autoencoder

PyTorch

32

30%

An autonomous driving company

swin-transformer

PyTorch

32

70%

AIACC-Inference

AIACC-Inference can optimize models based on TensorFlow and exportable frameworks in the Open Neural Network Exchange (ONNX) format to improve inference performance.

AIACC-Inference provides benefits in inference speed and cost. For more information about test data, visit Stanford DAWNBench.

AIACC-Inference provides a model conversion tool to convert existing models to TF or ONNX models. AIACC-Inference also provides TensorFlow and ONNX acceleration engines for acceleration.

The following content lists some of the acceleration features of AIACC-Inference:

  • TensorFlow and ONNX acceleration engines split and fuse model subgraphs. The split subgraph is passed to the high-performance operator acceleration library for acceleration.

  • The high-performance operator acceleration library finds the optimal operator among the self-developed high-performance operators and NVIDIA operators. Then, the library generates a list of high-performance operators for the acceleration engine to split and pass subgraphs.

For more information about how to install and use AIACC-Inference, see the following topics: