Apsara AI Accelerator (AIACC) is an AI acceleration engine based on Alibaba Cloud IaaS resources. AIACC is used to optimize models based on the mainstream AI computing framework and improves the training and inference performance. AIACC can cooperate with the resource management tool of FastGPU to build AI computing tasks and make research and development more efficient.

Use AIACC to accelerate an application in deep learning scenarios

  • Resource layer (Alibaba Cloud IaaS resources): uses Alibaba Cloud IaaS resources at the resource layer. The resources can be enabled on demand to meet requirements of elastic computing, storage, and network resources of large-scale GPU clusters.
  • Scheduling layer (AI acceleration resource management): uses FastGPU to build AI computing tasks and manage the resources of large-scale GPU clusters at the scheduling layer. For more information, see What is FastGPU?
  • Frame layer (AI acceleration engine): uses AIACC to achieve multi-frame unified acceleration at the frame layer. AIACC uses the performance optimization technology based on data communication. When AIACC implements distributed training, AIACC must exchange data between machines and GPUs to ensure the acceleration effect. For more information, see AIACC-Training and AIACC-Inference.
  • Application layer (AI acceleration reference solution): implements deep learning in various application scenarios such as image recognition, object detection, video recognition, click-through rate (CTR) prediction, natural language understanding, and speech recognition. AIACC is used to implement unified acceleration for multiple frameworks at the framework layer. Therefore, you need only to make minimal modifications to the code to improve application performance.

Benefits

AIACC provides the following benefits:
  • AIACC are based on Alibaba Cloud IaaS resources that are stable and easy to use.
  • AIACC cooperates with FastGPU to build training tasks. This reduces the time to create and configure resources. This also improves GPU resource utilization to reduce costs.
  • AIACC supports unified acceleration for multiple frameworks. This provides a small adaptation workload and improves training and inference performance. AI algorithms are developed in shorter verification cycles, which ensures faster model iteration. This makes research and development more efficient.

AIACC-Training

AIACC-Training can optimize models based on mainstream AI computing frameworks such as TensorFlow, PyTorch, MXNet, and Caffe, which can improve training performance.

AIACC-Training provides benefits in training speed and cost. For more information about test data, visit Stanford DAWNBench.

AIACC-Training abstracts communication interface classes and basic component classes for AI mainstream computing frameworks. AIACC-Training also provides unified basic communication classes and gradient at the entry layer to optimize distributed performance in a unified manner.

The AIACC-Training method supports data parallelism and model parallelism. AIACC-Training uses data parallelism as the main method. The following list shows some of the acceleration features of AIACC-Training:
  • Gradient fusion communication: allows you to use adaptive multi-stream fusion and adaptive gradient fusion to improve the training performance of bandwidth-intensive network models by 50% to 300%.
  • Decentralized gradient-based negotiation: reduces the traffic of gradient-based negotiation on large-scale nodes by up to two orders of magnitude.
  • Hierarchical Allreduce algorithm: supports FP16 gradient compression and mixed precision compression.
  • NaN check can be enabled during the training process, and the check result determines the gradient from which NaN comes on an SM60 platform or later.
  • API extensions for MXNet: support data parallelism and model parallelism of the InsightFace type.
  • Deep optimization for remote direct memory access (RDMA) networks.
  • Hybrid link communication (RDMA and VPC).
  • Gradient compression based on gossip.
  • Gradient communication optimization based on multistep.
  • Operators that support synchronization of BN across cards for MXNet.
For more information about how to install and use AIACC-Training, see the following topics:

AIACC-Inference

AIACC-Inference can optimize models based on TensorFlow and exportable frameworks in the Open Neural Network Exchange (ONNX) format to improve inference performance.

AIACC-Inference provides benefits in inference speed and cost. For more information about test data, visit Stanford DAWNBench.

AIACC-Inference provides a model conversion tool to convert existing models to TF or ONNX models. AIACC-Inference also provides TensorFlow and ONNX acceleration engines for acceleration.

The following content lists some of the acceleration features of AIACC-Inference:
  • TensorFlow and ONNX acceleration engines split and fuse model subgraphs. The split subgraph is passed to the high-performance operator acceleration library for acceleration.
  • The high-performance operator acceleration library finds the optimal operator among the self-developed high-performance operators and NVIDIA operators. Then, the library generates a list of high-performance operators for the acceleration engine to split and pass subgraphs.