Big picture of machine learning technology

AI cluster aims to obtain higher AI integrated computing power and plays the role of "connecting the past with the next". "Connecting" is the effective integrated computing power provided for AI applications, and high integrated computing power is the enabling ability to support AI large models and massive data. "On/off" means to give full play to AI computing chip capabilities through cluster computing, network, and storage balance design. For example, memory access or network bottleneck will lead to lower AI chip efficiency. The key points of AI cluster design are: 1) Single machine: computing optimization, software and hardware co-optimization to play the effective computing power of AI accelerator. 2) Multi-computer: communication optimization to minimize the loss of computing efficiency introduced by multi-machine data exchange. We will introduce relevant technologies and systems from two aspects: stand-alone acceleration and cluster acceleration (multiple machines).

First of all, we will give a comprehensive description of Alibaba's existing technology stack in terms of computing and interconnection.

DPCA server and DPCA virtualization technology

Under the hardware platform of Alibaba Cloud DPCA, the virtualization architecture has also been upgraded to make the architecture of computing virtualization more clear and concise, so that virtual functions can provide performance close to physical machines. As shown in the figure, the main features of the DPCA server architecture are: the I/O link is transformed from traditional software implementation to hardware and pass-through devices, and storage virtualization and network virtualization are implemented on the MOC card; At the same time, the control system and monitoring program are all sunk on the MOC card. On the physical machines that provide computing services, only the Linux operating system and lightweight virtual machine monitors tailored by Alibaba Cloud will run. In general, the base of DPCA hardware platform, together with lightweight host OS and lightweight virtual machine monitor, constitute a new generation of lightweight and efficient virtualization platform under DPCA architecture.

GPU

CPU execution of AI computing often does not achieve the optimal cost performance. Therefore, AI chips with massive parallel computing capabilities and capable of accelerating AI computing came into being. At present, GPU, FPGA and AI ASIC chips are the most representative. GPU is still the most mature and widely used accelerator at present. Alibaba's upper framework has done a lot of compilation and optimization work for GPU. GPU has been widely deployed in Alibaba, and it is also the main force for the sales of AI computing power on the cloud. We have been able to achieve the simultaneous release of GPU-based cloud products and the latest generation of GPU. We are at the forefront of the industry in terms of the security, operability and user experience of GPU on the cloud. It is the hot upgrade capability of universal computing operability in the GPU virtualization scenario, ranking first in the industry; It is the first cloud manufacturer in the industry to release the pre-research of GPU hot migration technology based on SRIOV. It is the first time in the industry to realize GRID-based vGPU technology output on the cloud, leading the technical trend of vGPU cloudization, and paving the GPU computing infrastructure for cloud games in the 5G era.

The training chip of GPU has always led the development trend of GPU technology. In addition to the rapid growth of basic FP32 computing power, the computing power has been greatly improved through the change of precision. For example, Tensorcore is another trend of computing power improvement; In addition, due to the demand of multi-card and multi-computer communication, GPU communication has experienced PCIE P2P technology, high-speed communication technology based on NVLink, and GPUDirect RDMA technology through RDMA network. On Alibaba Cloud, due to the need for computing power sharing between multiple tenants, how to split computing power and isolate communication under different communication modes is a technology that Alibaba Cloud has been studying, including the latest NVLink full-connection scenario based on NVSwitch programmable topology segmentation technology.

FPGA

Since its birth, FPGA devices have been widely used in wired and wireless communications, aerospace, medical electronics, automotive electronics and other fields because of their highly flexible programmability, providing ASIC-like performance and energy efficiency ratio. However, compared with CPU and GPU, the development cycle of FPGA is longer (although only half or even one-third of the development cycle of ASIC), and the development and use threshold is higher, which makes the developers of FPGA far less than those of CPU and GPU. At the same time, the application scope and popularity are also greatly limited. In terms of FPGA, we have higher customization and self-development capabilities. The Xilinx FPGA board, the industry's first single-card and dual-chip board jointly developed by Alibaba Cloud and AIS, has achieved the ability of independent technology innovation at the board and HDK levels.

Shuntian platform: FPGA as a service (FaaS)

The FPGA instance on the cloud has made rich function output. Alibaba Cloud FaaS (FPGA as a Service) Shuntian platform provides a unified hardware platform and middleware on the cloud, which can greatly reduce the development and deployment costs of the accelerator. Third-party ISV accelerator IP can quickly form services and provide them to users, eliminating the hardware barriers between the acceleration technology and the end users. Users can use the acceleration service on demand without knowing the underlying hardware. In order to provide a more efficient and unified development and deployment platform for accelerator providers and users, the FaaS Shuntian platform provides two major development kits: HDK and SDK. The logical architecture of FaaS is shown in the following figure:

Alibaba Cloud FaaS Shuntian platform supports the most comprehensive DMA technology, including DMA, XDMA and QDMA; The same architecture supports RTL and HLS development, verification and testing; The only software architecture in the world that supports two major FPGA manufacturers Xilinx and Intel's cloud manufacturers. Shell technology, which is comprehensive, excellent, and compatible, and can use PR technology for dynamic hot upgrade, makes the FaaS Shuntian platform become the infrastructure of Alibaba Group's FPGA heterogeneous acceleration business. It is fully compatible with all the FPGA devices introduced by the Group, and has successfully served several major business segments such as Taobao, Youku, Ant and Cloud Security.

AliDNN

Unlike the one-way software adaptation hardware in GPU environment, FPGA and Alibaba's self-developed NPU give us the opportunity to define hardware, and we can conduct in-depth software and hardware optimization according to business characteristics. AliDNN is an FPGA-based instruction set, accelerator, SDK and compiler full-stack self-developed in-depth learning acceleration engine. The instruction set plus compiler design provides full flexibility for AliDNN. The deep learning framework (TensorFlow, Caffe, etc.) can directly call the AliDNN engine, and the compiler (Sinian) compiles the deep learning model into accelerator instructions for calculation. The full-stack software and hardware coordination and optimization of algorithm, runtime, compiler and accelerator make AliDNN extremely efficient and efficient. AliDNN provides deep learning inference services with high throughput and low latency.

NPU

AliNPU (including Optical 800) also analyzes the needs of AI application scenarios within Alibaba Group, determines that it has made in-depth optimization based on CNN model, and supports some general models, such as RNN model. This is a special optimization for a specific in-depth learning algorithm field to improve the cost-performance ratio of relevant applications to the extreme. It is officially true that the cost-performance ratio of the Light 800 is far higher than that of the competitive products, becoming the world's strongest AI reasoning chip.

RDMA

RDMA is the most popular high-performance network technology in the industry at present, which can greatly save data transmission time and CPU overhead. It is recognized as the key technology to improve the performance of artificial intelligence, scientific computing and distributed storage. Based on the new HAIL network architecture and in combination with self-developed switches, Alibaba has built a complete technical system from host network, message middleware, Pangu distributed storage, network operation to product operation, and achieved the world's largest RDMA network deployment of dozens of data centers, far surpassing major cloud manufacturers such as Amazon and Microsoft. This "high-speed network" of the largest data center in the world makes the cluster greatly break through the bottleneck of transmission speed, effectively supports the popular innovative products such as cloud disk ESSD, cloud supercomputing SCC, machine learning PAI, and cloud native database POLARDB, and helps the e-commerce database to cope with the peak traffic test of the November 11. At the same time, the lossy RDMA technology that can cross POD has entered the experimental testing stage in Alibaba. At that time, the application scope of RDMA will be further expanded.

EXSPARCL communication library

The self-developed EXSPARCL (Extremely Scalable and High Performance Alibaba gRoll Communication Library) collective communication library provides universal collective communication functions and is compatible with NVIDIA's NCCL. ExSparcl specially optimizes the high-speed interconnection architecture and multi-network card features that support large-scale AI clusters, makes full use of the interconnection bandwidth between devices, and ensures the linear expansion of communication and business performance. Through the topology awareness of cluster/host physical interconnection and the optimal selection of network routing, innovative congestion free algorithm is implemented to ensure high-speed communication within and between nodes. For example, for the SCC training cluster architecture, the implementation of the Rank remapping Havling-Doubling algorithm can ensure that there is no congestion and queuing caused by path conflict in the process of collective communication. In a large-scale environment, compared with NVDIA's NCCL communication library, the performance of collective communication (AllReduce/AllGather) is improved several times, and the improvement of business performance is also very obvious. In addition, the topology-aware feature can be used for fault avoidance and greatly enhance the availability of the network. See: https://www.qbitai.com/2020/03/11987.html 。

Apsara AI Acceleration Tool (AIACC)

The Apsara AI acceleration tool supports the distributed performance acceleration of four mainstream AI computing frameworks, Tensorflow, PyTorch, MXNET, and Cafe, through a unified framework, and has made in-depth performance optimization for VPC network and RDMA network, which can improve the training performance by 1 to 10 times in different scenarios and different training scales. At the same time, AIACC and all AI computing frameworks are decoupled. On the one hand, it can easily support the forward iteration of each AI computing framework community version. On the other hand, users can easily obtain performance acceleration without modifying the model and algorithm code implemented by each AI computing framework.

FaaS Shuntian can support high-speed interconnection of FPGA chips (up to 600G)

Alibaba Cloud and AIS jointly developed the industry's first single-card dual-chip FPGA board, AliFPGAx2. The bandwidth of Serdes between two FPGAs can be up to 600G. Two AliFPGAx2 boards on a server can also be interconnected with high-speed Serdes through optical cables. At the same time, FaaS F3 provides users with a powerful interconnection topology, which is convenient for users to build FPGA clusters, realize FPGA interconnection, and provide high-speed DMA software and hardware support through FaaS Shuntian's shell. It can support single-card 2-chip interconnection, dual-card 4-chip interconnection and 8-card 16-chip interconnection, and the interconnection channel can flexibly configure the hardware isolation partition through software to interconnect and isolate between different topology requirements of cloud users. The following figure shows a typical 2-card 4-chip interconnection topology.

High performance computing is a parallel computing cluster composed of computing nodes with low latency and high bandwidth. Through parallel computing, it can solve floating-point intensive scientific and engineering models, including AI depth models. As a computing node, DPCA bare metal server realizes high-speed MPI communication between nodes through RoCE network card and interconnection with VPC, IO, network and cloud disk through DPCA MOC card. Thus, while outputting supercomputing power at supercomputing level, the "cloud native" flexibility and unified operation and maintenance of all nodes can be maintained. We also developed the elastic high-performance computing E-HPC PaaS platform as a high-performance computing software stack to create and manage SCC clusters. At the same time, we directly output SCC clusters to customers who have the ability to build their own HPC platforms.

SCC-training cluster: large-scale AI training cluster

In response to the computational power requirements of AI large-scale training, Alibaba has carried out an integrated design from hardware to algorithm. From the perspective of performance, one of the core of cluster design is to improve the ability of data interaction between accelerators, reduce the proportion of non-computing overhead, and realize the scale linear expansion of computing power. Therefore, the heterogeneous cluster system takes the optimization of communication as a breakthrough, redefines the network architecture of the server and the whole machine, first removes the communication bottleneck from the hardware, and then brings the enhanced communication capability into play through the cooperation of software and hardware. The internal code of this part is EFlaps, which realizes the linear acceleration of AI training. The relevant results were released at the top academic conference HPCA2020( https://www.csdn.net/article/a/2020-03-03/15988517 )At present, the single cluster AI computing power of the system under construction can reach 500 PFlops (FP16 performance).

In addition, liquid cooling is another force driving the evolution of AI cluster architecture. Due to the limitation of power consumption, currently one cabinet can only accommodate two or four 8-card servers, which is difficult to improve the computing density. However, liquid cooling can break this limitation. A tank can accommodate more than 100 GPU cards, saving power, land and fiber, and improving stability. Accordingly, the cluster architecture has been redesigned according to the size and outlet characteristics of the submerged liquid-cooled tank, which has entered the experimental stage.

Thirdly, while achieving the ultimate performance, we also take into account the optimization in computing cost and the flexibility after decoupling the general and heterogeneous computing power.

Heterogeneous hardware layer virtualization

Alibaba IaaS's GPU virtualization technology is the foundation of AI computing power on the cloud. Alibaba IaaS has carried out secondary development on the existing open source GPU virtualization technology, making the existing GPU virtualization scheme suitable for the improvement of security, high reliability and monitoring and other key functions of the public cloud.

In terms of security isolation of public cloud GPU servers, Alibaba Cloud's heterogeneous IaaS layer has completed the preliminary security filtering from the internal drive of the instance, the filtering and interception of GPU privilege instructions by the host virtualization layer, and the fault-tolerant processing made by the host PCIe protocol, to ensure that the customer instance is not attacked, nor is it possible to attack other customers.

At present, the mainstream heterogeneous virtualization still exists in the form of device pass-through, and on this basis, the latest technologies such as SRIOV and vGPU sharding virtualization have gradually evolved. Alibaba IaaS relies on the existing GPU virtualization technology to complete the cloud scale and product output of AI clusters (GPU/FPGA). GPU virtualization has come a long way; It has experienced straight-through virtualization, sharded virtualization of SRIOV and vGPU, Intel's GVT-G technology, and so on.

In the era of universal computing, virtualization is mainly introduced to improve CPU utilization, but with the introduction of hot upgrade and hot migration technology, the security and reliability of computing resources have been pushed to a new level. GPU virtualization technology also plays an important role in improving the utilization of GPU resources and defragmenting computing resources.

In the monitoring of public cloud GPU servers, Alibaba Cloud's heterogeneous IaaS layer has made GPU-related cloud monitoring schemes, which can obtain real-time information about the current GPU's operating status and temperature, video memory usage, and other information, and can customize multiple monitoring modes. GPU custom monitoring and GPU cloud monitoring plug-in.

In terms of high availability of public cloud GPU servers, Alibaba Cloud heterogeneous IaaS layer has developed and deployed unique GPU server hot upgrade and hot migration functions. When the host computer where the instance is located needs operation and maintenance operations such as system software update/hardware maintenance, it can let the customer upgrade and update without feeling, thus ensuring the stability and continuity of the customer's business. Based on the hot migration capabilities of SRIOV and GRID vGPU, Alibaba Cloud is leading the competition as the first echelon in the industry.

Heterogeneous GPU container support (cGPU container AI computing power isolation technology)

Cloud native has become a trend of cloud services in the industry. How to support heterogeneous computing on cloud native and how to run multiple containers and isolate them on a single GPU has also been explored by the industry. Nvidia vGPU, Nvidia MPS, and friend's vCUDA solutions all make it possible for users to use GPU with smaller granularity.

Alibaba Cloud GPU team launched the Haotian cGPU solution, which is a disruptive innovation compared with other solutions. The common solution in the industry is to replace the CUDA library to achieve interception, which requires recompilation of statically linked programs, and at the same time, the CUDA upgrade also needs to adapt to the new version; While Haotian cGPU is able to achieve the separation of computing power scheduling and display memory, it also realizes that there is no need to replace the CUDA static library and dynamic library, and there is no need to recompile, and the CUDA, cuDNN and other versions can be upgraded at any time without adaptation.

The cGPU is a self-developed host kernel driver. Its advantages are:

Kubernetes and NVidia Docker solutions adapted to open source standards

The user side is transparent. AI applications do not need to be recompiled and cuda library replacement is not required for execution

The underlying operation for Nvidia devices is more stable and convergent, while the API of CUDA layer is changeable, and some non-open APIs of Cudnn are not easy to capture.

At the same time, GPU video memory and computing power isolation are supported

Now the combination of cGPU Haotian solution based on Alibaba Cloud GPU team and GPU shared scheduling of container service can create a low-cost, reliable and user-friendly large-scale GPU scheduling and isolation solution.

Software pooling

EAIS.EI: resource scheduling layer pooling

EAIS decouples the number of CPU cores and the back-end heterogeneous acceleration devices through software pooling. The front-end pure CPU ECS can dynamically mount or unload the back-end heterogeneous acceleration devices. The front-end ECS and the back-end heterogeneous acceleration devices communicate through the encrypted gRPC protocol. Back-end acceleration devices can include GPU, FPGA, NPU and other heterogeneous accelerators, and can be uniformly scheduled and managed through software pooling.

HARP: Runtime layer pooling

The main purpose of accelerating cluster pooling is to improve the utilization of cluster resources. Cluster management, such as K8s, is very limited in terms of accelerating resource management. Generally, it is distributed in a pass-through manner with the strength of the whole card. Moreover, CPU and acceleration resources are bound to resource allocation in the form of a complete machine, which has a great waste of resources in practical applications. The data center presents the trend of decentralization. From physical machine to virtual machine, to virtual network card, to distributed storage, all parts of the whole machine are gradually assembled through the network in the form of dedicated cabinet and in the form of virtualization and configurable. Compared with disk, the connection between accelerated resources and general computing is closer. However, with the progress of technology, it has become a technical trend to accelerate the general computing (CPU) through network or other forms of interconnection.

Hertogenous accelerator resource pooling (HARP), currently taking GPU as the main acceleration resource, will be extended to NPU and other acceleration hardware in the future. By adding an intermediate layer between the user program and the driver (currently software, which can be expanded to hardware in the future) to realize the virtualization of acceleration resources, dynamically allocate local or remote acceleration resources for users, so as to better manage and utilize the acceleration resources.

The advantages of HARP implementation are:

Transparent to the upper application and no requirements for the running environment (physical machine, container, VM)

Both local and remote acceleration resources are supported. There is no difference in performance between local mode and direct use of GPU

Control API can be used to control the video memory and computing resources in the case of single card reuse

You can hide the details of the underlying hardware for the upper layer applications and automate some underlying configurations (local physical GPU/physical machine on another host connected by PCIE-switch/vGPU/next-generation GPU support/Compute Instance)

It can implement some additional profiling functions, and even generate some traceable replays

The HARP resource pool needs to support other acceleration chips, so we hope to establish a unified interface. The chip supporting this interface can access the resource pool scheduling system with only a small amount of work. Therefore, we have joined forces with Shanghai Jiaotong University and Tsinghua University, as well as Cambrian chip manufacturers, to establish China's heterogeneous resource pool technology standards industry alliance.

Hardware layer pooling

As an upgrade of software pooling technology, using self-developed or third-party hardware plug-in cards, through the high-speed bus interconnection technology of rack or small-scale cross-rack, the general calculator and multiple accelerators are matched and decoupled to achieve the flexible combination of medium-sized accelerator pooling and any accelerator. At the same time, it provides better services in terms of reliability, operability, and SLA for hardware fault handling of accelerator cards.

Remote Resources - Local Access

Finally, based on the technical research and development in the "nuclear high base" field, Alibaba Cloud's IaaS and PaaS services today have the following characteristics:

The design goal is to create the hardware computing infrastructure of computing power supply with excellent cost performance and product diversity.

The R&D and layout of technology mainly based on soft and hard collaboration form our differentiated competitiveness and technical grasp.

The delivery ability of elastic computing power from single accelerator piecemeal computing power sharing to large-scale cluster computing power is a core capability that needs to be built.

SLA service capabilities are differentiated and highly reliable.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us