Developer Content

In recent years, in-depth learning has been widely implemented in various fields of industry such as vision, natural language processing, search advertising recommendation, etc. The exponential increase in the number of parameters of the deep learning model, as well as the demand of new businesses for complex models, require that the elastic computing of cloud manufacturers can reduce the computational cost and improve the computational efficiency, especially the reasoning of deep learning, which will become the focus of optimization. Under the influence of this factor, the Alibaba Cloud Pingtouge team launched the world's first 5 nm ARM Server chip, Yitian 710. The chip is based on ARM Neoverse N2 architecture and supports the latest ARMv9 instruction set, including i8mm, bf16 and other extended instruction sets. It can gain performance advantages in the field of science/AI computing.

In this paper, we focus on the ECS instance g8y, which uses the Yitian 710 chip, and test and compare the performance of the deep learning reasoning task.

01 Workloads

In this analysis, we selected four common deep learning reasoning scenarios, covering the fields of image classification and recognition, image target detection, natural language processing and search recommendation.

02 Platforms

Instance type

We tested two types of Alibaba Cloud instances, ECS g8y (Yitian 710) and ECS g7 (Ice Lake), both of which are 8-vCPU.

Deep Learning Framework

On all platforms, we use TensorFlow v2.10.0 and PyTorch 1.12.1.

On the Arm device, TensorFlow supports two kinds of backend, we use OneDNN backend. OneDNN is an open source cross-platform deep learning library, and can integrate the Arm Compute Library (the machine learning computing library of the Arm device). Using this backend on the Arm device can achieve higher performance.

OneDNN's support on PyTorch is still an experimental version, so the default OpenBLAS backend is used on the PyTorch framework.

BFloat16

BFloat16 (BF16) is a floating-point number representation. Its digits are consistent with single-precision floating-point number (IEEE FP32), but the decimal places are only 7. Therefore, the representation range of BF16 is almost the same as that of FP32, but the precision is low. BF16 is very suitable for deep learning, because generally, the prediction accuracy of the model will not be significantly reduced by the reduction of accuracy, but the 16-bit data format can save space and speed up the calculation.

03 TensorFlow Performance Comparison

With the help of the new BF16 instruction, g8y significantly improved the reasoning performance of the deep learning model, and ran better data than g7 in multiple scenarios. In addition, as a self-developed chip, Yitian 710 has a maximum price advantage of 30% compared with G7.

The following four figures are the comparison results under Resnet50, SSD, BERT and DIN models. Resnet, SSD and BERT are all from the MLPerf Conference Benchmark project, and DIN is the hit rate prediction model proposed by alibaba. The blue bar is a direct performance comparison, while the orange bar is a performance comparison with unit price. For example, on the Resnet50, the performance of g8y is 1.43 times that of g7, and the performance of unit price is 2.05 times that of g7.

Note: Set Batch Size=32 here, and the test image size is 224 * 224

Note: here Batch Size=1, the test image size is 1200 * 1200

04 PyTorch Performance Comparison

The PyTorch version of OneDNN backend on Arm is still experimental, so the default OpenBLAS backend is used in this experiment. OpenBLAS is an open source linear algebra library. We have added an optimized implementation of BFloat16 matrix multiplication calculation for Arm Neoverse N2.

OpenBLAS BFloat16 matrix multiplication optimization

There is a very close relationship between matrix multiplication and deep learning. For example, the commonly used Fully Connected Layer and Convolution Layer in deep learning are finally converted into matrix multiplication. Therefore, the accelerated matrix multiplication can ultimately accelerate the calculation of the model.

OpenBLAS is a widely used computing library. By default, it is the back end of Numpy, PyTorch and other libraries. We found that the library does not support the bf16 instruction extension of Yitian 710 in our research. After communicating with the community, we decided to use vector instructions such as BFMMLA supported by Yitian 710 to implement matrix multiplication that supports the bf16 data format. After implementation, the performance has been greatly improved. The performance comparison is shown in Figure 5. This implementation has been contributed to the open source community, and the latest version of OpenBLAS, 0.3.21, has also been incorporated.

Figure 5: OpenBLAS matrix multiplication performance comparison

Note: The number of rows and columns of the matrix involved in the operation is 1000.

PyTorch CNN Performance

OpenBLAS is the default backend of PyTorch. The optimization of matrix multiplication can be reflected in the deep learning model implemented by PyTorch. We take the model VGG19, which has a high proportion of convolution calculation, as an example. During the model reasoning, all convolution operators will be converted into matrix multiplication, and OpenBLAS will be used to complete the calculation. The following figure shows the performance comparison of VGG 19:

05 Conclusion

The analysis in this paper shows that the reasoning performance of multiple deep learning models is higher than that of the same specification g7 on the Alibaba Cloud Heavenly Rely instance g8y, which is mainly due to the new instructions of Arm Neoverse N2 and the constantly updated software support (OneDNN, ACL and OpenBLAS). During this process, the Alibaba Cloud compiler team contributed some software optimization. Later, we will continue to focus on software and hardware optimization in this field to improve the competitiveness of ARM series instances in ML/AI.

ECS Yitian instance deep learning reasoning performance measurement

Related Articles

A detailed explanation of Hadoop core architecture HDFS

What Does IOT Mean

6 Optional Technologies for Data Storage

What Is Blockchain Technology

Explore More Special Offers

Short Message Service(SMS) & Mail Service

Sales Support

Technical Support

Connect & Report Abuse