Ali BladeDISC Deep Learning Compiler Officially Open Source

One: Introduction

With the continuous development of deep learning, the AI model structure is evolving rapidly, and the underlying computing hardware technology is emerging in an endless stream. For the majority of developers, it is not only necessary to consider how to effectively utilize computing power in complex and changeable scenarios, but also Deal with continuous iteration of computing frameworks. The deep compiler has become a widely concerned technical direction to deal with the above problems, allowing users to only focus on upper-level model development, reducing the human development cost of manual performance optimization, and further squeezing the hardware performance space. Alibaba Cloud Machine Learning PAI has open sourced BladeDISC, a dynamic shape deep learning compiler that was put into practical business applications earlier in the industry. This article will explain the design principles and applications of BladeDISC in detail.

Two: What is BladeDISC

BladeDISC is Ali's latest open source MLIR-based dynamic shape deep learning compiler.

1 Main Features

*Multiple front-end framework support: TensorFlow, PyTorch
* Multiple backend hardware support: CUDA, ROCM, x86
*Complete support for dynamic shape semantic compilation
*Support reasoning and training
* Lightweight API, universal and transparent to users
*Support plug-in mode embedded in the host framework to run, and independent deployment mode

Three background of deep learning compiler

In recent years, deep learning compilers have been extremely active as a relatively new technical direction, including some old-fashioned TensorFlow XLA, TVM, Tensor Comprehension, Glow, and later the highly popular MLIR and its extension projects in different fields IREE, mlir -hlo and so on. We can see that different companies and communities are doing a lot of exploration and advancement in this field.

1 The wave of AI and the wave of chips have jointly spawned - from the beginning of the bud to the vigorous development
The reason why deep learning compilers have received continuous attention in recent years mainly comes from several reasons:

The requirements of framework performance optimization in terms of model generalization

Deep learning is still developing with each passing day, and innovative application fields are constantly emerging. How to effectively utilize the computing power of hardware in complex and changeable scenarios has become a very important part of the entire AI application chain. In the early days, the focus of neural network deployment was on the framework and operator library. This part of the responsibility was largely undertaken by the deep learning framework, the operator library provided by the hardware manufacturer, and the manual optimization work of the business team.

The figure above roughly divides the deep learning frameworks in recent years into three generations. A trend is that at the upper user API level, these frameworks are becoming more and more flexible, and at the same time, they pose greater challenges to the underlying performance issues. The first-generation deep learning framework is similar to Caffe, which uses a sequence of layer to describe the neural network structure. The second generation is similar to TensorFlow, which uses a finer-grained graph of operators to describe the calculation graph. The third generation is similar to PyTorch, TensorFlow Eager Mode dynamic graph. We can see that the framework is becoming more and more flexible, and the description ability is getting stronger and stronger. The problem brought about is that it is becoming more and more difficult to optimize the underlying performance. The business team also often needs to supplement the required manual optimization. These tasks depend on the specific business and the understanding of the underlying hardware, which is labor-intensive and difficult to generalize. The deep learning compiler combines the optimization of the compile-time layer and automatic or semi-automatic code generation, and generalizes the principle of manual optimization to replace various problems caused by pure manual optimization to solve deep learning. The conflict between flexibility and performance of the framework.

AI Framework Requirements for Hardware Generalization

On the surface, the development of AI in recent years is obvious to all and is in the ascendant, but the development of hardware computing power in the background for decades is the core driving force that catalyzes the prosperity of AI. With the increasing physical challenges faced by transistor scaling, the increase of chip computing power is becoming more and more difficult, Moore's Law is facing failure, and various DSA chips with innovative architectures are ushering in a wave of upsurge. Traditional x86, ARM They are all strengthening their competitiveness in different fields. The flourishing of hardware has also brought new challenges to the development of AI frameworks.

The innovation of hardware is one problem, but how to utilize the computing power of hardware in real business scenarios is another problem. New AI hardware manufacturers have to face the problem of not only innovating on hardware, but also making heavy manpower investment on the software stack. How to be backward compatible with hardware has become one of the core difficulties of today's deep learning frameworks, and the issue of hardware compatibility needs to be solved by the compiler.

Requirements of the AI system platform for the generalization of the front-end AI framework

Today's mainstream deep learning frameworks include Tensoflow, Pytorch, Keras, JAX, etc. Several frameworks have their own advantages and disadvantages. They have different styles in the upper-layer interface to users, but they also face hardware adaptation and full use of hardware computing power. The problem. Different teams often choose different frameworks according to their own modeling scenarios and usage habits, while the performance optimization tools and hardware adaptation solutions of cloud vendors or platforms need to consider different front-end frameworks and even the needs of future framework evolution. Google uses XLA to support TensorFlow and JAX at the same time. At the same time, other open source communities have evolved access solutions such as Torch_XLA and Torch-MLIR. Although these access solutions still have some problems in terms of ease of use and maturity, they reflect The generalization requirements and technical trends of the front-end AI framework for the work of the AI ​​system layer.

2 What is a deep learning compiler

Traditional compilers use high-level languages as input to avoid users directly writing machine codes, and work with relatively flexible and efficient languages, and introduce optimizations in the compilation process to solve performance problems introduced by high-level languages and balance development efficiency and performance the contradiction between. The role of the deep learning compiler is similar. Its input is more flexible, with a higher degree of abstraction of the calculation graph description, and the output includes the underlying machine code and execution engine on the CPU, GPU and other heterogeneous hardware platforms.

One of the missions of traditional compilers is to reduce the burden on programmers. The high-level language used as the input of the compiler is often more about describing a logic. For the convenience of the programmer, the description of the high-level language will be more abstract and flexible. As for whether this logic can be executed efficiently on the machine, it is often the test of the compiler. an important indicator. Deep learning, as an application field that has developed extremely fast in recent years, its performance optimization is very important, and there is also a contradiction between the flexibility and abstraction of the high-level description and the underlying computing performance, so the compiler specifically for deep learning Appeared. Another important mission of traditional compilers is to ensure that the high-level language input by programmers can be executed on hardware computing units with different architectures and instruction sets. This is also reflected in deep learning compilers. Faced with a new hardware device, it is unlikely that humans will have the energy to re-write all the operators required by a framework for so many target hardware. The deep learning compiler provides the IR of the middle layer and converts the model flow graph of the top-level framework The intermediate layer is used to represent the IR, and the general layer optimization is performed on the intermediate layer IR, and the machine code of each target platform is generally generated from the optimized IR at the back end.

The goal of the deep learning compiler is to complete performance optimization and hardware adaptation in the form of a general-purpose compiler for AI computing tasks. It allows users to focus on upper-layer model development, reduces the human development cost for users to manually optimize performance, and further squeezes the hardware performance space.

3. The bottleneck problem faced by large-scale application of distance

Deep learning compilers have developed to this day. Although they have many similarities with traditional compilers in terms of goals and technical architecture, and have shown good potential in technical direction, the current practical application range is still far from traditional compilers. There are certain gaps, the main difficulties include:

ease of use

The original intention of the deep learning compiler is to simplify the human cost of manually optimizing performance and adapting hardware. However, at this stage, the large-scale deployment and application of deep learning compilers is still a big challenge, and the threshold for using the compiler is relatively high. The main reasons for this phenomenon include:

The problem of docking with the front-end framework. Different frameworks have different abstract descriptions and API interfaces for deep learning tasks, and have their own characteristics in semantics and mechanisms, and the number of operator types in the front-end framework used as compiler input is open. How to transparently support the user's calculation graph description without guaranteeing that all operators are fully supported is one of the important factors for the deep learning compiler to be widely used by users.

Dynamic shape problems and dynamic computational graph problems. At this stage, mainstream deep learning compilers mainly complete compilation for specific static shape inputs. In addition, they can only provide limited or no support for dynamic calculation graphs containing control flow semantics. However, there are a large number of such task requirements in the application scenarios of AI. At this time, we can only manually rewrite the calculation graph into a static or semi-static calculation graph, or find a way to extract some subgraphs suitable for the compiler and hand them over to the compiler. This undoubtedly increases the engineering burden when applying deep learning compilers. A more serious problem is that many task types cannot be statically rewritten manually, which makes the compiler completely useless in these cases.

Compile overhead issues. A deep learning compiler as a performance optimization tool is of real practical value only if its compilation overhead compares to the performance benefits brought by it. Some application scenarios have high requirements for compilation overhead. For example, it takes several days to complete the training task on a normal scale, and the compilation overhead of several hours may not be acceptable. For application engineers, model debugging cannot be completed quickly when using a compiler, which also increases the difficulty and burden of development and deployment.

Transparency issues to users. Some AI compilers are not fully automatic compilation tools, and their performance depends on the high-level abstract implementation templates provided by users. It is mainly to provide efficiency tools for operator development engineers and reduce the labor cost for users to manually tune various operator implementations. But this also puts forward relatively high requirements for the user's operator development experience and familiarity with the hardware architecture. In addition, for software developers of new hardware, existing abstractions are often insufficient to describe the implementation of operators required on innovative hardware architectures. It is necessary to be familiar with the compiler architecture enough to carry out secondary development or even structural reconstruction, the threshold and development burden are still high.

robustness

At present, most of the mainstream AI compiler projects are still experimental products, but there is a big gap between the maturity of the products and the industrial applications. The robustness here includes whether the compilation of the input calculation graph can be successfully completed, the correctness of the calculation results, and the avoidance of extreme bad cases in the coner case in terms of performance.

performance problem

The optimization of the compiler is essentially to replace the human cost of manual optimization with limited compilation overhead through generalized precipitation and abstraction of manual optimization methods, or optimization methods that are not easily explored by humans. However, how to precipitate and abstract the methodology is the most essential and difficult problem in the entire link. The deep learning compiler can only really play its value when it can really replace or surpass manual optimization in terms of performance, or can really reduce labor costs significantly.

However, it is not easy to achieve this goal. Most deep learning tasks are tensor-level calculations, which have high requirements for the splitting of parallel tasks. However, how to integrate manual optimization and generalization into the compiler technology, To avoid the explosion of compilation overhead and the optimization linkage between different levels after layering, there are still more unknowns that need to be explored and mined. This has also become a problem that the next generation of deep learning compilers represented by the MLIR framework need to focus on and solve.

Four: Main technical characteristics of BladeDISC

The original intention of the project was to solve the static shape limitations of the current versions of XLA and TVM. It was internally named DISC (DynamIc Shape Compiler), hoping to create a deep learning compiler that fully supports dynamic shape semantics and can be used in actual business.

Since the team started the work on the deep learning compiler four years ago, the dynamic shape problem has been one of the serious problems that hinder the actual business implementation. At that time, mainstream deep learning frameworks, including XLA, were compiler frameworks based on static shape semantics. A typical solution requires the user to specify the input shape, or the compiler captures the actual input shape combination of the subgraph to be compiled at runtime, and generates a compilation result for each input shape combination.

The advantages of the static shape compiler are obvious. When the static shape information is fully known during compilation, the Compiler can make better optimization decisions and obtain better CodeGen performance, as well as better video memory/memory optimization plans and scheduling. Execute the plan. However, its disadvantages are also very obvious, including:

Significantly increases compilation overhead. The offline compilation warm-up process is introduced, which greatly increases the complexity of the inference task deployment process; the training iteration speed is unstable and even the overall training time is negatively optimized.
In some business scenarios, the range of shape changes tends to be infinite, causing the compilation cache to never converge, and the solution is unavailable.

Increased memory usage. The additional memory and video memory occupied by the compilation cache often leads to memory/video memory OOM in the actual deployment environment, which directly hinders the actual implementation of the business.

Mitigation solutions such as artificial padding for static shapes are not user-friendly, greatly reduce the versatility and transparency of the application, and affect the iteration efficiency.

In the summer of 2020, DISC completed the first version that only supports TensorFlow front-end and Nvidia GPU back-end, and officially launched it in Ali internally and put it into practical application. It was first put into use in several business scenarios that have been troubled by dynamic shape problems for a long time, and achieved the expected results. That is, in the case of one compilation and no special processing of the calculation graph by the user, it fully supports dynamic shape semantics, and its performance is almost equal to that of a static shape compiler. Compared with optimization frameworks based on manual operator libraries such as TensorRT, DISC's technical architecture based on compiler automatic codegen has achieved obvious advantages in performance and ease of use in actual businesses that are often non-standard open source models.

Since the second quarter of 2020, DISC has continued to invest in research and development. Aiming at the aforementioned bottlenecks of the deep learning compiler from the perspective of the cloud platform to large-scale deployment and application, it has improved performance and operator coverage. Rate and robustness, CPU and new hardware support, front-end framework support, etc. are gradually improved. At present, in terms of scene coverage and performance, it has gradually replaced the team's previous work based on static shape frameworks such as XLA and TVM, and has become the main optimization method for PAI-Blade to support Alibaba's internal and external businesses. After 2021, the performance of DISC on the back-end hardware of the CPU and GPGPU architecture has been significantly improved, and more technical strength has been invested in the support of new hardware. At the end of 2021, in order to attract more technical exchanges and cooperation and joint construction needs, as well as a wider range of user feedback, the name was officially changed to BladeDISC and the first version of open source was completed.

Five key technologies

The overall architecture of BladeDISC and its context in related products of Alibaba Cloud are shown in the figure below:

1 MLIR infrastructure

MLIR is a project initiated by Google in 2019. The core of MLIR is a set of flexible multi-layer IR infrastructure and compiler utility library. It is deeply influenced by LLVM and reuses many of its excellent concepts. The main reasons why we choose MLIR-based here include its relatively rich infrastructure support, easy-to-extend modular design architecture, and MLIR's strong glue ability.

2 Dynamic shape compilation

The picture above shows the main Pass Pipeline design of BladeDISC. Compared with the current mainstream deep learning compiler projects, the main technical features are as follows:

Layer IR Design

BladeDISC chooses HLO as the core layer IR to access different front-end frameworks, but HLO is an IR with pure static shape semantics originally designed for XLA. In static scenarios, the shape expression in HLO IR will be static, and all shape calculations will be solidified as compile-time constants and retained in the compilation result; while in dynamic shape scenarios, IR itself needs to have sufficient ability to express shape calculations and Transmission of dynamic shape information. BladeDISC has maintained close cooperation with the MHLO community since the establishment of the project. Based on XLA's HLO IR, it has expanded a set of IR with complete dynamic shape expression capabilities, and added corresponding infrastructure and operator conversion logic of the front-end framework. . This part of the implementation has been fully upstreamed to the MHLO community to ensure the consistency of IR in other subsequent MHLO-related projects.

Runtime Shape calculation, storage management and Kernel scheduling

The main challenge of dynamic shape compilation comes from the need to be able to handle dynamic computational graph semantics during static compilation. In order to fully support dynamic shapes, the compiled results need to be able to do real-time shape derivation calculations at runtime, not only for data calculations, but also for code generation for shape calculations. The calculated shape information is used for memory/video memory management, parameter selection during kernel scheduling, and so on. The design of BladeDISC's pass pipeline fully considers the above-mentioned requirements for dynamic shape semantic support, and adopts the scheme of host-device combined with codegen. Taking GPU Backend as an example, all processes including shape calculation, memory/video memory application release, hardware management, and kernel launch runtime are automatically code-generated, in order to obtain a complete dynamic shape end-to-end support solution and more extreme overall performance.

Performance issues under dynamic shape

When the shape is unknown or partially unknown, the performance challenges faced by deep learning compilers are further amplified. On most mainstream hardware backends, BladeDISC adopts a strategy of distinguishing between computation-intensive parts and memory-intensive parts, in order to achieve a better balance between performance and complexity and compilation overhead.

For the calculation-intensive part, different shapes require a more refined schedule implementation to achieve better performance. The main consideration in the design of the pass pipeline is to support the selection of the appropriate operator library implementation according to different specific shapes at runtime, and Deal with layout issues under dynamic shape semantics.

As one of the main sources of performance benefits for deep learning compilers, the automatic operator fusion of memory-intensive parts also faces performance challenges when the shape is unknown. Many deterministic issues under static shape semantics, such as vectorization of the instruction layer, codegen template selection, whether implicit broadcast is required, etc., will face greater complexity in dynamic shape scenarios. In response to these problems, BladeDISC chose to sink some optimization decisions from compile time to runtime. That is, multiple versions of kernel implementations are generated according to certain rules during compilation, and the optimal implementation is automatically selected according to the actual shape at runtime. This mechanism is called speculation and is implemented in BladeDISC based on host-device joint code generation. In addition, when there is no specific shape value at compile time, it is easy to lose a lot of optimization opportunities at each level, from layer linear algebra simplification, fusion decision-making to instruction-level CSE, constant folding, etc. During the design process of IR and pass pipeline, BladeDISC focused on the abstraction of shape constraints in IR and the use in pass pipeline, such as the constraint relationship between different dimension sizes that are unknown at compile time. It plays a more obvious role in optimizing the overall performance, ensuring that it can be close enough to or even exceed the performance results of the static shape compiler.

Large granularity operator fusion

Before starting the BladeDISC project, the team had made some explorations in large-grained operator fusion and automatic code generation based on the static shape compiler. The basic idea can be summarized as using shared memory or The Memory Cache with low memory access overhead in the CPU stitches the calculation subgraphs of different schedules into the same kernel to realize the compounding of multiple parallel loops. This codegen method is called fusion-stitching. The automatic code generation of this memory-intensive subgraph breaks the conventional loop fusion and input/output fusion's limitation on fusion granularity. While ensuring the quality of code generation, the granularity of fusion is greatly increased, and complexity and compilation overhead explosion are avoided. And the whole process is completely transparent to the user, without manually specifying the schedule description.

The implementation of fusion-stitching under dynamic shape semantics also needs to deal with greater complexity than under static shape semantics. The shape constraint abstraction under dynamic shape semantics simplifies this complexity to a certain extent, making the overall performance closer to or even better than manual calculations. child implementation.

3 Support for multiple front-end frameworks

The design of the AICompiler framework also includes the consideration of extending support for different front-end frameworks. The PyTorch side implements a lightweight Converter to convert TorchScript to DHLO IR to achieve coverage of PyTorch reasoning jobs. The relatively complete IR infrastructure of MLIR also facilitates the realization of Converter. BladeDISC includes the Compiler and the Bridge side that adapts to different front-end frameworks. Among them, Bridge is further divided into two parts: the layer pass in the host framework and the runtime Op, which is connected to the host framework in the form of a plug-in. This way of working enables BladeDISC to transparently support the front-end computing graph and adapt to various versions of the user's host framework.

4 Runtime Environment Adaptation

In order to execute the compiled results with TensorFlow/PyTorch and other hosts in their respective operating environments, and to manage state information that is not easy to express in the IR layer at runtime, etc., we have implemented a unified Compiler architecture for different runtime environments. , and introduces a runtime abstraction layer, the RAL (Runtime Abstraction Layer) layer.

RAL implements adaptation support for various operating environments, and users can choose according to their needs, including:

Compile the whole image and run it independently. When the entire calculation graph supports compilation, RAL provides a set of simple runtime and the implementation of RAL Driver on top of it, so that the compiled result of the compiler can be run directly without the framework, reducing the framework overhead.
TF neutron graph compile and run.
Compile and run the subgraph in Pytorch.
There are differences in resource management and API semantics in the above environments. RAL abstracts a minimum set of APIs and clearly defines their semantics to isolate the compiler from the runtime to achieve different The environment can execute the compiled results. In addition, the RAL layer implements stateless compilation, which solves the problem of state information processing when the compiled result may be executed multiple times after the calculation graph is compiled. On the one hand, it simplifies the complexity of code generation, on the other hand, it is easier to support multi-threaded concurrent execution (such as reasoning) scenarios, and it is also easier to support error handling and rollback.

Six: application scenarios

Typical application scenarios of BladeDISC can be not strictly divided into two categories: one is as a general-purpose and transparent performance optimization tool on mainstream hardware platforms (including Nvidia GPU, x86 CPU, etc.), reducing the manpower of users to deploy AI jobs Burden, improve model iteration efficiency; another important application scenario is to help new hardware to do AI scene adaptation and access support.

At present, BladeDISC has been widely used in many different application scenarios of Alibaba internal and external users on Alibaba Cloud. The coverage model types involve NLP, machine translation, speech ASR, speech TTS, image detection, recognition, AI for science, etc. Typical AI applications; industries covered include the Internet, e-commerce, autonomous driving, security industry, online entertainment, medical and biology, etc.

In the inference scenario, BladeDISC and the inference optimization tools provided by manufacturers such as TensorRT form a good technical complementarity. Its main differentiating advantages include:

*Complete dynamic shape semantic support for dynamic shape business
*The performance advantage of the model generalization based on the compiler based technology path on non-standard models
*More flexible deployment mode selection, supporting the transparency advantages of the front-end framework in the form of plug-ins

In terms of new hardware support, the general situation is that in addition to the head manufacturers such as Nvidia with relatively deep accumulation, other GPGPU hardware including ROCM is common. The hardware indicators are already quite competitive, but the manufacturers are restricted by AI. The accumulation on the software stack is relatively small, and there is a common problem that the hardware computing power cannot be fully utilized, which makes it difficult to implement the hardware application. As mentioned above, the compiler-based technology path naturally has a certain generalization ability for the hardware backend, and forms a relatively strong complementarity with the technical reserves of hardware manufacturers. BladeDISC currently has relatively mature reserves on GPGPU and general-purpose CPU architectures. Taking GPGPU as an example, most of the technology stack on Nvidia GPU can be migrated to hardware with similar architectures such as Haiguang DCU and AMD GPU. BladeDISC's strong hardware generalization capability combined with the strong versatility of the hardware itself solves the performance and usability problems of new hardware adaptation.

Seven: Open source ecology - vision and future

We decided to build an open source ecosystem mainly because of the following considerations:

BladeDISC originated from the business needs of the Alibaba Cloud computing platform team. During the development process, discussions and exchanges with MLIR/MHLO/IREE and other community peers gave us good input and reference. While we are gradually improving with the iteration of business needs, we also hope to open source to the community. At present, most of the experimental projects in the field of AI compilers are mostly experimental projects, and there are few more practical products, and the work between different technology stacks In a relatively fragmented situation, I hope to give back my own experience and understanding to the community. I hope to have more and better exchanges and co-construction with developers of deep learning compilers and practitioners of AI systems. This industry contributes our technological strength.
We hope to use open source work to receive more user feedback in real business scenarios to help us continue to improve and iterate, and provide input for subsequent work.
In the future, we plan to release the Release version regularly in units of two months. The recent Roadmap of BladeDISC is as follows:

* Ongoing robustness and performance improvements
*The x86 backend complements the support for computing-intensive operators, and the end-to-end complete open source x86 backend supports
* Large-grained automatic code generation based on Stitching on GPGPU
* AMD rocm GPU backend support
*Support for PyTorch training scenarios

In addition, in the medium and long term, we will continue to invest in the following exploratory directions, and welcome feedback, improvement suggestions and technical discussions from various dimensions. At the same time, we welcome and look forward to working with colleagues who are interested in open source community construction Participate in co-construction.

* Support and adaptation of more new hardware architectures, and precipitation of software-hardware collaboration methodology under new hardware architectures
* Exploration of automatic code generation of computation-intensive operators and global layout optimization under dynamic shape semantics
Optimization Exploration of Sparse Subgraphs
* Exploration of runtime scheduling strategy, memory/video memory optimization, etc. under dynamic shape semantics
*Technical exploration of the combination of model compression and compilation optimization
* Support and optimization of more AI job types such as graph neural network, etc.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us