How to Break the Dilemma of Limited Computing Power on Edge Chip
1. Background
1.1 No chip, no AI
The rapid development of the artificial intelligence industry, whether it is the realization of algorithms, the acquisition and storage of massive data, or the embodiment of computing power, is inseparable from the current only physical foundation - chips. Whether there are AI chips that meet market business needs is crucial. of.
According to whether the target application is "training" or "inference", whether it is in the "cloud" or "terminal", the AI chip can be divided into four quadrants, as shown in the figure above.
Cloud training: Most deep neural networks are trained on the cloud. For this market, NVIDIA series GPU chips are the most widely used.
Cloud reasoning: Different from model training, many companies have launched dedicated chips for cloud reasoning, including Google's TPU chips, Intel's Nervana series chips, and Cambrian MLU100 chips. Pingtouge plans to launch the first one this year The AI chip Ali-NPU is used for cloud reasoning.
Edge-side reasoning: Compared with the cloud, the edge-side is currently mainly based on reasoning. It is the general trend to inject AI computing power into the edge and empower edge intelligence. More and more AI applications have begun to be developed and deployed on edge devices.
Edge-side training: At the same time, as Google proposes Federated Learning (FL) and launches the first commercial-grade edge-side distributed machine learning system, learning and training on edge and embedded devices is becoming more and more important. Compared with training on the cloud, training on edge devices has the great advantage of protecting user privacy and providing more personalized services, making the model smarter and allowing users to experience the updated model as quickly as possible.
1.2 Why must edge computing be used?
Because edge computing (Edge computing) has the characteristics of low latency, bandwidth saving, offline availability and privacy protection, many applications are more suitable for reasoning on edge devices, intelligence will sink to terminal devices, and intelligent edge computing will rise. For example, for video surveillance tasks, the data of millions of high-definition cameras is completely processed by the cloud, which will put a huge pressure on the network. Another example is the inference of unmanned driving, which cannot be run by the cloud. Otherwise, if there is a problem with the network, it is likely to be Catastrophic consequences ensue.
1.3 Why self-developed algorithm engine?
Edge devices are far more than just cameras and mobile phones. Their application scenarios range from unmanned driving with great demand for computing power to food delivery robots with multi-sensor fusion to wearable devices that are sensitive to power consumption and cost. Variety. In current AI application scenarios, edge devices mainly perform inference calculations, which requires the chips of edge devices to have sufficient computing power. However, at present, the computing power of the edge processor chip is relatively limited. It is the responsibility of the engine to make better use of this limited computing power, shield the underlying hardware details, and quickly implement the business.
ACE (AI Labs Compute Engine) is a computing engine that supports heterogeneous computing methods such as CPU, GPU, DSP, and dedicated AI chips on the edge device side in our device-cloud integration. For self-developed hardware, we start from the selection of the main chip and data processing chip, and cooperate with chip manufacturers in depth, from the driver layer to the HAL layer to the system layer, carry out in-depth customization and optimization, and create an algorithm engine for proprietary self-developed hardware. Support the better operation of upper-level business.
Taking Tmall Genie as an example, we cooperated deeply with our algorithm classmates to customize and optimize a certain low-end chip we used in combination with actual business. Through ACE, we successfully implemented applications such as gesture recognition that require relatively high real-time computing power. algorithm.
2. Architecture overview
Calculation engine:
Computing layer: adopts methods such as model quantization, heterogeneous computing, memory friendliness and assembly optimization to accelerate algorithms;
Access layer: Algorithm services are arranged in the form of calculation graphs, and commonly used operators are provided to shorten the development cycle.
Model management:
Cloud: connect to the AutoAI platform to generate mobile models;
Edge: Receive instructions from the cloud and actively push information.
3. Calculation engine
3.1 Computing layer
Artificial intelligence can not only exist in expensive enterprise services, we also need to bring the joy of AI to thousands of households, inject it into many smart devices such as smart TVs, Internet of Things and smart speakers, so that more users can experience AI the changes brought about. Because the computing power of these terminal smart devices is generally limited, we use methods such as model quantization, heterogeneous computing, memory friendliness, and assembly optimization to accelerate the algorithm.
■ Model quantization
When we first applied the gesture algorithm to a product using a low-end computing chip, the detection model used in the cloud was a float32 model, which took hundreds of milliseconds. The core time consumption has been reduced to 130ms, but there is still a certain gap from the ideal speed of 20 frames/S.
We adopt model quantization for further acceleration. Quantized fixed-point calculations save computing resources and memory compared to floating-point calculations. Then we deeply optimize the quantized inference process.
After quantization, the single-core time consumption is reduced to 59ms, and the detection frame rate can reach 17 frames/s, which is faster than the 4-core float32 model.
In order to achieve a better quantization acceleration effect, we adopted conventional methods such as memory friendliness and assembly optimization for acceleration, and introduced the QNNPACK acceleration library. The library is specially designed for model quantization acceleration, using the same quantization principle as TFLite. In the end, we shortened the single-core time consumption of the gesture model to 41ms, achieved a total acceleration of 3.17 times, and saved 74% of the model memory.
Taking the standard mobilenet_v2_1.0_224 model as an example, the single-core quantization acceleration effect can reach 2.2 times.
For multi-threaded scenarios, we have also increased the parallelism of the kernel, and the acceleration effect of the two cores is about 3 times.
■ Heterogeneous acceleration
Because there are many other businesses running on the CPU, the CPU resources that can be used for computing are relatively vacant, and the CPU performance of general-purpose processors can no longer grow according to Moore's Law. In contrast, data growth requires computing performance that exceeds the growth rate according to "Moore's Law", and there is a huge gap between computing power requirements and actual performance. A mainstream solution is heterogeneous computing.
Different from general-purpose computing that pursues universality, special-purpose computing is optimized for some specific occasions. Performance/power consumption and other indicators are usually orders of magnitude higher than general-purpose CPUs, but the application scenarios are limited. The relationship between general and special is like a "generalist" who knows everything but is basically not good at anything, and a "specialist" who has studied deeply in some fields but knows almost nothing about other fields. Heterogeneous computing is generally a mixed system composed of CPU + special-purpose equipment (GPU, DSP, VPU, FPGA, etc.), which use different types of instruction sets and different architectures, to perform collaborative computing.
Taking the business on a certain chip as an example, a large number of businesses are using the CPU while the GPU is relatively idle. At this time, when running the nib detection algorithm on the CPU, the 4 threads took 260ms, and the CPU consumption exceeded 240%, which had a great impact on other businesses. However, using the CPU+GPU heterogeneous computing method takes 150ms, and the CPU consumption is reduced from 245% to 50%, which makes better use of the overall resources.
On this chip, in addition to the GPU, the heterogeneous computing resources also have a fixed-point computing accelerator VPU. After the pen tip detection model is quantized, the VPU can be used for acceleration. The quantitative version of the nib detection model uses 4 CPU threads for calculation, and it takes about 76ms, which is a great improvement compared to float32's 260ms, but it consumes all the computing resources of the CPU, causing other businesses to fail to run normally. However, using the CPU+VPU heterogeneous computing method takes only 51ms, saves most of the CPU resources, and reduces the overall power consumption.
3.2 Access layer
The role of the access layer is to simplify the algorithm development process, improve the efficiency of debugging and maintenance, and finally realize the rapid implementation of business. We mainly use the following methods to achieve this goal:
Docking AutoAI one-stop solution to open up the entire process of model training, calculation graph construction, and model management;
The joint algorithm team develops commonly used High Level & Low Level operators and creates a modular operator library;
API/UI simplifies the construction of calculation graphs, supports packaging resources such as calculation graphs, models, and configurations into a single file, and reduces management difficulty;
It supports deep learning and traditional algorithm hybrid graph calculation, reduces the amount of engineering code development, and provides functions such as performance analysis, problem location, and algorithm evaluation, which reduces the difficulty of debugging and speeds up the speed of business implementation.
4. Model management
Model management is an important function of ACE. It consists of two parts, the cloud and the edge, and can flexibly manage the two dimensions of the model and the business. It turns out that the edge end must update the model through system software upgrades, which means that the online release and grayscale testing of new models need to follow the pace of software upgrades. But sometimes we want to quickly test new models in a small area, and at this time we need a model management system to support this function.
4.1 Cloud model management
The model management system is precisely a cloud background system that manages each algorithm model on the edge, and its core is in the cloud. The structure of the cloud model management system is shown in the figure above. On the background interface, the server can be operated on the end to control. Currently, the operations on the end are supported:
Query: query the model details on a certain device;
Download (Download): download a certain model to the terminal;
Reload: switch a certain model/business on the end;
Reset: Resets the end model to its initial state (used only in emergency).
4.2 Edge Model Management
Generally speaking, model management is based only on model dimensions, that is, there is no coupling between models. But in some cases, this is unreasonable, and we use pet detection and gestures in videos as an example to illustrate the case of model coupling.
As shown in the figure above, in order to save the storage space and calculation amount on the end, the two businesses of pets and gestures share a model. Therefore, only relying on model-based single-dimensional management can no longer meet the demand.
To solve this problem, we add the business dimension to the model dimension. The business is an abstraction above the model. The business and the model have a many-to-many relationship, and the business may share the same model, which solves the coupling problem well.
5. Looking to the future
The mission of ACE is to enable the algorithm to be quickly and accurately applied to self-developed edge devices such as Tmall Genie and robots, and to well support related upper-level services. Although there are still many deficiencies, we will continue to improve its usability in the subsequent development process, do a good job in optimization and acceleration of the bottom layer, and cooperate more deeply with manufacturers to achieve a combination of software and hardware, and make reasonable use of the terminal limited computing resources.
1.1 No chip, no AI
The rapid development of the artificial intelligence industry, whether it is the realization of algorithms, the acquisition and storage of massive data, or the embodiment of computing power, is inseparable from the current only physical foundation - chips. Whether there are AI chips that meet market business needs is crucial. of.
According to whether the target application is "training" or "inference", whether it is in the "cloud" or "terminal", the AI chip can be divided into four quadrants, as shown in the figure above.
Cloud training: Most deep neural networks are trained on the cloud. For this market, NVIDIA series GPU chips are the most widely used.
Cloud reasoning: Different from model training, many companies have launched dedicated chips for cloud reasoning, including Google's TPU chips, Intel's Nervana series chips, and Cambrian MLU100 chips. Pingtouge plans to launch the first one this year The AI chip Ali-NPU is used for cloud reasoning.
Edge-side reasoning: Compared with the cloud, the edge-side is currently mainly based on reasoning. It is the general trend to inject AI computing power into the edge and empower edge intelligence. More and more AI applications have begun to be developed and deployed on edge devices.
Edge-side training: At the same time, as Google proposes Federated Learning (FL) and launches the first commercial-grade edge-side distributed machine learning system, learning and training on edge and embedded devices is becoming more and more important. Compared with training on the cloud, training on edge devices has the great advantage of protecting user privacy and providing more personalized services, making the model smarter and allowing users to experience the updated model as quickly as possible.
1.2 Why must edge computing be used?
Because edge computing (Edge computing) has the characteristics of low latency, bandwidth saving, offline availability and privacy protection, many applications are more suitable for reasoning on edge devices, intelligence will sink to terminal devices, and intelligent edge computing will rise. For example, for video surveillance tasks, the data of millions of high-definition cameras is completely processed by the cloud, which will put a huge pressure on the network. Another example is the inference of unmanned driving, which cannot be run by the cloud. Otherwise, if there is a problem with the network, it is likely to be Catastrophic consequences ensue.
1.3 Why self-developed algorithm engine?
Edge devices are far more than just cameras and mobile phones. Their application scenarios range from unmanned driving with great demand for computing power to food delivery robots with multi-sensor fusion to wearable devices that are sensitive to power consumption and cost. Variety. In current AI application scenarios, edge devices mainly perform inference calculations, which requires the chips of edge devices to have sufficient computing power. However, at present, the computing power of the edge processor chip is relatively limited. It is the responsibility of the engine to make better use of this limited computing power, shield the underlying hardware details, and quickly implement the business.
ACE (AI Labs Compute Engine) is a computing engine that supports heterogeneous computing methods such as CPU, GPU, DSP, and dedicated AI chips on the edge device side in our device-cloud integration. For self-developed hardware, we start from the selection of the main chip and data processing chip, and cooperate with chip manufacturers in depth, from the driver layer to the HAL layer to the system layer, carry out in-depth customization and optimization, and create an algorithm engine for proprietary self-developed hardware. Support the better operation of upper-level business.
Taking Tmall Genie as an example, we cooperated deeply with our algorithm classmates to customize and optimize a certain low-end chip we used in combination with actual business. Through ACE, we successfully implemented applications such as gesture recognition that require relatively high real-time computing power. algorithm.
2. Architecture overview
Calculation engine:
Computing layer: adopts methods such as model quantization, heterogeneous computing, memory friendliness and assembly optimization to accelerate algorithms;
Access layer: Algorithm services are arranged in the form of calculation graphs, and commonly used operators are provided to shorten the development cycle.
Model management:
Cloud: connect to the AutoAI platform to generate mobile models;
Edge: Receive instructions from the cloud and actively push information.
3. Calculation engine
3.1 Computing layer
Artificial intelligence can not only exist in expensive enterprise services, we also need to bring the joy of AI to thousands of households, inject it into many smart devices such as smart TVs, Internet of Things and smart speakers, so that more users can experience AI the changes brought about. Because the computing power of these terminal smart devices is generally limited, we use methods such as model quantization, heterogeneous computing, memory friendliness, and assembly optimization to accelerate the algorithm.
■ Model quantization
When we first applied the gesture algorithm to a product using a low-end computing chip, the detection model used in the cloud was a float32 model, which took hundreds of milliseconds. The core time consumption has been reduced to 130ms, but there is still a certain gap from the ideal speed of 20 frames/S.
We adopt model quantization for further acceleration. Quantized fixed-point calculations save computing resources and memory compared to floating-point calculations. Then we deeply optimize the quantized inference process.
After quantization, the single-core time consumption is reduced to 59ms, and the detection frame rate can reach 17 frames/s, which is faster than the 4-core float32 model.
In order to achieve a better quantization acceleration effect, we adopted conventional methods such as memory friendliness and assembly optimization for acceleration, and introduced the QNNPACK acceleration library. The library is specially designed for model quantization acceleration, using the same quantization principle as TFLite. In the end, we shortened the single-core time consumption of the gesture model to 41ms, achieved a total acceleration of 3.17 times, and saved 74% of the model memory.
Taking the standard mobilenet_v2_1.0_224 model as an example, the single-core quantization acceleration effect can reach 2.2 times.
For multi-threaded scenarios, we have also increased the parallelism of the kernel, and the acceleration effect of the two cores is about 3 times.
■ Heterogeneous acceleration
Because there are many other businesses running on the CPU, the CPU resources that can be used for computing are relatively vacant, and the CPU performance of general-purpose processors can no longer grow according to Moore's Law. In contrast, data growth requires computing performance that exceeds the growth rate according to "Moore's Law", and there is a huge gap between computing power requirements and actual performance. A mainstream solution is heterogeneous computing.
Different from general-purpose computing that pursues universality, special-purpose computing is optimized for some specific occasions. Performance/power consumption and other indicators are usually orders of magnitude higher than general-purpose CPUs, but the application scenarios are limited. The relationship between general and special is like a "generalist" who knows everything but is basically not good at anything, and a "specialist" who has studied deeply in some fields but knows almost nothing about other fields. Heterogeneous computing is generally a mixed system composed of CPU + special-purpose equipment (GPU, DSP, VPU, FPGA, etc.), which use different types of instruction sets and different architectures, to perform collaborative computing.
Taking the business on a certain chip as an example, a large number of businesses are using the CPU while the GPU is relatively idle. At this time, when running the nib detection algorithm on the CPU, the 4 threads took 260ms, and the CPU consumption exceeded 240%, which had a great impact on other businesses. However, using the CPU+GPU heterogeneous computing method takes 150ms, and the CPU consumption is reduced from 245% to 50%, which makes better use of the overall resources.
On this chip, in addition to the GPU, the heterogeneous computing resources also have a fixed-point computing accelerator VPU. After the pen tip detection model is quantized, the VPU can be used for acceleration. The quantitative version of the nib detection model uses 4 CPU threads for calculation, and it takes about 76ms, which is a great improvement compared to float32's 260ms, but it consumes all the computing resources of the CPU, causing other businesses to fail to run normally. However, using the CPU+VPU heterogeneous computing method takes only 51ms, saves most of the CPU resources, and reduces the overall power consumption.
3.2 Access layer
The role of the access layer is to simplify the algorithm development process, improve the efficiency of debugging and maintenance, and finally realize the rapid implementation of business. We mainly use the following methods to achieve this goal:
Docking AutoAI one-stop solution to open up the entire process of model training, calculation graph construction, and model management;
The joint algorithm team develops commonly used High Level & Low Level operators and creates a modular operator library;
API/UI simplifies the construction of calculation graphs, supports packaging resources such as calculation graphs, models, and configurations into a single file, and reduces management difficulty;
It supports deep learning and traditional algorithm hybrid graph calculation, reduces the amount of engineering code development, and provides functions such as performance analysis, problem location, and algorithm evaluation, which reduces the difficulty of debugging and speeds up the speed of business implementation.
4. Model management
Model management is an important function of ACE. It consists of two parts, the cloud and the edge, and can flexibly manage the two dimensions of the model and the business. It turns out that the edge end must update the model through system software upgrades, which means that the online release and grayscale testing of new models need to follow the pace of software upgrades. But sometimes we want to quickly test new models in a small area, and at this time we need a model management system to support this function.
4.1 Cloud model management
The model management system is precisely a cloud background system that manages each algorithm model on the edge, and its core is in the cloud. The structure of the cloud model management system is shown in the figure above. On the background interface, the server can be operated on the end to control. Currently, the operations on the end are supported:
Query: query the model details on a certain device;
Download (Download): download a certain model to the terminal;
Reload: switch a certain model/business on the end;
Reset: Resets the end model to its initial state (used only in emergency).
4.2 Edge Model Management
Generally speaking, model management is based only on model dimensions, that is, there is no coupling between models. But in some cases, this is unreasonable, and we use pet detection and gestures in videos as an example to illustrate the case of model coupling.
As shown in the figure above, in order to save the storage space and calculation amount on the end, the two businesses of pets and gestures share a model. Therefore, only relying on model-based single-dimensional management can no longer meet the demand.
To solve this problem, we add the business dimension to the model dimension. The business is an abstraction above the model. The business and the model have a many-to-many relationship, and the business may share the same model, which solves the coupling problem well.
5. Looking to the future
The mission of ACE is to enable the algorithm to be quickly and accurately applied to self-developed edge devices such as Tmall Genie and robots, and to well support related upper-level services. Although there are still many deficiencies, we will continue to improve its usability in the subsequent development process, do a good job in optimization and acceleration of the bottom layer, and cooperate more deeply with manufacturers to achieve a combination of software and hardware, and make reasonable use of the terminal limited computing resources.
Related Articles
-
A detailed explanation of Hadoop core architecture HDFS
Knowledge Base Team
-
What Does IOT Mean
Knowledge Base Team
-
6 Optional Technologies for Data Storage
Knowledge Base Team
-
What Is Blockchain Technology
Knowledge Base Team
Explore More Special Offers
-
Short Message Service(SMS) & Mail Service
50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00