Community Blog Ultra Low Latency and High Performance Deep Learning Processor with FPGA

Ultra Low Latency and High Performance Deep Learning Processor with FPGA

Alibaba's Infrastructure Service Group and the algorithm team from Machine Intelligence Technologies have successfully developed an ultra low latency .

Image recognition and analysis have always been a staple in Alibaba's research and development (R&D) projects, and have played an important role in Alibaba's product innovation. However, these applications typically involve high workloads with strict requirements on service quality. Current solutions such as GPU are not able to balance the low latency and high performance requirements at the same time.

In order to provide a good user experience while applying deep learning, Alibaba's Infrastructure Service Group and the algorithm team from Machine Intelligence Technologies have architected an ultra low latency and high performance DLP (deep learning processor) on FPGA.

The DLP FPGA can support sparse convolution and low precision data computing at the same time, while a customized ISA (instruction Set Architecture) was defined to meet the requirements for flexibility and user experience. Latency test results with Resnet18 (sparse kernel) show that Alibaba's FPGA has a delay of only 0.174ms.

In this article, we will briefly discuss how Alibaba and the team from Machine Intelligence Technologies are able to achieve such a feat with the new DLP FPGA.


Alibaba's newly developed DLP have 4 types of modules, which are classified based on their functions.

  1. Computing: Convolution, Batch Normalization, Activation and other calculations
  2. Data Path: Data storage, movement and reshaping
  3. Parameter: Storage weight and other parameters, decoding
  4. Instruction: Instruction unit and global control


The Protocal Engine (PE) in the DLP can support:

  1. Int4 data type input.
  2. Int32 data type output.
  3. Int16 quantization

This PE also offers over 90% efficiency. Furthermore, the DLP's weight loading supports CSR Decoder and data pre-fetching.


Re-training is needed to develop a high accuracy model. There are 4 main steps illustrated below to get both sparse weight and low precision data feature map.


We used an effective method to train the Resnet18 model to sparse and low precision (1707.09870). The key component in our method is discretization. We focused on compressing and accelerating deep models with network weights represented by very small numbers of bits, referred to as extremely low bit neural network. We then modeled this problem as a discretely constrained optimization problem.

Borrowing the idea from Alternating Direction Method of Multipliers (ADMM), we decoupled the continuous parameters from the discrete constraints of the network, and casted the original hard problem into several sub-problems. We proposed to solve these subproblems using extragradient and iterative quantization algorithms, which lead to considerably faster convergence compared to conventional optimization methods.

Extensive experiments on image recognition and object detection verify that the proposed algorithm is more effective than state-of-the-art approaches when coming to extremely low bit neural network.



As mentioned previously, only having low latency is not enough for most online service and usage scenarios since the algorithm model will change frequently. As we know, FPGA development cycle is very long; it usually takes a few weeks' or months' time to finish a customized design. In order to solve this challenge, we designed an industry standard architecture (ISA) and compiler to reduce model upgrade time to only a few minutes.


The SW-HW co-development platform consists of the following items:

  1. Compiler: Model graph analysis and instruction generation.
  2. API/Driver: CPU-FPGA DMA picture reshape, weight compression.
  3. ISA Controller: Instruction decoding, task scheduling, multi-thread pipeline management.

Hardware Card

The DLP was implemented on Alibaba designed FPGA Card, which has PCIe and DDR4 memory. The DLP, combined with this FPGA card, can benefit application scenarios such as online image search on Alibaba.




FPGA test results with Resnet18 show that our design achieved ultra-low level latency meanwhile maintaining very high performance with less than 70W chip power.



Read similar articles and learn more about Alibaba Cloud's products and solutions at www.alibabacloud.com/blog.

0 0 0
Share on

Alibaba Clouder

2,605 posts | 745 followers

You may also like


Alibaba Clouder

2,605 posts | 745 followers

Related Products