BEVFormer-accelerate: Accelerate BEVFormer based on EasyCV

Related Tags:1.Alibaba Cloud Big Data Consulting Services for Retail
2. What Is Big Data


BEVFormer is a purely visual autonomous driving perception algorithm. By fusing the spatial and temporal features of the surround view camera image, it explicitly generates BEV features with strong representation capabilities, and applies them to downstream 3D detection, segmentation and other tasks, achieving SOTA results. . We integrated the BEVFomer algorithm in the EasyCV open source framework (, and optimized the code from the perspective of training speed and algorithm convergence speed. At the same time, we further used the inference optimization tool PAI-Blade to optimize the model. This article will introduce the following parts: 1. BEVFormer algorithm idea 2. Training speed and algorithm convergence speed optimization 3. Using PAI-Blade to optimize inference speed.
BEVFormer algorithm idea
As shown in the figure above, BEVFormer consists of the following three parts:
1. Backbone: used to extract multi-scale multi-camera feature from 6-angle look-around images
2. BEV encoder: This module mainly includes two parts: Temporal self-Attention and Spatial Cross-Attention.
a. Spatial Cross-Attention combines the internal and external reference information of multiple cameras to query the multi-camera feature at the corresponding position, so as to fuse the multi-camera feature under a unified BEV perspective.
b. Temporal self-Attention integrates the History BEV feature and the current BEV feature through the self-attention module.
c. Through the above two modules, output the BEV feature containing both multi-view and timing information for further downstream 3D detection and segmentation tasks
3. Det&Seg Head: task head for specific tasks
BEVFormer training optimization
Training Acceleration Optimization
We optimize the training code from the perspectives of data reading and reducing memory copy consumption.
• Data reading
• Use turbojpeg, a more efficient image decoding library
• During the training process, BEVFormer needs time-series data as input, and optimizes the serial reading method to parallel reading.
• Do resize first and then do other preprocessing, reducing the calculation overhead caused by extra pixels
• Memory copy optimization
• Use pin_memery=True, and fix mmcv DataContainer pin_memory bug
• Replace the numpy operation in the code with torch.tensor to avoid unnecessary h2d copy
• other
• Use torch.backends.cudnn.benchmark=True (ps: it needs to be used when the input data is not dynamic, otherwise it will increase the training time)
• Fixed the bug that mixed precision of torch.cuda.amp failed at LayerNorm layer
throughput (samples/s)
BEVFormer-tiny bs=32
EasyCV BEVFormer-tiny bs=32
9.84 (+177%)
BEVFormer-base bs=5
EasyCV BEVFormer-base bs=5
Precision Convergence Optimization
We use additional data augmentation methods and different loss functions to optimize the model. At the same time, additional training strategies are added to further improve the model convergence speed and accuracy.
• Data augmentation methods
• rand scale (training with input of different resolutions, it was found in the experiment that this operation will introduce at least 20% of additional training time, so it was not used in the following experiments)
• rand_flip (randomly flip the image with 50% probability)
• loss function
• Use smooth l1 loss or balance l1 loss instead of l1 loss. (In the mini dataset experiment, both losses can improve the accuracy, and the balance l1 loss is used in the following experiment)
• training strategy
• Using one2many Branch
This approach comes from H-Deformable-DETR. The one2one matching method is used to allocate GT Boxes in the DETR series of detection models. Although this approach allows the model to avoid redundant NMS post-processing operations during testing, only A small number of queries will be assigned to positive samples, resulting in a much slower model convergence speed than the one2many method during training. Therefore, adding auxiliary query during training, the same GT Box will match multiple auxiliary queries, and use the attention mask to isolate the information of one2one branch and one2many branch. In this way, the convergence speed during training can be significantly improved, and only one2one branch needs to be kept for prediction during testing. (In the experiment, an additional 1800 auxiliary queries were added, and each GT box matched 4 queries for training)
• CBGS in one2many Branch
Our experiment is carried out on the NuScenes dataset. There are 10 types of labels in the 3D detection task of this dataset, but the samples between these 10 types of labels are extremely unbalanced. Many algorithms will use CBGS operations to balance samples between classes. , but this operation will expand the entire data set by 4.5 times. Although there is a certain degree of accuracy improvement, it also brings huge training costs. We consider performing a sample balancing operation on one2many Branch, that is, use fewer auxiliary queries for matching for samples with a large number of instances, and use more auxiliary queries for matching for long-tailed samples. Through the CBGS in one2many Branch method, the convergence speed will be further improved on the basis of keeping the training time consistent with the base, and the final accuracy will also be improved to a certain extent. (The number of matching frames changes in the experiment: [4, 4, 4, 4, 4, 4, 4, 4, 4, 4] -> [2, 3, 7, 7, 9, 6, 7, 6, 2, 5])
We conduct experiments on a single machine with 8 cards 80G, as shown in the table below:
config setting
throughput (samples/s)
Official BEVFormer-base
EasyCV BEVFormer-base
EasyCV BEVFormer-base-one2manybranch
42.48 (+0.57)
EasyCV BEVFormer-base-cbgs_one2manybranch
53.28 (+0.84)
42.63 (+0.72)
The model convergence rate is shown in the figure below:
As can be seen from the above figure, using the above optimization method can greatly improve the convergence speed of the model, and only 75% of the training time is needed to reach the final accuracy of the base. At the same time, the final NDS also has a 0.8 improvement compared to the base.

For detailed configuration, training log and model weights, refer to:
Use the BEVFormer model on the Alibaba Cloud machine learning platform PAI
PAI-DSW (Data Science Workshop) is an IDE on the cloud developed by Alibaba Cloud's machine learning platform PAI. It provides an interactive programming environment for various developers. In DSW Gallery (link), various Notebook examples are provided to facilitate users to easily get started with DSW and build various machine learning applications. We have also listed the Sample Notebook of BEVFormer for 3D inspection in DSW Gallery (see the picture below), welcome everyone to experience it!
Inference Acceleration Using PAI-Blade
PAI-Blade is a model optimization tool developed by Alibaba Cloud's machine learning platform PAI, which can optimize inference acceleration for different models of different devices. PAI-Blade follows the principles of ease of use, robustness and high performance, highly encapsulates the deployment optimization of the model, and designs a unified and simple API. After completing the installation of the Blade environment, users can use it without knowing ONNX, TensorRT, compiling Under the conditions of technical details such as optimization, it is convenient to implement high-performance deployment of the model through simple code calls. For more introductions to PAI-Blade related technologies, please refer to [PAI-Blade Introduction].
Blade is supported in PAI-EasyCV, and users can export the trained model by configuring the relevant export parameters in the training config of PAI-EasyCV.
Median (FPS)
easycv script
Environmental preparation
We provide a PAI-Blade + PAI-EasyCV mirror package for users to use directly, the mirror package address: easycv-blade-torch181-cuda111.tar
Users can also build their own inference environment based on the images released by Blade daily [PAI-Blade community image release].
Note when building the environment by yourself: BEVFomer-base uses resnet101-dcn as the image backbone, and the DCN operator uses the custom operator in mmcv. In order to export TorchScript, we modified the interface. So mmcv needs source code compilation.
1. clone mmcv source code
$ git clone
2. Replace the mmcv file
Please pay attention to the version of mmcv when replacing, and pay attention to the matching of the interface. The mmcv1.6.0 version has been verified.
Refer to the modified files in the easycv/thirdparty/mmcv/ directory. Use mmcv/ops/csrc/pytorch/modulated_deform_conv.cpp and mmcv/ops/ to replace the original files in mmcv.
3. Source code compilation
Please refer to mmcv source code compilation:
Export Blade model
The configuration of exporting the Blade model can refer to the export field in the file, the configuration is as follows:
export = dict(
blade_config = dict(
'aten::select', 'aten::index', 'aten::slice', 'aten::view',
'aten::upsample', 'aten::clamp'
Export command:
$ export PYTHONPATH='./'
$ python tools/ configs/detection3d/bevformer/ bevformer_base.pth bevformer_export.pth
Blade Model Inference
Reasoning script:
from easycv.predictors import BEVFormerPredictor
blade_model_path = 'bevformer_export.pth.blade'
config_file= 'configs/detection3d/bevformer/'
predictor = BEVFormerPredictor(
inputs_file = 'nuscenes_infos_temporal_val.pkl' # Take the NuScenes val dataset file as an example
input_samples = mmcv. load(inputs_file)['infos']
predict_results = predictor(input_samples)
NuScenes dataset preparation please refer to: NuScenes dataset preparation
We integrated the BEVFormer algorithm in the EasyCV framework, and made some improvements to the algorithm from the perspectives of training acceleration, accuracy convergence, and inference acceleration. Recently, many new BEV perception algorithms have emerged, such as BEVFormerv2. In BEVFormerv2, through Perspective Supervision, the algorithm is not limited to using some pre-trained backbones for depth estimation or 3D detection, but directly uses the recent more effective large model BackBone (such as ConvNext, DCNv3, etc.). The two-stage detection method further enhances the model capability, and achieves sota results in the camera-based 3D detection task of the Nuscenes dataset.
EasyCV ( will continue to follow up the industry's sota method, welcome everyone to pay attention to and use, welcome feedback from various dimensions, improvement suggestions and technical discussions, and we welcome and look forward to open source Colleagues who are interested in community building participate in the joint construction together.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us