UC CV type model optimization summary

Background

The CV model is a common model in business, but we have observed that the CV model in the UC cluster still has a lot of room to improve the GPU utilization. If this is not optimized, a large number of GPU resources are required to meet the latency requirements. In order to improve resource utilization and save computing resources, we explored the optimization of these models (mainly PyTorch) and tested the Detectron2 model, which uses more resources.

Detectron2 uses the central network as the backbone to provide feature extraction capability, and the extracted features are deposited_ The generator generates the candidate box, adjusts the candidate box, and performs weight removal and preliminary filtering. Finally, the roi_ Heads performs scoring and sorting, and outputs score and boxes of target detection.

The calculation time of the model mainly focuses on the backbone part. Python is used for direct online deployment. Before optimization, the single card service capability is weak. When the single card exceeds 12qps, queuing starts to occur, and after the qps is 20, the delay is dramatically amplified or even the service is abnormal. In order to maintain the stable operation of the service, online users can only use more computing resources to improve the qps. At the peak of requests, the qps of the service is around 1850. However, this method of increasing resources to improve qps will cause a lot of resource waste. It is observed that the utilization rate of gpu can only be maintained at about 20% in the peak period.

Optimization scheme exploration

The optimization mainly considers two aspects: 1) improving the reasoning speed of the model under the condition of a single CUDA Stream; 2) Use multiple CUDA Stream to further improve GPU utilization.

Backend optimization of model reasoning

Main focus of optimizing back-end

As an engineering team, we pay more attention to the following features of optimization tools:

• Success rate of optimization: PyTorch is flexible, extensible, and rich in OP types, which will lead to high diversity of modeling and reasoning codes. On the other hand, most back end operators support only a subset of PyTorch. How to maximize the success rate of optimization in this case is the most important point to consider.

• Optimized performance: on the one hand, it can give full play to the ability to deploy hardware, reduce user response time and save costs; On the other hand, you can use a more complex deep learning model while keeping the response time unchanged to improve the business accuracy indicators.

• Difficulty of optimization and deployment: optimization tools need to provide simple, stable and easy to use APIs to enable users to quickly complete the process of model optimization. At the same time, the optimized model needs to be easily deployed to the existing engineering system.

Several mainstream optimization methods

Native TensorRT

From the perspective of optimization performance, the current best practice of the CV domain model is to convert the whole map to TensorRT to make full use of its acceleration characteristics. The team has also accumulated many TensorRT acceleration experience and some TensorRT plug-ins. There are two main ways to convert PyTorch model to TensorRT:

1. PyTorch model exports ONNX, which is then parsed into TensorRT network by onnx2trt;

2. Directly call the TensorRT network API to build its equivalent model and load the parameters.

The problem with the above methods is that TensorRT supports fewer types of operators than ONNX, and many operators between PyTorch and ONNX cannot be converted. Therefore, the optimization success rate of these two methods will be low. In addition, some implementations of TensorRT Plugin have robustness problems, and there may be numerical differences on different models (such as roiAlign).

TensorRT-based Backend

Considering the diversity of PyTorch operators, we further try TensorRT based backend based on subgraph architecture. Sub graph refers to the selection of optimizeable OP in the calculation graph before optimization, and aggregation of these into sub graphs. Each sub graph will be converted to the corresponding back-end for optimization. If it cannot be converted to the corresponding backend's op, it will be returned to PyTorch for execution.

This architecture can achieve the maximum optimization effect on the premise of ensuring the deployment success rate. At present, there are two mainstream optimization tools in the community that access TensorRT with circle diagram architecture, namely Torch TensorRT and PAI Blade.

In terms of diagram transformation architecture, Torch TensorRT supports two types of input: Torch Script and Torch Fx Module. It will write many converters for PyTorch OP, and call TensorRT netowrk API in the converter to translate OP and build a network. The PAI Blade supports TorchScipt as the input. For the sub graph, TorchScript will go through ONNX and be converted to TensorRT network. Routes via ONNX can reuse community mature tools, but a layer of intermediate representation will make customized links longer; The direct write converter can shorten the customized link, but the development volume is greatly increased. It needs to be developed from scratch for each OP and its different conditions. At the current stage, the maturity of the route via ONNX is higher.

In terms of coverage of TensorRT functions, both provide dynamic range, low precision optimization, user-defined plugin and other functions. However, if Torch Sensor needs to enable higher-order functions such as dynamic range and int8 quantization, the model to be input can be converted to TensorRT from the whole network, while PAI Blade enables these higher-order functions separately for each sub graph under the circle graph architecture.

In terms of optimization success rate and optimization performance, PAI Blade has developed a large number of JIT passes based on the features of TorchScript to make the optimized graph cleaner and more efficient, and the circle graph has a wider range, thus achieving greater optimization effect. However, due to the existence of the intermediate medium of ONNX, some obscure operators cannot be transferred to the TensorRT backend. However, the maturity of Torch TensorRT at the current stage is low, and the optimization success rate is also low.

In addition to TensorRT, we also investigated OneFlow, PaddlePaddle and other popular backend in the industry. These backend have problems similar to the native TensorRT, such as incomplete operator support and low opset.

After a period of comparison and consideration of all aspects, the PAI Blade optimization scheme with better usability, robustness and performance was finally selected, and the original model implementation was rewritten and adapted to the modified graph.

Multi CUDA Stream Optimization

The PyTorch model places the cuda operations of all threads on the default stream, causing the multithread to run serially on the CUDA kernel execution, and also blocking the execution of other streams. The reasoning performed within the multithread is independent of each other, and the execution is not dependent on each other, so you can break up the stream. The timeline of a typical single CUDA Stream is as follows:

Optimization process

Model reasoning code rewriting

PyTorch model does not contain an "explicit" calculation diagram at the nn.Module level. It needs to be converted into TorchScript to obtain the inference calculation diagram for inference optimization. On the one hand, TorchScript is only a subset of Python, and the supported syntax is limited. On the other hand, the characteristics of PyTorch dynamic graph will lead to great diversity of reasoning code of the model. This conflict will result in the failure to export TorchScript without modifying the inference code. Therefore, we need to modify the Detectron2 reasoning code, mainly including the following parts:

Syntax and logic rewriting

1. Explicitly add type labels to variables and parameters

Torchscript will regard the parameter types that are not explicitly labeled as Tensors. If the actual variables are not labeled and the types are not Tensors, an exception of type mismatch will occur in the reasoning process.

#Before dimensioning
def forward(self, x, dim):
return torch.sum(x, dim=dim)
#After dimensioning
def forward(self, x: torch.Tensor, dim: int):
return torch.sum(x, dim=dim)

2. The storage contents of containers such as Dict and List should be consistent

TorchScript requires that the data types stored in the container object are completely consistent. For example, the types of all data in a List need to be consistent.

#Before modification

x = [1, "str"]
num_ data = x[0]
str_ data = x[1]
#After modification
int_ x = [1,]
str_ x = ["str",]
num_ data = int_ x[0]
str_ data = str_ x[0]

3. For__ getattr__ Rewrite magic methods not supported by TorchScript

def init():
self.local_ modules :nn.ModuleList = []
self.local_ modules_ tmp :Dict[str, nn.Module] = {}
#self.add_ module("acb", lateral_conv)
self.local_ modules_ tmp["acb"] = lateral_ conv.cuda()
self.local_ modules = nn.ModuleDict(self.local_modules_tmp)
def forward(self, input_node):
for kn, vv in self.local_ modules.items():
if name == kn:
input_ node = vv(input_node)
#input_ node = self.__ getattr__ (name)(input_node)

4. * Expand function replacement

Boxlists=list (zip (* sampled_boxes)) # Replace with for loop

5. Modify list, dict, and NoneType. list is replaced with torch. nn.ModuleList, dict is replaced with torch. nn.ModuleDict, etc.

6. Make good use of the bypass mechanism to ignore inference independent code branches

The reasoning code of the model may have training and reasoning related parts at the same time, and also include pre and post processing and other content that does not belong to the model reasoning itself. This part of code can be bypassed directly through relevant mechanisms.

#Before modification
if self.traininig:
do_ some_ training_ thing
else:
do_ some_ eval_ thing
#After modification
if self.traininig:
assert not torch.jit.is_ scripting(), "Bypass the codes blow"
do_ some_ training_ The code after thing # assert will not be translated
else:
do_ some_ eval_ thing

7. Currently, single card reasoning is used in the deployment environment. Torch.nn.BatchNorm2d can be used instead of torch.nn.SyncBatchNorm.

Key modification of model parameters

After the model code is modified, the organization form of torch.nn.Module will change, which will make it available in the state_ The key in dict changes. Due to the key mismatch, the pre training model cannot be directly loaded. Therefore, we need to modify the key of the pre training model to load correctly.

self.linear = xxx # key : “linear”
self.linear = torch.nn.ModuleList([xxx]) # key: "linear.0","linear.1",..

Disassembly drawing

In the actual test, the end-to-end performance of the optimized GPU is not the best. In addition, the performance of the GPU is not necessarily due to the CPU. Therefore, we split the diagram for different network components (backbones, necks, detection heads) and different code logic, and each part is optimized in a better way.

Transformation&Loop Graph Analysis

PAI Blade optimization

import torch_ blade
import torch_ blade.tensorrt
model_ path = "./opt_model_backbone.pt"

#Load pre optimization TorchScript
script_ model = torch.jit.load(model_path)
script_ model.cuda().eval()
...

#Configuration optimization

cfg = torch_ blade.Config()
cfg.optimization_ Pipeline="TensorRT" # Take TensorRT backend
#Perform optimization
with cfg:
opt_ model = torch_ blade.optimize(script_model, False, model_inputs=img)
print("Optimize finish!")
#The optimized model is also a TorchScript, which can directly call the relevant api to serialize to the local
torch.jit.save(opt_model, opt_file_name)
print("save finish!")
#Load model and reason
opt_ model = torch.jit.load(opt_file_name)
#Execute reasoning
opt_ model(img)
Cyclograph analysis

By default, the circle diagram of the PAI Blade is at least three basic OPs. For time-consuming OPs that are not circled in the trt grp for execution, you can manually set the white list. Then, the circle diagram optimizes the OP for special processing. In addition, if the performance after optimization is not as expected, you can check whether there are key subgraphs that are not circled in the trt grp by analyzing the circle diagram information

Multiple CUDA Streams

Optimization results

After the two optimizations were launched in the Detectron2 model, we saw that the QPS and GPU utilization of the system were greatly improved. After optimization, nearly 65% of computing resources can be saved.

Summary

As the model reasoning optimization team of UC, our main goal is to improve the model reasoning performance and reduce the use of computing resources without affecting the model business effect. In this optimization, we selected the optimization tools from three perspectives: the success rate of optimization, the performance of optimization, and the difficulty of optimization and deployment. Finally, PAI Blade was selected for integration into the system. From the perspective of the final performance optimization effect and resource cost savings, it basically met the initial expectations. However, performance optimization has no end point. With the continuous evolution of model structure and the gradual enlargement of model scale, new challenges will continue to emerge. However, the success rate, absolute performance and ease of use of optimization tools are still the core standards for our technology selection.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us