Many PyTorch users implement the post-processing part of a detection model using a TensorRT Plugin. This allows them to export the entire model to TensorRT. Blade offers good scalability. If you have a custom TensorRT Plugin, you can use it with Blade for optimization. This topic describes how to use Blade to optimize a detection model that already uses a TensorRT Plugin.
Background information
TensorRT is a powerful tool for inference optimization on NVIDIA GPUs. Blade's underlying optimization is deeply integrated with TensorRT's methods. Blade also integrates multiple optimization techniques. These techniques include computation graph optimization, vendor libraries such as TensorRT and oneDNN, AI compiler optimization, Blade's manually optimized operator library, Blade mixed precision, and Blade EasyCompression.
RetinaNet is a one-stage RCNN detection network. Its basic structure consists of a backbone, multiple subnets, and Non-Maximum Suppression (NMS) post-processing. Many training frameworks implement RetinaNet, and Detectron2 is a typical example. Previously, this document described how to export a RetinaNet (Detectron2) model using the scripting_with_instances method and quickly optimize it with Blade. For more information, see RetinaNet optimization case 1: Optimize a RetinaNet (Detectron2) model using Blade.
Although many PyTorch users export models to ONNX and then deploy them with TensorRT, this process has limitations. The support for ONNX export and ONNX opsets in TensorRT is limited, which can make the optimization process unreliable. It is especially difficult to export the post-processing part of a detection network to ONNX for TensorRT optimization. The post-processing code is also often inefficient. Therefore, many users implement the post-processing part using the TensorRT Plugin mechanism. This allows the entire model to be exported to TensorRT.
In comparison, optimizing with Blade and TorchScript Custom C++ Operators is simpler than implementing the post-processing part using the TensorRT Plugin mechanism. For more information, see RetinaNet optimization case 2: Optimize a model using Blade and a Custom C++ Operator. Blade also has good scalability. If you have already implemented a custom TensorRT Plugin, you can use it with Blade for collaborative optimization.
Limits
The environment used in this topic must meet the following requirements:
-
System environment: Linux with Python 3.6 or later, GCC 5.4 or later, Nvidia Tesla T4, CUDA 10.2, CuDNN 8.0.5.39, and TensorRT 7.2.2.3.
-
Framework: PyTorch 1.8.1 or later and Detectron2 0.4.1 or later.
-
Inference optimization tool: Blade 3.16.0 or later (dynamically linked to TensorRT).
Procedure
The following steps describe how to optimize a model using Blade and a TensorRT Plugin:
-
Step 1: Create a PyTorch model with a TensorRT Plugin
Implement the post-processing part of RetinaNet using a TensorRT Plugin.
-
Step 2: Call Blade to optimize the model
Call the
blade.optimizeinterface to optimize the model and save the optimized model. -
Step 3: Load and run the optimized model
After you perform a performance test on the original and optimized models, you can load the optimized model for inference if you are satisfied with the results.
Step 1: Create a PyTorch model with a TensorRT Plugin
Blade can work with the TensorRT extension mechanism for collaborative optimization. This section describes how to use TensorRT extensions to implement the post-processing part of RetinaNet. For more information about developing and compiling TensorRT Plugins, see the NVIDIA Deep Learning TensorRT Documentation. The program logic for the RetinaNet post-processing part in this topic is from the NVIDIA open source community. For more information, see Retinanet-Examples. This topic extracts the core code to explain the development and implementation process of a Custom Operator.
-
Download and decompress the sample code.
wget -nv https://pai-blade.oss-cn-zhangjiakou.aliyuncs.com/tutorials/retinanet_example/retinanet-examples.tar.gz -O retinanet-examples.tar.gz tar xvfz retinanet-examples.tar.gz 1>/dev/null -
Compile the TensorRT Plugin.
The sample code includes the TensorRT Plugin implementation and registration for the
decodeandnmspost-processing of RetinaNet. The official PyTorch documentation provides three ways to compile Custom Operators: Building with CMake, Just-in-Time (JIT) compilation, and Building with Setuptools. For more information, see EXTENDING TORCHSCRIPT WITH CUSTOM C++ OPERATORS. These three compilation methods are suitable for different scenarios. You can select one based on your needs. For simplicity, this topic uses the JIT compilation method. The following is the sample code.NoteBefore you compile, you must configure dependency libraries such as TensorRT, CUDA, and CUDNN.
import torch.utils.cpp_extension import os codebase="retinanet-examples" sources=['csrc/plugins/plugin.cpp', 'csrc/cuda/decode.cu', 'csrc/cuda/nms.cu',] sources = [os.path.join(codebase,src) for src in sources] torch.utils.cpp_extension.load( name="plugin", sources=sources, build_directory=codebase, extra_include_paths=['/usr/local/TensorRT/include/', '/usr/local/cuda/include/', '/usr/local/cuda/include/thrust/system/cuda/detail'], extra_cflags=['-std=c++14', '-O2', '-Wall'], extra_ldflags=['-L/usr/local/TensorRT/lib/', '-lnvinfer'], extra_cuda_cflags=[ '-std=c++14', '--expt-extended-lambda', '--use_fast_math', '-Xcompiler', '-Wall,-fno-gnu-unique', '-gencode=arch=compute_75,code=sm_75',], is_python_module=False, with_cuda=True, verbose=False, ) -
Encapsulate the RetinaNet convolution model part.
Encapsulate the RetinaNet model part into a separate
RetinaNetBackboneAndHeadsModule.import torch from typing import List from torch import Tensor from torch.testing import assert_allclose from detectron2 import model_zoo # This class encapsulates the backbone and RPN heads of RetinaNet. class RetinaNetBackboneAndHeads(torch.nn.Module): def __init__(self, model): super().__init__() self.model = model def preprocess(self, img): batched_inputs = [{"image": img}] images = self.model.preprocess_image(batched_inputs) return images.tensor def forward(self, images): features = self.model.backbone(images) features = [features[f] for f in self.model.head_in_features] cls_heads, box_heads = self.model.head(features) cls_heads = [cls.sigmoid() for cls in cls_heads] box_heads = [b.contiguous() for b in box_heads] return cls_heads, box_heads retinanet_model = model_zoo.get("COCO-Detection/retinanet_R_50_FPN_3x.yaml", trained=True).eval() retinanet_bacbone_heads = RetinaNetBackboneAndHeads(retinanet_model) -
Build the RetinaNet post-processing network using a TensorRT Plugin. If you have already created a TensorRT Engine, you can skip this step.
-
Create a TensorRT Engine.
To make the TensorRT Plugin effective, you must implement the following features:
-
Dynamically load the compiled plugin.so file using
ctypes.cdll.LoadLibrary. -
The
build_retinanet_decodefunction uses thetensorrtPython API to build the post-processing network and then builds it into an Engine.
The following is the sample code.
import os import numpy as np import tensorrt as trt import ctypes # Load the TensorRT Plugin dynamic-link library. codebase="retinanet-examples" ctypes.cdll.LoadLibrary(os.path.join(codebase, 'plugin.so')) TRT_LOGGER = trt.Logger() trt.init_libnvinfer_plugins(TRT_LOGGER, "") PLUGIN_CREATORS = trt.get_plugin_registry().plugin_creator_list # Get the TensorRT Plugin function. def get_trt_plugin(plugin_name, field_collection): plugin = None for plugin_creator in PLUGIN_CREATORS: if plugin_creator.name != plugin_name: continue if plugin_name == "RetinaNetDecode": plugin = plugin_creator.create_plugin( name=plugin_name, field_collection=field_collection ) if plugin_name == "RetinaNetNMS": plugin = plugin_creator.create_plugin( name=plugin_name, field_collection=field_collection ) assert plugin is not None, "plugin not found" return plugin # Function to build the TensorRT network. def build_retinanet_decode(example_outputs, input_image_shape, anchors_list, test_score_thresh = 0.05, test_nms_thresh = 0.5, test_topk_candidates = 1000, max_detections_per_image = 100, ): builder = trt.Builder(TRT_LOGGER) EXPLICIT_BATCH = 1 << (int)(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH) network = builder.create_network(EXPLICIT_BATCH) config = builder.create_builder_config() config.max_workspace_size = 3 ** 20 cls_heads, box_heads = example_outputs profile = builder.create_optimization_profile() decode_scores = [] decode_boxes = [] decode_class = [] input_blob_names = [] input_blob_types = [] def _add_input(head_tensor, head_name): input_blob_names.append(head_name) input_blob_types.append("Float") head_shape = list(head_tensor.shape)[-3:] profile.set_shape( head_name, [1] + head_shape, [20] + head_shape, [1000] + head_shape) return network.add_input( name=head_name, dtype=trt.float32, shape=[-1] + head_shape ) # Build network inputs. cls_head_inputs = [] cls_head_strides = [input_image_shape[-1] // cls_head.shape[-1] for cls_head in cls_heads] for idx, cls_head in enumerate(cls_heads): cls_head_name = "cls_head" + str(idx) cls_head_inputs.append(_add_input(cls_head, cls_head_name)) box_head_inputs = [] for idx, box_head in enumerate(box_heads): box_head_name = "box_head" + str(idx) box_head_inputs.append(_add_input(box_head, box_head_name)) output_blob_names = [] output_blob_types = [] # Build decode network. for idx, anchors in enumerate(anchors_list): field_coll = trt.PluginFieldCollection([ trt.PluginField("topk_candidates", np.array([test_topk_candidates], dtype=np.int32), trt.PluginFieldType.INT32), trt.PluginField("score_thresh", np.array([test_score_thresh], dtype=np.float32), trt.PluginFieldType.FLOAT32), trt.PluginField("stride", np.array([cls_head_strides[idx]], dtype=np.int32), trt.PluginFieldType.INT32), trt.PluginField("num_anchors", np.array([anchors.numel()], dtype=np.int32), trt.PluginFieldType.INT32), trt.PluginField("anchors", anchors.contiguous().cpu().numpy().astype(np.float32), trt.PluginFieldType.FLOAT32),] ) decode_layer = network.add_plugin_v2( inputs=[cls_head_inputs[idx], box_head_inputs[idx]], plugin=get_trt_plugin("RetinaNetDecode", field_coll), ) decode_scores.append(decode_layer.get_output(0)) decode_boxes.append(decode_layer.get_output(1)) decode_class.append(decode_layer.get_output(2)) # Build NMS network. scores_layer = network.add_concatenation(decode_scores) boxes_layer = network.add_concatenation(decode_boxes) class_layer = network.add_concatenation(decode_class) field_coll = trt.PluginFieldCollection([ trt.PluginField("nms_thresh", np.array([test_nms_thresh], dtype=np.float32), trt.PluginFieldType.FLOAT32), trt.PluginField("max_detections_per_image", np.array([max_detections_per_image], dtype=np.int32), trt.PluginFieldType.INT32),] ) nms_layer = network.add_plugin_v2( inputs=[scores_layer.get_output(0), boxes_layer.get_output(0), class_layer.get_output(0)], plugin=get_trt_plugin("RetinaNetNMS", field_coll), ) nms_layer.get_output(0).name = "scores" nms_layer.get_output(1).name = "boxes" nms_layer.get_output(2).name = "classes" nms_outputs = [network.mark_output(nms_layer.get_output(k)) for k in range(3)] config.add_optimization_profile(profile) cuda_engine = builder.build_engine(network, config) assert cuda_engine is not None return cuda_engine -
-
Create the
cuda_enginebased on the actual number of outputs, output types, and output shapes ofRetinaNetBackboneAndHeads.import numpy as np from detectron2.data.detection_utils import read_image !wget http://images.cocodataset.org/val2017/000000439715.jpg -q -O input.jpg img = read_image('./input.jpg') img = torch.from_numpy(np.ascontiguousarray(img.transpose(2, 0, 1))) example_inputs = retinanet_bacbone_heads.preprocess(img) example_outputs = retinanet_bacbone_heads(example_inputs) cell_anchors = [c.contiguous() for c in retinanet_model.anchor_generator.cell_anchors] cuda_engine = build_retinanet_decode( example_outputs, example_inputs.shape, cell_anchors)
-
-
Use Blade extensions to support models that use both PyTorch and a TensorRT Engine.
The following code recombines the Backbone, Heads, and TensorRT Plugin post-processing parts using
RetinaNetWrapper,RetinaNetBackboneAndHeads, andRetinaNetPostProcess.import blade.torch # The post-processing part supported by the Blade TensorRT extension. class RetinaNetPostProcess(torch.nn.Module): def __init__(self, cuda_engine): super().__init__() blob_names = [cuda_engine.get_binding_name(idx) for idx in range(cuda_engine.num_bindings)] input_blob_names = blob_names[:-3] input_blob_types = ["Float"] * len(input_blob_names) output_blob_names = blob_names[-3:] output_blob_types = ["Float"] * len(output_blob_names) self.trt_ext_plugin = torch.classes.torch_addons.TRTEngineExtension( bytes(cuda_engine.serialize()), (input_blob_names, output_blob_names, input_blob_types, output_blob_types), ) def forward(self, inputs: List[Tensor]): return self.trt_ext_plugin.forward(inputs) # Use PyTorch and a TensorRT Engine together. class RetinaNetWrapper(torch.nn.Module): def __init__(self, model, trt_postproc): super().__init__() self.backbone_and_heads = model self.trt_postproc = torch.jit.script(trt_postproc) def forward(self, images): cls_heads, box_heads = self.backbone_and_heads(images) return self.trt_postproc(cls_heads + box_heads) trt_postproc = RetinaNetPostProcess(cuda_engine) retinanet_mix_trt = RetinaNetWrapper(retinanet_bacbone_heads, trt_postproc) # Export and save as a TorchScript file. retinanet_script = torch.jit.trace(retinanet_mix_trt, (example_inputs, ), check_trace=False) torch.jit.save(retinanet_script, 'retinanet_script.pt') torch.save(example_inputs, 'example_inputs.pth') outputs = retinanet_script(example_inputs)The newly assembled
torch.nn.Modulehas the following features:-
It uses the
torch.classes.torch_addons.TRTEngineExtensioninterface, which is supported by the Blade TensorRT extension. -
It supports TorchScript model export. The preceding code uses
torch.jit.traceto export the model. -
It supports saving the model in TorchScript format.
-
Step 2: Call Blade to optimize the model
-
Call the Blade optimization interface.
Call the
blade.optimizeinterface to optimize the model. The following is the sample code. For more information about theblade.optimizeinterface, see Optimize a PyTorch model.import blade import blade.torch import ctypes import torch import os codebase="retinanet-examples" ctypes.cdll.LoadLibrary(os.path.join(codebase, 'plugin.so')) blade_config = blade.Config() blade_config.gpu_config.disable_fp16_accuracy_check = True script_model = torch.jit.load('retinanet_script.pt') example_inputs = torch.load('example_inputs.pth') test_data = [(example_inputs,)] # The input data for PyTorch is a list of tuples. with blade_config: optimized_model, opt_spec, report = blade.optimize( script_model, # The TorchScript model exported in the previous step. 'o1', # Enable Blade O1 level optimization. device_type='gpu', # The target device is a GPU. test_data=test_data, # Provide a set of test data to assist with optimization and testing. ) -
Print the optimization report and save the model.
The model optimized by Blade is still a TorchScript model. After optimization, you can use the following code to print the optimization report and save the optimized model.
# Print the optimization report. print("Report: {}".format(report)) # Save the optimized model. torch.jit.save(optimized_model, 'optimized.pt')The printed optimization report is as follows. For more information about the fields in the optimization report, see Optimization report.
Report: { "software_context": [ { "software": "pytorch", "version": "1.8.1+cu102" }, { "software": "cuda", "version": "10.2.0" } ], "hardware_context": { "device_type": "gpu", "microarchitecture": "T4" }, "user_config": "", "diagnosis": { "model": "unnamed.pt", "test_data_source": "user provided", "shape_variation": "undefined", "message": "Unable to deduce model inputs information (data type, shape, value range, etc.)", "test_data_info": "0 shape: (1, 3, 480, 640) data type: float32" }, "optimizations": [ { "name": "PtTrtPassFp16", "status": "effective", "speedup": "4.37", "pre_run": "40.59 ms", "post_run": "9.28 ms" } ], "overall": { "baseline": "40.02 ms", "optimized": "9.27 ms", "speedup": "4.32" }, "model_info": { "input_format": "torch_script" }, "compatibility_list": [ { "device_type": "gpu", "microarchitecture": "T4" } ], "model_sdk": {} } -
Perform a performance test on the original and optimized models.
The following is the sample code for the performance test.
import time @torch.no_grad() def benchmark(model, inp): for i in range(100): model(inp) torch.cuda.synchronize() start = time.time() for i in range(200): model(inp) torch.cuda.synchronize() elapsed_ms = (time.time() - start) * 1000 print("Latency: {:.2f}".format(elapsed_ms / 200)) # Test the performance of the original model. benchmark(script_model, example_inputs) # Test the performance of the optimized model. benchmark(optimized_model, example_inputs)The following are the reference results for this test.
Latency: 40.71 Latency: 9.35The results show that after 200 runs, the average latencies of the original and optimized models are 40.71 ms and 9.35 ms, respectively.
Step 3: Load and run the optimized model
Optional: During the trial period, add the following environment variable setting to prevent the program from unexpected quits due to an authentication failure:
export BLADE_AUTH_USE_COUNTING=1Get authenticated to use PAI-Blade.
export BLADE_REGION=<region> export BLADE_TOKEN=<token>Configure the following parameters based on your business requirements:
<region>: the region where you use PAI-Blade. You can join the DingTalk group of PAI-Blade users to obtain the regions where PAI-Blade can be used. For information about the QR code of the DingTalk group, see Install PAI-Blade.
<token>: the authentication token that is required to use PAI-Blade. You can join the DingTalk group of PAI-Blade users to obtain the authentication token. For information about the QR code of the DingTalk group, see Install PAI-Blade.
-
Load and run the optimized model.
The model optimized by Blade is still a TorchScript model. You can load the optimized model without switching environments.
import blade.runtime.torch import torch from torch.testing import assert_allclose import ctypes import os codebase="retinanet-examples" ctypes.cdll.LoadLibrary(os.path.join(codebase, 'plugin.so')) optimized_model = torch.jit.load('optimized.pt') example_inputs = torch.load('example_inputs.pth') with torch.no_grad(): pred = optimized_model(example_inputs)