RetinaNet is a one-stage detection network based on Region-based Convolutional Neural Networks (R-CNN). Its basic structure consists of a backbone, multiple subnetworks, and uses Non-Maximum Suppression (NMS) for post-processing. Many training frameworks, such as Detectron2, implement RetinaNet. This topic describes how to use Blade to optimize a RetinaNet (Detectron2) model, using the standard implementation in Detectron2 as an example.
Limits
- System environment: A Linux system with Python 3.6 or later and CUDA 10.2.
- Framework: PyTorch 1.8.1 or later and Detectron2 0.4.1 or later.
- Inference optimization tool: Blade 3.16.0 or later.
Procedure
- Step 1: Export the model
Export the model using either the
TracingAdapterorscripting_with_instancesmethod provided by Detectron2. - Step 2: Call Blade to optimize the model
Call the
blade.optimizeinterface to optimize the model and save the optimized model. - Step 3: Load and run the optimized model
Run a performance test. If the results are satisfactory, load the optimized model for inference.
Step 1: Export the model
Detectron2 is a flexible, extensible, and configurable open source training framework from Facebook AI Research (FAIR) for object detection and image segmentation. Because of the framework's flexibility, exporting a model using conventional methods might fail or produce incorrect results. To support TorchScript deployment, Detectron2 provides two export methods: TracingAdapter and scripting_with_instances. For more information, see Detectron2 Usage.
scripting_with_instances to demonstrate the model export process.import torch
import numpy as np
from torch import Tensor
from torch.testing import assert_allclose
from detectron2 import model_zoo
from detectron2.export import scripting_with_instances
from detectron2.structures import Boxes
from detectron2.data.detection_utils import read_image
# Export the RetinaNet model using scripting_with_instances.
def load_retinanet(config_path):
model = model_zoo.get(config_path, trained=True).eval()
fields = {
"pred_boxes": Boxes,
"scores": Tensor,
"pred_classes": Tensor,
}
script_model = scripting_with_instances(model, fields)
return model, script_model
# Download a sample image.
# wget http://images.cocodataset.org/val2017/000000439715.jpg -q -O input.jpg
img = read_image('./input.jpg')
img = torch.from_numpy(np.ascontiguousarray(img.transpose(2, 0, 1)))
# Run and compare the results before and after exporting the model.
pytorch_model, script_model = load_retinanet("COCO-Detection/retinanet_R_50_FPN_3x.yaml")
with torch.no_grad():
batched_inputs = [{"image": img.float()}]
pred1 = pytorch_model(batched_inputs)
pred2 = script_model(batched_inputs)
assert_allclose(pred1[0]['instances'].scores, pred2[0].scores)Step 2: Call Blade to optimize the model
- Call the Blade optimization interface.
Call the
blade.optimizeinterface to optimize the model. The following code provides an example. For more information about theblade.optimizeinterface, see Optimize a PyTorch model.import blade test_data = [(batched_inputs,)] # The input data for PyTorch is a list of tuples. optimized_model, opt_spec, report = blade.optimize( script_model, # The TorchScript model exported in the previous step. 'o1', # Enable Blade O1 level optimization. device_type='gpu', # The target device is a GPU. test_data=test_data, # A given set of test data to assist with optimization and testing. ) - Print the optimization report and save the model.
The model optimized by Blade is still a TorchScript model. After optimization is complete, use the following code to print the optimization report and save the optimized model.
The following is the printed optimization report. For more information about the fields in the optimization report, see Optimization report.# Print the optimization report. print("Report: {}".format(report)) # Save the optimized model. torch.jit.save(optimized_model, 'optimized.pt')Report: { "software_context": [ { "software": "pytorch", "version": "1.8.1+cu102" }, { "software": "cuda", "version": "10.2.0" } ], "hardware_context": { "device_type": "gpu", "microarchitecture": "T4" }, "user_config": "", "diagnosis": { "model": "unnamed.pt", "test_data_source": "user provided", "shape_variation": "undefined", "message": "Unable to deduce model inputs information (data type, shape, value range, etc.)", "test_data_info": "0 shape: (3, 480, 640) data type: float32" }, "optimizations": [ { "name": "PtTrtPassFp16", "status": "effective", "speedup": "3.77", "pre_run": "40.64 ms", "post_run": "10.78 ms" } ], "overall": { "baseline": "40.73 ms", "optimized": "10.76 ms", "speedup": "3.79" }, "model_info": { "input_format": "torch_script" }, "compatibility_list": [ { "device_type": "gpu", "microarchitecture": "T4" } ], "model_sdk": {} } - Run a performance test on the models before and after optimization.
The following code provides an example of a performance test.
The following are the reference results for this test.import time @torch.no_grad() def benchmark(model, inp): for i in range(100): model(inp) torch.cuda.synchronize() start = time.time() for i in range(200): model(inp) torch.cuda.synchronize() elapsed_ms = (time.time() - start) * 1000 print("Latency: {:.2f}".format(elapsed_ms / 200)) # Test the speed of the original model. benchmark(pytorch_model, batched_inputs) # Test the speed of the optimized model. benchmark(optimized_model, batched_inputs)
The results show that after 200 runs, the average latencies of the original and optimized models are 42.38 ms and 10.77 ms, respectively.Latency: 42.38 Latency: 10.77
Step 3: Load and run the optimized model
Optional: During the trial period, add the following environment variable setting to prevent the program from unexpected quits due to an authentication failure:
export BLADE_AUTH_USE_COUNTING=1Get authenticated to use PAI-Blade.
export BLADE_REGION=<region> export BLADE_TOKEN=<token>Configure the following parameters based on your business requirements:
<region>: the region where you use PAI-Blade. You can join the DingTalk group of PAI-Blade users to obtain the regions where PAI-Blade can be used. For information about the QR code of the DingTalk group, see Install PAI-Blade.
<token>: the authentication token that is required to use PAI-Blade. You can join the DingTalk group of PAI-Blade users to obtain the authentication token. For information about the QR code of the DingTalk group, see Install PAI-Blade.
- Deploy the model.
The model optimized by Blade is still a TorchScript model. Therefore, you can load the optimized model without switching environments.
import blade.runtime.torch import detectron2 import torch from torch.testing import assert_allclose from detectron2.utils.testing import ( get_sample_coco_image, ) pytorch_model = model_zoo.get("COCO-Detection/retinanet_R_50_FPN_3x.yaml", trained=True).eval() optimized_model = torch.jit.load('optimized.pt') img = read_image('./input.jpg') img = torch.from_numpy(np.ascontiguousarray(img.transpose(2, 0, 1))) with torch.no_grad(): batched_inputs = [{"image": img.float()}] pred1 = pytorch_model(batched_inputs) pred2 = optimized_model(batched_inputs) assert_allclose(pred1[0]['instances'].scores, pred2[0].scores, rtol=1e-3, atol=1e-2)