RetinaNet is a detection network of the One-Stage Region-based Convolutional Neural Network (R-CNN) type. The basic structure of RetinaNet consists of a backbone, multiple subnetworks, and Non-Maximum Suppression (NMS). NMS is a post-processing algorithm. RetinaNet is implemented in many training frameworks. Detectron2 is a typical training framework that uses RetinaNet. This topic describes how to use PAI-Blade provided by Machine Learning Platform for AI (PAI) to optimize a RetinaNet model that is in the Detectron2 framework.

Limits

The environment used for the procedure described in this topic must meet the following version requirements:
  • System environment: Python 3.6 or later and Compute Unified Device Architecture (CUDA) 10.2 in Linux
  • Framework: PyTorch 1.8.1 or later, and Detectron2 0.4.1 or later
  • Inference optimization tool: PAI-Blade V3.16.0 or later

Procedure

To use PAI-Blade to optimize a RetinaNet model that is in the Detectron2 framework, perform the following steps:
  1. Step 1: Export the RetinaNet model to be optimized

    Call the TracingAdapter or scripting_with_instances API provided by Detectron2 to export the RetinaNet model that you want to optimize.

  2. Step 2: Use PAI-Blade to optimize the model

    Call the blade.optimize method to optimize the model and save the optimized model.

  3. Step 3: Load and run the optimized model

    If the optimized model passes the performance testing and meets your expectations, load the optimized model for inference.

Step 1: Export the RetinaNet model to be optimized

Detectron2 is an open source training framework built by Facebook AI Research (FAIR). Detectron2 implements object detection and segmentation algorithms and is flexible, extensible, and configurable. Because of the flexibility of Detectron2, if you export a model in regular ways, the export may fail or a wrong export result may be returned. To ensure that a model can be deployed in the TorchScript format, Detectron2 allows you to export the model by calling the TracingAdapter or scripting_with_instances API. For more information, see Usage.

PAI-Blade allows you to import models in the TorchScript format of all types. The following sample code provides an example on how to export a model in the TorchScript format. In this example, the scripting_with_instances API is used.
import torch
import numpy as np

from torch import Tensor
from torch.testing import assert_allclose

from detectron2 import model_zoo
from detectron2.export import scripting_with_instances
from detectron2.structures import Boxes
from detectron2.data.detection_utils import read_image

# Call the scripting_with_instances API to export the RetinaNet model. 
def load_retinanet(config_path):
    model = model_zoo.get(config_path, trained=True).eval()
    fields = {
        "pred_boxes": Boxes,
        "scores": Tensor,
        "pred_classes": Tensor,
    }
    script_model = scripting_with_instances(model, fields)
    return model, script_model

# Download a sample image. 
# wget http://images.cocodataset.org/val2017/000000439715.jpg -q -O input.jpg
img = read_image('./input.jpg')
img = torch.from_numpy(np.ascontiguousarray(img.transpose(2, 0, 1)))

# Run the model and compare the latency before and after you export the model. 
pytorch_model, script_model = load_retinanet("COCO-Detection/retinanet_R_50_FPN_3x.yaml")
with torch.no_grad():
    batched_inputs = [{"image": img.float()}]
    pred1 = pytorch_model(batched_inputs)
    pred2 = script_model(batched_inputs)

assert_allclose(pred1[0]['instances'].scores, pred2[0].scores)

Step 2: Use PAI-Blade to optimize the model

  1. Call the blade.optimize method of PAI-Blade.
    Call the blade.optimize method to optimize the model. The following sample code provides an example. For more information about the blade.optimize method, see Optimize a PyTorch model.
    import blade
    
    test_data = [(batched_inputs,)] # The test data used for a model in PyTorch is a list of tuples of tensors. 
    optimized_model, opt_spec, report = blade.optimize(
        script_model,  # The model in the TorchScript format exported in the previous step.  
        'o1',  # The optimization level of PAI-Blade. In this example, the optimization level is o1. 
        device_type='gpu',  # The type of the device on which the model is run. In this example, the device is type GPU. 
        test_data=test_data,  # The given set of test data, which facilitates optimization and testing. 
    )
  2. Display the optimization report and save the optimized model.
    The optimized model is still in the TorchScript format. After the optimization is complete, you can run the following code to display the optimization report and save the optimized model:
    # Display the optimization report. 
    print("Report: {}".format(report))
    # Save the optimized model. 
    torch.jit.save(optimized_model, 'optimized.pt')
    The following sample code provides a sample optimization report. For more information about the parameters in the report, see Optimization report.
    Report: {
      "software_context": [
        {
          "software": "pytorch",
          "version": "1.8.1+cu102"
        },
        {
          "software": "cuda",
          "version": "10.2.0"
        }
      ],
      "hardware_context": {
        "device_type": "gpu",
        "microarchitecture": "T4"
      },
      "user_config": "",
      "diagnosis": {
        "model": "unnamed.pt",
        "test_data_source": "user provided",
        "shape_variation": "undefined",
        "message": "Unable to deduce model inputs information (data type, shape, value range, etc.)",
        "test_data_info": "0 shape: (3, 480, 640) data type: float32"
      },
      "optimizations": [
        {
          "name": "PtTrtPassFp16",
          "status": "effective",
          "speedup": "3.77",
          "pre_run": "40.64 ms",
          "post_run": "10.78 ms"
        }
      ],
      "overall": {
        "baseline": "40.73 ms",
        "optimized": "10.76 ms",
        "speedup": "3.79"
      },
      "model_info": {
        "input_format": "torch_script"
      },
      "compatibility_list": [
        {
          "device_type": "gpu",
          "microarchitecture": "T4"
        }
      ],
      "model_sdk": {}
    }
  3. Test the performance of the original model and the optimized model.
    The following sample code provides an example on how to test the performance of the models:
    import time
    
    @torch.no_grad()
    def benchmark(model, inp):
        for i in range(100):
            model(inp)
        torch.cuda.synchronize()
        start = time.time()
        for i in range(200):
            model(inp)
        torch.cuda.synchronize()
        elapsed_ms = (time.time() - start) * 1000
        print("Latency: {:.2f}".format(elapsed_ms / 200))
    
    # Test the latency of the original model. 
    benchmark(pytorch_model, batched_inputs)
    # Test the latency of the optimized model. 
    benchmark(optimized_model, batched_inputs)
    The following results of this performance testing are for your reference:
    Latency: 42.38
    Latency: 10.77
    The preceding results show that after both the models are run for 200 times, the average latency of the original model is 42.38 ms and the average latency of the optimized model is 10.77 ms.

Step 3: Load and run the optimized model

  1. Optional:During the trial period, add the following environment variable setting to prevent the program from unexpected quits due to an authentication failure:
    export BLADE_AUTH_USE_COUNTING=1
  2. Get authenticated to use PAI-Blade.
    export BLADE_REGION=<region>
    export BLADE_TOKEN=<token>
    Configure the following parameters based on your business requirements:
    • <region>: the region where you use PAI-Blade. You can join the DingTalk group of PAI-Blade users to obtain the regions where PAI-Blade can be used. For information about the QR code of the DingTalk group, see Obtain an access token.
    • <token>: the authentication token that is required to use PAI-Blade. You can join the DingTalk group of PAI-Blade users to obtain the authentication token. For information about the QR code of the DingTalk group, see Obtain an access token.
  3. Deploy the model.
    The optimized model is still in the TorchScript. Therefore, you can load the optimized model without changing the environment.
    import blade.runtime.torch
    import detectron2
    import torch
    
    from torch.testing import assert_allclose
    from detectron2.utils.testing import (
        get_sample_coco_image,
    )
    
    pytorch_model = model_zoo.get("COCO-Detection/retinanet_R_50_FPN_3x.yaml", trained=True).eval()
    optimized_model = torch.jit.load('optimized.pt')
    
    img = read_image('./input.jpg')
    img = torch.from_numpy(np.ascontiguousarray(img.transpose(2, 0, 1)))
    
    with torch.no_grad():
        batched_inputs = [{"image": img.float()}]
        pred1 = pytorch_model(batched_inputs)
        pred2 = optimized_model(batched_inputs)
    
    assert_allclose(pred1[0]['instances'].scores, pred2[0].scores, rtol=1e-3, atol=1e-2)