All Products
Search
Document Center

Platform For AI:Optimize RetinaNet (Detectron2) with Blade

Last Updated:Mar 10, 2026

RetinaNet is a one-stage detection network based on Region-based Convolutional Neural Networks (R-CNN). Its basic structure consists of a backbone, multiple subnetworks, and uses Non-Maximum Suppression (NMS) for post-processing. Many training frameworks, such as Detectron2, implement RetinaNet. This topic describes how to use Blade to optimize a RetinaNet (Detectron2) model, using the standard implementation in Detectron2 as an example.

Limits

The environment must meet the following requirements:
  • System environment: A Linux system with Python 3.6 or later and CUDA 10.2.
  • Framework: PyTorch 1.8.1 or later and Detectron2 0.4.1 or later.
  • Inference optimization tool: Blade 3.16.0 or later.

Procedure

The procedure to optimize a RetinaNet (Detectron2) model using Blade is as follows:
  1. Step 1: Export the model

    Export the model using either the TracingAdapter or scripting_with_instances method provided by Detectron2.

  2. Step 2: Call Blade to optimize the model

    Call the blade.optimize interface to optimize the model and save the optimized model.

  3. Step 3: Load and run the optimized model

    Run a performance test. If the results are satisfactory, load the optimized model for inference.

Step 1: Export the model

Detectron2 is a flexible, extensible, and configurable open source training framework from Facebook AI Research (FAIR) for object detection and image segmentation. Because of the framework's flexibility, exporting a model using conventional methods might fail or produce incorrect results. To support TorchScript deployment, Detectron2 provides two export methods: TracingAdapter and scripting_with_instances. For more information, see Detectron2 Usage.

Blade supports any TorchScript model as input. The following example uses scripting_with_instances to demonstrate the model export process.
import torch
import numpy as np

from torch import Tensor
from torch.testing import assert_allclose

from detectron2 import model_zoo
from detectron2.export import scripting_with_instances
from detectron2.structures import Boxes
from detectron2.data.detection_utils import read_image

# Export the RetinaNet model using scripting_with_instances.
def load_retinanet(config_path):
    model = model_zoo.get(config_path, trained=True).eval()
    fields = {
        "pred_boxes": Boxes,
        "scores": Tensor,
        "pred_classes": Tensor,
    }
    script_model = scripting_with_instances(model, fields)
    return model, script_model

# Download a sample image.
# wget http://images.cocodataset.org/val2017/000000439715.jpg -q -O input.jpg
img = read_image('./input.jpg')
img = torch.from_numpy(np.ascontiguousarray(img.transpose(2, 0, 1)))

# Run and compare the results before and after exporting the model.
pytorch_model, script_model = load_retinanet("COCO-Detection/retinanet_R_50_FPN_3x.yaml")
with torch.no_grad():
    batched_inputs = [{"image": img.float()}]
    pred1 = pytorch_model(batched_inputs)
    pred2 = script_model(batched_inputs)

assert_allclose(pred1[0]['instances'].scores, pred2[0].scores)

Step 2: Call Blade to optimize the model

  1. Call the Blade optimization interface.
    Call the blade.optimize interface to optimize the model. The following code provides an example. For more information about the blade.optimize interface, see Optimize a PyTorch model.
    import blade
    
    test_data = [(batched_inputs,)] # The input data for PyTorch is a list of tuples.
    optimized_model, opt_spec, report = blade.optimize(
        script_model,  # The TorchScript model exported in the previous step.
        'o1',  # Enable Blade O1 level optimization.
        device_type='gpu', # The target device is a GPU.
        test_data=test_data, # A given set of test data to assist with optimization and testing.
    )
  2. Print the optimization report and save the model.
    The model optimized by Blade is still a TorchScript model. After optimization is complete, use the following code to print the optimization report and save the optimized model.
    # Print the optimization report.
    print("Report: {}".format(report))
    # Save the optimized model.
    torch.jit.save(optimized_model, 'optimized.pt')
    The following is the printed optimization report. For more information about the fields in the optimization report, see Optimization report.
    Report: {
      "software_context": [
        {
          "software": "pytorch",
          "version": "1.8.1+cu102"
        },
        {
          "software": "cuda",
          "version": "10.2.0"
        }
      ],
      "hardware_context": {
        "device_type": "gpu",
        "microarchitecture": "T4"
      },
      "user_config": "",
      "diagnosis": {
        "model": "unnamed.pt",
        "test_data_source": "user provided",
        "shape_variation": "undefined",
        "message": "Unable to deduce model inputs information (data type, shape, value range, etc.)",
        "test_data_info": "0 shape: (3, 480, 640) data type: float32"
      },
      "optimizations": [
        {
          "name": "PtTrtPassFp16",
          "status": "effective",
          "speedup": "3.77",
          "pre_run": "40.64 ms",
          "post_run": "10.78 ms"
        }
      ],
      "overall": {
        "baseline": "40.73 ms",
        "optimized": "10.76 ms",
        "speedup": "3.79"
      },
      "model_info": {
        "input_format": "torch_script"
      },
      "compatibility_list": [
        {
          "device_type": "gpu",
          "microarchitecture": "T4"
        }
      ],
      "model_sdk": {}
    }
  3. Run a performance test on the models before and after optimization.
    The following code provides an example of a performance test.
    import time
    
    @torch.no_grad()
    def benchmark(model, inp):
        for i in range(100):
            model(inp)
        torch.cuda.synchronize()
        start = time.time()
        for i in range(200):
            model(inp)
        torch.cuda.synchronize()
        elapsed_ms = (time.time() - start) * 1000
        print("Latency: {:.2f}".format(elapsed_ms / 200))
    
    # Test the speed of the original model.
    benchmark(pytorch_model, batched_inputs)
    # Test the speed of the optimized model.
    benchmark(optimized_model, batched_inputs)
    The following are the reference results for this test.
    Latency: 42.38
    Latency: 10.77
    The results show that after 200 runs, the average latencies of the original and optimized models are 42.38 ms and 10.77 ms, respectively.

Step 3: Load and run the optimized model

  1. Optional: During the trial period, add the following environment variable setting to prevent the program from unexpected quits due to an authentication failure:

    export BLADE_AUTH_USE_COUNTING=1
  2. Get authenticated to use PAI-Blade.

    export BLADE_REGION=<region>
    export BLADE_TOKEN=<token>

    Configure the following parameters based on your business requirements:

    • <region>: the region where you use PAI-Blade. You can join the DingTalk group of PAI-Blade users to obtain the regions where PAI-Blade can be used. For information about the QR code of the DingTalk group, see Install PAI-Blade.

    • <token>: the authentication token that is required to use PAI-Blade. You can join the DingTalk group of PAI-Blade users to obtain the authentication token. For information about the QR code of the DingTalk group, see Install PAI-Blade.

  3. Deploy the model.
    The model optimized by Blade is still a TorchScript model. Therefore, you can load the optimized model without switching environments.
    import blade.runtime.torch
    import detectron2
    import torch
    
    from torch.testing import assert_allclose
    from detectron2.utils.testing import (
        get_sample_coco_image,
    )
    
    pytorch_model = model_zoo.get("COCO-Detection/retinanet_R_50_FPN_3x.yaml", trained=True).eval()
    optimized_model = torch.jit.load('optimized.pt')
    
    img = read_image('./input.jpg')
    img = torch.from_numpy(np.ascontiguousarray(img.transpose(2, 0, 1)))
    
    with torch.no_grad():
        batched_inputs = [{"image": img.float()}]
        pred1 = pytorch_model(batched_inputs)
        pred2 = optimized_model(batched_inputs)
    
    assert_allclose(pred1[0]['instances'].scores, pred2[0].scores, rtol=1e-3, atol=1e-2)