All Products
Search
Document Center

Platform For AI:Optimize RetinaNet (Detectron2) with Blade

Last Updated:Apr 01, 2026

RetinaNet is a one-stage object detection network built on Region-based Convolutional Neural Networks (R-CNN). Its architecture combines a backbone, multiple subnetworks, and Non-Maximum Suppression (NMS) for post-processing. This tutorial shows how to use PAI-Blade to optimize a RetinaNet model trained with Detectron2.

Prerequisites

Before you begin, ensure that you have:

  • A Linux system with Python 3.6 or later and CUDA 10.2

  • PyTorch 1.8.1 or later and Detectron2 0.4.1 or later

  • Blade 3.16.0 or later

How it works

The optimization workflow has three stages:

  1. Export the trained RetinaNet model to TorchScript format using Detectron2's scripting_with_instances.

  2. Call blade.optimize to apply O1-level FP16 optimization and save the optimized model.

  3. Authenticate with PAI-Blade and load the optimized model for inference.

Step 1: Export the model

Detectron2 is an open source object detection and image segmentation framework from Facebook AI Research (FAIR). Its flexible, extensible design means conventional export methods can fail or produce incorrect results. To support TorchScript deployment, Detectron2 provides two export methods:

  • `scripting_with_instances`: Exports the model using TorchScript scripting. Works reliably with models that use structured output types like Instances.

  • `TracingAdapter`: Wraps the model for tracing-based export. Use this when scripting is not feasible for your model architecture.

For full details on both methods, see Detectron2 deployment documentation.

Blade accepts any TorchScript model as input. The following example uses scripting_with_instances to export the RetinaNet model:

import torch
import numpy as np

from torch import Tensor
from torch.testing import assert_allclose

from detectron2 import model_zoo
from detectron2.export import scripting_with_instances
from detectron2.structures import Boxes
from detectron2.data.detection_utils import read_image

# Export the RetinaNet model using scripting_with_instances.
def load_retinanet(config_path):
    model = model_zoo.get(config_path, trained=True).eval()
    fields = {
        "pred_boxes": Boxes,
        "scores": Tensor,
        "pred_classes": Tensor,
    }
    script_model = scripting_with_instances(model, fields)
    return model, script_model

# Download a sample image.
# wget http://images.cocodataset.org/val2017/000000439715.jpg -q -O input.jpg
img = read_image('./input.jpg')
img = torch.from_numpy(np.ascontiguousarray(img.transpose(2, 0, 1)))

# Verify that the exported model produces the same results as the original.
pytorch_model, script_model = load_retinanet("COCO-Detection/retinanet_R_50_FPN_3x.yaml")
with torch.no_grad():
    batched_inputs = [{"image": img.float()}]
    pred1 = pytorch_model(batched_inputs)
    pred2 = script_model(batched_inputs)

assert_allclose(pred1[0]['instances'].scores, pred2[0].scores)

Step 2: Optimize the model with Blade

Call the optimization interface

Call blade.optimize to optimize the TorchScript model from the previous step. This example uses O1 level optimization, which applies FP16 mixed-precision acceleration — the recommended starting point for GPU inference. FP16 reduces memory bandwidth and enables faster compute on modern GPUs, typically with negligible accuracy loss.

For all blade.optimize parameters, see Optimize a PyTorch model.

import blade

# Blade expects input as a list of tuples — one tuple per forward pass.
# The inner tuple matches the model's positional arguments.
test_data = [(batched_inputs,)]

optimized_model, opt_spec, report = blade.optimize(
    script_model,       # The TorchScript model exported in the previous step.
    'o1',               # O1 level: FP16 mixed-precision optimization.
    device_type='gpu',  # Target device is GPU.
    test_data=test_data,
)

Save the optimized model

The optimized model is still a TorchScript model. Print the optimization report and save the model:

# Print the optimization report.
print("Report: {}".format(report))
# Save the optimized model.
torch.jit.save(optimized_model, 'optimized.pt')

A sample optimization report looks like this:

{
  "software_context": [
    {
      "software": "pytorch",
      "version": "1.8.1+cu102"
    },
    {
      "software": "cuda",
      "version": "10.2.0"
    }
  ],
  "hardware_context": {
    "device_type": "gpu",
    "microarchitecture": "T4"
  },
  "user_config": "",
  "diagnosis": {
    "model": "unnamed.pt",
    "test_data_source": "user provided",
    "shape_variation": "undefined",
    "message": "Unable to deduce model inputs information (data type, shape, value range, etc.)",
    "test_data_info": "0 shape: (3, 480, 640) data type: float32"
  },
  "optimizations": [
    {
      "name": "PtTrtPassFp16",
      "status": "effective",
      "speedup": "3.77",
      "pre_run": "40.64 ms",
      "post_run": "10.78 ms"
    }
  ],
  "overall": {
    "baseline": "40.73 ms",
    "optimized": "10.76 ms",
    "speedup": "3.79"
  },
  "model_info": {
    "input_format": "torch_script"
  },
  "compatibility_list": [
    {
      "device_type": "gpu",
      "microarchitecture": "T4"
    }
  ],
  "model_sdk": {}
}

For a description of all report fields, see Optimization report.

Run a performance test

Compare the latency of the original and optimized models. The benchmark runs 100 warmup iterations to let the GPU reach steady state, then measures average latency over 200 iterations:

import time

@torch.no_grad()
def benchmark(model, inp):
    # Warmup: allow GPU kernels to initialize and reach steady state.
    for i in range(100):
        model(inp)
    torch.cuda.synchronize()
    start = time.time()
    for i in range(200):
        model(inp)
    torch.cuda.synchronize()
    elapsed_ms = (time.time() - start) * 1000
    print("Latency: {:.2f}".format(elapsed_ms / 200))

# Test the original model.
benchmark(pytorch_model, batched_inputs)
# Test the optimized model.
benchmark(optimized_model, batched_inputs)

Reference results on a T4 GPU:

Latency: 42.38
Latency: 10.77

The results show that after 200 runs, the average latencies of the original and optimized models are 42.38 ms and 10.77 ms, respectively.

Step 3: Load and run the optimized model

Authenticate with PAI-Blade

During the trial period, add the following environment variable to prevent program exits caused by authentication failures:
export BLADE_AUTH_USE_COUNTING=1

Set the following environment variables to authenticate:

export BLADE_REGION=<region>
export BLADE_TOKEN=<token>

Replace the placeholders:

PlaceholderDescription
<region>The region where you use PAI-Blade. Join the PAI-Blade DingTalk user group to get the list of available regions. For the QR code, see Install PAI-Blade.
<token>The authentication token for PAI-Blade. Obtain it from the PAI-Blade DingTalk user group. For the QR code, see Install PAI-Blade.

Deploy the model

Because the optimized model is still a TorchScript model, load and run it without switching environments:

import blade.runtime.torch
import detectron2
import torch

from torch.testing import assert_allclose
from detectron2.utils.testing import (
    get_sample_coco_image,
)

pytorch_model = model_zoo.get("COCO-Detection/retinanet_R_50_FPN_3x.yaml", trained=True).eval()
optimized_model = torch.jit.load('optimized.pt')

img = read_image('./input.jpg')
img = torch.from_numpy(np.ascontiguousarray(img.transpose(2, 0, 1)))

with torch.no_grad():
    batched_inputs = [{"image": img.float()}]
    pred1 = pytorch_model(batched_inputs)
    pred2 = optimized_model(batched_inputs)

assert_allclose(pred1[0]['instances'].scores, pred2[0].scores, rtol=1e-3, atol=1e-2)

What's next