Use PAI-Blade to optimize a ResNet50 model with dynamic input shapes - Platform For AI

Limits

The environment used for the procedure in this topic must meet the following version requirements:

System environment: Python 3.6 or later in Linux
Framework: PyTorch 1.7.1
Device and backend: NVIDIA T4 and CUDA 11.0
Inference optimization tool: PAI-Blade V3.17.0 or later

Procedure

To optimize a ResNet50 model with dynamic input shapes by using PAI-Blade, perform the following steps:

Step 1: Make preparations
Construct the test data and model. In this topic, the standard ResNet50 model from TorchVision is used.
Step 2: Construct PAI-Blade config objects for optimization
Construct PAI-Blade config objects based on the range of dynamic shapes.
Step 3: Use PAI-Blade to optimize the model
Call the blade.optimize method to optimize the model, and save the optimized model.
Step 4: Verify the performance and accuracy of the model
Test the inference speeds and inference results of the original and optimized models to verify the information in the optimization report that is generated.
Step 5: Load and run the optimized model
Integrate PAI-Blade SDK to load the optimized model for inference.

Step 1: Make preparations

Download the pre-trained parameters of the model and test data.
The pre-trained parameters have been downloaded from TorchVision to Object Storage Service (OSS). This saves your time. The test data is randomly selected from the ImageNet-1k validation set. The test data has been preprocessed and can be directly used after the data is downloaded.
```
wget http://pai-blade.oss-cn-zhangjiakou.aliyuncs.com/share/dynamic_ranges_pratice/resnet50-19c8e357.pth -O resnet50-19c8e357.pth
wget http://pai-blade.oss-cn-zhangjiakou.aliyuncs.com/share/dynamic_ranges_pratice/imagenet_val_example.pt -O imagenet_val_example.pt
```

Define the model, and load the pre-trained parameters and test data to generate a TorchScript model.

import torch
import torchvision

# Construct a Resnet50 model. 
model = torchvision.models.resnet50().eval().cuda()
# Load the pre-trained parameters. 
ckpt = torch.load('resnet50-19c8e357.pth')
model.load_state_dict(ckpt)
# Load the test data. 
example_input = torch.load('imagenet_val_example.pt').cuda()
# Generate a TorchScript model. 
traced_model = torch.jit.trace(model, example_input).cuda().eval()

Step 2: Construct PAI-Blade config objects for optimization

You can construct PAI-Blade config objects based on the range of dynamic shapes. PAI-Blade supports a dynamic shape range in a random dimension. In this topic, the batch dimension is used.

Define the range of dynamic shapes.
A valid range of dynamic shapes must be defined by the following three fields:
- min: the lower limit of the range.
- max: the upper limit of the range.
- opts: the one or more shapes that require special optimization. In general, inferences on inputs of the specified shapes are accelerated at a higher rate by using the optimized model.
Take note of the following rules when you set the preceding fields:
- The number of shapes that are specified in each group in the min, max, and opts fields must be the same and equal the number of input shapes in the module.
- Numbers in the same position in each group of shapes that are specified for the min, max, and opts fields must follow the min_num <= opt_num <= max_num rule.
The following sample code provides an example on how to define the range of dynamic shapes:
```
shapes = {
    "min": [[1, 3, 224, 224]],
    "max": [[10, 3, 224, 224]],
    "opts": [
        [[5, 3, 224, 224]],
        [[8, 3, 224, 224]],
    ]
}
```
In addition, PAI-Blade allows you to define multiple ranges of dynamic shapes. If the upper limit and lower limit define a range that is excessively large, the optimized model may not have a clear advantage in inference acceleration. You can split the excessively large range into multiple small ranges. This way, the inference speed is accelerated. For more information about how to define multiple ranges of dynamic shapes, see Appendix: Define multiple ranges of dynamic shapes in this topic.

Construct PAI-Blade config objects based on the defined range of dynamic shapes.

import blade
import blade.torch as blade_torch

# The config object related to PAI-Blade Torch. This config object is used to specify the range of dynamic shapes. 
blade_torch_cfg = blade_torch.Config()
blade_torch_cfg.dynamic_tuning_shapes = shapes

# The config object related to PAI-Blade. This config object is used to disable FP16 precision verification to achieve an optimal acceleration effect. 
gpu_config = {
    "disable_fp16_accuracy_check": True,
}
blade_config = blade.Config(
    gpu_config=gpu_config
)

Step 3: Use PAI-Blade to optimize the model

Call the blade.optimize method to optimize the model. The following sample code provides an example. For more information, see Python method.
```
with blade_torch_cfg:
    optimized_model, _, report = blade.optimize(
        traced_model,          # The path of the model. 
        'o1',                  # Lossless optimization. 
        config=blade_config,
        device_type='gpu',     # Optimization for GPU devices. 
        test_data=[(example_input,)]  # The test data. 
    )
```
Take note of the following items when you optimize the model:
- The first return value of the blade.optimize method indicates the optimized model. The data type remains the same as that of the original model. In this example, a TorchScript model is specified in the input, and the optimized TorchScript model is returned.
- Make sure that the test data belongs to the dynamic shape range that you define.

Display the optimization report after the optimization is complete.

print("Report: {}".format(report))

The following sample code provides an example of the optimization report:

Report: {
  "software_context": [
    {
      "software": "pytorch",
      "version": "1.7.1+cu110"
    },
    {
      "software": "cuda",
      "version": "11.0.0"
    }
  ],
  "hardware_context": {
    "device_type": "gpu",
    "microarchitecture": "T4"
  },
  "user_config": "",
  "diagnosis": {
    "model": "unnamed.pt",
    "test_data_source": "user provided",
    "shape_variation": "undefined",
    "message": "Unable to deduce model inputs information (data type, shape, value range, etc.)",
    "test_data_info": "0 shape: (1, 3, 224, 224) data type: float32"
  },
  "optimizations": [
    {
      "name": "PtTrtPassFp16",
      "status": "effective",
      "speedup": "4.06",
      "pre_run": "6.55 ms",
      "post_run": "1.61 ms"
    }
  ],
  "overall": {
    "baseline": "6.54 ms",
    "optimized": "1.61 ms",
    "speedup": "4.06"
  },
  "model_info": {
    "input_format": "torch_script"
  },
  "compatibility_list": [
    {
      "device_type": "gpu",
      "microarchitecture": "T4"
    }
  ],
  "model_sdk": {}
}

The optimization report shows that an optimization item named PtTrtPassFp16 takes effect. The inference time of the model is shortened by 4.06 times, from 6.55 ms to 1.61 ms for the inference on test data. The preceding optimization report is only for reference. The actual optimization effect of your model prevails. For more information about the fields in the optimization report, see Optimization report.

Invoke PyTorch-related functions to save and load the optimized TorchScript model.

file_name = "resnet50_opt.pt"
# Save the optimized model to a local device. 
torch.jit.save(optimized_model, file_name)
# Load the optimized model from the disk. 
optimized_model = torch.jit.load(file_name)

Step 4: Verify the performance and accuracy of the model

After the optimization is complete, you can verify the information in the optimization report by running a Python script.

Define the benchmark method, warm up the model 10 times, and then run the model 100 times to obtain the average inference time of the model, which indicates the inference speed.

import time

@torch.no_grad()
def benchmark(model, test_data):
    # Switch the model to the verification mode. 
    model = model.eval()
    
    # Warm up the model. 
    for i in range(0, 10):
        model(test_data)
        
    # Run the model in timed mode. 
    num_runs = 100
    start = time.time()
    for i in range(0, num_runs):
        model(test_data)
    torch.cuda.synchronize()
    elapsed = time.time() - start
    rt_ms = elapsed / num_runs * 1000.0
    
    # Display the results. 
    print("{:.2f} ms.".format(rt_ms))
    return rt_ms

Define multiple groups of test data in different shapes.

dummy_inputs = []
batch_num = [1, 3, 5, 7, 9]
for n in batch_num:
    dummy_inputs.append(torch.randn(n, 3, 224, 224).cuda())

Traverse all groups of test data, call the benchmark method to test the original and optimized models, and then display the results.

for inp in dummy_inputs:
    print(f'--------------test with shape {list(inp.shape)}--------------')
    print("  Origin model inference cost:     ", end='')
    origin_rt = benchmark(traced_model, inp)
    print("  Optimized model inference cost:  ", end='')
    opt_rt = benchmark(optimized_model, inp)
    speedup = origin_rt / opt_rt
    print('  Speed up: {:.2f}'.format(speedup))
    print('')

The system displays information similar to the following output:

--------------test with shape [1, 3, 224, 224]--------------
  Origin model inference cost:     6.54 ms.
  Optimized model inference cost:  1.66 ms.
  Speed up: 3.94

--------------test with shape [3, 3, 224, 224]--------------
  Origin model inference cost:     10.79 ms.
  Optimized model inference cost:  2.40 ms.
  Speed up: 4.49

--------------test with shape [5, 3, 224, 224]--------------
  Origin model inference cost:     16.27 ms.
  Optimized model inference cost:  3.25 ms.
  Speed up: 5.01

--------------test with shape [7, 3, 224, 224]--------------
  Origin model inference cost:     22.62 ms.
  Optimized model inference cost:  4.39 ms.
  Speed up: 5.16

--------------test with shape [9, 3, 224, 224]--------------
  Origin model inference cost:     28.83 ms.
  Optimized model inference cost:  5.25 ms.
  Speed up: 5.49

The output shows the test results for all groups of test data in different shapes. The inference speed of the optimized model is 3.94 to 5.49 times that of the original model. The preceding optimization report is only for reference. The actual optimization effect of your model prevails.

Verify the accuracy of the optimized model by using the example_input test data that you prepared in Step 1.

origin_output = traced_model(example_input)
_, pred = origin_output.topk(1, 1, True, True)
print("origin model output: {}".format(pred))
opt_output = optimized_model(example_input)
_, pred = origin_output.topk(1, 1, True, True)
print("optimized model output: {}".format(pred))

The system displays information similar to the following output:

origin model output: tensor([[834]], device='cuda:0')
optimized model output: tensor([[834]], device='cuda:0')

The output shows that both the original model and optimized model classify the example_input test data as Category 834.

Step 5: Load and run the optimized model

After the verification is complete, you can deploy the optimized model. PAI-Blade provides an SDK for Python and an SDK for C++ that you can integrate. For more information about how to use the SDK for C++, see Use an SDK to deploy a TensorFlow model for inference. The following section describes how to use the SDK for Python to deploy a model.

Optional:During the trial period, add the following environment variable setting to prevent the program from unexpected quits due to an authentication failure:
```
export BLADE_AUTH_USE_COUNTING=1
```
Get authenticated to use PAI-Blade.
```
export BLADE_REGION=<region>
export BLADE_TOKEN=<token>
```
Configure the following parameters based on your business requirements:
- <region>: the region where you use PAI-Blade. You can join the DingTalk group of PAI-Blade users to obtain the regions where PAI-Blade can be used. For information about the QR code of the DingTalk group, see Obtain an access token.
- <token>: the authentication token that is required to use PAI-Blade. You can join the DingTalk group of PAI-Blade users to obtain the authentication token. For information about the QR code of the DingTalk group, see Obtain an access token.

Load and run the optimized model.

Add import blade.runtime.torch to the inference code. In addition to this, you do not need to write extra code for the integration of PAI-Blade SDK or modify the original inference code.

import torch
import blade.runtime.torch
# Replace <your_optimized_model_path> with the path of the optimized model. 
opt_model_dir = <your_optimized_model_path>
# Replace <your_infer_data> with the data on which you want to perform inference. 
infer_data = <your_infer_data>

model = torch.jit.load(opt_model_dir)
output = model(infer_data)

Appendix: Define multiple ranges of dynamic shapes

If the upper limit and lower limit define a range that is excessively large, the optimized model may not have a clear advantage in inference acceleration. You can split the excessively large range into multiple small ranges. This way, the inference speed is accelerated. The following sample code provides an example on how to define multiple ranges of dynamic shapes:

shapes1 = {
    "min": [[1, 3, 224, 224]],
    "max": [[5, 3, 224, 224]],
    "opts": [
        [[5, 3, 224, 224]],
    ]
}
shapes2 = {
    "min": [[5, 3, 224, 224]],
    "max": [[10, 3, 224, 224]],
    "opts": [
        [[8, 3, 224, 224]],
    ]
}
shapes = [shapes1, shapes2]

Then, you can use this shapes configuration to construct the above-mentioned PAI-Blade config objects. For more information, see Step 2: Construct PAI-Blade config objects for optimization.