In general, conventional inference optimization technologies are designed for models
with static input shapes. If the inputs of a model are of dynamic shapes, these inference
optimization technologies may not take effect. In actual production scenarios, more
and more models are required to work on inputs of dynamic shapes. The need to optimize
the inference performance of models with dynamic input shapes has become stronger.
This topic describes how to use Machine Learning Platform for AI (PAI)-Blade to optimize
a model with dynamic input shapes.
Limits
The environment used for the procedure in this topic must meet the following version
requirements:
- System environment: Python 3.6 or later in Linux
- Framework: PyTorch 1.7.1
- Device and backend: NVIDIA T4 and CUDA 11.0
- Inference optimization tool: PAI-Blade V3.17.0 or later
Step 1: Make preparations
- Download the pre-trained parameters of the model and test data.
The pre-trained parameters have been downloaded from TorchVision to Object Storage
Service (OSS). This saves your time. The test data is randomly selected from the ImageNet-1k
validation set. The test data has been preprocessed and can be directly used after
the data is downloaded.
wget http://pai-blade.oss-cn-zhangjiakou.aliyuncs.com/share/dynamic_ranges_pratice/resnet50-19c8e357.pth -O resnet50-19c8e357.pth
wget http://pai-blade.oss-cn-zhangjiakou.aliyuncs.com/share/dynamic_ranges_pratice/imagenet_val_example.pt -O imagenet_val_example.pt
- Define the model, and load the pre-trained parameters and test data to generate a
TorchScript model.
import torch
import torchvision
# Construct a Resnet50 model.
model = torchvision.models.resnet50().eval().cuda()
# Load the pre-trained parameters.
ckpt = torch.load('resnet50-19c8e357.pth')
model.load_state_dict(ckpt)
# Load the test data.
example_input = torch.load('imagenet_val_example.pt').cuda()
# Generate a TorchScript model.
traced_model = torch.jit.trace(model, example_input).cuda().eval()
Step 2: Construct PAI-Blade config objects for optimization
You can construct PAI-Blade config objects based on the range of dynamic shapes. PAI-Blade
supports a dynamic shape range in a random dimension. In this topic, the batch dimension
is used.
- Define the range of dynamic shapes.
A valid range of dynamic shapes must be defined by the following three fields:
- min: the lower limit of the range.
- max: the upper limit of the range.
- opts: the one or more shapes that require special optimization. In general, inferences
on inputs of the specified shapes are accelerated at a higher rate by using the optimized
model.
Take note of the following rules when you set the preceding fields:
- The number of shapes that are specified in each group in the min, max, and opts fields must be the same and equal the number of input shapes in the module.
- Numbers in the same position in each group of shapes that are specified for the min, max, and opts fields must follow the
min_num <= opt_num <= max_num rule.
The following sample code provides an example on how to define the range of dynamic
shapes:
shapes = {
"min": [[1, 3, 224, 224]],
"max": [[10, 3, 224, 224]],
"opts": [
[[5, 3, 224, 224]],
[[8, 3, 224, 224]],
]
}
In addition, PAI-Blade allows you to define multiple ranges of dynamic shapes. If
the upper limit and lower limit define a range that is excessively large, the optimized
model may not have a clear advantage in inference acceleration. You can split the
excessively large range into multiple small ranges. This way, the inference speed
is accelerated. For more information about how to define multiple ranges of dynamic
shapes, see
Appendix: Define multiple ranges of dynamic shapes in this topic.
- Construct PAI-Blade config objects based on the defined range of dynamic shapes.
import blade
import blade.torch as blade_torch
# The config object related to PAI-Blade Torch. This config object is used to specify the range of dynamic shapes.
blade_torch_cfg = blade_torch.Config()
blade_torch_cfg.dynamic_tuning_shapes = shapes
# The config object related to PAI-Blade. This config object is used to disable FP16 precision verification to achieve an optimal acceleration effect.
gpu_config = {
"disable_fp16_accuracy_check": True,
}
blade_config = blade.Config(
gpu_config=gpu_config
)
Step 3: Use PAI-Blade to optimize the model
- Call the
blade.optimize method to optimize the model. The following sample code provides an example. For
more information, see Python method. with blade_torch_cfg:
optimized_model, _, report = blade.optimize(
traced_model, # The path of the model.
'o1', # Lossless optimization.
config=blade_config,
device_type='gpu', # Optimization for GPU devices.
test_data=[(example_input,)] # The test data.
)
Take note of the following items when you optimize the model:
- The first return value of the
blade.optimize method indicates the optimized model. The data type remains the same as that of the
original model. In this example, a TorchScript model is specified in the input, and
the optimized TorchScript model is returned.
- Make sure that the test data belongs to the dynamic shape range that you define.
- Display the optimization report after the optimization is complete.
print("Report: {}".format(report))
The following sample code provides an example of the optimization report:
Report: {
"software_context": [
{
"software": "pytorch",
"version": "1.7.1+cu110"
},
{
"software": "cuda",
"version": "11.0.0"
}
],
"hardware_context": {
"device_type": "gpu",
"microarchitecture": "T4"
},
"user_config": "",
"diagnosis": {
"model": "unnamed.pt",
"test_data_source": "user provided",
"shape_variation": "undefined",
"message": "Unable to deduce model inputs information (data type, shape, value range, etc.)",
"test_data_info": "0 shape: (1, 3, 224, 224) data type: float32"
},
"optimizations": [
{
"name": "PtTrtPassFp16",
"status": "effective",
"speedup": "4.06",
"pre_run": "6.55 ms",
"post_run": "1.61 ms"
}
],
"overall": {
"baseline": "6.54 ms",
"optimized": "1.61 ms",
"speedup": "4.06"
},
"model_info": {
"input_format": "torch_script"
},
"compatibility_list": [
{
"device_type": "gpu",
"microarchitecture": "T4"
}
],
"model_sdk": {}
}
The optimization report shows that an optimization item named
PtTrtPassFp16 takes effect. The inference time of the model is shortened by 4.06 times, from 6.55
ms to 1.61 ms for the inference on test data. The preceding optimization report is
only for reference. The actual optimization effect of your model prevails. For more
information about the fields in the optimization report, see
Optimization report.
- Invoke PyTorch-related functions to save and load the optimized TorchScript model.
file_name = "resnet50_opt.pt"
# Save the optimized model to a local device.
torch.jit.save(optimized_model, file_name)
# Load the optimized model from the disk.
optimized_model = torch.jit.load(file_name)
Step 4: Verify the performance and accuracy of the model
After the optimization is complete, you can verify the information in the optimization
report by running a Python script.
- Define the
benchmark method, warm up the model 10 times, and then run the model 100 times to obtain the
average inference time of the model, which indicates the inference speed. import time
@torch.no_grad()
def benchmark(model, test_data):
# Switch the model to the verification mode.
model = model.eval()
# Warm up the model.
for i in range(0, 10):
model(test_data)
# Run the model in timed mode.
num_runs = 100
start = time.time()
for i in range(0, num_runs):
model(test_data)
torch.cuda.synchronize()
elapsed = time.time() - start
rt_ms = elapsed / num_runs * 1000.0
# Display the results.
print("{:.2f} ms.".format(rt_ms))
return rt_ms
- Define multiple groups of test data in different shapes.
dummy_inputs = []
batch_num = [1, 3, 5, 7, 9]
for n in batch_num:
dummy_inputs.append(torch.randn(n, 3, 224, 224).cuda())
- Traverse all groups of test data, call the
benchmark method to test the original and optimized models, and then display the results. for inp in dummy_inputs:
print(f'--------------test with shape {list(inp.shape)}--------------')
print(" Origin model inference cost: ", end='')
origin_rt = benchmark(traced_model, inp)
print(" Optimized model inference cost: ", end='')
opt_rt = benchmark(optimized_model, inp)
speedup = origin_rt / opt_rt
print(' Speed up: {:.2f}'.format(speedup))
print('')
The system displays information similar to the following output:
--------------test with shape [1, 3, 224, 224]--------------
Origin model inference cost: 6.54 ms.
Optimized model inference cost: 1.66 ms.
Speed up: 3.94
--------------test with shape [3, 3, 224, 224]--------------
Origin model inference cost: 10.79 ms.
Optimized model inference cost: 2.40 ms.
Speed up: 4.49
--------------test with shape [5, 3, 224, 224]--------------
Origin model inference cost: 16.27 ms.
Optimized model inference cost: 3.25 ms.
Speed up: 5.01
--------------test with shape [7, 3, 224, 224]--------------
Origin model inference cost: 22.62 ms.
Optimized model inference cost: 4.39 ms.
Speed up: 5.16
--------------test with shape [9, 3, 224, 224]--------------
Origin model inference cost: 28.83 ms.
Optimized model inference cost: 5.25 ms.
Speed up: 5.49
The output shows the test results for all groups of test data in different shapes.
The inference speed of the optimized model is 3.94 to 5.49 times that of the original
model. The preceding optimization report is only for reference. The actual optimization
effect of your model prevails.
- Verify the accuracy of the optimized model by using the example_input test data that you prepared in Step 1.
origin_output = traced_model(example_input)
_, pred = origin_output.topk(1, 1, True, True)
print("origin model output: {}".format(pred))
opt_output = optimized_model(example_input)
_, pred = origin_output.topk(1, 1, True, True)
print("optimized model output: {}".format(pred))
The system displays information similar to the following output:
origin model output: tensor([[834]], device='cuda:0')
optimized model output: tensor([[834]], device='cuda:0')
The output shows that both the original model and optimized model classify the
example_input test data as Category 834.
Step 5: Load and run the optimized model
After the verification is complete, you can deploy the optimized model. PAI-Blade
provides an SDK for Python and an SDK for C++ that you can integrate. For more information
about how to use the SDK for C++, see Use an SDK to deploy a TensorFlow model for inference. The following section describes how to use the SDK for Python to deploy a model.
- Optional:During the trial period, add the following environment variable setting to prevent
the program from unexpected quits due to an authentication failure:
export BLADE_AUTH_USE_COUNTING=1
- Get authenticated to use PAI-Blade.
export BLADE_REGION=<region>
export BLADE_TOKEN=<token>
Configure the following parameters based on your business requirements:
- <region>: the region where you use PAI-Blade. You can join the DingTalk group of PAI-Blade
users to obtain the regions where PAI-Blade can be used. For information about the
QR code of the DingTalk group, see Obtain an access token.
- <token>: the authentication token that is required to use PAI-Blade. You can join the DingTalk
group of PAI-Blade users to obtain the authentication token. For information about
the QR code of the DingTalk group, see Obtain an access token.
- Load and run the optimized model.
Add
import blade.runtime.torch to the inference code. In addition to this, you do not need to write extra code for
the integration of PAI-Blade SDK or modify the original inference code.
import torch
import blade.runtime.torch
# Replace <your_optimized_model_path> with the path of the optimized model.
opt_model_dir = <your_optimized_model_path>
# Replace <your_infer_data> with the data on which you want to perform inference.
infer_data = <your_infer_data>
model = torch.jit.load(opt_model_dir)
output = model(infer_data)
Appendix: Define multiple ranges of dynamic shapes
If the upper limit and lower limit define a range that is excessively large, the optimized
model may not have a clear advantage in inference acceleration. You can split the
excessively large range into multiple small ranges. This way, the inference speed
is accelerated. The following sample code provides an example on how to define multiple
ranges of dynamic shapes:
shapes1 = {
"min": [[1, 3, 224, 224]],
"max": [[5, 3, 224, 224]],
"opts": [
[[5, 3, 224, 224]],
]
}
shapes2 = {
"min": [[5, 3, 224, 224]],
"max": [[10, 3, 224, 224]],
"opts": [
[[8, 3, 224, 224]],
]
}
shapes = [shapes1, shapes2]
Then, you can use this
shapes configuration to construct the above-mentioned PAI-Blade config objects. For more
information, see
Step 2: Construct PAI-Blade config objects for optimization.