Use PAI-Blade to optimize a TensorFlow ResNet50 model - Platform For AI

ResNet50 is a classic structural network that is widely used. To optimize a ResNet50 model brings great practical values in various model deployment and inference scenarios. This topic describes how to use Platform for AI (PAI)-Blade to optimize a TensorFlow ResNet50 model.

Background information

Residual Neural Network (ResNet) serves as the "Hello World" practice of deep learning models in the image field. ResNet models are widely used in object classification. In addition, ResNet models are part of the classic neural network for computer vision and can be used to extract convolutional features from images. Typical ResNet networks include ResNet26, ResNet50, and ResNet101.

Limits

The environment used for the procedure in this topic must meet the following version requirements:

System environment: Python 3.6 or later and Compute Unified Device Architecture (CUDA) 10.0 in Linux
Framework: TensorFlow 1.15
Inference optimization tool: PAI-Blade V3.17.0 or later

Procedure

Perform the following steps to optimize a TensorFlow ResNet50 model by using PAI-Blade:

Step 1: Make preparations
Install a wheel package of PAI-Blade that supports TensorRT optimization, and download a ResNet50 model and test data.
Step 2: Use PAI-Blade to optimize the model
Call the blade.optimize method to optimize the model.
Step 3: Verify the performance
Verify the information in the optimization report by testing the inference speed of the original model and the optimized model.
Step 4: Load and run the optimized model
Integrate PAI-Blade SDK to load the optimized model for inference.

Step 1: Make preparations

In this example, the main optimization item to take effect for the ResNet50 model is TensorRT. Therefore, you must install PAI-Blade V3.17.0 or later, which supports TensorRT optimization.

Install PAI-Blade that corresponds to TensorFlow 1.15.0 and CUDA 10.0.

pip3 install pai_blade_gpu==3.17.0 -f https://pai-blade.oss-cn-zhangjiakou.aliyuncs.com/release/repo.html

Download a TensorFlow ResNet50 model and the corresponding test data.

wget http://pai-blade.cn-hangzhou.oss.aliyun-inc.com/tutorials/tf_resnet50_v1.5.tar.gz

The downloaded tf_resnet50_v1.5.tar.gz package consists of the frozen.pb file, which is the ResNet50 model, and the corresponding test data of different batch sizes. You must manually decompress the package.
```
tar zxvf tf_resnet50_v1.5.tar.gz
```

Step 2: Use PAI-Blade to optimize the model

Obtain the TensorFlow model and test data from the TAR package that you downloaded in the previous step.

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "1"
import numpy as np
import time
# import tf to import graphdef model.
import tensorflow.compat.v1 as tf
import blade
from blade.model.tf_model import TfModel

def _load_model_and_data():
    local_dir = "./tf_resnet50_v1.5/"
    model_path = os.path.abspath(os.path.join(local_dir, "frozen.pb"))
    data_path = os.path.abspath(os.path.join(local_dir, "test_bc1.npy"))
    graph_def = tf.GraphDef()
    with open(model_path, 'rb') as f:
        graph_def.ParseFromString(f.read())
    test_data = np.load(data_path, allow_pickle=True, encoding='bytes').item()
    return graph_def, test_data

# Let's go!

# Load resnet model and test data.
graph_def, test_data = _load_model_and_data()
print(test_data)

Use PAI-Blade to perform TensorRT optimization.

Based on the type of input, TensorRT optimization can be performed in one of the following two modes:

Static shape optimization

This mode is applicable if the inputs are of the same shape in a model inference request. For example, only inputs of a specific size are allowed to reduce latency. The following sample code provides an example:

config = blade.Config()
config.gpu_config.aicompiler.enable = False
config.gpu_config.disable_fp16_accuracy_check = True 
config.gpu_config.tensorrt.enable = True # TensorRT optimization is enabled by default, you can also use this param to disable if necessary.

# Function `optimize` is the entrance to Blade's one-stop optimization.
optimized_model_static, opt_spec_static, report = blade.optimize(
    graph_def,  # The original model, here is a TF GraphDef.
    'o1',  # Optimization level o1 or o2.
    device_type='gpu',  # Target device to run the optimized model.
    config=config,  # The blade.Config with more detailed optimizations configs
    outputs=['softmax_tensor'],  # Name of outputs nodes. You can provide them or blade will guess.
    test_data=[test_data]
)
print(report)

The system displays an optimization report similar to the following output:

{
  "software_context": [
    {
      "software": "tensorflow",
      "version": "1.15.0"
    },
    {
      "software": "cuda",
      "version": "10.0.0"
    }
  ],
  "hardware_context": {
    "device_type": "gpu",
    "microarchitecture": "T4"
  },
  "user_config": "",
  "diagnosis": {
    "model": "tmp_graph.pbtxt",
    "test_data_source": "user provided",
    "shape_variation": "dynamic",
    "message": "",
    "test_data_info": "input_tensor:0 shape: (1, 224, 224, 3) data type: float32"
  },
  "optimizations": [
    {
      "name": "Tf2TrtPlus",
      "status": "effective",
      "speedup": "3.37",
      "pre_run": "6.81 ms",
      "post_run": "2.02 ms"
    },
    {
      "name": "TfStripUnusedNodes",
      "status": "effective",
      "speedup": "na",
      "pre_run": "na",
      "post_run": "na"
    },
    {
      "name": "TfFoldConstants",
      "status": "effective",
      "speedup": "na",
      "pre_run": "na",
      "post_run": "na"
    }
  ],
  "overall": {
    "baseline": "6.98 ms",
    "optimized": "2.11 ms",
    "speedup": "3.31"
  },
  "model_info": {
    "input_format": "frozen_pb"
  },
  "compatibility_list": [
    {
      "device_type": "gpu",
      "microarchitecture": "T4"
    }
  ],
  "model_sdk": {}
}

No extra parameters are set for the optimization. Therefore, static shape optimization is enabled, and the shape of test_data is used. The preceding optimization report shows that the Tf2TrtPlus optimization item takes effect. In this case, a model inference is performed on an original TensorFlow image if the shape of the input is inconsistent with the shape specified for the optimization. The inference efficiency is greatly reduced.

The preceding optimization report is only for reference. The actual optimization effects of your model prevail. For more information about the parameters in the optimization report, see Optimization report.

Dynamic shape optimization

If PAI-Blade that you deploy supports the batching feature, the server packages the requests that are received in a specific period of time into a batch. The size of a batch depends on the number of requests received by the server in a short period of time. Therefore, batch sizes may change. To support the optimization of such dynamic shapes, PAI-Blade supports dynamic shape optimization by integrating TensorRT. You need to only configure the TensorRTConfig class to enable dynamic shape optimization. The following sample code provides an example. For more information about how to configure the TensorRTConfig class, see Appendix: TensorRTConfig in this topic.

config_dynamic = blade.Config()
config_dynamic.gpu_config.aicompiler.enable = False
config_dynamic.gpu_config.disable_fp16_accuracy_check = True 
config_dynamic.gpu_config.tensorrt.enable = True 
config_dynamic.gpu_config.tensorrt.dynamic_tuning_shapes = {
    "min": [1, 224, 224, 3],
    "opts": [
        [1, 224, 224, 3],
        [2, 224, 224, 3],
        [4, 224, 224, 3],
        [8, 224, 224, 3],
    ],
    "max": [8, 224, 224, 3],
}

# Call Blade's one-stop optimization, with a dynamic shapes setting for TensorRT optimization.
optimized_model_dynamic, opt_spec_dynamic, report = blade.optimize(
    graph_def,  
    'o1',  
    device_type='gpu',  
    config=config_dynamic,
    outputs=['softmax_tensor'],  
    test_data=[test_data]
)
print(report)

with tf.gfile.FastGFile('optimized_model_dynamic.pb', mode='wb') as f:
    f.write(optimized_model_dynamic.SerializeToString())

The system displays an optimization report similar to the following output:

{
  "software_context": [
    {
      "software": "tensorflow",
      "version": "1.15.0"
    },
    {
      "software": "cuda",
      "version": "10.0.0"
    }
  ],
  "hardware_context": {
    "device_type": "gpu",
    "microarchitecture": "T4"
  },
  "user_config": "",
  "diagnosis": {
    "model": "tmp_graph.pbtxt",
    "test_data_source": "user provided",
    "shape_variation": "dynamic",
    "message": "",
    "test_data_info": "input_tensor:0 shape: (1, 224, 224, 3) data type: float32"
  },
  "optimizations": [
    {
      "name": "Tf2TrtPlus",
      "status": "effective",
      "speedup": "3.96",
      "pre_run": "7.98 ms",
      "post_run": "2.02 ms"
    },
    {
      "name": "TfStripUnusedNodes",
      "status": "effective",
      "speedup": "na",
      "pre_run": "na",
      "post_run": "na"
    },
    {
      "name": "TfFoldConstants",
      "status": "effective",
      "speedup": "na",
      "pre_run": "na",
      "post_run": "na"
    }
  ],
  "overall": {
    "baseline": "7.87 ms",
    "optimized": "2.52 ms",
    "speedup": "3.12"
  },
  "model_info": {
    "input_format": "frozen_pb"
  },
  "compatibility_list": [
    {
      "device_type": "gpu",
      "microarchitecture": "T4"
    }
  ],
  "model_sdk": {}
}

The preceding optimization report is similar to that displayed for static shape optimization. Be aware that TensorRT optimization is used if the shape of the input on which inference is performed belongs to the range that is defined by the values of the min and max parameters. Otherwise, the inference is performed on the original TensorFlow image.

Step 3: Verify the performance

After the optimization is complete, you can run the following Python script to verify the information in the optimization report:

import time
with tf.Session(config=TfModel.new_session_config()) as sess, opt_spec_dynamic:
    sess.graph.as_default()
    tf.import_graph_def(optimized_model_dynamic, name="")

    # Warmup!
    for i in range(0, 100):
        sess.run(['softmax_tensor:0'], test_data)

    # Benchmark!
    num_runs = 1000
    start = time.time()
    for i in range(0, num_runs):
        sess.run(['softmax_tensor:0'], test_data)
    elapsed = time.time() - start
    rt_ms = elapsed / num_runs * 1000.0

    # Show the result!
    print("Latency of optimized model: {:.2f}".format(rt_ms))

The system displays information similar to the following output:

Latency of optimized model: 2.26

In the preceding output, the inference latency of the optimized model is 2.26 ms. The value is basically consistent with the value of the optimized parameter, which is 2.52 ms, in the overall parameter in the optimization report. The shapes of the test data belong to the range specified for dynamic shape optimization, and therefore the optimization takes effect. The preceding optimization report is only for reference. The actual optimization effects of your model prevail.

Step 4: Load and run the optimized model

After the verification is complete, you can deploy the optimized model. PAI-Blade provides an SDK for Python and an SDK for C++ that you can integrate. For more information about how to use the SDK for C++, see Use an SDK to deploy a TensorFlow model for inference. The following section describes how to use the SDK for Python to deploy a model.

Optional: During the trial period, add the following environment variable setting to prevent the program from unexpected quits due to an authentication failure:
```
export BLADE_AUTH_USE_COUNTING=1
```
Get authenticated to use PAI-Blade.
```
export BLADE_REGION=<region>
export BLADE_TOKEN=<token>
```
Configure the following parameters based on your business requirements:
- <region>: the region where you use PAI-Blade. You can join the DingTalk group of PAI-Blade users to obtain the regions where PAI-Blade can be used. For information about the QR code of the DingTalk group, see Obtain an access token.
- <token>: the authentication token that is required to use PAI-Blade. You can join the DingTalk group of PAI-Blade users to obtain the authentication token. For information about the QR code of the DingTalk group, see Obtain an access token.

Load and run the optimized model.

Add import blade.runtime.tensorflow to the inference code. In addition to this, you do not need to write extra code for the integration of PAI-Blade SDK or modify the original inference code. In the following example, the model optimized by using TensorRT for dynamic shapes is used.

import tensorflow.compat.v1 as tf
import blade.runtime.tensorflow

infer_data = np.load('./tf_resnet50_v1.5/test_bc1.npy', allow_pickle=True, encoding='bytes').item()
# optimized model produced by blade.optimize
model_path = './optimized_model_dynamic.pb'

graph_def = tf.GraphDef()
with open(model_path, 'rb') as f:
    graph_def.ParseFromString(f.read())
    
with tf.Session() as sess:
    sess.graph.as_default()
    tf.import_graph_def(graph_def, name="")

    print(sess.run(['softmax_tensor:0'], infer_data))

Appendix: TensorRTConfig

Due to the special nature of TensorRT optimization, PAI-Blade provides a class that you can use to meet various deployment requirements.

class TensorRTConfig():
    def __init__(self) -> None:
        self.enable = True
        self.dynamic_tuning_shapes: Dict[str, List[List[Any]]] = dict()
        ......

The TensorRTConfig class contains the following key parameters:

enable: specifies whether to enable TensorRT optimization. The value of this parameter is a Boolean value. By default, TensorRT optimization is enabled when optimization is performed on a GPU device.
dynamic_tuning_shapes: the dictionary of shapes for the optimization. Requests received by a server may contain inputs of different sizes. To perform optimization based on these inputs, PAI-Blade supports dynamic shape optimization. You need to set this parameter to define a dictionary of shapes for the optimization.
The dynamic_tuning_shapes parameter contains three keys: min, max, and opts. The min and max keys specify the minimum and maximum sizes of which inputs can be. The value type of the keys is List[List[int]]. The opts key specifies multiple sizes of inputs. The value type of the key is List[List[List[int]]].
Note
The sizes specified by the opts key must belong to the range defined by the minimum and maximum sizes. Otherwise, the Dim value in \'opts\' is not between min_dim and max_dim error is reported for TensorRT optimization.
The following sample code provides an example on how to set the dynamic_tuning_shapes parameter. In the example, the sizes specified by the opts key are larger than the minimum size and smaller than the maximum size.
```
{
 "min": [[1, 3, 224, 224], [1, 50]],    # lower bound of the dynamic range of each inputs.
 "opts": [
     [[1, 3, 512, 512], [1, 60]],
     [[1, 3, 320, 320], [1, 55]],
  ], # shapes that should be optimized like static shapes
  "max": [[1, 3, 1024, 1024], [1, 70]]   # upper bound of the dynamic range.
}
```