Use PAI-Blade to optimize a TensorFlow BERT model - Platform For AI

Bidirectional Encoder Representations from Transformers (BERT) is a pre-trained language representation model. BERT models achieve the best inference performance in various natural language processing (NLP) tasks and are a breakthrough in the NLP area in recent years. However, BERT models involve numerous parameters and a huge amount of computing workload. Therefore, the optimization of BERT models is greatly needed in actual production scenarios. This topic describes how to use Machine Learning Platform for AI (PAI)-Blade to optimize a BERT model that is trained by TensorFlow.

Limits

The environment used for the procedure in this topic must meet the following version requirements:

System environment: Python 3.6 or later and Compute Unified Device Architecture (CUDA) 10.0 in Linux
Framework: TensorFlow 1.15
Inference optimization tool: PAI-Blade V3.16.0 or later

Procedure

To optimize a BERT model by using PAI-Blade, perform the following steps:

Step 1: Make preparations
Download the model to be optimized, and prepare test data by using the tokenizers library.
Step 2: Use PAI-Blade to optimize the model
Call the blade.optimize method to optimize the model, and save the optimized model.
Step 3: Verify the performance and accuracy of the model
Test the inference speeds and inference results of the original and optimized models to verify the information in the optimization report that is generated.
Step 4: Load and run the optimized model
Integrate PAI-Blade SDK to load the optimized model for inference.

Step 1: Make preparations

Run the following command to install the tokenizers library:
```
pip3 install tokenizers
```

Run the following commands to download the model and decompress the model to the specified directory:

wget http://pai-blade.oss-cn-zhangjiakou.aliyuncs.com/tutorials/bert_example/nlu_general_news_classification_base.tar.gz
mkdir nlu_general_news_classification_base
tar zxvf nlu_general_news_classification_base.tar.gz -C nlu_general_news_classification_base

Run the saved_model_cli command provided by TensorFlow to view the basic information of the model:

saved_model_cli show --dir nlu_general_news_classification_base --all

The following output is returned:

MetaGraphDef with tag-set: 'serve' contains the following SignatureDefs:

signature_def['serving_default']:
  The given SavedModel SignatureDef contains the following input(s):
    inputs['input_ids'] tensor_info:
        dtype: DT_INT32
        shape: (-1, -1)
        name: input_ids:0
    inputs['input_mask'] tensor_info:
        dtype: DT_INT32
        shape: (-1, -1)
        name: input_mask:0
    inputs['segment_ids'] tensor_info:
        dtype: DT_INT32
        shape: (-1, -1)
        name: segment_ids:0
  The given SavedModel SignatureDef contains the following output(s):
    outputs['logits'] tensor_info:
        dtype: DT_FLOAT
        shape: (-1, 28)
        name: app/ez_dense/BiasAdd:0
    outputs['predictions'] tensor_info:
        dtype: DT_INT32
        shape: (-1)
        name: ArgMax:0
    outputs['probabilities'] tensor_info:
        dtype: DT_FLOAT
        shape: (-1, 28)
        name: Softmax:0
  Method name is: tensorflow/serving/predict

In the preceding output, the news classification model has three input tensors, namely, input_ids:0, input_mask:0, and segment_ids:0, and three output tensors, namely, logits, predictions, and probabilities. ArgMax:0 that is nested under predictions indicates the category to which a piece of news belongs.

Call the tokenizers library to prepare test data.

from tokenizers import BertWordPieceTokenizer

# Initialize a tokenizer by using the vocab.txt file in the model directory. 
tokenizer = BertWordPieceTokenizer('./nlu_general_news_classification_base/vocab.txt')

# Group four pieces of news into a batch for encoding. 
news = [
    'Declaration of a national public health emergency in Mexico due to over 1,000 confirmed cases. Chinanews.com reported on March 31 that the number of confirmed COVID-19 cases in Mexico had exceeded 1,000. On March 30, the Mexican government declared a national public health emergency. Solutions were launched to curb the spread of COVID-19. ',
    'Data released by the National Bureau of Statistics showed that the Manufacturing Purchasing Managers' Index (PMI) for China was 50.1% in August, 0.3 percentage points lower than that in July but still higher than the critical point. ',
    'Reported on August 31, China standard time, in the final round of group stage matches of men's blind football at Tokyo 2020 Paralympic Games, China beat hosts Japan 2-0 and advanced to the semi-final match. China's football player Zhu Ruiming scored a brace in this match. ',
    'By August 30, the Zhurong rover has been traveling on the surface of Mars for 100 days. In the past 100 days, it has traveled a total of 1,064 meters in the south from its landing spot. It carries six scientific instruments and has obtained around 10 GB of raw scientific data. ',
]
tokenized = tokenizer.encode_batch(news)

# Expand the sequence to 128 characters in length. 
def pad(seq, seq_len, padding_val):
    return seq + [padding_val] * (seq_len - len(seq))

input_ids = [pad(tok.ids, 128, 0) for tok in tokenized]
segment_ids = [pad(tok.type_ids, 128, 0) for tok in tokenized]
input_mask = [ pad([1] * len(tok.ids), 128, 0) for tok in tokenized ]

# The final test data is in the format of TensorFlow feed_dict arguments. 
test_data = {
    "input_ids:0": input_ids,
    "segment_ids:0": segment_ids,
    "input_mask:0": input_mask,
}

Load the model and perform inference on the test data.

import tensorflow.compat.v1 as tf
import json

# Load the label mapping file and obtain the category names that correspond to the output integers. 
with open('./nlu_general_news_classification_base/label_mapping.json') as f:
    MAPPING = {v: k for k, v in json.load(f).items()}

# Load and run the model. 
cfg = tf.ConfigProto()
cfg.gpu_options.allow_growth = True
with tf.Session(config=cfg) as sess:
    tf.saved_model.loader.load(sess, ['serve'], './nlu_general_news_classification_base')
    result = sess.run('ArgMax:0', test_data)
    print([MAPPING[r] for r in result])

The following inference result is returned as expected:

['International', 'Finance', 'Sports', 'Science']

Step 2: Use PAI-Blade to optimize the model

Call the blade.optimize method to optimize the model. The following sample code provides an example. For more information, see Python method.
```
import blade

saved_model_dir = 'nlu_general_news_classification_base'
optimized_model, _, report = blade.optimize(
    saved_model_dir,       # The path of the model file. 
    'o1',                  # Lossless optimization. 
    device_type='gpu',     # Optimization for GPU devices. 
    test_data=[test_data]  # The test data. 
)
```
Take note of the following items when you optimize the model:
- The first return value of the blade.optimize method indicates the optimized model. The data type remains the same as that of the original model. In this example, the path of the SavedModel file is specified in the input, and the path of the optimized SavedModel file is returned.
- You do not need to set the inputs and outputs parameters. PAI-Blade automatically infers the input and output nodes.

Display the optimization report after the optimization is complete.

print("Report: {}".format(report))

The following sample code provides an example of the optimization report:

Report: {
  "software_context": [
    {
      "software": "tensorflow",
      "version": "1.15.0"
    },
    {
      "software": "cuda",
      "version": "10.0.0"
    }
  ],
  "hardware_context": {
    "device_type": "gpu",
    "microarchitecture": "T4"
  },
  "user_config": "",
  "diagnosis": {
    "model": "nlu_general_news_classification_base",
    "test_data_source": "user provided",
    "shape_variation": "dynamic",
    "message": "",
    "test_data_info": "input_ids:0 shape: (4, 128) data type: int64\nsegment_ids:0 shape: (4, 128) data type: int64\ninput_mask:0 shape: (4, 128) data type: int64"
  },
  "optimizations": [
    {
      "name": "TfStripUnusedNodes",
      "status": "effective",
      "speedup": "na",
      "pre_run": "na",
      "post_run": "na"
    },
    {
      "name": "TfStripDebugOps",
      "status": "effective",
      "speedup": "na",
      "pre_run": "na",
      "post_run": "na"
    },
    {
      "name": "TfAutoMixedPrecisionGpu",
      "status": "effective",
      "speedup": "1.46",
      "pre_run": "35.04 ms",
      "post_run": "24.02 ms"
    },
    {
      "name": "TfAicompilerGpu",
      "status": "effective",
      "speedup": "2.43",
      "pre_run": "23.99 ms",
      "post_run": "9.87 ms"
    }
  ],
  "overall": {
    "baseline": "35.01 ms",
    "optimized": "9.90 ms",
    "speedup": "3.54"
  },
  "model_info": {
    "input_format": "saved_model"
  },
  "compatibility_list": [
    {
      "device_type": "gpu",
      "microarchitecture": "T4"
    }
  ],
  "model_sdk": {}
}

The preceding optimization report shows that two optimization items, namely, TfAutoMixedPrecisionGpu and TfAicompilerGpu, take effect. The inference time of the model is shortened by 3.54 times, from 35 ms to 9.9 ms. The preceding optimization report is only for reference. The actual optimization effect of your model prevails. For more information about the fields in the optimization report, see Optimization report.

Display the path of the optimized model.

print("Optimized model: {}".format(optimized_model))

The system displays information similar to the following output:

Optimized model: /root/nlu_general_news_classification_base_blade_opt_20210901141823/nlu_general_news_classification_base

The output shows the new path of the optimized model.

Step 3: Verify the performance and accuracy of the model

After the optimization is complete, you can verify the information in the optimization report by running a Python script.

Define the benchmark method, warm up the model 10 times, and then run the model 1,000 times to obtain the average inference time of the model, which indicates the inference speed.

import time

def benchmark(model, test_data):
    tf.reset_default_graph()
    with tf.Session() as sess:
        sess.graph.as_default()
        tf.saved_model.loader.load(sess, ['serve'], model)
        # Warmup!
        for i in range(0, 10):
            result = sess.run('ArgMax:0', test_data)
        # Benchmark!
        num_runs = 1000
        start = time.time()
        for i in range(0, num_runs):
            result = sess.run('ArgMax:0', test_data)
        elapsed = time.time() - start
        rt_ms = elapsed / num_runs * 1000.0
        # Show the result!
        print("Latency of model: {:.2f} ms.".format(rt_ms))
        print("Predict result: {}".format([MAPPING[r] for r in result]))

Call the benchmark method to verify the original model.
```
benchmark('nlu_general_news_classification_base', test_data)
```
The system displays information similar to the following output:
```
Latency of model: 36.20 ms.
Predict result: ['International', 'Finance', 'Sports', 'Science']
```
The output shows that the inference time of the original model is 36.20 ms. The value is basically consistent with "35.01 ms", which is the value of "baseline" that is nested under the "overall" field in the optimization report. The prediction result ['International', 'Finance', 'Sports', 'Science'] is as expected. The inference time in the preceding output is only for reference. The actual inference time of your model prevails.
Call the benchmark method to verify the optimized model.
```
import os
os.environ['TAO_COMPILATION_MODE_ASYNC'] = '0'

benchmark(optimized_model, test_data)
```
The preceding optimization report shows that AICompiler has an optimization effect on the model. However, AICompiler works asynchronously, and the original model may be used for inference during compilation. To ensure the accuracy of the verification, you must change the compilation mode to synchronous by setting the environment variable TAO_COMPILATION_MODE_ASYNC to 0 before you call the benchmark method.
The system displays information similar to the following output:
```
Latency of model: 9.87 ms.
Predict result: ['International', 'Finance', 'Sports', 'Science']
```
The output shows that the inference time of the optimized model is 9.87 ms. The value is basically consistent with "9.90 ms", which is the value of "optimized" that is nested under the "overall" field in the optimization report. The prediction result ['International', 'Finance', 'Sports', 'Science'] is as expected. The inference time in the preceding output is only for reference. The actual inference time of your model prevails.

Step 4: Load and run the optimized model

After the verification is complete, you can deploy the optimized model. PAI-Blade provides an SDK for Python and an SDK for C++ that you can integrate. For more information about how to use the SDK for C++, see Use an SDK to deploy a TensorFlow model for inference. The following section describes how to use the SDK for Python to deploy a model.

Optional: During the trial period, add the following environment variable setting to prevent the program from unexpected quits due to an authentication failure:
```
export BLADE_AUTH_USE_COUNTING=1
```
Get authenticated to use PAI-Blade.
```
export BLADE_REGION=<region>
export BLADE_TOKEN=<token>
```
Configure the following parameters based on your business requirements:
- <region>: the region where you use PAI-Blade. You can join the DingTalk group of PAI-Blade users to obtain the regions where PAI-Blade can be used. For information about the QR code of the DingTalk group, see Obtain an access token.
- <token>: the authentication token that is required to use PAI-Blade. You can join the DingTalk group of PAI-Blade users to obtain the authentication token. For information about the QR code of the DingTalk group, see Obtain an access token.

Load and run the optimized model.

Add import blade.runtime.torch to the inference code. In addition to this, you do not need to write extra code for the integration of PAI-Blade SDK or modify the original inference code.

import tensorflow.compat.v1 as tf
import blade.runtime.tensorflow
# Replace <your_optimized_model_path> with the path of the optimized model. 
savedmodel_dir = <your_optimized_model_path>
# Replace <your_infer_data> with the data on which you want to perform inference. 
infer_data = <your_infer_data>

with tf.Session() as sess:
    sess.graph.as_default()
    tf.saved_model.loader.load(sess, ['serve'], savedmodel_dir)
    result = sess.run('ArgMax:0', infer_data)