AICompiler is an AI compiler optimization component that is integrated with Machine Learning Platform for AI (PAI)-Blade. AICompiler provides static shape and dynamic shape compilers. You can use AICompiler to improve the performance of model inference in a transparent and universally applied manner without the need to configure additional settings. This topic describes how to use AICompiler to optimize TensorFlow and PyTorch models.
Background information
So far, the structures of AI models have evolved, more underlying computing hardware has been used, and user habits have become more diversified. This makes it difficult to manually improve the performance and efficiency of AI models. Thereby, the development of AI compiler optimization has become a popular concern.
Traditional compilers use code written in a high-level language as input and do not require you to write machine code. Deep learning compilers have similar features to traditional compilers and use flexible computational graphs with a higher level of abstraction as input. The output of deep learning compilers includes underlying machine code and execution engines of hardware platforms such as CPUs or GPUs. AICompiler is developed to work as a common compiler to improve the performance of AI computing tasks. When you use AICompiler, you need to only focus on upper-level model development. This helps you reduce manual optimization efforts and make full use of hardware performance.
Over the past two years, the PAI team has invested a lot of time and resources in developing AI compiler optimization. As one of the optimization components, AICompiler has been integrated with PAI-Blade to help you optimize and deploy models for inference in a transparent and universally applied manner.
Compiler | TensorFlow | PyTorch | Scenario |
---|---|---|---|
Static shape compiler | Supported | Not supported | Is suitable for tasks in which the shapes of computational graphs are static or slightly change to help achieve optimal performance. |
Dynamic shape compiler | Supported | Supported | Is suitable for all types of tasks. |
Use the dynamic shape compiler to optimize a TensorFlow model
- Download the model and test data.
# Download the sample model and test data. wget https://pai-blade.cn-hangzhou.oss.aliyun-inc.com/test_public_model/bbs/tf_aicompiler_demo/frozen.pb wget https://pai-blade.cn-hangzhou.oss.aliyun-inc.com/test_public_model/bbs/tf_aicompiler_demo/test_bc4.npy
- Load the model and test data. Then, call the
blade.optimize
method. You do not need to configure additional settings.import numpy as np import tensorflow as tf import blade # Load the model and test data. graph_def = tf.GraphDef() with open('./frozen.pb', 'rb') as f: graph_def.ParseFromString(f.read()) test_data = np.load('test_bc4.npy', allow_pickle=True, encoding='bytes',).item() # Optimize the model. optimized_model, opt_spec, report = blade.optimize( graph_def, # The original model, here is a TF GraphDef. 'o1', # Optimization level o1 or o2. device_type='gpu', # Target device to run the optimized model. config=blade.Config(), inputs=['encoder_memory_placeholder', 'encoder_memory_length_placeholder'], outputs=['score', 'seq_id'], test_data=[test_data] #verbose=True ) # Save the optimization results. tf.train.write_graph(optimized_model, "./", "optimized.pb", as_text=False) print("Report: {}".format(report))
- After the model is optimized, the
blade.optimize
method returns an optimization report. You can view the optimization effects achieved by AICompiler from the report. In the following example of an optimization report, the performance of Tesla T4 is improved by 2.23 times in a transparent and universally applied manner. For more information about the fields in the report, see Optimization report.{ "name": "TfAicompilerGpu", "status": "effective", "speedup": "2.23", "pre_run": "120.54 ms", "post_run": "53.99 ms" }
Use the static shape compiler to optimize a TensorFlow model
blade.optimize
method to specify the static shape compiler for your models. The following sample
code provides an example: optimized_model, opt_spec, report = blade.optimize(
graph_def, # The original model, here is a TF GraphDef.
'o1', # Optimization level o1 or o2.
device_type='gpu', # Target device to run the optimized model.
# Provide an additional config here in order to try Static Shape Compilation:
config=blade.Config(enable_static_shape_compilation_opt = True),
inputs=['encoder_memory_placeholder', 'encoder_memory_length_placeholder'],
outputs=['score', 'seq_id'],
test_data=[test_data]
#verbose=True
)
For more information about the advanced configurations of the input parameter config,
see Table 1.
blade.optimize
method returns an optimization report. You can view the optimization effects achieved
by AICompiler from the report. In this example, the performance of Tesla T4 is improved
by 2.35 times. For more information about the fields in the report, see Optimization report. {
"name": "TfAicompilerGpu",
"status": "effective",
"speedup": "2.35",
"pre_run": "114.91 ms",
"post_run": "48.86 ms"
}
Use the dynamic shape compiler to optimize a PyTorch model
- Download the model.
# PyTorch 1.6.0 # Python3.6 wget https://pai-blade.cn-hangzhou.oss.aliyun-inc.com/test_public_model/bbs/pt_aicompiler_demo/orig_decoder_v2.pt
- Load the model and test data. Then, call the
blade.optimize
method. You do not need to configure additional settings.import os import time import torch # To use blade, just import it. import blade # Load the model. pt_file = 'orig_decoder_v2.pt' batch = 8 model = torch.jit.load(pt_file) # Prepare the test data. def get_test_data(batch_size=1): decoder_input_t = torch.LongTensor([1] * batch_size).cuda() decoder_hidden_t = torch.rand(batch_size, 1, 256).cuda() decoder_hidden_t = decoder_hidden_t * 1.0 decoder_hidden_t = torch.tanh(decoder_hidden_t) output_highfeature_t = torch.rand(batch_size, 448, 4, 50).cuda() attention_sum_t = torch.rand(batch_size, 1, 4, 50).cuda() decoder_attention_t = torch.rand(batch_size, 1, 4, 50).cuda() et_mask = torch.rand(batch_size, 4, 50).cuda() return (decoder_input_t, decoder_hidden_t, output_highfeature_t, attention_sum_t, decoder_attention_t, et_mask) dummy = get_test_data(batch) # Optimize the model. optimized_model, opt_spec, report = blade.optimize( model, # The original model, here is a torch scrip model. 'o1', # Optimization level o1 or o2. device_type='gpu', # Target device to run the optimized model. test_data=[dummy], # For PyTorch, input data is list of tupoles. config=blade.Config() ) print("spec: {}".format(opt_spec)) print("report: {}".format(report)) # Save the optimization results. torch.jit.save(optimized_model, 'optimized_decoder.pt')
- After the model is optimized, the
blade.optimize
method returns an optimization report. You can view the optimization effects achieved by AICompiler from the report. In the following example of an optimization report, the performance of Tesla T4 is improved by 2.45 times in a transparent and universally applied manner. For more information about the fields in the report, see Optimization report."optimizations": [ { "name": "PyTorchMlir", "status": "effective", "speedup": "2.45", "pre_run": "1.99 ms", "post_run": "0.81 ms" } ],