AICompiler is an AI compiler optimization component integrated with PAI-Blade. Call blade.optimize() to improve model inference performance on GPU — no manual hardware tuning required.
How it works
Optimizing deep learning models for inference requires detailed knowledge of hardware architecture, instruction sets, and memory access patterns. AICompiler handles this automatically through two compiler modes:
| Compiler | TensorFlow | PyTorch | When to use |
|---|---|---|---|
| Dynamic shape compiler | Supported | Supported | General-purpose. Use this when input shapes vary at runtime or when you are unsure which to choose. |
| Static shape compiler | Supported | Not supported | Use this when input shapes are fixed or change only slightly. Achieves better peak performance than the dynamic shape compiler. |
PAI-Blade automatically selects the appropriate compiler based on your model. If the shapes are static or change minimally, it applies the static shape compiler for better performance. No configuration is needed to enable this behavior.
Prerequisites
Before you begin, ensure that you have:
A PAI-Blade environment with
bladeinstalledA TensorFlow or PyTorch model to optimize
GPU compute resources (the examples use an NVIDIA Tesla T4)
Optimize a TensorFlow model
The examples below use an open source automatic speech recognition (ASR) model. Both the dynamic and static shape compiler examples share the same model and test data.
Step 1: Download the model and test data.
wget https://pai-blade.cn-hangzhou.oss.aliyun-inc.com/test_public_model/bbs/tf_aicompiler_demo/frozen.pb
wget https://pai-blade.cn-hangzhou.oss.aliyun-inc.com/test_public_model/bbs/tf_aicompiler_demo/test_bc4.npyStep 2: Load the model and call `blade.optimize()`.
The core call is:
optimized_model, opt_spec, report = blade.optimize(model, 'o1', device_type='gpu', ...)The optimization level 'o1' enables general-purpose optimizations. Use 'o2' for more aggressive optimizations. The complete example:
import numpy as np
import tensorflow as tf
import blade
# Load the model and test data.
graph_def = tf.GraphDef()
with open('./frozen.pb', 'rb') as f:
graph_def.ParseFromString(f.read())
test_data = np.load('test_bc4.npy', allow_pickle=True, encoding='bytes').item()
# Optimize the model.
optimized_model, opt_spec, report = blade.optimize(
graph_def, # The original model — a TensorFlow GraphDef.
'o1', # Optimization level: o1 or o2.
device_type='gpu', # Target device for the optimized model.
config=blade.Config(),
inputs=['encoder_memory_placeholder', 'encoder_memory_length_placeholder'],
outputs=['score', 'seq_id'],
test_data=[test_data]
)
# Save the optimized model.
tf.train.write_graph(optimized_model, './', 'optimized.pb', as_text=False)
print('Report: {}'.format(report))Step 3: Review the optimization report.
After optimization, blade.optimize() returns a report showing the performance gains. The following report shows a 2.23x speedup on an NVIDIA Tesla T4 GPU using the dynamic shape compiler:
{
"name": "TfAicompilerGpu",
"status": "effective",
"speedup": "2.23",
"pre_run": "120.54 ms",
"post_run": "53.99 ms"
}For details on report fields, see Optimization report.
Use the static shape compiler
If your model's input shapes are fixed or change minimally, pass enable_static_shape_compilation_opt=True in blade.Config() to use the static shape compiler:
optimized_model, opt_spec, report = blade.optimize(
graph_def,
'o1',
device_type='gpu',
config=blade.Config(enable_static_shape_compilation_opt=True),
inputs=['encoder_memory_placeholder', 'encoder_memory_length_placeholder'],
outputs=['score', 'seq_id'],
test_data=[test_data]
)For advanced config options, see Table 1.
The static shape compiler achieves better peak performance when shapes are consistent. The following report shows a 2.35x speedup on the same NVIDIA Tesla T4 GPU:
{
"name": "TfAicompilerGpu",
"status": "effective",
"speedup": "2.35",
"pre_run": "114.91 ms",
"post_run": "48.86 ms"
}Optimize a PyTorch model
This example uses an open source ASR model and requires PyTorch 1.6.0 and Python 3.6.
Step 1: Download the model.
# PyTorch 1.6.0
# Python 3.6
wget https://pai-blade.cn-hangzhou.oss.aliyun-inc.com/test_public_model/bbs/pt_aicompiler_demo/orig_decoder_v2.ptStep 2: Load the model and call `blade.optimize()`.
The core call structure is the same as TensorFlow. For PyTorch, pass test data as a list of tuples:
optimized_model, opt_spec, report = blade.optimize(model, 'o1', device_type='gpu', test_data=[dummy], ...)The complete example:
import torch
import blade
# Load the TorchScript model.
pt_file = 'orig_decoder_v2.pt'
batch = 8
model = torch.jit.load(pt_file)
# Prepare test data.
def get_test_data(batch_size=1):
decoder_input_t = torch.LongTensor([1] * batch_size).cuda()
decoder_hidden_t = torch.rand(batch_size, 1, 256).cuda()
decoder_hidden_t = decoder_hidden_t * 1.0
decoder_hidden_t = torch.tanh(decoder_hidden_t)
output_highfeature_t = torch.rand(batch_size, 448, 4, 50).cuda()
attention_sum_t = torch.rand(batch_size, 1, 4, 50).cuda()
decoder_attention_t = torch.rand(batch_size, 1, 4, 50).cuda()
et_mask = torch.rand(batch_size, 4, 50).cuda()
return (decoder_input_t, decoder_hidden_t, output_highfeature_t,
attention_sum_t, decoder_attention_t, et_mask)
dummy = get_test_data(batch)
# Optimize the model.
optimized_model, opt_spec, report = blade.optimize(
model, # The original TorchScript model.
'o1', # Optimization level: o1 or o2.
device_type='gpu', # Target device for the optimized model.
test_data=[dummy], # For PyTorch, input data is a list of tuples.
config=blade.Config()
)
print('spec: {}'.format(opt_spec))
print('report: {}'.format(report))
# Save the optimized model.
torch.jit.save(optimized_model, 'optimized_decoder.pt')Step 3: Review the optimization report.
The following report shows a 2.45x speedup on an NVIDIA Tesla T4 GPU:
{
"optimizations": [
{
"name": "PyTorchMlir",
"status": "effective",
"speedup": "2.45",
"pre_run": "1.99 ms",
"post_run": "0.81 ms"
}
]
}For details on report fields, see Optimization report.