AICompiler is an AI compiler optimization component that is integrated with Machine Learning Platform for AI (PAI)-Blade. AICompiler provides static shape and dynamic shape compilers. You can use AICompiler to improve the performance of model inference in a transparent and universally applied manner without the need to configure additional settings. This topic describes how to use AICompiler to optimize TensorFlow and PyTorch models.

Background information

So far, the structures of AI models have evolved, more underlying computing hardware has been used, and user habits have become more diversified. This makes it difficult to manually improve the performance and efficiency of AI models. Thereby, the development of AI compiler optimization has become a popular concern.

Traditional compilers use code written in a high-level language as input and do not require you to write machine code. Deep learning compilers have similar features to traditional compilers and use flexible computational graphs with a higher level of abstraction as input. The output of deep learning compilers includes underlying machine code and execution engines of hardware platforms such as CPUs or GPUs. AICompiler is developed to work as a common compiler to improve the performance of AI computing tasks. When you use AICompiler, you need to only focus on upper-level model development. This helps you reduce manual optimization efforts and make full use of hardware performance.

Over the past two years, the PAI team has invested a lot of time and resources in developing AI compiler optimization. As one of the optimization components, AICompiler has been integrated with PAI-Blade to help you optimize and deploy models for inference in a transparent and universally applied manner.

AICompiler provides static shape and dynamic shape compilers. The dynamic shape compiler is suitable for all types of tasks, including model inference tasks in which the shapes of computational graphs dramatically change. The static shape compiler is suitable for tasks in which the shapes of computational graphs are static or slightly change to help achieve optimal performance. The following table compares the two compilers.
Compiler TensorFlow PyTorch Scenario
Static shape compiler Supported Not supported Is suitable for tasks in which the shapes of computational graphs are static or slightly change to help achieve optimal performance.
Dynamic shape compiler Supported Supported Is suitable for all types of tasks.
By default, PAI-Blade automatically determines whether the dynamic shape compiler is suitable for your models. You do not need to provide additional information as input. If the shapes of computational graphs in your tasks are static or slightly change, PAI-Blade uses the static shape compiler for your models to ensure better performance. The following sections provide examples on how to use the static shape and dynamic shape compilers to optimize different models.

Use the dynamic shape compiler to optimize a TensorFlow model

In this example, an open source automatic speech recognition (ASR) model is used for demonstration.
  1. Download the model and test data.
    # Download the sample model and test data. 
    wget https://pai-blade.cn-hangzhou.oss.aliyun-inc.com/test_public_model/bbs/tf_aicompiler_demo/frozen.pb
    wget https://pai-blade.cn-hangzhou.oss.aliyun-inc.com/test_public_model/bbs/tf_aicompiler_demo/test_bc4.npy
  2. Load the model and test data. Then, call the blade.optimize method. You do not need to configure additional settings.
    import numpy as np
    import tensorflow as tf
    import blade
    
    # Load the model and test data. 
    graph_def = tf.GraphDef()
    with open('./frozen.pb', 'rb') as f:
        graph_def.ParseFromString(f.read())
    test_data = np.load('test_bc4.npy', allow_pickle=True, encoding='bytes',).item()
    
    # Optimize the model. 
    optimized_model, opt_spec, report = blade.optimize(
        graph_def,  # The original model, here is a TF GraphDef.
        'o1',  # Optimization level o1 or o2.
        device_type='gpu',  # Target device to run the optimized model.
        config=blade.Config(),
        inputs=['encoder_memory_placeholder', 'encoder_memory_length_placeholder'],
        outputs=['score', 'seq_id'],
        test_data=[test_data]
        #verbose=True
    )
    
    # Save the optimization results. 
    tf.train.write_graph(optimized_model, "./", "optimized.pb", as_text=False)
    print("Report: {}".format(report))
  3. After the model is optimized, the blade.optimize method returns an optimization report. You can view the optimization effects achieved by AICompiler from the report. In the following example of an optimization report, the performance of Tesla T4 is improved by 2.23 times in a transparent and universally applied manner. For more information about the fields in the report, see Optimization report.
        {
          "name": "TfAicompilerGpu",
          "status": "effective",
          "speedup": "2.23",
          "pre_run": "120.54 ms",
          "post_run": "53.99 ms"
        }

Use the static shape compiler to optimize a TensorFlow model

If the shapes of computational graphs in your tasks are static or slightly change, you can set the input parameter config of the blade.optimize method to specify the static shape compiler for your models. The following sample code provides an example:
optimized_model, opt_spec, report = blade.optimize(
    graph_def,  # The original model, here is a TF GraphDef.
    'o1',  # Optimization level o1 or o2.
    device_type='gpu',  # Target device to run the optimized model.
    # Provide an additional config here in order to try Static Shape Compilation:
    config=blade.Config(enable_static_shape_compilation_opt = True), 
    inputs=['encoder_memory_placeholder', 'encoder_memory_length_placeholder'],
    outputs=['score', 'seq_id'],
    test_data=[test_data]
    #verbose=True
)
For more information about the advanced configurations of the input parameter config, see Table 1.
After the model is optimized, the blade.optimize method returns an optimization report. You can view the optimization effects achieved by AICompiler from the report. In this example, the performance of Tesla T4 is improved by 2.35 times. For more information about the fields in the report, see Optimization report.
    {
      "name": "TfAicompilerGpu",
      "status": "effective",
      "speedup": "2.35",
      "pre_run": "114.91 ms",
      "post_run": "48.86 ms"
    }

Use the dynamic shape compiler to optimize a PyTorch model

In this example, an open source ASR model is used for demonstration.
  1. Download the model.
    # PyTorch 1.6.0
    # Python3.6
    wget https://pai-blade.cn-hangzhou.oss.aliyun-inc.com/test_public_model/bbs/pt_aicompiler_demo/orig_decoder_v2.pt
  2. Load the model and test data. Then, call the blade.optimize method. You do not need to configure additional settings.
    import os
    import time
    import torch
    # To use blade, just import it.
    import blade
    
    # Load the model. 
    pt_file = 'orig_decoder_v2.pt'
    batch = 8
    model = torch.jit.load(pt_file)
    
    # Prepare the test data. 
    def get_test_data(batch_size=1):
        decoder_input_t = torch.LongTensor([1] * batch_size).cuda()
        decoder_hidden_t = torch.rand(batch_size, 1, 256).cuda()
        decoder_hidden_t = decoder_hidden_t * 1.0
        decoder_hidden_t = torch.tanh(decoder_hidden_t)
        output_highfeature_t = torch.rand(batch_size, 448, 4, 50).cuda()
        attention_sum_t = torch.rand(batch_size, 1, 4, 50).cuda()
        decoder_attention_t = torch.rand(batch_size, 1, 4, 50).cuda()
        et_mask = torch.rand(batch_size, 4, 50).cuda()
    
        return (decoder_input_t, decoder_hidden_t, output_highfeature_t, attention_sum_t, decoder_attention_t, et_mask)
    
    dummy = get_test_data(batch)
    
    # Optimize the model. 
    optimized_model, opt_spec, report = blade.optimize(
        model,  # The original model, here is a torch scrip model.
        'o1',  # Optimization level o1 or o2.
        device_type='gpu',  # Target device to run the optimized model.
        test_data=[dummy],  # For PyTorch, input data is list of tupoles.
        config=blade.Config()
    )
    
    print("spec: {}".format(opt_spec))
    print("report: {}".format(report))
    
    # Save the optimization results. 
    torch.jit.save(optimized_model, 'optimized_decoder.pt')
  3. After the model is optimized, the blade.optimize method returns an optimization report. You can view the optimization effects achieved by AICompiler from the report. In the following example of an optimization report, the performance of Tesla T4 is improved by 2.45 times in a transparent and universally applied manner. For more information about the fields in the report, see Optimization report.
      "optimizations": [
        {
          "name": "PyTorchMlir",
          "status": "effective",
          "speedup": "2.45",
          "pre_run": "1.99 ms",
          "post_run": "0.81 ms"
        }
      ],