All Products
Search
Document Center

:Manually install AIACC-Inference for Torch

Last Updated:Oct 27, 2023

AIACC-Inference can optimize models that are built based on the Torch framework to significantly improve inference performance. This topic describes how to manually install AIACC-Inference for Torch and provides examples on inference effects.

Prerequisites

An Alibaba Cloud GPU-accelerated instance is created.

  • Specification: The instance is configured with an A10, V100, or T4 GPU.

    Note

    For more information, see Overview of instance families.

  • Image: The OS of the image used by the instance is Ubuntu 16.04 LTS or CentOS 7.x.

Background information

AIACC-Inference for Torch cuts computational graphs of models, fuses layers, and uses high-performance operational amplifiers to significantly improve inference performance in PyTorch. You can optimize the inference performance of deep learning models in the PyTorch framework by using just-in-time (JIT) compilation. You do not need to specify the precision and input sizes.

AIACC-Inference for Torch allows you to call the aiacctorch.compile(model) operation to improve inference performance. You need to only call the torch.jit.script or torch.jit.trace operation to convert your PyTorch model to a TorchScript model. For more information, see TorchScript. This topic provides examples on how to call the torch.jit.script and torch.jit.trace operations to improve inference performance.

Install AIACC-Inference for Torch

AIACC-Inference for Torch provides Conda and WHL installation packages. The Conda package is out-of-the-box. You can use one of the packages based on your business requirements.

  • Conda package

    A large number of dependency packages are pre-installed in the Conda package. Before you install the Conda package, you need to only manually install a CUDA driver. Perform the following operations:

    Important

    We recommend that you do not change the information about the pre-installed dependency packages in the Conda package. If you change the information, an error may be reported due to inconsistent versions when the demo runs.

    1. Connect to the GPU-accelerated instance.

    2. Install a CUDA driver of v470.57.02 or later.

    3. Download the Conda package.

      wget https://aiacc-inference-public.oss-cn-beijing.aliyuncs.com/aiacc-inference-torch/aiacc-inference-torch-miniconda-latest.tar.bz2
    4. Decompress the Conda package.

      mkdir ./aiacc-inference-miniconda && tar -xvf ./aiacc-inference-torch-miniconda-latest.tar.bz2 -C ./aiacc-inference-miniconda
    5. Load the Conda package.

      source ./aiacc-inference-miniconda/bin/activate
  • WHL package

    Before you install the WHL package, you must manually install the required dependency packages. Perform the following operations:

    1. Connect to the GPU-accelerated instance.

    2. Use one of the following methods to install the dependency packages. The WHL package depends on a variety of software. Proceed with caution when you install the dependency packages.

      • Method 1

        1. Install the following dependency packages:

          • CUDA 11.1

          • cuDNN 8.3.1.22

          • TensorRT 8.2.3.0

        2. Specify the dependency libraries of TensorRT and CUDA for the LD_LIBRARY_PATH environment variable.

          You can run the following commands to specify the dependency libraries. In this example, the dependency library of CUDA is stored in the /usr/local/cuda/ directory and the dependency library of TensorRT is stored in the /usr/local/TensorRT/ directory. You can change the directories based on your business requirements.

          export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
          export LD_LIBRARY_PATH=/usr/local/TensorRT/lib:$LD_LIBRARY_PATH
        3. Execute the environment variable.

          source ~/.bashrc
      • Method 2

        Use the NVIDIA pip to install the dependency packages.

        pip install nvidia-pyindex && \ 
        pip install nvidia-tensorrt==8.2.3.0
    3. Install PyTorch 1.9.0+cu111.

      pip3 install torch==1.9.0+cu111 torchvision==0.10.0+cu111 torchaudio==0.9.0 -f https://download.pytorch.org/whl/torch_stable.html
    4. Download and install AIACC-Inference for Torch.

      pip install aiacctorch -f https://aiacc-inference-public.oss-cn-beijing.aliyuncs.com/aiacc-inference-torch/aiacctorch_stable.html -f https://download.pytorch.org/whl/torch_stable.html

Run inference on the ResNet-50 model

In this example, the Conda package is installed, and the torch.jit.script operation is called to run inference on the ResNet-50 model 1,000 times. The average inference time is reduced from 3.68 ms to 0.396 ms.

  • Original inference

    The following content shows the original code:

    import time
    import torch
    import torchvision.models as models
    mod = models.resnet50(pretrained=True).eval()
    mod_jit = torch.jit.script(mod)
    mod_jit = mod_jit.cuda()
    
    in_t = torch.randn([1,3,224,224]).float().cuda()
    
    # Warming up
    for _ in range(10):
        mod_jit(in_t)
    
        inference_count = 1000
        # inference test
        start = time.time()
        for _ in range(inference_count):
            mod_jit(in_t)
            end = time.time()
            print(f"use {(end-start)/inference_count*1000} ms each inference")
    print(f"{inference_count/(end-start)} step/s")

    The following figure shows that the average inference time is about 3.68 ms.

    aiacc3.68
  • Accelerated inference

    To improve inference performance, add the following commands to the original code:

    import aiacctorch
    aiacctorch.compile(mod_jit)

    The following content shows the updated code:

    import time
    import aiacctorch #Import the AIACC-Inference for Torch package.
    import torch
    import torchvision.models as models
    mod = models.resnet50(pretrained=True).eval()
    mod_jit = torch.jit.script(mod)
    mod_jit = mod_jit.cuda()
    mod_jit = aiacctorch.compile(mod_jit) #Compile the AIACC-Inference for Torch code.
    
    in_t = torch.randn([1,3,224,224]).float().cuda()
    
    # Warming up
    for _ in range(10):
        mod_jit(in_t)
    
    inference_count = 1000
    # inference test
    start = time.time()
    for _ in range(inference_count):
        mod_jit(in_t)
    end = time.time()
    print(f"use {(end-start)/inference_count*1000} ms each inference")
    print(f"{inference_count/(end-start)} step/s")

    The following figure shows that the average inference time is about 0.396 ms. Compared with the original inference of which the average inference time is 3.68 ms, AIACC-Inference for Torch significantly improves inference performance.

    2022-04-19_17-53-05.png

Run inference on the Bert-Base model

In this example, the torch.jit.trace operation is called to run inference on the Bert-Base model. The average inference time is reduced from 4.95 ms to 0.419 ms.

  1. Install the transformers package.

    pip install transformers
  2. Run the demos of original inference and accelerated inference, and view the inference results.

    • Original inference

      The following content shows the original code:

      from transformers import BertModel, BertTokenizer, BertConfig
      import torch
      import time
      
      enc = BertTokenizer.from_pretrained("bert-base-uncased")
      
      # Tokenizing input text
      text = "[CLS] Who was Jim Henson ? [SEP] Jim Henson was a puppeteer [SEP]"
      tokenized_text = enc.tokenize(text)
      
      # Masking one of the input tokens
      masked_index = 8
      tokenized_text[masked_index] = '[MASK]'
      indexed_tokens = enc.convert_tokens_to_ids(tokenized_text)
      segments_ids = [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, ]
      
      # Creating a dummy input
      tokens_tensor = torch.tensor([indexed_tokens]).cuda()
      segments_tensors = torch.tensor([segments_ids]).cuda()
      dummy_input = [tokens_tensor, segments_tensors]
      
      # Initializing the model with the torchscript flag
      # Flag set to True even though it is not necessary as this model does not have an LM Head.
      config = BertConfig(vocab_size_or_config_json_file=32000, hidden_size=768,
                          num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072, torchscript=True)
      
      # Instantiating the model
      model = BertModel(config)
      
      # The model needs to be in evaluation mode
      model.eval()
      
      # If you are instantiating the model with `from_pretrained` you can also easily set the TorchScript flag
      model = BertModel.from_pretrained("bert-base-uncased", torchscript=True)
      
      model = model.eval().cuda()
      
      # Creating the trace
      traced_model = torch.jit.trace(model, dummy_input)
      
      # Warming up
      for _ in range(10):
          all_encoder_layers, pooled_output = traced_model(*dummy_input)
      
          inference_count = 1000
          # inference test
          start = time.time()
          for _ in range(inference_count):
              traced_model(*dummy_input)
              end = time.time()
              print(f"use {(end-start)/inference_count*1000} ms each inference")
      print(f"{inference_count/(end-start)} step/s")

      The following figure shows that the average inference time is about 4.95 ms.

      2022-04-19_17-54-40.png
    • Accelerated inference

      To improve inference performance, add the following commands to the original code:

      import aiacctorch
      aiacctorch.compile(traced_model)

      The following content shows the updated code:

      from transformers import BertModel, BertTokenizer, BertConfig
      import torch
      import aiacctorch #Import the AIACC-Inference for Torch package.
      import time
      
      enc = BertTokenizer.from_pretrained("bert-base-uncased")
      
      # Tokenizing input text
      text = "[CLS] Who was Jim Henson ? [SEP] Jim Henson was a puppeteer [SEP]"
      tokenized_text = enc.tokenize(text)
      
      # Masking one of the input tokens
      masked_index = 8
      tokenized_text[masked_index] = '[MASK]'
      indexed_tokens = enc.convert_tokens_to_ids(tokenized_text)
      segments_ids = [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, ]
      
      # Creating a dummy input
      tokens_tensor = torch.tensor([indexed_tokens]).cuda()
      segments_tensors = torch.tensor([segments_ids]).cuda()
      dummy_input = [tokens_tensor, segments_tensors]
      
      # Initializing the model with the torchscript flag
      # Flag set to True even though it is not necessary as this model does not have an LM Head.
      config = BertConfig(vocab_size_or_config_json_file=32000, hidden_size=768,
          num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072, torchscript=True)
      
      # Instantiating the model
      model = BertModel(config)
      
      # The model needs to be in evaluation mode
      model.eval()
      
      # If you are instantiating the model with `from_pretrained` you can also easily set the TorchScript flag
      model = BertModel.from_pretrained("bert-base-uncased", torchscript=True)
      
      model = model.eval().cuda()
      
      # Creating the trace
      traced_model = torch.jit.trace(model, dummy_input)
      traced_model = aiacctorch.compile(traced_model) #Compile the AIACC-Inference for Torch code.
      
      # Warming up
      for _ in range(10):
          all_encoder_layers, pooled_output = traced_model(*dummy_input)
      
      inference_count = 1000
      # inference test
      start = time.time()
      for _ in range(inference_count):
          traced_model(*dummy_input)
      end = time.time()
      print(f"use {(end-start)/inference_count*1000} ms each inference")
      print(f"{inference_count/(end-start)} step/s")

      The following figure shows that the average inference time is about 0.419 ms. Compared with the original inference of which the average inference time is 4.95 ms, AIACC-Inference for Torch significantly improves inference performance.

      2022-04-19_17-56-13.png

Run inference for dynamic input sizes on the ResNet-50 model

AIACC-Inference for Torch supports various input sizes. This allows you to specify dynamic input sizes. In this example, three sets of input sizes are specified for length and width dimensions to run inference on the ResNet-50 model.

import time
import aiacctorch #Import the AIACC-Inference for Torch package.
import torch
import torchvision.models as models
mod = models.resnet50(pretrained=True).eval()
mod_jit = torch.jit.script(mod)
mod_jit = mod_jit.cuda()
mod_jit = aiacctorch.compile(mod_jit) #Compile the AIACC-Inference for Torch code.

in_t = torch.randn([1,3,224,224]).float().cuda()
in_2t = torch.randn([1,3,448,448]).float().cuda()
in_3t = torch.randn([16,3,640,640]).float().cuda()

# Warming up
for _ in range(10):
    mod_jit(in_t)
    mod_jit(in_3t)

inference_count = 1000
# inference test
start = time.time()
for _ in range(inference_count):
    mod_jit(in_t)
    mod_jit(in_2t)
    mod_jit(in_3t)
end = time.time()
print(f"use {(end-start)/(inference_count*3)*1000} ms each inference")
print(f"{inference_count/(end-start)} step/s")

The following figure shows that the average inference time is about 9.84 ms. 2022-04-19_18-00-12.png

Note

You must make sure that the system runs inference for the maximum and minimum tensor input sizes during the warming up stage. This prevents recompilation errors and shortens the compilation time. For example, if the input sizes are between 1 × 3 × 224 × 224 and 16 × 3 × 640 × 640, you must make sure that the system runs inference for the maximum and minimum input sizes during the warming up stage.

Comparison of performance data

The following table compares the inference performance of models optimized by using AIACC-Inference for Torch and PyTorch. In this example, the following items are configured in the environment.

  • Instance specification: a GPU-accelerated instance configured with an NVIDIA A10 GPU.

  • CUDA version: 11.5.

  • CUDA driver version: 470.57.02.

Model

Input-Size

AIACC-Inference-Torch (ms)

Pytorch Half (ms)

Pytorch Float (ms)

Acceleration ratio

resnet50

1x3x224x224

0.46974873542785645

3.4382946491241455

2.9194235801696777

6.22

mobilenet-v2-100

1x3x224x224

0.23872756958007812

2.8045766353607178

2.0068271160125732

8.69

SRGAN-X4

1x3x272x480

23.070229649543762

35.863523721694946

132.00348043441772

5.74

YOLO-V3

1x3x640x640

3.869319200515747

8.807475328445435

15.704705834388735

4.06

bert-base-uncased

1 × 128 and 1 × 128

0.9421144723892212

3.1525989770889282

3.761411190032959

4.00

bert-large-uncased

1 × 128 and 1 × 128

1.3300731182098389

6.11789083480835

7.110695481300354

5.34