AIACC-Inference can optimize models that are built based on the Torch framework to significantly improve inference performance. This topic describes how to manually install AIACC-Inference for Torch and provides examples on inference effects.
Prerequisites
An Alibaba Cloud GPU-accelerated instance is created.
Specification: The instance is configured with an A10, V100, or T4 GPU.
NoteFor more information, see Overview of instance families.
Image: The OS of the image used by the instance is Ubuntu 16.04 LTS or CentOS 7.x.
Background information
AIACC-Inference for Torch cuts computational graphs of models, fuses layers, and uses high-performance operational amplifiers to significantly improve inference performance in PyTorch. You can optimize the inference performance of deep learning models in the PyTorch framework by using just-in-time (JIT) compilation. You do not need to specify the precision and input sizes.
AIACC-Inference for Torch allows you to call the aiacctorch.compile(model)
operation to improve inference performance. You need to only call the torch.jit.script
or torch.jit.trace
operation to convert your PyTorch model to a TorchScript model. For more information, see TorchScript. This topic provides examples on how to call the torch.jit.script
and torch.jit.trace
operations to improve inference performance.
Install AIACC-Inference for Torch
AIACC-Inference for Torch provides Conda and WHL installation packages. The Conda package is out-of-the-box. You can use one of the packages based on your business requirements.
Conda package
A large number of dependency packages are pre-installed in the Conda package. Before you install the Conda package, you need to only manually install a CUDA driver. Perform the following operations:
ImportantWe recommend that you do not change the information about the pre-installed dependency packages in the Conda package. If you change the information, an error may be reported due to inconsistent versions when the demo runs.
Install a CUDA driver of v470.57.02 or later.
Download the Conda package.
wget https://aiacc-inference-public.oss-cn-beijing.aliyuncs.com/aiacc-inference-torch/aiacc-inference-torch-miniconda-latest.tar.bz2
Decompress the Conda package.
mkdir ./aiacc-inference-miniconda && tar -xvf ./aiacc-inference-torch-miniconda-latest.tar.bz2 -C ./aiacc-inference-miniconda
Load the Conda package.
source ./aiacc-inference-miniconda/bin/activate
WHL package
Before you install the WHL package, you must manually install the required dependency packages. Perform the following operations:
Use one of the following methods to install the dependency packages. The WHL package depends on a variety of software. Proceed with caution when you install the dependency packages.
Method 1
Install the following dependency packages:
CUDA 11.1
cuDNN 8.3.1.22
TensorRT 8.2.3.0
Specify the dependency libraries of TensorRT and CUDA for the LD_LIBRARY_PATH environment variable.
You can run the following commands to specify the dependency libraries. In this example, the dependency library of CUDA is stored in the /usr/local/cuda/ directory and the dependency library of TensorRT is stored in the /usr/local/TensorRT/ directory. You can change the directories based on your business requirements.
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH export LD_LIBRARY_PATH=/usr/local/TensorRT/lib:$LD_LIBRARY_PATH
Execute the environment variable.
source ~/.bashrc
Method 2
Use the NVIDIA pip to install the dependency packages.
pip install nvidia-pyindex && \ pip install nvidia-tensorrt==8.2.3.0
Install PyTorch 1.9.0+cu111.
pip3 install torch==1.9.0+cu111 torchvision==0.10.0+cu111 torchaudio==0.9.0 -f https://download.pytorch.org/whl/torch_stable.html
Download and install AIACC-Inference for Torch.
pip install aiacctorch -f https://aiacc-inference-public.oss-cn-beijing.aliyuncs.com/aiacc-inference-torch/aiacctorch_stable.html -f https://download.pytorch.org/whl/torch_stable.html
Run inference on the ResNet-50 model
In this example, the Conda package is installed, and the torch.jit.script
operation is called to run inference on the ResNet-50 model 1,000 times. The average inference time is reduced from 3.68 ms to 0.396 ms.
Original inference
The following content shows the original code:
import time import torch import torchvision.models as models mod = models.resnet50(pretrained=True).eval() mod_jit = torch.jit.script(mod) mod_jit = mod_jit.cuda() in_t = torch.randn([1,3,224,224]).float().cuda() # Warming up for _ in range(10): mod_jit(in_t) inference_count = 1000 # inference test start = time.time() for _ in range(inference_count): mod_jit(in_t) end = time.time() print(f"use {(end-start)/inference_count*1000} ms each inference") print(f"{inference_count/(end-start)} step/s")
The following figure shows that the average inference time is about 3.68 ms.
Accelerated inference
To improve inference performance, add the following commands to the original code:
import aiacctorch aiacctorch.compile(mod_jit)
The following content shows the updated code:
import time import aiacctorch #Import the AIACC-Inference for Torch package. import torch import torchvision.models as models mod = models.resnet50(pretrained=True).eval() mod_jit = torch.jit.script(mod) mod_jit = mod_jit.cuda() mod_jit = aiacctorch.compile(mod_jit) #Compile the AIACC-Inference for Torch code. in_t = torch.randn([1,3,224,224]).float().cuda() # Warming up for _ in range(10): mod_jit(in_t) inference_count = 1000 # inference test start = time.time() for _ in range(inference_count): mod_jit(in_t) end = time.time() print(f"use {(end-start)/inference_count*1000} ms each inference") print(f"{inference_count/(end-start)} step/s")
The following figure shows that the average inference time is about 0.396 ms. Compared with the original inference of which the average inference time is 3.68 ms, AIACC-Inference for Torch significantly improves inference performance.
Run inference on the Bert-Base model
In this example, the torch.jit.trace
operation is called to run inference on the Bert-Base model. The average inference time is reduced from 4.95 ms to 0.419 ms.
Install the transformers package.
pip install transformers
Run the demos of original inference and accelerated inference, and view the inference results.
Original inference
The following content shows the original code:
from transformers import BertModel, BertTokenizer, BertConfig import torch import time enc = BertTokenizer.from_pretrained("bert-base-uncased") # Tokenizing input text text = "[CLS] Who was Jim Henson ? [SEP] Jim Henson was a puppeteer [SEP]" tokenized_text = enc.tokenize(text) # Masking one of the input tokens masked_index = 8 tokenized_text[masked_index] = '[MASK]' indexed_tokens = enc.convert_tokens_to_ids(tokenized_text) segments_ids = [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, ] # Creating a dummy input tokens_tensor = torch.tensor([indexed_tokens]).cuda() segments_tensors = torch.tensor([segments_ids]).cuda() dummy_input = [tokens_tensor, segments_tensors] # Initializing the model with the torchscript flag # Flag set to True even though it is not necessary as this model does not have an LM Head. config = BertConfig(vocab_size_or_config_json_file=32000, hidden_size=768, num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072, torchscript=True) # Instantiating the model model = BertModel(config) # The model needs to be in evaluation mode model.eval() # If you are instantiating the model with `from_pretrained` you can also easily set the TorchScript flag model = BertModel.from_pretrained("bert-base-uncased", torchscript=True) model = model.eval().cuda() # Creating the trace traced_model = torch.jit.trace(model, dummy_input) # Warming up for _ in range(10): all_encoder_layers, pooled_output = traced_model(*dummy_input) inference_count = 1000 # inference test start = time.time() for _ in range(inference_count): traced_model(*dummy_input) end = time.time() print(f"use {(end-start)/inference_count*1000} ms each inference") print(f"{inference_count/(end-start)} step/s")
The following figure shows that the average inference time is about 4.95 ms.
Accelerated inference
To improve inference performance, add the following commands to the original code:
import aiacctorch aiacctorch.compile(traced_model)
The following content shows the updated code:
from transformers import BertModel, BertTokenizer, BertConfig import torch import aiacctorch #Import the AIACC-Inference for Torch package. import time enc = BertTokenizer.from_pretrained("bert-base-uncased") # Tokenizing input text text = "[CLS] Who was Jim Henson ? [SEP] Jim Henson was a puppeteer [SEP]" tokenized_text = enc.tokenize(text) # Masking one of the input tokens masked_index = 8 tokenized_text[masked_index] = '[MASK]' indexed_tokens = enc.convert_tokens_to_ids(tokenized_text) segments_ids = [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, ] # Creating a dummy input tokens_tensor = torch.tensor([indexed_tokens]).cuda() segments_tensors = torch.tensor([segments_ids]).cuda() dummy_input = [tokens_tensor, segments_tensors] # Initializing the model with the torchscript flag # Flag set to True even though it is not necessary as this model does not have an LM Head. config = BertConfig(vocab_size_or_config_json_file=32000, hidden_size=768, num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072, torchscript=True) # Instantiating the model model = BertModel(config) # The model needs to be in evaluation mode model.eval() # If you are instantiating the model with `from_pretrained` you can also easily set the TorchScript flag model = BertModel.from_pretrained("bert-base-uncased", torchscript=True) model = model.eval().cuda() # Creating the trace traced_model = torch.jit.trace(model, dummy_input) traced_model = aiacctorch.compile(traced_model) #Compile the AIACC-Inference for Torch code. # Warming up for _ in range(10): all_encoder_layers, pooled_output = traced_model(*dummy_input) inference_count = 1000 # inference test start = time.time() for _ in range(inference_count): traced_model(*dummy_input) end = time.time() print(f"use {(end-start)/inference_count*1000} ms each inference") print(f"{inference_count/(end-start)} step/s")
The following figure shows that the average inference time is about 0.419 ms. Compared with the original inference of which the average inference time is 4.95 ms, AIACC-Inference for Torch significantly improves inference performance.
Run inference for dynamic input sizes on the ResNet-50 model
AIACC-Inference for Torch supports various input sizes. This allows you to specify dynamic input sizes. In this example, three sets of input sizes are specified for length and width dimensions to run inference on the ResNet-50 model.
import time
import aiacctorch #Import the AIACC-Inference for Torch package.
import torch
import torchvision.models as models
mod = models.resnet50(pretrained=True).eval()
mod_jit = torch.jit.script(mod)
mod_jit = mod_jit.cuda()
mod_jit = aiacctorch.compile(mod_jit) #Compile the AIACC-Inference for Torch code.
in_t = torch.randn([1,3,224,224]).float().cuda()
in_2t = torch.randn([1,3,448,448]).float().cuda()
in_3t = torch.randn([16,3,640,640]).float().cuda()
# Warming up
for _ in range(10):
mod_jit(in_t)
mod_jit(in_3t)
inference_count = 1000
# inference test
start = time.time()
for _ in range(inference_count):
mod_jit(in_t)
mod_jit(in_2t)
mod_jit(in_3t)
end = time.time()
print(f"use {(end-start)/(inference_count*3)*1000} ms each inference")
print(f"{inference_count/(end-start)} step/s")
The following figure shows that the average inference time is about 9.84 ms.
You must make sure that the system runs inference for the maximum and minimum tensor input sizes during the warming up stage. This prevents recompilation errors and shortens the compilation time. For example, if the input sizes are between 1 × 3 × 224 × 224 and 16 × 3 × 640 × 640, you must make sure that the system runs inference for the maximum and minimum input sizes during the warming up stage.
Comparison of performance data
The following table compares the inference performance of models optimized by using AIACC-Inference for Torch and PyTorch. In this example, the following items are configured in the environment.
Instance specification: a GPU-accelerated instance configured with an NVIDIA A10 GPU.
CUDA version: 11.5.
CUDA driver version: 470.57.02.
Model | Input-Size | AIACC-Inference-Torch (ms) | Pytorch Half (ms) | Pytorch Float (ms) | Acceleration ratio |
resnet50 | 1x3x224x224 | 0.46974873542785645 | 3.4382946491241455 | 2.9194235801696777 | 6.22 |
mobilenet-v2-100 | 1x3x224x224 | 0.23872756958007812 | 2.8045766353607178 | 2.0068271160125732 | 8.69 |
SRGAN-X4 | 1x3x272x480 | 23.070229649543762 | 35.863523721694946 | 132.00348043441772 | 5.74 |
YOLO-V3 | 1x3x640x640 | 3.869319200515747 | 8.807475328445435 | 15.704705834388735 | 4.06 |
bert-base-uncased | 1 × 128 and 1 × 128 | 0.9421144723892212 | 3.1525989770889282 | 3.761411190032959 | 4.00 |
bert-large-uncased | 1 × 128 and 1 × 128 | 1.3300731182098389 | 6.11789083480835 | 7.110695481300354 | 5.34 |