AIACC-Inference(AIACC推理加速)支持优化基于Torch框架搭建的模型,能够显著提升推理性能。本文介绍如何手动安装AIACC-Inference(AIACC推理加速)Torch版并提供示例体验推理加速效果。

前提条件

已创建阿里云GPU实例:
  • 实例规格:配备NVIDIA P100、V100或T4 GPU
    说明 更多信息,请参见实例规格族
  • 实例镜像:Ubuntu 16.04 LTS或CentOS 7.x

背景信息

AIACC-Inference(AIACC推理加速)Torch版通过对模型的计算图进行切割,执行层间融合,以及高性能OP实现,大幅度提升PyTorch的推理性能。您无需指定精度和输入尺寸,即可通过JIT编译的方式对PyTorch框架下的深度学习模型进行推理优化。

AIACC-Inference(AIACC推理加速)Torch版通过调用aiacctorch.compile(model)接口即可实现推理性能加速。您只需先使用torch.jit.script或者torch.jit.trace接口,将PyTorch模型转换为TorchScript模型,更多信息,请参见PyTorch官方文档。本文将为您提供分别使用torch.jit.scripttorch.jit.trace接口实现推理性能加速的示例。

准备并安装AIACC-Inference(AIACC推理加速)Torch版软件包

AIACC-Inference(AIACC推理加速)Torch版为您提供了Conda一键安装包以及whl包两种软件包,您可以根据自身业务场景选择一种进行安装。

  • Conda安装包

    Conda一键安装包中已经预装了大部分依赖包,您只需手动安装CUDA驱动,再安装Conda包即可。具体操作如下:

    注意 请勿随意更改Conda安装包中的预装依赖包信息,否则可能会因为版本不匹配导致Demo运行报错。
    1. 远程登录实例
    2. 自行安装CUDA 455.23.05或以上版本的驱动。
    3. 下载Conda安装包。
      wget https://aiacc-inference-torch.oss-cn-hangzhou.aliyuncs.com/aiacc-inference-torch-miniconda-latest.tar.bz2
    4. 解压Conda安装包。
      mkdir ./aiacc-inference-miniconda && tar -xvf ./aiacc-inference-torch-miniconda-latest.tar.bz2 -C ./aiacc-inference-miniconda
    5. 加载Conda安装包。
      source ./aiacc-inference-miniconda/bin/activate
  • whl安装包
    您需要手动安装相关依赖包后再安装whl软件包。具体操作如下:
    1. 远程登录实例
    2. 选择以下任一方式安装相关依赖包。由于whl软件包依赖大量不同的软件组合,请您谨慎设置。
      • 方式一
        1. 自行安装如下版本的依赖包:
          • CUDA 11.1
          • cuDNN 8.2.0.53
          • TensorRT 7.2.3.4
        2. 将TensorRT及CUDA的相关依赖库放置在系统LD_LIBRARY_PATH环境变量中。

          以下命令以CUDA的相关依赖库位于/usr/local/cuda/目录下,TensorRT的相关依赖库位于/usr/local/TensorRT/目录下为例,您需要根据实际情况替换。

          export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
          export LD_LIBRARY_PATH=/usr/local/TensorRT/lib:$LD_LIBRARY_PATH
        3. 执行环境变量。
          source ~/.bashrc
      • 方式二

        使用NVIDIA的pip包安装相关依赖包。

        pip install nvidia-pyindex && \
        pip install nvidia-tensorrt==7.2.3.4
    3. 安装PyTorch 1.9.0+cu111。
      pip3 install torch==1.9.0+cu111 torchvision==0.10.0+cu111 torchaudio==0.9.0 -f https://download.pytorch.org/whl/torch_stable.html
    4. 下载与您自身业务的Python版本一致的安装包。
      • Python 3.6
        wget https://aiacc-inference-torch.oss-cn-hangzhou.aliyuncs.com/aiacctorch-0.4.0a1-cp36-cp36m-linux_x86_64.whl
      • Python 3.7
        wget https://aiacc-inference-torch.oss-cn-hangzhou.aliyuncs.com/aiacctorch-0.4.0a1-cp37-cp37m-linux_x86_64.whl
      • Python 3.8
        wget https://aiacc-inference-torch.oss-cn-hangzhou.aliyuncs.com/aiacctorch-0.4.0a1-cp38-cp38-linux_x86_64.whl
      • Python 3.9
        wget https://aiacc-inference-torch.oss-cn-hangzhou.aliyuncs.com/aiacctorch-0.4.0a1-cp39-cp39-linux_x86_64.whl
    5. 安装下载的Python包。
      pip install *.whl

基于ResNet50模型执行推理

以下示例将以安装了Conda软件包为例,基于ResNet50模型,并调用torch.jit.script接口执行推理任务,执行1000次后取平均时间,将推理耗时从9.12 ms降低至1.15 ms以内。

  • 原始版本

    原始代码如下所示:

    import time
    import torch
    import torchvision.models as models
    mod = models.resnet50()
    mod_jit = torch.jit.script(mod)
    mod_jit = mod_jit.cuda()
    
    in_t = torch.randn([1,3,224,224]).float().cuda()
    
    # Warming up
    for _ in range(10):
        mod_jit(in_t)
    
    inference_count = 1000
    # inference test
    start = time.time()
    for _ in range(inference_count):
        mod_jit(in_t)
    end = time.time()
    print(f"use {(end-start)/inference_count*1000} ms each inference")
    print(f"{inference_count/(end-start)} step/s")

    执行结果如下,显示推理耗时大约为9.12 ms。

    2021-09-29_22-33-32
  • 加速版本

    您仅需要在原始版本的示例代码中增加如下两行内容,即可实现性能加速。

    import aiacctorch
    aiacctorch.compile(mod_jit)

    更新后的代码如下:

    import time
    import aiacctorch #import aiacc包
    import torch
    import torchvision.models as models
    mod = models.resnet50()
    mod_jit = torch.jit.script(mod)
    mod_jit = mod_jit.cuda()
    mod_jit = aiacctorch.compile(mod_jit) #进行编译
    
    in_t = torch.randn([1,3,224,224]).float().cuda()
    
    # Warming up
    for _ in range(10):
        mod_jit(in_t)
    
    inference_count = 1000
    # inference test
    start = time.time()
    for _ in range(inference_count):
        mod_jit(in_t)
    end = time.time()
    print(f"use {(end-start)/inference_count*1000} ms each inference")
    print(f"{inference_count/(end-start)} step/s")

    执行结果如下,显示推理耗时为1.15 ms。相较于之前的9.12 ms,推理性能有了显著提升。

    2021-09-29_22-41-33

基于Bert-Base模型执行推理

以下示例将基于Bert-Base模型,并调用torch.jit.trace接口执行推理任务,将推理耗时从8.13 ms降低至3.91 ms以内。

  1. 安装transformers包。
    pip install transformers
  2. 分别运行原始版本和加速版本的Demo,并查看运行结果。
    • 原始版本

      原始代码如下所示:

      from transformers import BertModel, BertTokenizer, BertConfig
      import torch
      import time
      
      enc = BertTokenizer.from_pretrained("bert-base-uncased")
      
      # Tokenizing input text
      text = "[CLS] Who was Jim Henson ? [SEP] Jim Henson was a puppeteer [SEP]"
      tokenized_text = enc.tokenize(text)
      
      # Masking one of the input tokens
      masked_index = 8
      tokenized_text[masked_index] = '[MASK]'
      indexed_tokens = enc.convert_tokens_to_ids(tokenized_text)
      segments_ids = [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]
      
      # Creating a dummy input
      tokens_tensor = torch.tensor([indexed_tokens]).cuda()
      segments_tensors = torch.tensor([segments_ids]).cuda()
      dummy_input = [tokens_tensor, segments_tensors]
      
      # Initializing the model with the torchscript flag
      # Flag set to True even though it is not necessary as this model does not have an LM Head.
      config = BertConfig(vocab_size_or_config_json_file=32000, hidden_size=768,
          num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072, torchscript=True)
      
      # Instantiating the model
      model = BertModel(config)
      
      # The model needs to be in evaluation mode
      model.eval()
      
      # If you are instantiating the model with `from_pretrained` you can also easily set the TorchScript flag
      model = BertModel.from_pretrained("bert-base-uncased", torchscript=True)
      
      model = model.eval().cuda()
      
      # Creating the trace
      traced_model = torch.jit.trace(model, dummy_input)
      
      # Warming up
      for _ in range(10):
          all_encoder_layers, pooled_output = traced_model(*dummy_input)
      
      inference_count = 1000
      # inference test
      start = time.time()
      for _ in range(inference_count):
          traced_model(*dummy_input)
      end = time.time()
      print(f"use {(end-start)/inference_count*1000} ms each inference")
      print(f"{inference_count/(end-start)} step/s")

      执行结果如下,显示推理耗时大约为8.13 ms。

      2021-09-29_22-49-08
    • 加速版本

      您仅需要在原始版本的示例代码中增加如下两行内容,即可实现性能加速。

      import aiacctorch
      aiacctorch.compile(traced_model)

      更新后的代码如下:

      from transformers import BertModel, BertTokenizer, BertConfig
      import torch
      import aiacctorch #import aiacc包
      import time
      
      enc = BertTokenizer.from_pretrained("bert-base-uncased")
      
      # Tokenizing input text
      text = "[CLS] Who was Jim Henson ? [SEP] Jim Henson was a puppeteer [SEP]"
      tokenized_text = enc.tokenize(text)
      
      # Masking one of the input tokens
      masked_index = 8
      tokenized_text[masked_index] = '[MASK]'
      indexed_tokens = enc.convert_tokens_to_ids(tokenized_text)
      segments_ids = [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]
      
      # Creating a dummy input
      tokens_tensor = torch.tensor([indexed_tokens]).cuda()
      segments_tensors = torch.tensor([segments_ids]).cuda()
      dummy_input = [tokens_tensor, segments_tensors]
      
      # Initializing the model with the torchscript flag
      # Flag set to True even though it is not necessary as this model does not have an LM Head.
      config = BertConfig(vocab_size_or_config_json_file=32000, hidden_size=768,
          num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072, torchscript=True)
      
      # Instantiating the model
      model = BertModel(config)
      
      # The model needs to be in evaluation mode
      model.eval()
      
      # If you are instantiating the model with `from_pretrained` you can also easily set the TorchScript flag
      model = BertModel.from_pretrained("bert-base-uncased", torchscript=True)
      
      model = model.eval().cuda()
      
      # Creating the trace
      traced_model = torch.jit.trace(model, dummy_input)
      traced_model = aiacctorch.compile(traced_model) #进行编译
      
      # Warming up
      for _ in range(10):
          all_encoder_layers, pooled_output = traced_model(*dummy_input)
      
      inference_count = 1000
      # inference test
      start = time.time()
      for _ in range(inference_count):
          traced_model(*dummy_input)
      end = time.time()
      print(f"use {(end-start)/inference_count*1000} ms each inference")
      print(f"{inference_count/(end-start)} step/s")

      执行结果如下,显示推理耗时为3.91 ms。相较于之前的8.13 ms,推理性能有了显著提升。

      2021-09-29_22-52-51