Use an SDK to deploy a PyTorch model for inference - Platform For AI

Machine Learning Platform for AI (PAI)-Blade provides an SDK for C++ that you can use to deploy optimized models for inference. This topic describes how to use PAI-Blade SDK to deploy a PyTorch model.

Prerequisites

A PyTorch model is optimized by using PAI-Blade. For more information, see Optimize a PyTorch model.
An SDK is installed, and an authentication token is obtained. This is required because the SDK for the Pre-CXX11 application binary interface (ABI) and the .deb package of V3.7.0 are used in this example.

Note A model that is optimized by using PAI-Blade can be properly run only if the corresponding SDK is installed.

Prepare the environment

This topic describes how to use PAI-Blade SDK to deploy a PyTorch model for inference. In this example, Ubuntu 18.04 64-bit is installed.

Prepare the server.
Prepare an Elastic Compute Service (ECS) instance that is configured with the following specifications:
- Instance type: ecs.gn6i-c4g1.xlarge, NVIDIA Tesla T4 GPU
- Operating system: Ubuntu 18.04 64-bit
- Device: CUDA 10.0
- GPU: Driver 440.64.00
- GPU computing acceleration package: cuDNN 7.6.5

Install Python 3.

# Update pip. 
python3 -m pip install --upgrade pip

# Install virtualenv, which is a virtual environment in which you can install PyTorch. 
pip3 install virtualenv==16.0
python3 -m virtualenv venv

# Activate virtualenv. 
source venv/bin/activate

Deploy a model for inference

To use PAI-Blade SDK to load and deploy an optimized model for inference, you can link the libraries in the SDK when you compile the inference code, without the need to modify the original code logic.

Prepare the model and test data.
In this example, an optimized sample model is used. Run the following command to download the sample model. You can also use your own optimized model. For more information about how to optimize a model by using PAI-Blade, see Optimize a PyTorch model.
```
# Download the optimized sample model. 
wget http://pai-blade.oss-cn-zhangjiakou.aliyuncs.com/demo/sdk/pytorch/optimized_resnet50.pt
# Download the test data. 
wget http://pai-blade.oss-cn-zhangjiakou.aliyuncs.com/demo/sdk/pytorch/inputs.pth
```

Download and view the inference code.

You can run a PyTorch model that is optimized by using PAI-Blade in the same way as a regular PyTorch model. You do not need to write extra code or set extra parameters. In this example, the following interface code is downloaded:

#include <torch/script.h>
#include <torch/serialize.h>
#include <chrono>
#include <iostream>
#include <fstream>
#include <memory>

int benchmark(torch::jit::script::Module &module,
             std::vector<torch::jit::IValue> &inputs) {
  // warmup 10-iter
  for (int k = 0; k < 10; ++ k) {
    module.forward(inputs);
  }
  auto start = std::chrono::system_clock::now();
  // run 20-iter
  for (int k = 0; k < 20; ++ k) {
    module.forward(inputs);
  }
  auto end = std::chrono::system_clock::now();
  std::chrono::duration<double> elapsed_seconds = end-start;
  std::time_t end_time = std::chrono::system_clock::to_time_t(end);

  std::cout << "finished computation at " << std::ctime(&end_time)
            << "\nelapsed time: " << elapsed_seconds.count() << "s"
            << "\navg latency: " << 1000.0 * elapsed_seconds.count()/20 << "ms\n";
  return 0;
}

torch::Tensor load_data(const char* data_file) {
  std::ifstream file(data_file, std::ios::binary);
  std::vector<char> data((std::istreambuf_iterator<char>(file)), std::istreambuf_iterator<char>());
  torch::IValue ivalue = torch::pickle_load(data);
  CHECK(ivalue.isTensor());
  return ivalue.toTensor();
}

int main(int argc, const char* argv[]) {
  if (argc != 3) {
    std::cerr << "usage: example-app <path-to-exported-script-module> <path-to-saved-test-data>\n";
    return -1;
  }

  torch::jit::script::Module module;
  try {
    // Deserialize the ScriptModule from a file using torch::jit::load().
    module = torch::jit::load(argv[1]);
    auto image_tensor = load_data(argv[2]);

    std::vector<torch::IValue> inputs{image_tensor};
    benchmark(module, inputs);
    auto outputs = module.forward(inputs);
  }
  catch (const c10::Error& e) {
    std::cerr << "error loading the model" << std::endl << e.what();
    return -1;
  }

  std::cout << "ok\n";
}

Save the preceding sample code to a local file named torch_app.cc.

Compile the inference code.
When you compile the code, link LibTorch libraries and the libtorch_blade.so and libral_base_context.so files in the /usr/local/lib directory. Run the following command to compile the code:
```
TORCH_DIR=$(python3 -c "import torch; import os; print(os.path.dirname(torch.__file__))")
g++ torch_app.cc -std=c++14 \
    -D_GLIBCXX_USE_CXX11_ABI=0 \
    -I ${TORCH_DIR}/include \
    -I ${TORCH_DIR}/include/torch/csrc/api/include \
    -Wl,--no-as-needed \
    -L /usr/local/lib \
    -L ${TORCH_DIR}/lib \
    -l torch -l torch_cuda -l torch_cpu -l c10 -l c10_cuda \
    -l torch_blade -l ral_base_context \
    -o torch_app
```
You can modify the following parameters based on your business requirements:
- torch_app.cc: the name of the file that contains the inference code.
- /usr/local/lib: the installation path of the SDK. In most cases, you do not need to modify this parameter.
- torch_app: the name of the executable program that is generated after compilation.
In some versions of the system and compiler, the link works only if you write the code line -Wl,--no-as-needed \.
Notice
- Set the value of the GLIBCXX_USE_CXX11_ABI macro based on the version of the LibTorch ABI.
- PyTorch for CUDA 10.0 provided by PAI-Blade is compiled by using GNU Compiler Collection (GCC) 7.5. If you use the CXX11 ABI, make sure that the GCC version is 7.5. If you use the Pre-CXX11 ABI, no limits are placed on the GCC version.
Run the model for inference on a local device.
Use the executable program to load and run the optimized model. The following sample code provides an example. In this example, the executable program torch_app and the optimized sample model optimized_resnet50.pt are used.
```
export BLADE_REGION=<region>    # Region: cn-beijing, cn-shanghai for example.
export BLADE_TOKEN=<token>
export LD_LIBRARY_PATH=/usr/local/lib:${TORCH_DIR}/lib:${LD_LIBRARY_PATH}
./torch_app  optimized_resnet50.pt inputs.pth                      
```
Modify the following parameters based on your business requirements:
- <region>: the region in which you use PAI-Blade. You can join the DingTalk group of PAI-Blade users to obtain the regions in which PAI-Blade can be used.
- <token>: the authentication token that is required to use PAI-Blade. You can join the DingTalk group of PAI-Blade users to obtain the authentication token.
- torch_app: the executable program that is generated after compilation.
- optimized_resnet50.pt: the PyTorch model that is optimized by using PAI-Blade. In this example, the optimized sample model that is downloaded in Step 1 is used.
- inputs.pth: the test data. In this example, the test data that is downloaded in Step 1 is used.
If the system displays information similar to the following output, the model is being run.
```
finished computation at Wed Jan 27 20:03:38 2021

elapsed time: 0.513882s
avg latency: 25.6941ms
ok
```