使用Blade最佳化輸入為DynamicShape的模型 - Platform For AI

常規推理最佳化普遍針對輸入為Static Shape的模型，如果實際推理的模型Shape發生變化，推理最佳化效果就可能失效。在實際生產中，輸入為Dynamic Shape的模型越來越多，因此對不同輸入Shape的推理過程具有強烈的最佳化需求。本文介紹如何使用Blade最佳化輸入為Dynamic Shape的模型。

使用限制

本文使用的環境需要滿足以下版本要求：

系統內容：Linux系統中使用Python 3.6及其以上版本。
架構：PyTorch 1.7.1。
裝置及後端：NVIDIA T4、CUDA 11.0。
推理最佳化工具：Blade 3.17.0及其以上版本。

操作流程

使用Blade最佳化輸入為Dynamic Shape的ResNet50流程如下：

步驟一：準備工作
構建測試資料和模型，本文使用torchvision中標準的ResNet50模型。
步驟二：配置用於最佳化的config
根據Dynamic Shape的範圍配置Blade config。
步驟三：調用Blade最佳化模型
調用blade.optimize介面最佳化模型，並儲存最佳化後的模型。
步驟四：驗證效能與正確性
對最佳化前後的推理速度及推理結果進行測試，從而驗證最佳化報告中資訊的正確性。
步驟五：載入運行最佳化後的模型
整合Blade SDK，載入最佳化後的模型進行推理。

步驟一：準備工作

下載模型預訓練參數與測試資料。

預訓練參數選自torchvision，為了加速下載過程，已將其儲存至OSS中。測試資料隨機選自ImageNet-1k驗證集，預先處理操作已完成，您可以下載後直接使用。

wget http://pai-blade.oss-cn-zhangjiakou.aliyuncs.com/share/dynamic_ranges_pratice/resnet50-19c8e357.pth -O resnet50-19c8e357.pth
wget http://pai-blade.oss-cn-zhangjiakou.aliyuncs.com/share/dynamic_ranges_pratice/imagenet_val_example.pt -O imagenet_val_example.pt

定義模型、載入模型參數和測試資料，並產生TorchScript。

import torch
import torchvision

# 構建Resnet50。
model = torchvision.models.resnet50().eval().cuda()
# 載入預訓練參數。
ckpt = torch.load('resnet50-19c8e357.pth')
model.load_state_dict(ckpt)
# 載入測試資料。
example_input = torch.load('imagenet_val_example.pt').cuda()
# 產生TorchScript。
traced_model = torch.jit.trace(model, example_input).cuda().eval()

步驟二：配置用於最佳化的config

根據Dynamic Shape的範圍配置Blade config，Blade支援任意維度動態範圍。本文以Batch維度示範config的配置。

定義Dynamic Shape的範圍。
一組有效動態範圍，需要包括以下三個欄位：
- min：表示dynamic shape的下界。
- max：表示dynamic shape的上界。
- opts：表示需要特別最佳化的Shape，可以設定多個。通常最佳化後的模型在這些Shape上的推理加速比更高。
上述三個欄位需要符合以下規則：
- min、max及opts中的每組Shape的長度相等，且等於網路的輸入數量。
- min、max及opts中的每組Shape對應位置的數值需要滿足min_num <= opt_num <= max_num。
例如構建如下Dynamic Shape的範圍。
```
shapes = {
    "min": [[1, 3, 224, 224]],
    "max": [[10, 3, 224, 224]],
    "opts": [
        [[5, 3, 224, 224]],
        [[8, 3, 224, 224]],
    ]
}
```
此外，Blade支援設定多個動態範圍。如果Dynamic Shape的上界和下界範圍過大，可能會導致最佳化後的模型加速不明顯，您可以將一個大的範圍拆分為多個小範圍，通常能夠帶來更好的加速效果。關於如何設定多個動態範圍，請參見下文的附錄：設定多個動態範圍。

通過定義好的Dynamic Shape範圍構建Blade config。

import blade
import blade.torch as blade_torch

# Blade Torch相關config，用於設定Dynamic Shapes。
blade_torch_cfg = blade_torch.Config()
blade_torch_cfg.dynamic_tuning_shapes = shapes

# Blade相關config，用於關閉FP16的精度檢查，以獲得最好的加速效果。
gpu_config = {
    "disable_fp16_accuracy_check": True,
}
blade_config = blade.Config(
    gpu_config=gpu_config
)

步驟三：調用Blade最佳化模型

調用blade.optimize對模型進行最佳化，範例程式碼如下。關於該介面的詳細描述，請參見Python介面文檔。
```
with blade_torch_cfg:
    optimized_model, _, report = blade.optimize(
        traced_model,          # 模型路徑。
        'o1',                  # o1無損最佳化。
        config=blade_config,
        device_type='gpu',     # 面向GPU裝置最佳化，
        test_data=[(example_input,)]  # 測試資料。
    )
```
最佳化模型時，您需要注意以下事宜：
- blade.optimize的第一個傳回值為最佳化後的模型，其資料類型與輸入的模型相同。在這個樣本中，輸入的是TorchScript，返回的是最佳化後的TorchScript。
- 您需要確保輸入的test_data在定義的Dynamic Shape範圍內。

最佳化完成後，列印最佳化報告。

print("Report: {}".format(report))

列印的最佳化報告類似如下輸出。

Report: {
  "software_context": [
    {
      "software": "pytorch",
      "version": "1.7.1+cu110"
    },
    {
      "software": "cuda",
      "version": "11.0.0"
    }
  ],
  "hardware_context": {
    "device_type": "gpu",
    "microarchitecture": "T4"
  },
  "user_config": "",
  "diagnosis": {
    "model": "unnamed.pt",
    "test_data_source": "user provided",
    "shape_variation": "undefined",
    "message": "Unable to deduce model inputs information (data type, shape, value range, etc.)",
    "test_data_info": "0 shape: (1, 3, 224, 224) data type: float32"
  },
  "optimizations": [
    {
      "name": "PtTrtPassFp16",
      "status": "effective",
      "speedup": "4.06",
      "pre_run": "6.55 ms",
      "post_run": "1.61 ms"
    }
  ],
  "overall": {
    "baseline": "6.54 ms",
    "optimized": "1.61 ms",
    "speedup": "4.06"
  },
  "model_info": {
    "input_format": "torch_script"
  },
  "compatibility_list": [
    {
      "device_type": "gpu",
      "microarchitecture": "T4"
    }
  ],
  "model_sdk": {}
}

從最佳化報告可以看出本樣本的最佳化中，PtTrtPassFp16最佳化項生效，帶來了約4.06倍左右的加速，將模型在測試資料上的推理耗時從6.55 ms下降到了1.61 ms。上述最佳化結果僅為本樣本的測試結果，您的最佳化效果以實際為準。關於最佳化報告的欄位詳情請參見最佳化報告。

調用PyTorch的相關函數儲存並載入最佳化後的TorchScript模型。

file_name = "resnet50_opt.pt"
# 將最佳化後的模型儲存到本地。
torch.jit.save(optimized_model, file_name)
# 從硬碟中載入最佳化後的模型。
optimized_model = torch.jit.load(file_name)

步驟四：驗證效能與正確性

最佳化完成後，通過Python指令碼對最佳化報告的資訊進行驗證。

定義benchmark方法，對模型進行10次預熱，然後運行100次，最終取平均的推理時間作為推理速度。

import time

@torch.no_grad()
def benchmark(model, test_data):
    # 切換模型至驗證模式。
    model = model.eval()
    
    # 預熱。
    for i in range(0, 10):
        model(test_data)
        
    # 開始計時運行。
    num_runs = 100
    start = time.time()
    for i in range(0, num_runs):
        model(test_data)
    torch.cuda.synchronize()
    elapsed = time.time() - start
    rt_ms = elapsed / num_runs * 1000.0
    
    # 列印結果。
    print("{:.2f} ms.".format(rt_ms))
    return rt_ms

定義一系列不同Shape的測試資料。

dummy_inputs = []
batch_num = [1, 3, 5, 7, 9]
for n in batch_num:
    dummy_inputs.append(torch.randn(n, 3, 224, 224).cuda())

遍曆每組測試資料，分別調用benchmark方法對最佳化前與最佳化後的模型進行測試，並列印結果。

for inp in dummy_inputs:
    print(f'--------------test with shape {list(inp.shape)}--------------')
    print("  Origin model inference cost:     ", end='')
    origin_rt = benchmark(traced_model, inp)
    print("  Optimized model inference cost:  ", end='')
    opt_rt = benchmark(optimized_model, inp)
    speedup = origin_rt / opt_rt
    print('  Speed up: {:.2f}'.format(speedup))
    print('')

系統返回如下類似結果。

--------------test with shape [1, 3, 224, 224]--------------
  Origin model inference cost:     6.54 ms.
  Optimized model inference cost:  1.66 ms.
  Speed up: 3.94

--------------test with shape [3, 3, 224, 224]--------------
  Origin model inference cost:     10.79 ms.
  Optimized model inference cost:  2.40 ms.
  Speed up: 4.49

--------------test with shape [5, 3, 224, 224]--------------
  Origin model inference cost:     16.27 ms.
  Optimized model inference cost:  3.25 ms.
  Speed up: 5.01

--------------test with shape [7, 3, 224, 224]--------------
  Origin model inference cost:     22.62 ms.
  Optimized model inference cost:  4.39 ms.
  Speed up: 5.16

--------------test with shape [9, 3, 224, 224]--------------
  Origin model inference cost:     28.83 ms.
  Optimized model inference cost:  5.25 ms.
  Speed up: 5.49

從結果可以看出對於不同Shape的測試資料，最佳化後模型的推理速度是原始模型的3.94~5.49倍。上述最佳化結果僅為本樣本的測試結果，您的最佳化效果以實際為準。

使用準備工作階段準備的真實測試資料example_input，驗證最佳化模型的正確性。

origin_output = traced_model(example_input)
_, pred = origin_output.topk(1, 1, True, True)
print("origin model output: {}".format(pred))
opt_output = optimized_model(example_input)
_, pred = origin_output.topk(1, 1, True, True)
print("optimized model output: {}".format(pred))

系統返回如下類似結果。

origin model output: tensor([[834]], device='cuda:0')
optimized model output: tensor([[834]], device='cuda:0')

從上述結果可以看出最佳化前後模型對於測試資料example_input的預測均為第834類。

步驟五：載入運行最佳化後的模型

完成驗證後，您需要對模型進行部署，Blade提供了Python和C++兩種運行時SDK供您整合。關於C++的SDK使用方法請參見使用SDK部署TensorFlow模型推理，下文主要介紹如何使用Python SDK部署模型。

可選：在試用階段，您可以設定如下的環境變數，防止因為鑒權失敗而程式退出。
```
export BLADE_AUTH_USE_COUNTING=1
```
擷取鑒權。
```
export BLADE_REGION=<region>
export BLADE_TOKEN=<token>
```
您需要根據實際情況替換以下參數：
- <region>：Blade支援的地區，需要加入Blade使用者群擷取該資訊，使用者群的二維碼詳情請參見擷取Token。
- <token>：鑒權Token，需要加入Blade使用者群擷取該資訊，使用者群的二維碼詳情請參見擷取Token。

載入運行最佳化後的模型。

除了增加一行import blade.runtime.torch，您無需為Blade的接入編寫額外代碼，即原有的推理代碼無需任何改動。

import torch
import blade.runtime.torch
# <your_optimized_model_path>替換為最佳化後的模型路徑。
opt_model_dir = <your_optimized_model_path>
# <your_infer_data>替換為用於推理的資料。
infer_data = <your_infer_data>

model = torch.jit.load(opt_model_dir)
output = model(infer_data)

附錄：設定多個動態範圍

如果Dynamic Shape的上界和下界範圍過大，可能會導致最佳化後的模型加速不明顯，您可以將一個大的範圍拆分為多個小範圍，通常能夠帶來更好的加速效果。例如設定如下Dynamic Shape。

shapes1 = {
    "min": [[1, 3, 224, 224]],
    "max": [[5, 3, 224, 224]],
    "opts": [
        [[5, 3, 224, 224]],
    ]
}
shapes2 = {
    "min": [[5, 3, 224, 224]],
    "max": [[10, 3, 224, 224]],
    "opts": [
        [[8, 3, 224, 224]],
    ]
}
shapes = [shapes1, shapes2]

您可以使用該shapes配置上述提及的最佳化config，詳情請參見步驟二：配置用於最佳化的config。