To ensure that a model meets the deployment standards before you deploy it in a production environment, you can use the model analysis and optimization commands supported by the cloud-native AI suite to benchmark, analyze, and optimize the model. In this topic, a ResNet18 model provided by PyTorch is used as an example and V100 GPUs are used to accelerate the model.
Prerequisites
A Container Service for Kubernetes (ACK) Pro cluster is created and the Kubernetes version of the cluster is 1.20 or later. The cluster contains at least one GPU-accelerated node. For more information about how to update an ACK cluster, see Update an ACK cluster.
An Object Storage Service (OSS) bucket is created. A persistent volume (PV) and a persistent volume claim (PVC) are created. For more information, see Mount a statically provisioned ossfs 1.0 volume.
The latest version of the Arena client is installed. For more information, see Configure the Arena client.
Background information
Data scientists focus on the precision of models, whereas R&D engineers are more concerned about the performance of models. When both parties lack an understanding of each other's domains, misunderstandings can easily arise. As a result, a model may not meet the performance requirements after you release the model as an online service. To prevent this issue, you may need to benchmark a model before you release the model. If the model does not meet the performance requirements, you can identify the performance bottlenecks and optimize the model.
Introduction to the model analysis and optimization commands
The cloud-native AI suite supports multiple model analysis and optimization commands. You can run the commands to benchmark models, analyze the network structure, check the duration of each operator, and view the GPU utilization. Then, you can identify the performance bottlenecks of a model and use TensorRT to optimize the model. This helps you release models that meet the performance requirements of a production environment. The following figure shows the model lifecycle assisted by the model analysis and optimization commands.
Model Training: The model is trained based on a given dataset.
Model Benchmark: A benchmark is performed on the model to check whether the latency, throughput, and GPU utilization of the model meet the requirements.
Model Profile: The model is analyzed to identify performance bottlenecks.
Model Optimize: The GPU inference capability of the model is optimized by using tools such as TensorRT.
Model Serving: The model is deployed as an online service.
If the model still does not meet the performance requirements after you optimize the model, you can repeat the preceding phases.
How to run the commands
You can use Arena to submit model analysis, optimization, benchmark,and evaluation jobs to ACK Pro clusters. You can run the arena model analyze --help command to view the help information.
$ arena model analyze --help
submit a model analyze job.
Available Commands:
profile Submit a model profile job.
evaluate Submit a model evaluate job.
optimize Submit a model optimize job.
benchmark Submit a model benchmark job
Usage:
arena model analyze [flags]
arena model analyze [command]
Available Commands:
benchmark Submit a model benchmark job
delete Delete a model job
evaluate Submit a model evaluate job
get Get a model job
list List all the model jobs
optimize Submit a model optimize job, this is a experimental feature
profile Submit a model profile jobStep 1: Prepare a model
We recommend that you use TorchScript to deploy PyTorch models. In this topic, a ResNet18 model provided by PyTorch is used as an example.
Convert the model. Convert the ResNet18 model to a TorchScript model and save the model.
import torch import torchvision model = torchvision.models.resnet18(pretrained=True) # Switch the model to eval model model.eval() # An example input you would normally provide to your model's forward() method. dummy_input = torch.rand(1, 3, 224, 224) # Use torch.jit.trace to generate a torch.jit.ScriptModule via tracing. traced_script_module = torch.jit.trace(model, dummy_input) # Save the TorchScript model traced_script_module.save("resnet18.pt")Parameter
Description
model_nameThe name of the model.
model_platformThe platform or framework used by the model, such as
TorchScriptandONNX.model_pathThe path in which the model is stored.
inputsThe input parameters.
outputsThe output parameters.
After the model is converted, upload the model configuration file
resnet18.ptto OSS. The OSS path of the configuration file isoss://bucketname/models/resnet18/resnet18.pt. For more information, see Upload objects.
Step 2: Perform a benchmark
Before you deploy a model in a production environment, you can perform a benchmark to evaluate the performance of the model. In this step, a benchmark job is submitted in Arena and a PVC named oss-pvc in the default namespace of the cluster is used as an example. For more information, see Mount a statically provisioned ossfs 1.0 volume.
Prepare and upload the configuration file of the model.
Create a configuration file for the model. In this example, the configuration file is named
config.json.{ "model_name": "resnet18", "model_platform": "torchscript", "model_path": "/data/models/resnet18/resnet18.pt", "inputs": [ { "name": "input", "data_type": "float32", "shape": [1, 3, 224, 224] } ], "outputs": [ { "name": "output", "data_type": "float32", "shape": [ 1000 ] } ] }Upload the configuration file to OSS. The OSS path of the configuration file is
oss://bucketname/models/resnet18/config.json.
Run the following command to submit a benchmark job to the ACK Pro cluster:
arena model analyze benchmark \ --name=resnet18-benchmark \ --namespace=default \ --image=registry.cn-beijing.aliyuncs.com/kube-ai/easy-inference:1.0.2 \ --gpus=1 \ --data=oss-pvc:/data \ --model-config-file=/data/models/resnet18/config.json \ --report-path=/data/models/resnet18 \ --concurrency=5 \ --duration=60Parameter
Description
--gpusThe number of GPUs that are used.
--dataThe PVC for the cluster and the path to which the PVC is mounted.
--model-config-fileThe path of the configuration file.
--report-pathThe path in which the benchmark report is stored.
--concurrencyThe number of concurrent requests.
--durationThe duration of the benchmark job. Unit: seconds.
ImportantYou cannot specify the
--requestsand the--durationparameters at the same time. Specify only one of them when you submit a benchmark job. If you specify both of them, the system uses the--durationparameter by default.To specify the total number of requests sent by the benchmark job, specify the
--requestsparameter.
Run the following command to query the status of the job:
arena model analyze list -AExpected output:
NAMESPACE NAME STATUS TYPE DURATION AGE GPU(Requested) default resnet18-benchmark COMPLETE Benchmark 0s 2d 1View the benchmark report. If the
STATUSparameter displaysCOMPLETE, the benchmark job is completed. Then, you can find a benchmark report namedbenchmark_result.txtin the path specified by the--report-pathparameter.Expected output:
{ "p90_latency":7.511, "p95_latency":7.86, "p99_latency":9.34, "min_latency":7.019, "max_latency":12.269, "mean_latency":7.312, "median_latency":7.206, "throughput":136, "gpu_mem_used":1.47, "gpu_utilization":21.280 }The following table describes the metrics that are included in a benchmark report.
Metric
Description
Unit
p90_latency
90th percentile response time
Milliseconds
p95_latency
95th percentile response time
Milliseconds
p99_latency
99th percentile response time
Milliseconds
min_latency
Fastest response time
Milliseconds
max_latency
Slowest response time
Milliseconds
mean_latency
Average response time
Milliseconds
median_latency
Medium response time
Milliseconds
throughput
Throughput
Times
gpu_mem_used
GPU memory usage
GB
gpu_utilization
GPU utilization
Percentage
Step 3: Analyze the model
After you perform a benchmark, you can run the arena model analyze profile command to analyze the model and identify performance bottlenecks.
Run the following command to submit a model analysis job to the ACK Pro cluster.
arena model analyze profile \ --name=resnet18-profile \ --namespace=default \ --image=registry.cn-beijing.aliyuncs.com/kube-ai/easy-inference:1.0.2 \ --gpus=1 \ --data=oss-pvc:/data \ --model-config-file=/data/models/resnet18/config.json \ --report-path=/data/models/resnet18/log/ \ --tensorboard \ --tensorboard-image=registry.cn-beijing.aliyuncs.com/kube-ai/easy-inference:1.0.2Parameter
Description
--gpusThe number of GPUs that are used.
--dataThe PVC for the cluster and the path to which the PVC is mounted.
--model-config-fileThe path of the configuration file.
--report-pathThe path in which the analysis report is stored.
--tensorboardSpecifies whether to view the analysis report in TensorBoard.
--tensorboard-imageThe URL of the image that is used to deploy Tensorboard.
Run the following command to query the status of the job:
arena model analyze list -AExpected output:
NAMESPACE NAME STATUS TYPE DURATION AGE GPU(Requested) default resnet18-profile COMPLETE Profile 13s 2d 1Run the following command to query the status of TensorBoard:
kubectl get service -n defaultExpected output:
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE resnet18-profile-tensorboard NodePort 172.16.158.170 <none> 6006:30582/TCP 2d20hRun the following command to enable port forwarding and access TensorBoard:
kubectl port-forward svc/resnet18-profile-tensorboard -n default 6006:6006Expected output:
Forwarding from 127.0.X.X:6006 -> 6006 Forwarding from [::1]:6006 -> 6006Enter
http://localhost:6006into the address bar of your browser to view the analysis results. In the left-side navigation pane, click Views to view the analysis results based on multiple dimensions and identify performance bottlenecks. You can optimize the model based on the analysis results.
Step 4: Optimize the model
You can use Arena to optimize a model.
Run the following command to submit a model optimization job to the ACK Pro cluster:
arena model analyze optimize \ --name=resnet18-optimize \ --namespace=default \ --image=registry.cn-beijing.aliyuncs.com/kube-ai/easy-inference:1.0.2 \ --gpus=1 \ --data=oss-pvc:/data \ --optimizer=tensorrt \ --model-config-file=/data/models/resnet18/config.json \ --export-path=/data/models/resnet18Parameter
Description
--gpusThe number of GPUs that are used.
--dataThe PVC for the cluster and the path to which the PVC is mounted.
--optimizerThe optimization method. Valid values:
tensorrt(default)aiacc-torch.
--model-config-fileThe path of the configuration file.
--export-pathThe path in which the optimized model is stored.
Run the following command to query the status of the job:
arena model analyze list -AExpected output:
NAMESPACE NAME STATUS TYPE DURATION AGE GPU(Requested) default resnet18-optimize COMPLETE Optimize 16s 2d 1View the configuration file of the optimized model. If the
STATUSparameter displaysCOMPLETE, the optimization job is completed. Then, you can find the configuration file namedopt_resnet18.ptin the path specified by the--export-pathparameter.Change the value of the
--model_pathparameter in the benchmark job to the path of the configuration file that you obtained in the preceding step, and perform a benchmark again. For more information about how to perform a benchmark, see Step 2: Perform a benchmark.The following table describes the metric values before and after the model is optimized.
Metric
Before optimization
After optimization
p90_latency
7.511 milliseconds
5.162 milliseconds
p95_latency
7.86 milliseconds
5.428 milliseconds
p99_latency
9.34 milliseconds
6.64 milliseconds
min_latency
7.019 milliseconds
4.827 milliseconds
max_latency
12.269 milliseconds
8.426 milliseconds
mean_latency
7.312 milliseconds
5.046 milliseconds
median_latency
7.206 milliseconds
4.972 milliseconds
throughput
136 times
198 times
gpu_mem_used
1.47 GB
1.6 GB
gpu_utilization
21.280%
10.912%
The statistics show that the performance and GPU utilization of the model are greatly improved after optimization. If the model still does not meet the performance requirements, you can repeat the preceding steps to analyze and optimize the model.
Step 5: Deploy the model
If the model meets the performance requirements, you can deploy the model as an online service. Arena allows you to use NVIDIA Triton Inference Server to deploy TorchScript models. For more information, see Nvidia Triton Server.
Create a configuration file named
config.pbtxt.ImportantDo not change the file name.
name: "resnet18" platform: "pytorch_libtorch" max_batch_size: 1 default_model_filename: "opt_resnet18.pt" input [ { name: "input__0" format: FORMAT_NCHW data_type: TYPE_FP32 dims: [ 3, 224, 224 ] } ] output [ { name: "output__0", data_type: TYPE_FP32, dims: [ 1000 ] } ]NoteFor more information about the parameters in the configuration file, see Model Repository.
Create the following directory structure in OSS:
oss://bucketname/triton/model-repository/ resnet18/ config.pbtxt 1/ opt_resnet18.ptNote1/is a convention of NVIDIA Triton Inference Server. The value indicates the version number of the model. A model repository can store different versions of a model. For more information, see Model Repository.Use Arena to deploy the model. You can deploy a model in GPU sharing mode or GPU exclusive mode.
GPU exclusive mode: You can use this mode to deploy inference services that require high stability. In this mode, each GPU accelerates only one model. Models do not compete for GPU resources. You can run the following command to deploy a model in GPU exclusive mode:
arena serve triton \ --name=resnet18-serving \ --gpus=1 \ --replicas=1 \ --image=nvcr.io/nvidia/tritonserver:21.05-py3 \ --data=oss-pvc:/data \ --model-repository=/data/triton/model-repository \ --allow-metrics=trueGPU sharing mode: You can use this mode to deploy long-tail inference services or inference services that require cost-efficiency. In this mode, a GPU is shared by multiple models. Each model is allowed to use only a specified amount of GPU memory. You can run the following command to deploy a model in GPU sharing mode:
If you deploy models in GPU sharing mode, you must set the
--gpumemoryparameter. This parameter specifies the amount of memory that is allocated to each pod. You can specify a proper value based on thegpu_mem_usedmetric in the benchmark result. For example, if the value of thegpu_mem_usedmetric is 1.6 GB, you can set the--gpumemoryparameter to 2 GB. The value of this parameter must be a positive integer.arena serve triton \ --name=resnet18 \ --gpumemory=2 \ --replicas=1 \ --image=nvcr.io/nvidia/tritonserver:21.12-py3 \ --data=oss-pvc:/data \ --model-repository=/data/triton/model-repository \ --allow-metrics=true
Run the following command to query the status of the deployment:
arena serve list -AExpected output:
NAMESPACE NAME TYPE VERSION DESIRED AVAILABLE ADDRESS PORTS GPU default resnet18-serving Triton 202202141817 1 1 172.16.147.248 RESTFUL:8000,GRPC:8001 1If the value of the
AVAILABLEparameter equals the value of theDESIREDparameter, the model is deployed.