PyTorch is a deep learning framework that can be used to train models. This topic
describes how to use NVIDIA Triton Inference Server or TorchServe to deploy a PyTorch
model as an inference service.
Deployment method
You can use NVIDIA Triton Inference Server or TorchServe to deploy a PyTorch model
as an inference service. We recommend that you use NVIDIA Triton Inference Server.
Method 1: Use NVIDIA Triton Inference Server to deploy a PyTorch model as an inference
service
- Train a standalone PyTorch training job and convert the model to TorchScript code.
For more information, see Use Arena to submit standalone PyTorch training jobs.
Note In this example, a Bidirectional Encoder Representations from Transformers (BERT)
model is trained with PyTorch 1.16. The model is converted to TorchScript code, which
is saved in the triton directory of a persistent volume claim (PVC). The model is then deployed by using
NVIDIA Triton Inference Server
.
The following model directory structure is required by
Triton:
└── chnsenticorp # The name of the model.
├── 1623831335 # The version of the model.
│ └── model.savedmodel # The model file.
│ ├── saved_model.pb
│ └── variables
│ ├── variables.data-00000-of-00001
│ └── variables.index
└── config.pbtxt # The configuration of Triton.
- Run the following command to query the GPU resources available in the cluster:
arena top node
Expected output:
NAME IPADDRESS ROLE STATUS GPU(Total) GPU(Allocated)
cn-beijing.192.168.0.100 192.168.0.100 <none> Ready 1 0
cn-beijing.192.168.0.101 192.168.0.101 <none> Ready 1 0
cn-beijing.192.168.0.99 192.168.0.99 <none> Ready 1 0
---------------------------------------------------------------------------------------------------
Allocated/Total GPUs of nodes which own resource nvidia.com/gpu In Cluster:
0/3 (0.0%)
The preceding output shows that the cluster has three GPU-accelerated nodes on which
you can deploy the model.
- Upload the model file to your Object Storage Service (OSS) bucket. For more information,
see Upload objects.
- Use the following YAML file to create a persistent volume (PV) and a persistent volume
claim (PVC):
apiVersion: v1
kind: PersistentVolume
metadata:
name: model-csi-pv
spec:
capacity:
storage: 5Gi
accessModes:
- ReadWriteMany
persistentVolumeReclaimPolicy: Retain
csi:
driver: ossplugin.csi.alibabacloud.com
volumeHandle: model-csi-pv // The value must be the same as the name of the PV.
volumeAttributes:
bucket: "Your Bucket"
url: "Your oss url"
akId: "Your Access Key Id"
akSecret: "Your Access Key Secret"
otherOpts: "-o max_stat_cache_size=0 -o allow_other"
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: model-pvc
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 5Gi
- Run the following command to deploy the model by using
NVIDIA Triton Inference Server
: arena serve triton \
--name=bert-triton \
--namespace=inference \
--gpus=1 \
--replicas=1 \
--image=nvcr.io/nvidia/tritonserver:20.12-py3 \
--data=model-pvc:/models \
--model-repository=/models/triton
Expected output:
configmap/bert-triton-202106251740-triton-serving created
configmap/bert-triton-202106251740-triton-serving labeled
service/bert-triton-202106251740-tritoninferenceserver created
deployment.apps/bert-triton-202106251740-tritoninferenceserver created
INFO[0001] The Job bert-triton has been submitted successfully
INFO[0001] You can run `arena get bert-triton --type triton-serving` to check the job status
- Run the following command to check the deployment progress of the model:
arena serve list -n inference
Expected output:
NAME TYPE VERSION DESIRED AVAILABLE ADDRESS PORTS
bert-triton Triton 202106251740 1 1 172.16.70.14 RESTFUL:8000,GRPC:8001
- Run the following command to query the details about the inference service:
arena serve get bert-tfserving -n inference
Expected output:
Name: bert-triton
Namespace: inference
Type: Triton
Version: 202106251740
Desired: 1
Available: 1
Age: 5m
Address: 172.16.70.14
Port: RESTFUL:8000,GRPC:8001
Instances:
NAME STATUS AGE READY RESTARTS NODE
---- ------ --- ----- -------- ----
bert-triton-202106251740-tritoninferenceserver-667cf4c74c-s6nst Running 5m 1/1 0 cn-beijing.192.168.0.89
The output shows that the model is successfully deployed by using NVIDIA Triton Inference Server
. Port 8001 is exposed for the gRPC API and port 8000 is exposed for the RESTful API.
- Configure an Internet-facing Ingress. For more information, see Create an Ingress.
Note By default, the inference service deployed by using
NVIDIA Triton Inference Server
is exposed through a cluster IP address that is not exposed to external access. You
must create an Ingress for the inference service based on the following configurations:
- Set Namespace to inference.
- Set Service Port to 8501. This port is exposed for the RESTful API.
- After you create the Ingress, go to the Ingresses page and find the Ingress. The value in the Rules column contains the address of the Ingress.

- Run the following command to call the inference service by using the address of the
Ingress.
NVIDIA Triton Server
complies with the interface specifications of KFServing. For more information, see
NVIDIA Triton Server API. curl "http://<Ingress address>"
Expected output:
{
"name":"chnsenticorp",
"versions":[
"1623831335"
],
"platform":"tensorflow_savedmodel",
"inputs":[
{
"name":"input_ids",
"datatype":"INT64",
"shape":[
-1,
128
]
}
],
"outputs":[
{
"name":"probabilities",
"datatype":"FP32",
"shape":[
-1,
2
]
}
]
}
The output shows that the inference service is available, which indicates that the
inference service is successfully deployed.
Method 2: Use TorchServe to deploy a PyTorch model as an inference service
- Use torch-model-achiver to package a PyTorch model into a .mar file. For more information,
see torch-model-achiver.
- Upload the model file to your Object Storage Service (OSS) bucket. For more information,
see Upload objects.
- Use the following YAML file to create a persistent volume (PV) and a persistent volume
claim (PVC):
apiVersion: v1
kind: PersistentVolume
metadata:
name: model-csi-pv
spec:
capacity:
storage: 5Gi
accessModes:
- ReadWriteMany
persistentVolumeReclaimPolicy: Retain
csi:
driver: ossplugin.csi.alibabacloud.com
volumeHandle: model-csi-pv // The value must be the same as the name of the PV.
volumeAttributes:
bucket: "Your Bucket"
url: "Your oss url"
akId: "Your Access Key Id"
akSecret: "Your Access Key Secret"
otherOpts: "-o max_stat_cache_size=0 -o allow_other"
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: model-pvc
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 5Gi
- Run the following command to deploy the PyTorch model:
arena serve custom \
--name=torchserve-demo \
--gpus=1 \
--replicas=1 \
--image=pytorch/torchserve:0.4.2-gpu \
--port=8000 \
--restful-port=8001 \
--metrics-port=8002 \
--data=model-pvc:/data \
'torchserve --start --model-store /data/models --ts-config /data/config/ts.properties'
Note
- For the image parameter, you can specify an official image or a custom TorchServe image.
- You must set the
--model-store
field of the parameter to the path where the PyTorch model is stored.
Expected output:
service/torchserve-demo-202109101624 created
deployment.apps/torchserve-demo-202109101624-custom-serving created
INFO[0001] The Job torchserve-demo has been submitted successfully
INFO[0001] You can run `arena get torchserve-demo --type custom-serving` to check the job status
- Perform Step 6 to Step 10 in Method 1: Use NVIDIA Triton Inference Server to deploy a PyTorch model as an inference
service to verify that the inference service is successfully deployed.