PyTorch is a deep learning framework that you can use to train models. This topic describes how to deploy a PyTorch model as an inference service using the NVIDIA Triton inference server or TorchServe.
Prerequisites
-
A Kubernetes cluster with GPU nodes is available. For more information, see Add GPU nodes to a cluster.
-
Your Kubernetes cluster can access the Internet. For more information, see Allow cluster nodes to access the Internet.
-
Arena is installed. For more information, see Configure the Arena client.
Step 1: Deploy the PyTorch model
NVIDIA Triton
This method uses a BERT model trained with PyTorch 1.16. You convert the model to TorchScript, store it in the triton directory of a PVC, and then deploy it using the NVIDIA Triton inference server.
The NVIDIA Triton inference server requires the model repository to have the following directory structure:
└── chnsenticorp # The name of the model.
├── 1623831335 # The version of the model.
│ └── model.savedmodel # The model file.
│ ├── saved_model.pb
│ └── variables
│ ├── variables.data-00000-of-00001
│ └── variables.index
└── config.pbtxt # The Triton configuration.
-
Complete a standalone PyTorch training job and convert the PyTorch model to TorchScript. For more information, see Use Arena to submit a standalone PyTorch training job.
Run the following command to query the GPU resources available in the cluster:
arena top nodeExpected output:
NAME IPADDRESS ROLE STATUS GPU(Total) GPU(Allocated) cn-beijing.192.168.0.100 192.168.0.100 <none> Ready 1 0 cn-beijing.192.168.0.101 192.168.0.101 <none> Ready 1 0 cn-beijing.192.168.0.99 192.168.0.99 <none> Ready 1 0 --------------------------------------------------------------------------------------------------- Allocated/Total GPUs of nodes which own resource nvidia.com/gpu In Cluster: 0/3 (0.0%)The preceding output shows that the cluster has three GPU-accelerated nodes on which you can deploy the model.
-
Upload the model to a bucket in Object Storage Service (OSS).
ImportantThe following steps for uploading the model to OSS are for a Linux system. For upload instructions on other operating systems, see Quick start for the ossutil command-line tool.
-
Create a bucket named
examplebucket.-
Run the following command to create
examplebucket.ossutil64 mb oss://examplebucket -
The following output indicates that
examplebucketis created.0.668238(s) elapsed
-
-
Upload the model to the
examplebucketbucket.ossutil64 cp model.savedmodel oss://examplebucket
-
Create a PV and a PVC.
-
Use the following template to create a
pytorch-pv-pvc.yamlfile.apiVersion: v1 kind: PersistentVolume metadata: name: model-csi-pv spec: capacity: storage: 5Gi accessModes: - ReadWriteMany persistentVolumeReclaimPolicy: Retain csi: driver: ossplugin.csi.alibabacloud.com volumeHandle: model-csi-pv # Must be the same as the PV name. volumeAttributes: bucket: "Your Bucket" url: "Your oss url" akId: "Your Access Key Id" akSecret: "Your Access Key Secret" otherOpts: "-o max_stat_cache_size=0 -o allow_other" --- apiVersion: v1 kind: PersistentVolumeClaim metadata: name: model-pvc namespace: inference spec: accessModes: - ReadWriteMany resources: requests: storage: 5GiParameter
Description
bucket
The name of the OSS bucket. The name must be globally unique in OSS. For more information, see Bucket naming conventions.
url
The URL used to access the object in the bucket. For more information, see Obtain the URLs of one or more objects.
akId
The AccessKey ID and AccessKey Secret for accessing OSS. We recommend accessing OSS as a RAM user. For more information, see Create an AccessKey pair.
akSecret
otherOpts
Custom mount options for the OSS volume.
-
-o max_stat_cache_size=0disables metadata caching. The system retrieves the latest metadata from OSS for each file access. -
-o allow_otherallows other users to access the mounted file system.
For more information about parameter settings, see ossfs mount options.
-
-
Run the following command to create the PV and PVC:
kubectl apply -f pytorch-pv-pvc.yaml
-
-
Run the following command to deploy the model using the
NVIDIA Triton inference server:arena serve triton \ --name=bert-triton \ --namespace=inference \ --gpus=1 \ --replicas=1 \ --image=nvcr.io/nvidia/tritonserver:20.12-py3 \ --data=model-pvc:/models \ --model-repository=/models/tritonExpected output:
configmap/bert-triton-202106251740-triton-serving created configmap/bert-triton-202106251740-triton-serving labeled service/bert-triton-202106251740-tritoninferenceserver created deployment.apps/bert-triton-202106251740-tritoninferenceserver created INFO[0001] The Job bert-triton has been submitted successfully INFO[0001] You can run `arena get bert-triton --type triton-serving` to check the job status
TorchServe
-
Use torch-model-archiver to package the PyTorch model into a .mar file required by TorchServe. For more information, see torch-model-archiver.
-
Upload the model to a bucket in Object Storage Service (OSS).
ImportantThe following steps for uploading the model to OSS are for a Linux system. For upload instructions on other operating systems, see Quick start for the ossutil command-line tool.
-
Create a bucket named
examplebucket.-
Run the following command to create
examplebucket.ossutil64 mb oss://examplebucket -
The following output indicates that
examplebucketis created.0.668238(s) elapsed
-
-
Upload the model to the
examplebucketbucket.ossutil64 cp model.savedmodel oss://examplebucket
-
Create a PV and a PVC.
-
Use the following template to create a
pytorch-pv-pvc.yamlfile.apiVersion: v1 kind: PersistentVolume metadata: name: model-csi-pv spec: capacity: storage: 5Gi accessModes: - ReadWriteMany persistentVolumeReclaimPolicy: Retain csi: driver: ossplugin.csi.alibabacloud.com volumeHandle: model-csi-pv # Must be the same as the PV name. volumeAttributes: bucket: "Your Bucket" url: "Your oss url" akId: "Your Access Key Id" akSecret: "Your Access Key Secret" otherOpts: "-o max_stat_cache_size=0 -o allow_other" --- apiVersion: v1 kind: PersistentVolumeClaim metadata: name: model-pvc namespace: inference spec: accessModes: - ReadWriteMany resources: requests: storage: 5GiParameter
Description
bucket
The name of the OSS bucket. The name must be globally unique in OSS. For more information, see Bucket naming conventions.
url
The URL used to access the object in the bucket. For more information, see Obtain the URLs of one or more objects.
akId
The AccessKey ID and AccessKey Secret for accessing OSS. We recommend accessing OSS as a RAM user. For more information, see Create an AccessKey pair.
akSecret
otherOpts
Custom mount options for the OSS volume.
-
-o max_stat_cache_size=0disables metadata caching. The system retrieves the latest metadata from OSS for each file access. -
-o allow_otherallows other users to access the mounted file system.
For more information about parameter settings, see ossfs mount options.
-
-
Run the following command to create the PV and PVC:
kubectl apply -f pytorch-pv-pvc.yaml
-
-
Run the following command to deploy the TorchServe service:
arena serve custom \ --name=torchserve-demo \ --gpus=1 \ --replicas=1 \ --image=pytorch/torchserve:0.4.2-gpu \ --port=8000 \ --restful-port=8001 \ --metrics-port=8002 \ --data=model-pvc:/data \ 'torchserve --start --model-store /data/models --ts-config /data/config/ts.properties'Note-
You can use an official image or a custom TorchServe image.
-
The path for the
--model-storeparameter must be the directory containing your .mar model archive.
Expected output:
service/torchserve-demo-202109101624 created deployment.apps/torchserve-demo-202109101624-custom-serving created INFO[0001] The Job torchserve-demo has been submitted successfully INFO[0001] You can run `arena get torchserve-demo --type custom-serving` to check the job status -
Step 2: Verify the deployment
-
Run the following command to check the deployment status of the PyTorch model:
arena serve list -n inferenceExpected output:
NAME TYPE VERSION DESIRED AVAILABLE ADDRESS PORTS bert-triton Triton 202106251740 1 1 172.16.70.14 RESTFUL:8000,GRPC:8001 -
Run the following command to view the inference service details:
arena serve get bert-triton -n inferenceExpected output:
Name: bert-triton Namespace: inference Type: Triton Version: 202106251740 Desired: 1 Available: 1 Age: 5m Address: 172.16.70.14 Port: RESTFUL:8000,GRPC:8001 Instances: NAME STATUS AGE READY RESTARTS NODE ---- ------ --- ----- -------- ---- bert-triton-202106251740-tritoninferenceserver-667cf4c74c-s6nst Running 5m 1/1 0 cn-beijing.192.168.0.89The output indicates that the model was successfully deployed using the
NVIDIA Triton inference server. The service provides two API ports: 8001 (gRPC) and 8000 (REST). -
By default, the
NVIDIA Triton inference serverdeployment exposes the service via a ClusterIP. To access the service from the Internet, you must configure an Internet-facing Ingress.-
On the Clusters page, click the name of your cluster. In the left-side navigation pane, choose .
-
From the Namespace drop-down list at the top of the page, select
inference. -
In the upper-right corner of the page, click Create Ingress.
-
Set Service to
bert-triton. -
Set Port to 8000 (REST).
-
Configure other parameters as needed. For more information, see Create and use an NGINX Ingress to expose an application.
-
-
-
After the Ingress is created, find its address in the Rules column on the Ingresses page.
-
Use the Ingress address to invoke the inference service API. The
NVIDIA Triton inference servercomplies with the KFServing API specification. For more information, see the Nvidia Triton Server API documentation.curl "http://<your_ingress_address>"Expected output:
{ "name":"chnsenticorp", "versions":[ "1623831335" ], "platform":"tensorflow_savedmodel", "inputs":[ { "name":"input_ids", "datatype":"INT64", "shape":[ -1, 128 ] } ], "outputs":[ { "name":"probabilities", "datatype":"FP32", "shape":[ -1, 2 ] } ] }