PyTorch is a deep learning framework that can be used to train models. This topic describes how to use NVIDIA Triton Inference Server or TorchServe to deploy a PyTorch model as an inference service.

Prerequisites

Deployment method

You can use NVIDIA Triton Inference Server or TorchServe to deploy a PyTorch model as an inference service. We recommend that you use NVIDIA Triton Inference Server.

Method 1: Use NVIDIA Triton Inference Server to deploy a PyTorch model as an inference service

  1. Train a standalone PyTorch training job and convert the model to TorchScript code. For more information, see Use Arena to submit standalone PyTorch training jobs.
    Note In this example, a Bidirectional Encoder Representations from Transformers (BERT) model is trained with PyTorch 1.16. The model is converted to TorchScript code, which is saved in the triton directory of a persistent volume claim (PVC). The model is then deployed by using NVIDIA Triton Inference Server.
    The following model directory structure is required by Triton:
    └── chnsenticorp # The name of the model. 
        ├── 1623831335 # The version of the model. 
        │   └── model.savedmodel # The model file. 
        │       ├── saved_model.pb
        │       └── variables
        │           ├── variables.data-00000-of-00001
        │           └── variables.index
        └── config.pbtxt # The configuration of Triton. 
  2. Run the following command to query the GPU resources available in the cluster:
    arena top node

    Expected output:

    NAME                      IPADDRESS      ROLE    STATUS  GPU(Total)  GPU(Allocated)
    cn-beijing.192.168.0.100  192.168.0.100  <none>  Ready   1           0
    cn-beijing.192.168.0.101  192.168.0.101  <none>  Ready   1           0
    cn-beijing.192.168.0.99   192.168.0.99   <none>  Ready   1           0
    ---------------------------------------------------------------------------------------------------
    Allocated/Total GPUs of nodes which own resource nvidia.com/gpu In Cluster:
    0/3 (0.0%)

    The preceding output shows that the cluster has three GPU-accelerated nodes on which you can deploy the model.

  3. Upload the model file to your Object Storage Service (OSS) bucket. For more information, see Upload objects.
  4. Use the following YAML file to create a persistent volume (PV) and a persistent volume claim (PVC):
    apiVersion: v1
    kind: PersistentVolume
    metadata:
      name: model-csi-pv
    spec:
      capacity:
        storage: 5Gi
      accessModes:
        - ReadWriteMany
      persistentVolumeReclaimPolicy: Retain
      csi:
        driver: ossplugin.csi.alibabacloud.com
        volumeHandle: model-csi-pv   // The value must be the same as the name of the PV. 
        volumeAttributes:
          bucket: "Your Bucket"
          url: "Your oss url"
          akId: "Your Access Key Id"
          akSecret: "Your Access Key Secret"
          otherOpts: "-o max_stat_cache_size=0 -o allow_other"
    ---
    apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      name: model-pvc
    spec:
      accessModes:
      - ReadWriteMany
      resources:
        requests:
          storage: 5Gi
  5. Run the following command to deploy the model by using NVIDIA Triton Inference Server:
    arena serve triton \
     --name=bert-triton \
     --namespace=inference \
     --gpus=1 \
     --replicas=1 \
     --image=nvcr.io/nvidia/tritonserver:20.12-py3 \
     --data=model-pvc:/models \
     --model-repository=/models/triton

    Expected output:

    configmap/bert-triton-202106251740-triton-serving created
    configmap/bert-triton-202106251740-triton-serving labeled
    service/bert-triton-202106251740-tritoninferenceserver created
    deployment.apps/bert-triton-202106251740-tritoninferenceserver created
    INFO[0001] The Job bert-triton has been submitted successfully
    INFO[0001] You can run `arena get bert-triton --type triton-serving` to check the job status
  6. Run the following command to check the deployment progress of the model:
    arena serve list -n inference

    Expected output:

    NAME            TYPE        VERSION       DESIRED  AVAILABLE  ADDRESS        PORTS
    bert-triton     Triton      202106251740  1        1          172.16.70.14   RESTFUL:8000,GRPC:8001
  7. Run the following command to query the details about the inference service:
    arena serve get bert-tfserving -n inference

    Expected output:

    Name:       bert-triton
    Namespace:  inference
    Type:       Triton
    Version:    202106251740
    Desired:    1
    Available:  1
    Age:        5m
    Address:    172.16.70.14
    Port:       RESTFUL:8000,GRPC:8001
    
    
    Instances:
      NAME                                                             STATUS   AGE  READY  RESTARTS  NODE
      ----                                                             ------   ---  -----  --------  ----
      bert-triton-202106251740-tritoninferenceserver-667cf4c74c-s6nst  Running  5m   1/1    0         cn-beijing.192.168.0.89

    The output shows that the model is successfully deployed by using NVIDIA Triton Inference Server. Port 8001 is exposed for the gRPC API and port 8000 is exposed for the RESTful API.

  8. Configure an Internet-facing Ingress. For more information, see Create an Ingress.
    Note By default, the inference service deployed by using NVIDIA Triton Inference Server is exposed through a cluster IP address that is not exposed to external access. You must create an Ingress for the inference service based on the following configurations:
    • Set Namespace to inference.
    • Set Service Port to 8501. This port is exposed for the RESTful API.
  9. After you create the Ingress, go to the Ingresses page and find the Ingress. The value in the Rules column contains the address of the Ingress. 23
  10. Run the following command to call the inference service by using the address of the Ingress. NVIDIA Triton Server complies with the interface specifications of KFServing. For more information, see NVIDIA Triton Server API.
    curl "http://<Ingress address>"

    Expected output:

    {
        "name":"chnsenticorp",
        "versions":[
            "1623831335"
        ],
        "platform":"tensorflow_savedmodel",
        "inputs":[
            {
                "name":"input_ids",
                "datatype":"INT64",
                "shape":[
                    -1,
                    128
                ]
            }
        ],
        "outputs":[
            {
                "name":"probabilities",
                "datatype":"FP32",
                "shape":[
                    -1,
                    2
                ]
            }
        ]
    }

    The output shows that the inference service is available, which indicates that the inference service is successfully deployed.

Method 2: Use TorchServe to deploy a PyTorch model as an inference service

  1. Use torch-model-achiver to package a PyTorch model into a .mar file. For more information, see torch-model-achiver.
  2. Upload the model file to your Object Storage Service (OSS) bucket. For more information, see Upload objects.
  3. Use the following YAML file to create a persistent volume (PV) and a persistent volume claim (PVC):
    apiVersion: v1
    kind: PersistentVolume
    metadata:
      name: model-csi-pv
    spec:
      capacity:
        storage: 5Gi
      accessModes:
        - ReadWriteMany
      persistentVolumeReclaimPolicy: Retain
      csi:
        driver: ossplugin.csi.alibabacloud.com
        volumeHandle: model-csi-pv   // The value must be the same as the name of the PV. 
        volumeAttributes:
          bucket: "Your Bucket"
          url: "Your oss url"
          akId: "Your Access Key Id"
          akSecret: "Your Access Key Secret"
          otherOpts: "-o max_stat_cache_size=0 -o allow_other"
    ---
    apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      name: model-pvc
    spec:
      accessModes:
      - ReadWriteMany
      resources:
        requests:
          storage: 5Gi
  4. Run the following command to deploy the PyTorch model:
    arena serve custom \
      --name=torchserve-demo \
      --gpus=1 \
      --replicas=1 \
      --image=pytorch/torchserve:0.4.2-gpu \
      --port=8000 \
      --restful-port=8001 \
      --metrics-port=8002 \
      --data=model-pvc:/data \
      'torchserve --start --model-store /data/models --ts-config /data/config/ts.properties'
    Note
    • For the image parameter, you can specify an official image or a custom TorchServe image.
    • You must set the --model-store field of the torchserve parameter to the path where the PyTorch model is stored.

    Expected output:

    service/torchserve-demo-202109101624 created
    deployment.apps/torchserve-demo-202109101624-custom-serving created
    INFO[0001] The Job torchserve-demo has been submitted successfully
    INFO[0001] You can run `arena get torchserve-demo --type custom-serving` to check the job status
  5. Perform Step 6 to Step 10 in Method 1: Use NVIDIA Triton Inference Server to deploy a PyTorch model as an inference service to verify that the inference service is successfully deployed.