Deploy a TensorFlow BERT Inference Service on ACK GPU Nodes via Arena - Container Service for Kubernetes - Alibaba Cloud - Container Service for Kubernetes

This topic explains how to deploy a TensorFlow model as an inference service using Arena.

Prerequisites

Procedure

Note

This topic uses a BERT model trained with TensorFlow 1.15 to deploy an inference service. The model is exported as a saved model.

Run the following command to check the available GPU resources in the cluster.

arena top node

Expected output:

NAME                      IPADDRESS      ROLE    STATUS  GPU(Total)  GPU(Allocated)
cn-beijing.192.168.0.100  192.168.0.100  <none>  Ready   1           0
cn-beijing.192.168.0.101  192.168.0.101  <none>  Ready   1           0
cn-beijing.192.168.0.99   192.168.0.99   <none>  Ready   1           0
---------------------------------------------------------------------------------------------------
Allocated/Total GPUs of nodes which own resource nvidia.com/gpu In Cluster:
0/3 (0.0%)

The output indicates that three GPU nodes are available for model deployment.

Upload the model to a bucket in Object Storage Service (OSS).

Important
The following steps for uploading the model to OSS are for a Linux system. For upload instructions on other operating systems, see Quick start for the ossutil command-line tool.
1. Install ossutil.
2. Create a bucket named examplebucket.
  - Run the following command to create examplebucket.
```
ossutil64 mb oss://examplebucket
```
  - The following output indicates that examplebucket is created.
```
0.668238(s) elapsed
```
3. Upload the model to the examplebucket bucket.
```
ossutil64 cp model.savedmodel oss://examplebucket
```

Create a persistent volume (PV) and a persistent volume claim (PVC).

Create a Tensorflow.yaml file based on the following template.

apiVersion: v1
kind: PersistentVolume
metadata:
  name: model-csi-pv
spec:
  capacity:
    storage: 5Gi
  accessModes:
    - ReadWriteMany
  persistentVolumeReclaimPolicy: Retain
  csi:
    driver: ossplugin.csi.alibabacloud.com
    volumeHandle: model-csi-pv   # Must be the same as the PV name.
    volumeAttributes:
      bucket: "Your Bucket"
      url: "Your oss url"
      akId: "Your AccessKey ID"
      akSecret: "Your AccessKey Secret"
      otherOpts: "-o max_stat_cache_size=0 -o allow_other"
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: model-pvc
spec:
  accessModes:
  - ReadWriteMany
  resources:
    requests:
      storage: 5Gi

Parameter	Description
bucket	The name of the OSS bucket. The name must be globally unique within OSS. For more information, see Bucket naming conventions.
url	The URL used to access OSS files. For more information, see Obtain URLs of a single file or multiple files.
akId	The AccessKey ID and AccessKey Secret used to access OSS. We recommend that you use a RAM user for access. For more information, see Create an AccessKey pair.
akSecret
otherOpts	Custom parameters for mounting OSS. The `-o max_stat_cache_size=0` option disables attribute caching. Each time you access a file, the latest attribute information is retrieved from OSS. The `-o allow_other` option lets other users access the mounted file system. For more information about parameter settings, see ossfs-supported setting options.

Run the following command to create the PV and PVC.
```
kubectl apply -f Tensorflow.yaml
```

Run the following command to start a tensorflow serving instance named bert-tfserving.

arena serve tensorflow \
  --name=bert-tfserving \
  --model-name=chnsenticorp  \
  --gpus=1  \
  --image=tensorflow/serving:1.15.0-gpu \
  --data=model-pvc:/models \
  --model-path=/models/tensorflow \
  --version-policy=specific:1623831335

Expected output:

configmap/bert-tfserving-202106251556-tf-serving created
configmap/bert-tfserving-202106251556-tf-serving labeled
configmap/bert-tfserving-202106251556-tensorflow-serving-cm created
service/bert-tfserving-202106251556-tensorflow-serving created
deployment.apps/bert-tfserving-202106251556-tensorflow-serving created
INFO[0003] The Job bert-tfserving has been submitted successfully
INFO[0003] You can run `arena get bert-tfserving --type tf-serving` to check the job status

Run the following command to list currently running services.

arena serve list

Only the bert-tfserving service is running:

NAME            TYPE        VERSION       DESIRED  AVAILABLE  ADDRESS        PORTS
bert-tfserving  Tensorflow  202106251556  1        1          172.16.95.171  GRPC:8500,RESTFUL:8501

Run the following command to view details of the bert-tfserving inference service.

arena serve get bert-tfserving

Expected output:

Name:       bert-tfserving
Namespace:  inference
Type:       Tensorflow
Version:    202106251556
Desired:    1
Available:  1
Age:        4m
Address:    172.16.95.171
Port:       GRPC:8500,RESTFUL:8501
Instances:
  NAME                                                             STATUS   AGE  READY  RESTARTS  NODE
  ----                                                             ------   ---  -----  --------  ----
  bert-tfserving-202106251556-tensorflow-serving-8554d58d67-jd2z9  Running  4m   1/1    0         cn-beijing.192.168.0.88

The output shows that the model is successfully deployed using tensorflow serving and exposes two API ports: 8500 (gRPC) and 8501 (HTTP).

By default, arena serve tensorflow exposes the inference service through a ClusterIP. You must configure a public network Ingress for direct access.
1. On the Clusters page, click the name of the target cluster. In the left-side navigation pane, choose Network > Ingresses.
2. From the Namespace drop-down list at the top of the page, select the inference namespace, which contains the inference service created in Step 6.
3. In the upper-right corner of the page, click Create Ingress. For more information about the parameters, see Create and use an NGINX Ingress to expose a service.
  - Name: In this example, set the name to Tensorflow.
  - Rule:
    - Host: A custom domain name, for example, test.example.com.
    - Mappings:
      - Path: Leave this field empty to use the root path /.
      - Match Rule: The default value (ImplementationSpecific).
      - Service: Obtain the name by running the kubectl get service command.
      - Port: In this example, set the port to 8501.
After the Ingress is created, obtain the Ingress address from the Ingresses column on the Rules page.

Run the following command to call the inference service API. For more information about tensorflow serving, see the API documentation at Tensorflow Serving API.

curl "http://<Ingress address>"

Expected output:

{
 "model_version_status": [
  {
   "version": "1623831335",
   "state": "AVAILABLE",
   "status": {
    "error_code": "OK",
    "error_message": ""
   }
  }
 ]
}