Elastic inference based on ECI - Container Service for Kubernetes

After a model training job is completed, the model is usually deployed as an inference service. The number of calls to an inference service dynamically changes based on the business requirements. Elastic scaling is required to handle different loads and reduce costs. Conventional deployment solutions cannot meet the requirements of large-scale and highly concurrent systems. Alibaba Cloud allows you to deploy workloads on the Elastic Container Instance (ECI) to enable elastic scaling of inference services. This topic describes how to run elastic inference workloads on an ECI.

Prerequisites

A model is ready for deployment. In this topic, a BERT model trained with TensorFlow 1.15 is used.
The ack-virtual-node, ack-alibaba-cloud-metrics-adapter, and arena components are installed. For more information about how to manage the components, see Manage components. For more information about ack-virtual-node, see Connection overview.

Procedure

Upload the trained model to an Object Storage Service (OSS) bucket. For more information, see Upload objects.

Create a persistent volume (PV) and a persistent volume claim (PVC).

Create a file named pvc.yaml and copy the following code into the file:

apiVersion: v1
kind: PersistentVolume
metadata:
  name: model-csi-pv
spec:
  capacity:
    storage: 5Gi
  accessModes:
    - ReadWriteMany
  persistentVolumeReclaimPolicy: Retain
  csi:
    driver: ossplugin.csi.alibabacloud.com
    volumeHandle: model-csi-pv # The value must be the same as the name of the PV. 
    volumeAttributes:
      bucket: "Your Bucket"
      url: "Your oss url"
      akId: "Your Access Key Id"
      akSecret: "Your Access Key Secret"
      otherOpts: "-o max_stat_cache_size=0 -o allow_other"
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: model-pvc
spec:
  accessModes:
  - ReadWriteMany
  volumeName: model-csi-pv
  storageClassName: ""
  resources:
    requests:
      storage: 5Gi

Parameter	Description
bucket	The name of the OSS bucket, which is globally unique in OSS. For more information, see Bucket naming conventions.
url	The URL that is used to access an object in the bucket. For more information, see Obtain the URL of a single object or the URLs of multiple objects.
akId	The AccessKey ID and AccessKey secret that are used to access the OSS bucket. We recommend that you access the OSS bucket as a Resource Access Management (RAM) user. For more information, see Create an AccessKey pair.
akSecret
otherOpts	Custom parameters for mounting the OSS bucket. Set `-o max_stat_cache_size=0` to disable metadata caching. If this feature is disabled, the system retrieves the latest metadata from OSS each time it attempts to access objects in OSS. Set `-o allow_other` to allow other users to access the OSS bucket that you mounted. For more information about other parameters, see Custom parameters supported by ossfs.

Run the following command to create the PV and PVC:
```
kubectl apply -f pvc.yaml
```

Deploy an inference service.

Run the following command to deploy the inference service:

You can use annotations to specify the resource type that you want to request. The following table describes the parameters.

Parameter

Description

alibabacloud.com/burst-resource

Valid values:

If the parameter is left empty, only existing Elastic Container Service (ECS) resources in the cluster are used. This is the default setting.
eci: ECIs are used when the ECS resources in the cluster are insufficient.
eci_only: Only ECIs are used. The ECS resources in the cluster are not used.

k8s.aliyun.com/eci-use-specs

To use ECI's GPU resources, you must use this annotation to specify the GPU-accelerated instance type.

arena serve tensorflow \
   --namespace=default \
   --name=bert-tfserving \
   --model-name=chnsenticorp  \
   --gpus=1  \
   --image=tensorflow/serving:1.15.0-gpu \
   --data=model-pvc:/data \
   --model-path=/data/models/tensorflow/chnsenticorp \
   --version-policy=specific:1623831335 \
   --annotation=alibabacloud.com/burst-resource=eci_only \
   --annotation=k8s.aliyun.com/eci-use-specs=ecs.gn6i-c4g1.xlarge

Run the following command to query the status of the Services:

arena serve list

Expected output:

NAME            TYPE        VERSION       DESIRED  AVAILABLE  ADDRESS        PORTS                   GPU
bert-tfserving  Tensorflow  202207181536  1        1          172.16.52.170  GRPC:8500,RESTFUL:8501  1

Run the following command to query the status of the pods:

kubectl get pods -o wide

Expected output:

NAME                                                              READY   STATUS    RESTARTS   AGE    IP              NODE                           NOMINATED NODE   READINESS GATES
bert-tfserving-202207181536-tensorflow-serving-547797c546-djh58   1/1     Running   0          114s   192.168.0.246   virtual-kubelet-cn-beijing-h   <none>           <none>

The output shows that the type of the node is virtual-kubelet-cn-beijing-h. This indicates that the pod is deployed on an elastic container instance.

Configure a Horizontal Pod Autoscaler (HPA). The HPA can automatically adjust the number of replicated pods in a Kubernetes cluster based on workloads.

Run the following command to query the Deployment of the inference service:

kubectl get deployment

Expected output:

NAME                                             READY   UP-TO-DATE   AVAILABLE   AGE
bert-tfserving-202207181536-tensorflow-serving   1/1     1            1           2m18s

Run the following command to query the Service:

kubectl get service

Expected output:

NAME                                             TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)             AGE
bert-tfserving-202207181536-tensorflow-serving   ClusterIP   172.16.52.170   <none>        8500/TCP,8501/TCP   2m45s

Create a file named bert-tfserving-eci-hpa.yaml and copy the following code into the file:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: bert-tfserving-eci-hpa
  namespace: default
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: bert-tfserving-202207181536-tensorflow-serving
  minReplicas: 1
  maxReplicas: 10
  metrics:
  - type: External
    external:
      metric:
        name: sls_ingress_qps
        selector:
          matchLabels:
            sls.project: "k8s-log-{cluster id}"
            sls.logstore: "nginx-ingress"
            sls.ingress.route: "default-bert-tfserving-202207181536-tensorflow-serving-8501"
      target:
        type: AverageValue
        averageValue: "10"

The following table describes the parameters.

Parameter	Description
`scaleTargetRef`	Specifies the object to which the HPA is bound. In this example, the value is set to the name of the Deployment of the inference service configured in Step a.
`minReplicas`	The minimum number of replicated pods.
`maxReplicas`	The maximum number of replicated pods.
`sls.project`	The name of the Simple Log Service (SLS) project used by the cluster. The value of the parameter must be in the format of `k8s-log-{cluster id}`. Replace `{cluster id}` with your actual cluster ID.
`sls.logstore`	The name of the Log Service Logstore. The default value is `nginx-ingress`.
`sls.ingress.route`	The Ingress that is used to expose the service. In this example, the value is set to `{namespace}-{service name}-{service port}`.
`metricname`	The name of the metric. In this example, the value is set to `sls_ingress_qps`.
`targetaverageValue`	The queries per second (QPS) value that triggers scale-out activities. In this example, the value of this parameter is set to `10`. A scale-out activity is triggered when the QPS value is greater than 10.

Run the following command to query the status of the HPA:

kubectl get hpa

Expected output:

NAME                     REFERENCE                                                   TARGETS      MINPODS   MAXPODS   REPLICAS   AGE
bert-tfserving-eci-hpa   Deployment/bert-tfserving-202207181536-tensorflow-serving   0/10 (avg)   1         10        1          116s

Configure an Internet-facing Ingress.
By default, the inference service that is deployed by running the arena serve tensorflow command is assigned only a cluster IP address. The service cannot be accessed over the Internet. Therefore, you must create an Internet-facing Ingress for the inference service.
1. Log on to the ACK console. In the left-side navigation pane, click Clusters.
2. On the Clusters page, find the cluster you want and click its name. In the left navigation pane, choose Network > Ingresses.
3. On the Ingresses page, select the namespace where the inference service resides from the Namespace drop-down list, and click Create Ingress. Set the parameters that are described in the following table. For more information about the parameters, see Create an NGINX Ingress.
  - Name: In this example, the value is set to bert-tfserving.
  - Rules:
    - Domain Name: Enter a custom domain name, such as test.example.com.
    - Mappings:
      Path: The root path / is used in this example.
      Rule: The default rule (ImplementationSpecific) is used in this example.
      Service Name: In this example, the value is set to the service name bert-tfserving-202207181536-tensorflow-serving specified in Step b.
      Port: Port 8501 is used in this example.
4. On the Ingresses page, check the address of the Ingress in the Rules column.
Use the address that is obtained from Step 5 to perform stress tests on the inference service. If the QPS value is greater than the averageValue configured in the HPA, a scale-out activity is triggered and the number of pods does not exceed the value of maxReplicas. If the QPS value is smaller than the averageValue, a scale-in activity is triggered.