After a model training job is completed, the model is usually deployed as an inference service. The number of calls to an inference service dynamically changes based on the business requirements. Elastic scaling is required to handle different loads and reduce costs. Conventional deployment solutions cannot meet the requirements of large-scale and highly concurrent systems. Alibaba Cloud allows you to deploy workloads on elastic container instances to enable elastic scaling of inference services. This topic describes how to run elastic inference workloads on elastic container instances.
Prerequisites
A model is ready for deployment. In this topic, a BERT model trained with TensorFlow 1.15 is used.
The ack-virtual-node, ack-alibaba-cloud-metrics-adapter, and arena components are installed. For more information about how to manage the components, see Manage components. For more information about ack-virtual-node, see Connection overview.
Procedure
Upload the trained model to an Object Storage Service (OSS) bucket. For more information, see Upload objects.
Create a persistent volume (PV) and a persistent volume claim (PVC).
Create a file named
pvc.yaml
and copy the following code into the file:apiVersion: v1 kind: PersistentVolume metadata: name: model-csi-pv spec: capacity: storage: 5Gi accessModes: - ReadWriteMany persistentVolumeReclaimPolicy: Retain csi: driver: ossplugin.csi.alibabacloud.com volumeHandle: model-csi-pv # The value must be the same as the name of the PV. volumeAttributes: bucket: "Your Bucket" url: "Your oss url" akId: "Your Access Key Id" akSecret: "Your Access Key Secret" otherOpts: "-o max_stat_cache_size=0 -o allow_other" --- apiVersion: v1 kind: PersistentVolumeClaim metadata: name: model-pvc spec: accessModes: - ReadWriteMany resources: requests: storage: 5Gi
Parameter
Description
bucket
The name of the OSS bucket, which is globally unique in OSS. For more information, see Bucket naming conventions.
url
The URL that is used to access an object in the bucket. For more information, see Obtain the URL of a single object or the URLs of multiple objects.
akId
The AccessKey ID and AccessKey secret that are used to access the OSS bucket. We recommend that you access the OSS bucket as a Resource Access Management (RAM) user. For more information, see Create an AccessKey pair.
akSecret
otherOpts
Custom parameters for mounting the OSS bucket.
Set
-o max_stat_cache_size=0
to disable metadata caching. If this feature is disabled, the system retrieves the latest metadata from OSS each time it attempts to access objects in OSS.Set
-o allow_other
to allow other users to access the OSS bucket that you mounted.
For more information about other parameters, see Custom parameters supported by ossfs.
Run the following command to create the PV and PVC:
kubectl apply -f pvc.yaml
Deploy an inference service.
Run the following command to deploy the inference service:
You can use annotations to specify the resource type that you want to request. The following table describes the parameters.
Parameter
Description
alibabacloud.com/burst-resource
Valid values:
If the parameter is left empty, only existing Elastic Container Service (ECS) resources in the cluster are used. This is the default setting.
eci
: Elastic container instances are used when the ECS resources in the cluster are insufficient.eci_only
: Only elastic container instances are used. The ECS resources in the cluster are not used.
k8s.aliyun.com/eci-use-specs
To use the GPU resources of elastic container instances, you must use this annotation to specify the GPU-accelerated instance type.
arena serve tensorflow \ --namespace=default \ --name=bert-tfserving \ --model-name=chnsenticorp \ --gpus=1 \ --image=tensorflow/serving:1.15.0-gpu \ --data=model-pvc:/data \ --model-path=/data/models/tensorflow/chnsenticorp \ --version-policy=specific:1623831335 \ --annotation=alibabacloud.com/burst-resource=eci_only \ --annotation=k8s.aliyun.com/eci-use-specs=ecs.gn6i-c4g1.xlarge
Run the following command to query the status of the Services:
arena serve list
Expected output:
NAME TYPE VERSION DESIRED AVAILABLE ADDRESS PORTS GPU bert-tfserving Tensorflow 202207181536 1 1 172.16.52.170 GRPC:8500,RESTFUL:8501 1
Run the following command to query the status of the pods:
kubectl get pods -o wide
Expected output:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES bert-tfserving-202207181536-tensorflow-serving-547797c546-djh58 1/1 Running 0 114s 192.168.0.246 virtual-kubelet-cn-beijing-h <none> <none>
The output shows that the type of the
node
isvirtual-kubelet-cn-beijing-h
. This indicates that the pod is deployed on an elastic container instance.
Configure a Horizontal Pod Autoscaler (HPA). The HPA can automatically adjust the number of replicated pods in a Kubernetes cluster based on workloads.
Run the following command to query the Deployment of the inference service:
kubectl get deployment
Expected output:
NAME READY UP-TO-DATE AVAILABLE AGE bert-tfserving-202207181536-tensorflow-serving 1/1 1 1 2m18s
Run the following command to query the Service:
kubectl get service
Expected output:
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE bert-tfserving-202207181536-tensorflow-serving ClusterIP 172.16.52.170 <none> 8500/TCP,8501/TCP 2m45s
Create a file named
bert-tfserving-eci-hpa.yaml
and copy the following code into the file:apiVersion: autoscaling/v2beta2 kind: HorizontalPodAutoscaler metadata: name: bert-tfserving-eci-hpa namespace: default spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: bert-tfserving-202207181536-tensorflow-serving minReplicas: 1 maxReplicas: 10 metrics: - type: External external: metric: name: sls_ingress_qps selector: matchLabels: sls.project: "k8s-log-{cluster id}" sls.logstore: "nginx-ingress" sls.ingress.route: "default-bert-tfserving-202207181536-tensorflow-serving-8501" target: type: AverageValue averageValue: 10
The following table describes the parameters.
Parameter
Description
scaleTargetRef
Specifies the object to which the HPA is bound. In this example, the value is set to the name of the Deployment of the inference service configured in Step a.
minReplicas
The minimum number of replicated pods.
maxReplicas
The maximum number of replicated pods.
sls.project
The name of the Log Service project that is used by the cluster. The value of the parameter must be in the format of
k8s-log-{cluster id}
.sls.logstore
The name of the Log Service Logstore. The default value is
nginx-ingress
.sls.ingress.route
The Ingress that is used to expose the service. In this example, the value is set to
{namespace}-{service name}-{service port}
.metricname
The name of the metric. In this example, the value is set to
sls_ingress_qps
.targetaverageValue
The queries per second (QPS) value that triggers scale-out activities. In this example, the value of this parameter is set to
10
. A scale-out activity is triggered when the QPS value is greater than 10.Run the following command to query the status of the HPA:
kubectl get hpa
Expected output:
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE bert-tfserving-eci-hpa Deployment/bert-tfserving-202207181536-tensorflow-serving 0/10 (avg) 1 10 1 116s
Configure an Internet-facing Ingress.
By default, the inference service that is deployed by running the
arena serve tensorflow
command is assigned only a cluster IP address. The service cannot be accessed over the Internet. Therefore, you must create an Internet-facing Ingress for the inference service.Log on to the ACK console. In the left-side navigation pane, click Clusters.
On the Clusters page, click the name of the cluster that you want to manage and choose in the left-side navigation pane.
In the upper part of the Ingresses page, select the namespace where the inference service resides from the Namespace drop-down list, and click Create Ingress. Set the parameters that are described in the following table. For more information about the parameters, see Create an NGINX Ingress.
Name: In this example, the value is set to
bert-tfserving
.Rules:
Domain Name: Enter a custom domain name, such as
test.example.com
.Mappings:
Path: The root path
/
is used in this example.Rule: The default rule (ImplementationSpecific) is used in this example.
Service Name: In this example, the value is set to the service name
bert-tfserving-202207181536-tensorflow-serving
specified in Step b.Port: Port 8501 is used in this example.
On the Ingresses page, check the address of the Ingress in the Rules column.
Use the address that is obtained from Step 5 to perform stress tests on the inference service. If the QPS value is greater than the
averageValue
configured in the HPA, a scale-out activity is triggered and the number of pods does not exceed the value ofmaxReplicas
. If the QPS value is smaller than theaverageValue
, a scale-in activity is triggered.