After a model training job is completed, the model is usually deployed as an inference service. The number of calls to an inference service dynamically changes based on the business requirements. Elastic scaling is required to handle different loads and reduce costs. Conventional deployment solutions cannot meet the requirements of large-scale and highly concurrent systems. Alibaba Cloud allows you to deploy workloads on the Elastic Container Instance (ECI) to enable elastic scaling of inference services. This topic describes how to run elastic inference workloads on an ECI.
Prerequisites
A model is ready for deployment. In this topic, a BERT model trained with TensorFlow 1.15 is used.
The ack-virtual-node, ack-alibaba-cloud-metrics-adapter, and arena components are installed. For more information about how to manage the components, see Manage components. For more information about ack-virtual-node, see Connection overview.
Procedure
Upload the trained model to an Object Storage Service (OSS) bucket. For more information, see Upload objects.
Create a persistent volume (PV) and a persistent volume claim (PVC).
Create a file named
pvc.yamland copy the following code into the file:apiVersion: v1 kind: PersistentVolume metadata: name: model-csi-pv spec: capacity: storage: 5Gi accessModes: - ReadWriteMany persistentVolumeReclaimPolicy: Retain csi: driver: ossplugin.csi.alibabacloud.com volumeHandle: model-csi-pv # The value must be the same as the name of the PV. volumeAttributes: bucket: "Your Bucket" url: "Your oss url" akId: "Your Access Key Id" akSecret: "Your Access Key Secret" otherOpts: "-o max_stat_cache_size=0 -o allow_other" --- apiVersion: v1 kind: PersistentVolumeClaim metadata: name: model-pvc spec: accessModes: - ReadWriteMany volumeName: model-csi-pv storageClassName: "" resources: requests: storage: 5GiParameter
Description
bucket
The name of the OSS bucket, which is globally unique in OSS. For more information, see Bucket naming conventions.
url
The URL that is used to access an object in the bucket. For more information, see Obtain the URL of a single object or the URLs of multiple objects.
akId
The AccessKey ID and AccessKey secret that are used to access the OSS bucket. We recommend that you access the OSS bucket as a Resource Access Management (RAM) user. For more information, see Create an AccessKey pair.
akSecret
otherOpts
Custom parameters for mounting the OSS bucket.
Set
-o max_stat_cache_size=0to disable metadata caching. If this feature is disabled, the system retrieves the latest metadata from OSS each time it attempts to access objects in OSS.Set
-o allow_otherto allow other users to access the OSS bucket that you mounted.
For more information about other parameters, see Custom parameters supported by ossfs.
Run the following command to create the PV and PVC:
kubectl apply -f pvc.yaml
Deploy an inference service.
Run the following command to deploy the inference service:
You can use annotations to specify the resource type that you want to request. The following table describes the parameters.
Parameter
Description
alibabacloud.com/burst-resourceValid values:
If the parameter is left empty, only existing Elastic Container Service (ECS) resources in the cluster are used. This is the default setting.
eci: ECIs are used when the ECS resources in the cluster are insufficient.eci_only: Only ECIs are used. The ECS resources in the cluster are not used.
k8s.aliyun.com/eci-use-specsTo use ECI's GPU resources, you must use this annotation to specify the GPU-accelerated instance type.
arena serve tensorflow \ --namespace=default \ --name=bert-tfserving \ --model-name=chnsenticorp \ --gpus=1 \ --image=tensorflow/serving:1.15.0-gpu \ --data=model-pvc:/data \ --model-path=/data/models/tensorflow/chnsenticorp \ --version-policy=specific:1623831335 \ --annotation=alibabacloud.com/burst-resource=eci_only \ --annotation=k8s.aliyun.com/eci-use-specs=ecs.gn6i-c4g1.xlargeRun the following command to query the status of the Services:
arena serve listExpected output:
NAME TYPE VERSION DESIRED AVAILABLE ADDRESS PORTS GPU bert-tfserving Tensorflow 202207181536 1 1 172.16.52.170 GRPC:8500,RESTFUL:8501 1Run the following command to query the status of the pods:
kubectl get pods -o wideExpected output:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES bert-tfserving-202207181536-tensorflow-serving-547797c546-djh58 1/1 Running 0 114s 192.168.0.246 virtual-kubelet-cn-beijing-h <none> <none>The output shows that the type of the
nodeisvirtual-kubelet-cn-beijing-h. This indicates that the pod is deployed on an elastic container instance.
Configure a Horizontal Pod Autoscaler (HPA). The HPA can automatically adjust the number of replicated pods in a Kubernetes cluster based on workloads.
Run the following command to query the Deployment of the inference service:
kubectl get deploymentExpected output:
NAME READY UP-TO-DATE AVAILABLE AGE bert-tfserving-202207181536-tensorflow-serving 1/1 1 1 2m18sRun the following command to query the Service:
kubectl get serviceExpected output:
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE bert-tfserving-202207181536-tensorflow-serving ClusterIP 172.16.52.170 <none> 8500/TCP,8501/TCP 2m45sCreate a file named
bert-tfserving-eci-hpa.yamland copy the following code into the file:apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: bert-tfserving-eci-hpa namespace: default spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: bert-tfserving-202207181536-tensorflow-serving minReplicas: 1 maxReplicas: 10 metrics: - type: External external: metric: name: sls_ingress_qps selector: matchLabels: sls.project: "k8s-log-{cluster id}" sls.logstore: "nginx-ingress" sls.ingress.route: "default-bert-tfserving-202207181536-tensorflow-serving-8501" target: type: AverageValue averageValue: "10"The following table describes the parameters.
Parameter
Description
scaleTargetRefSpecifies the object to which the HPA is bound. In this example, the value is set to the name of the Deployment of the inference service configured in Step a.
minReplicasThe minimum number of replicated pods.
maxReplicasThe maximum number of replicated pods.
sls.projectThe name of the Simple Log Service (SLS) project used by the cluster. The value of the parameter must be in the format of
k8s-log-{cluster id}. Replace{cluster id}with your actual cluster ID.sls.logstoreThe name of the Log Service Logstore. The default value is
nginx-ingress.sls.ingress.routeThe Ingress that is used to expose the service. In this example, the value is set to
{namespace}-{service name}-{service port}.metricnameThe name of the metric. In this example, the value is set to
sls_ingress_qps.targetaverageValueThe queries per second (QPS) value that triggers scale-out activities. In this example, the value of this parameter is set to
10. A scale-out activity is triggered when the QPS value is greater than 10.Run the following command to query the status of the HPA:
kubectl get hpaExpected output:
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE bert-tfserving-eci-hpa Deployment/bert-tfserving-202207181536-tensorflow-serving 0/10 (avg) 1 10 1 116s
Configure an Internet-facing Ingress.
By default, the inference service that is deployed by running the
arena serve tensorflowcommand is assigned only a cluster IP address. The service cannot be accessed over the Internet. Therefore, you must create an Internet-facing Ingress for the inference service.Log on to the ACK console. In the left-side navigation pane, click Clusters.
On the Clusters page, find the cluster you want and click its name. In the left navigation pane, choose .
On the Ingresses page, select the namespace where the inference service resides from the Namespace drop-down list, and click Create Ingress. Set the parameters that are described in the following table. For more information about the parameters, see Create an NGINX Ingress.
Name: In this example, the value is set to
bert-tfserving.Rules:
Domain Name: Enter a custom domain name, such as
test.example.com.Mappings:
Path: The root path
/is used in this example.Rule: The default rule (ImplementationSpecific) is used in this example.
Service Name: In this example, the value is set to the service name
bert-tfserving-202207181536-tensorflow-servingspecified in Step b.Port: Port 8501 is used in this example.
On the Ingresses page, check the address of the Ingress in the Rules column.
Use the address that is obtained from Step 5 to perform stress tests on the inference service. If the QPS value is greater than the
averageValueconfigured in the HPA, a scale-out activity is triggered and the number of pods does not exceed the value ofmaxReplicas. If the QPS value is smaller than theaverageValue, a scale-in activity is triggered.