After training a model, you typically deploy it as an inference service. Inference workloads have variable traffic patterns — a fixed pool of ECS nodes can't efficiently handle sudden bursts without over-provisioning. By scheduling inference pods on Elastic Container Instance (ECI), Container Service for Kubernetes (ACK) scales capacity on demand without requiring you to manage nodes.
This topic walks through deploying a BERT model as a TensorFlow Serving inference service on ECI, configuring Horizontal Pod Autoscaler (HPA) to scale based on queries per second (QPS), and exposing the service over the Internet.
How it works
The following components work together to deliver elastic inference:
Client → Ingress (Internet-facing) → TF Serving pods (on ECI)
↑
HPA (scale-out/in)
↑
Metrics Adapter ← SLS (nginx-ingress QPS)
-
Upload the trained model to an Object Storage Service (OSS) bucket.
-
Mount the OSS bucket into the cluster using a persistent volume (PV) and persistent volume claim (PVC).
-
Deploy the inference service using
arena, with annotations that direct new pods to ECI. -
Configure HPA with an external metric sourced from Simple Log Service (SLS) to trigger scale-out when QPS exceeds a threshold.
-
Create an Internet-facing Ingress so external clients can reach the service.
-
Run a stress test to verify that the HPA scales pods out and back in based on QPS.
Prerequisites
Before you begin, ensure that you have:
-
A trained model ready for deployment. This topic uses a BERT model trained with TensorFlow 1.15
-
The following components installed in your ACK cluster: ack-virtual-node, ack-alibaba-cloud-metrics-adapter, and arena. For installation instructions, see Manage components. For information about ack-virtual-node, see Connection overview
Step 1: Upload the model to OSS
Upload your trained model files to an OSS bucket. For instructions, see Upload objects.
Step 2: Mount the model using a PV and PVC
Create a PV that maps to your OSS bucket and a PVC that the inference pod will mount.
-
Create a file named
pvc.yamlwith the following content:apiVersion: v1 kind: PersistentVolume metadata: name: model-csi-pv spec: capacity: storage: 5Gi accessModes: - ReadWriteMany persistentVolumeReclaimPolicy: Retain csi: driver: ossplugin.csi.alibabacloud.com volumeHandle: model-csi-pv # Must match the PV name above volumeAttributes: bucket: "<your-bucket-name>" url: "<your-oss-url>" akId: "<your-access-key-id>" akSecret: "<your-access-key-secret>" otherOpts: "-o max_stat_cache_size=0 -o allow_other" --- apiVersion: v1 kind: PersistentVolumeClaim metadata: name: model-pvc spec: accessModes: - ReadWriteMany volumeName: model-csi-pv storageClassName: "" resources: requests: storage: 5GiReplace the placeholders:
Placeholder Description <your-bucket-name>Name of the OSS bucket. Bucket names are globally unique. See Bucket naming conventions. <your-oss-url>URL used to access objects in the bucket. See Obtain the URL of a single object or the URLs of multiple objects. <your-access-key-id>AccessKey ID for OSS access. Use a Resource Access Management (RAM) user with least-privilege permissions. See Create an AccessKey pair. <your-access-key-secret>AccessKey secret corresponding to the AccessKey ID above. The
otherOptsfield accepts custom mount options for ossfs:-
-o max_stat_cache_size=0— disables metadata caching so the pod always reads the latest object metadata from OSS. -
-o allow_other— allows processes running as other users in the pod to access the mounted bucket.
For additional mount options, see Custom parameters supported by ossfs.
-
-
Apply the manifest:
kubectl apply -f pvc.yaml
Step 3: Deploy the inference service
Use arena to deploy a TensorFlow Serving inference service. The --annotation flags control whether pods land on ECS nodes or ECI.
-
Run the following command:
arena serve tensorflow \ --namespace=default \ --name=bert-tfserving \ --model-name=chnsenticorp \ --gpus=1 \ --image=tensorflow/serving:1.15.0-gpu \ --data=model-pvc:/data \ --model-path=/data/models/tensorflow/chnsenticorp \ --version-policy=specific:1623831335 \ --annotation=alibabacloud.com/burst-resource=eci_only \ --annotation=k8s.aliyun.com/eci-use-specs=ecs.gn6i-c4g1.xlargeThe two annotations control ECI scheduling:
Annotation Description alibabacloud.com/burst-resourceControls where pods are scheduled. Leave blank to use ECS only (default). Set to ecito use ECI when ECS capacity is insufficient. Set toeci_onlyto use ECI exclusively.k8s.aliyun.com/eci-use-specsSpecifies the GPU-accelerated ECI instance type. Required when using ECI GPU resources. -
Verify the service is running:
arena serve listExpected output:
NAME TYPE VERSION DESIRED AVAILABLE ADDRESS PORTS GPU bert-tfserving Tensorflow 202207181536 1 1 172.16.52.170 GRPC:8500,RESTFUL:8501 1 -
Confirm the pod is running on ECI:
kubectl get pods -o wideExpected output:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES bert-tfserving-202207181536-tensorflow-serving-547797c546-djh58 1/1 Running 0 114s 192.168.0.246 virtual-kubelet-cn-beijing-h <none> <none>The
NODEvaluevirtual-kubelet-cn-beijing-hconfirms the pod is running on an ECI instance, not an ECS node.
Step 4: Configure HPA for QPS-based scaling
HPA automatically adjusts the number of inference pods based on the sls_ingress_qps metric from SLS. Scale-out is triggered when QPS exceeds averageValue; scale-in when QPS drops below it.
-
Check the Deployment and Service names created by
arena:kubectl get deploymentExpected output:
NAME READY UP-TO-DATE AVAILABLE AGE bert-tfserving-202207181536-tensorflow-serving 1/1 1 1 2m18skubectl get serviceExpected output:
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE bert-tfserving-202207181536-tensorflow-serving ClusterIP 172.16.52.170 <none> 8500/TCP,8501/TCP 2m45s -
Create a file named
bert-tfserving-eci-hpa.yamlwith the following content:Parameter Description scaleTargetRefThe Deployment to scale — the inference service Deployment created in Step 3. minReplicasMinimum number of pods. maxReplicasMaximum number of pods. sls.projectName of the SLS project for your cluster, in the format k8s-log-{cluster id}. Replace{cluster id}with your actual cluster ID.sls.logstoreSLS Logstore name. Default is nginx-ingress.sls.ingress.routeIdentifies the Ingress route to monitor, in the format {namespace}-{service name}-{service port}.averageValueQPS threshold per pod that triggers scale-out. Set to 10in this example.apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: bert-tfserving-eci-hpa namespace: default spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: bert-tfserving-202207181536-tensorflow-serving minReplicas: 1 maxReplicas: 10 metrics: - type: External external: metric: name: sls_ingress_qps selector: matchLabels: sls.project: "k8s-log-{cluster id}" sls.logstore: "nginx-ingress" sls.ingress.route: "default-bert-tfserving-202207181536-tensorflow-serving-8501" target: type: AverageValue averageValue: "10"Key parameters:
-
Apply the HPA manifest:
kubectl apply -f bert-tfserving-eci-hpa.yaml -
Verify the HPA is active:
kubectl get hpaExpected output:
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE bert-tfserving-eci-hpa Deployment/bert-tfserving-202207181536-tensorflow-serving 0/10 (avg) 1 10 1 116s
Step 5: Expose the service over the Internet
By default, arena serve tensorflow assigns only a cluster IP address to the service. To accept external traffic, create an Internet-facing Ingress.
-
Log on to the ACK console and click Clusters in the left-side navigation pane.
-
Click the target cluster name. In the left navigation pane, choose Network > Ingresses.
-
Select the namespace where the inference service resides from the Namespace drop-down list, then click Create Ingress. Configure the following parameters. For details, see Create an NGINX Ingress.
Parameter Example value Name bert-tfservingDomain name test.example.com(use your own domain)Path /Rule ImplementationSpecific(default)Service name bert-tfserving-202207181536-tensorflow-servingPort 8501 -
On the Ingresses page, find the Ingress you created and note its address in the Rules column.
Step 6: Verify elastic scaling with a stress test
Use the Ingress address from Step 5 to send load to the inference service.
-
When QPS exceeds the
averageValue(10 in this example), HPA triggers scale-out and new pods are scheduled on ECI. The total number of pods stays withinmaxReplicas(10). -
When QPS drops below
averageValue, HPA triggers scale-in.