All Products
Search
Document Center

Container Service for Kubernetes:Elastic Inference on ECI

Last Updated:Mar 26, 2026

After training a model, you typically deploy it as an inference service. Inference workloads have variable traffic patterns — a fixed pool of ECS nodes can't efficiently handle sudden bursts without over-provisioning. By scheduling inference pods on Elastic Container Instance (ECI), Container Service for Kubernetes (ACK) scales capacity on demand without requiring you to manage nodes.

This topic walks through deploying a BERT model as a TensorFlow Serving inference service on ECI, configuring Horizontal Pod Autoscaler (HPA) to scale based on queries per second (QPS), and exposing the service over the Internet.

How it works

The following components work together to deliver elastic inference:

Client → Ingress (Internet-facing) → TF Serving pods (on ECI)
                                            ↑
                                     HPA (scale-out/in)
                                            ↑
                           Metrics Adapter ← SLS (nginx-ingress QPS)
  1. Upload the trained model to an Object Storage Service (OSS) bucket.

  2. Mount the OSS bucket into the cluster using a persistent volume (PV) and persistent volume claim (PVC).

  3. Deploy the inference service using arena, with annotations that direct new pods to ECI.

  4. Configure HPA with an external metric sourced from Simple Log Service (SLS) to trigger scale-out when QPS exceeds a threshold.

  5. Create an Internet-facing Ingress so external clients can reach the service.

  6. Run a stress test to verify that the HPA scales pods out and back in based on QPS.

Prerequisites

Before you begin, ensure that you have:

  • A trained model ready for deployment. This topic uses a BERT model trained with TensorFlow 1.15

  • The following components installed in your ACK cluster: ack-virtual-node, ack-alibaba-cloud-metrics-adapter, and arena. For installation instructions, see Manage components. For information about ack-virtual-node, see Connection overview

Step 1: Upload the model to OSS

Upload your trained model files to an OSS bucket. For instructions, see Upload objects.

Step 2: Mount the model using a PV and PVC

Create a PV that maps to your OSS bucket and a PVC that the inference pod will mount.

  1. Create a file named pvc.yaml with the following content:

    apiVersion: v1
    kind: PersistentVolume
    metadata:
      name: model-csi-pv
    spec:
      capacity:
        storage: 5Gi
      accessModes:
        - ReadWriteMany
      persistentVolumeReclaimPolicy: Retain
      csi:
        driver: ossplugin.csi.alibabacloud.com
        volumeHandle: model-csi-pv   # Must match the PV name above
        volumeAttributes:
          bucket: "<your-bucket-name>"
          url: "<your-oss-url>"
          akId: "<your-access-key-id>"
          akSecret: "<your-access-key-secret>"
          otherOpts: "-o max_stat_cache_size=0 -o allow_other"
    ---
    apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      name: model-pvc
    spec:
      accessModes:
        - ReadWriteMany
      volumeName: model-csi-pv
      storageClassName: ""
      resources:
        requests:
          storage: 5Gi

    Replace the placeholders:

    Placeholder Description
    <your-bucket-name> Name of the OSS bucket. Bucket names are globally unique. See Bucket naming conventions.
    <your-oss-url> URL used to access objects in the bucket. See Obtain the URL of a single object or the URLs of multiple objects.
    <your-access-key-id> AccessKey ID for OSS access. Use a Resource Access Management (RAM) user with least-privilege permissions. See Create an AccessKey pair.
    <your-access-key-secret> AccessKey secret corresponding to the AccessKey ID above.

    The otherOpts field accepts custom mount options for ossfs:

    • -o max_stat_cache_size=0 — disables metadata caching so the pod always reads the latest object metadata from OSS.

    • -o allow_other — allows processes running as other users in the pod to access the mounted bucket.

    For additional mount options, see Custom parameters supported by ossfs.

  2. Apply the manifest:

    kubectl apply -f pvc.yaml

Step 3: Deploy the inference service

Use arena to deploy a TensorFlow Serving inference service. The --annotation flags control whether pods land on ECS nodes or ECI.

  1. Run the following command:

    arena serve tensorflow \
      --namespace=default \
      --name=bert-tfserving \
      --model-name=chnsenticorp \
      --gpus=1 \
      --image=tensorflow/serving:1.15.0-gpu \
      --data=model-pvc:/data \
      --model-path=/data/models/tensorflow/chnsenticorp \
      --version-policy=specific:1623831335 \
      --annotation=alibabacloud.com/burst-resource=eci_only \
      --annotation=k8s.aliyun.com/eci-use-specs=ecs.gn6i-c4g1.xlarge

    The two annotations control ECI scheduling:

    Annotation Description
    alibabacloud.com/burst-resource Controls where pods are scheduled. Leave blank to use ECS only (default). Set to eci to use ECI when ECS capacity is insufficient. Set to eci_only to use ECI exclusively.
    k8s.aliyun.com/eci-use-specs Specifies the GPU-accelerated ECI instance type. Required when using ECI GPU resources.
  2. Verify the service is running:

    arena serve list

    Expected output:

    NAME            TYPE        VERSION       DESIRED  AVAILABLE  ADDRESS        PORTS                   GPU
    bert-tfserving  Tensorflow  202207181536  1        1          172.16.52.170  GRPC:8500,RESTFUL:8501  1
  3. Confirm the pod is running on ECI:

    kubectl get pods -o wide

    Expected output:

    NAME                                                              READY   STATUS    RESTARTS   AGE    IP              NODE                           NOMINATED NODE   READINESS GATES
    bert-tfserving-202207181536-tensorflow-serving-547797c546-djh58   1/1     Running   0          114s   192.168.0.246   virtual-kubelet-cn-beijing-h   <none>           <none>

    The NODE value virtual-kubelet-cn-beijing-h confirms the pod is running on an ECI instance, not an ECS node.

Step 4: Configure HPA for QPS-based scaling

HPA automatically adjusts the number of inference pods based on the sls_ingress_qps metric from SLS. Scale-out is triggered when QPS exceeds averageValue; scale-in when QPS drops below it.

  1. Check the Deployment and Service names created by arena:

    kubectl get deployment

    Expected output:

    NAME                                             READY   UP-TO-DATE   AVAILABLE   AGE
    bert-tfserving-202207181536-tensorflow-serving   1/1     1            1           2m18s
    kubectl get service

    Expected output:

    NAME                                             TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)             AGE
    bert-tfserving-202207181536-tensorflow-serving   ClusterIP   172.16.52.170   <none>        8500/TCP,8501/TCP   2m45s
  2. Create a file named bert-tfserving-eci-hpa.yaml with the following content:

    Parameter Description
    scaleTargetRef The Deployment to scale — the inference service Deployment created in Step 3.
    minReplicas Minimum number of pods.
    maxReplicas Maximum number of pods.
    sls.project Name of the SLS project for your cluster, in the format k8s-log-{cluster id}. Replace {cluster id} with your actual cluster ID.
    sls.logstore SLS Logstore name. Default is nginx-ingress.
    sls.ingress.route Identifies the Ingress route to monitor, in the format {namespace}-{service name}-{service port}.
    averageValue QPS threshold per pod that triggers scale-out. Set to 10 in this example.
    apiVersion: autoscaling/v2
    kind: HorizontalPodAutoscaler
    metadata:
      name: bert-tfserving-eci-hpa
      namespace: default
    spec:
      scaleTargetRef:
        apiVersion: apps/v1
        kind: Deployment
        name: bert-tfserving-202207181536-tensorflow-serving
      minReplicas: 1
      maxReplicas: 10
      metrics:
      - type: External
        external:
          metric:
            name: sls_ingress_qps
            selector:
              matchLabels:
                sls.project: "k8s-log-{cluster id}"
                sls.logstore: "nginx-ingress"
                sls.ingress.route: "default-bert-tfserving-202207181536-tensorflow-serving-8501"
          target:
            type: AverageValue
            averageValue: "10"

    Key parameters:

  3. Apply the HPA manifest:

    kubectl apply -f bert-tfserving-eci-hpa.yaml
  4. Verify the HPA is active:

    kubectl get hpa

    Expected output:

    NAME                     REFERENCE                                                   TARGETS      MINPODS   MAXPODS   REPLICAS   AGE
    bert-tfserving-eci-hpa   Deployment/bert-tfserving-202207181536-tensorflow-serving   0/10 (avg)   1         10        1          116s

Step 5: Expose the service over the Internet

By default, arena serve tensorflow assigns only a cluster IP address to the service. To accept external traffic, create an Internet-facing Ingress.

  1. Log on to the ACK console and click Clusters in the left-side navigation pane.

  2. Click the target cluster name. In the left navigation pane, choose Network > Ingresses.

  3. Select the namespace where the inference service resides from the Namespace drop-down list, then click Create Ingress. Configure the following parameters. For details, see Create an NGINX Ingress.

    Parameter Example value
    Name bert-tfserving
    Domain name test.example.com (use your own domain)
    Path /
    Rule ImplementationSpecific (default)
    Service name bert-tfserving-202207181536-tensorflow-serving
    Port 8501
  4. On the Ingresses page, find the Ingress you created and note its address in the Rules column.

Step 6: Verify elastic scaling with a stress test

Use the Ingress address from Step 5 to send load to the inference service.

  • When QPS exceeds the averageValue (10 in this example), HPA triggers scale-out and new pods are scheduled on ECI. The total number of pods stays within maxReplicas (10).

  • When QPS drops below averageValue, HPA triggers scale-in.