All Products
Search
Document Center

Container Service for Kubernetes:Elastic Container Instance-based elastic inference

Last Updated:Oct 11, 2023

After a model training job is completed, the model is usually deployed as an inference service. The number of calls to an inference service dynamically changes based on the business requirements. Elastic scaling is required to handle different loads and reduce costs. Conventional deployment solutions cannot meet the requirements of large-scale and highly concurrent systems. Alibaba Cloud allows you to deploy workloads on elastic container instances to enable elastic scaling of inference services. This topic describes how to run elastic inference workloads on elastic container instances.

Prerequisites

  • A model is ready for deployment. In this topic, a BERT model trained with TensorFlow 1.15 is used.

  • The ack-virtual-node, ack-alibaba-cloud-metrics-adapter, and arena components are installed. For more information about how to manage the components, see Manage components. For more information about ack-virtual-node, see Connection overview.

Procedure

  1. Upload the trained model to an Object Storage Service (OSS) bucket. For more information, see Upload objects.

  2. Create a persistent volume (PV) and a persistent volume claim (PVC).

    1. Create a file named pvc.yaml and copy the following code into the file:

      apiVersion: v1
      kind: PersistentVolume
      metadata:
        name: model-csi-pv
      spec:
        capacity:
          storage: 5Gi
        accessModes:
          - ReadWriteMany
        persistentVolumeReclaimPolicy: Retain
        csi:
          driver: ossplugin.csi.alibabacloud.com
          volumeHandle: model-csi-pv # The value must be the same as the name of the PV. 
          volumeAttributes:
            bucket: "Your Bucket"
            url: "Your oss url"
            akId: "Your Access Key Id"
            akSecret: "Your Access Key Secret"
            otherOpts: "-o max_stat_cache_size=0 -o allow_other"
      ---
      apiVersion: v1
      kind: PersistentVolumeClaim
      metadata:
        name: model-pvc
      spec:
        accessModes:
        - ReadWriteMany
        resources:
          requests:
            storage: 5Gi

      Parameter

      Description

      bucket

      The name of the OSS bucket, which is globally unique in OSS. For more information, see Bucket naming conventions.

      url

      The URL that is used to access an object in the bucket. For more information, see Obtain the URL of a single object or the URLs of multiple objects.

      akId

      The AccessKey ID and AccessKey secret that are used to access the OSS bucket. We recommend that you access the OSS bucket as a Resource Access Management (RAM) user. For more information, see Create an AccessKey pair.

      akSecret

      otherOpts

      Custom parameters for mounting the OSS bucket.

      • Set -o max_stat_cache_size=0 to disable metadata caching. If this feature is disabled, the system retrieves the latest metadata from OSS each time it attempts to access objects in OSS.

      • Set -o allow_other to allow other users to access the OSS bucket that you mounted.

      For more information about other parameters, see Custom parameters supported by ossfs.

    2. Run the following command to create the PV and PVC:

      kubectl apply -f pvc.yaml
  3. Deploy an inference service.

    1. Run the following command to deploy the inference service:

      You can use annotations to specify the resource type that you want to request. The following table describes the parameters.

      Parameter

      Description

      alibabacloud.com/burst-resource

      Valid values:

      • If the parameter is left empty, only existing Elastic Container Service (ECS) resources in the cluster are used. This is the default setting.

      • eci: Elastic container instances are used when the ECS resources in the cluster are insufficient.

      • eci_only: Only elastic container instances are used. The ECS resources in the cluster are not used.

      k8s.aliyun.com/eci-use-specs

      To use the GPU resources of elastic container instances, you must use this annotation to specify the GPU-accelerated instance type.

      arena serve tensorflow \
         --namespace=default \
         --name=bert-tfserving \
         --model-name=chnsenticorp  \
         --gpus=1  \
         --image=tensorflow/serving:1.15.0-gpu \
         --data=model-pvc:/data \
         --model-path=/data/models/tensorflow/chnsenticorp \
         --version-policy=specific:1623831335 \
         --annotation=alibabacloud.com/burst-resource=eci_only \
         --annotation=k8s.aliyun.com/eci-use-specs=ecs.gn6i-c4g1.xlarge
    2. Run the following command to query the status of the Services:

      arena serve list

      Expected output:

      NAME            TYPE        VERSION       DESIRED  AVAILABLE  ADDRESS        PORTS                   GPU
      bert-tfserving  Tensorflow  202207181536  1        1          172.16.52.170  GRPC:8500,RESTFUL:8501  1
    3. Run the following command to query the status of the pods:

      kubectl get pods -o wide

      Expected output:

      NAME                                                              READY   STATUS    RESTARTS   AGE    IP              NODE                           NOMINATED NODE   READINESS GATES
      bert-tfserving-202207181536-tensorflow-serving-547797c546-djh58   1/1     Running   0          114s   192.168.0.246   virtual-kubelet-cn-beijing-h   <none>           <none>

      The output shows that the type of the node is virtual-kubelet-cn-beijing-h. This indicates that the pod is deployed on an elastic container instance.

  4. Configure a Horizontal Pod Autoscaler (HPA). The HPA can automatically adjust the number of replicated pods in a Kubernetes cluster based on workloads.

    1. Run the following command to query the Deployment of the inference service:

      kubectl get deployment

      Expected output:

      NAME                                             READY   UP-TO-DATE   AVAILABLE   AGE
      bert-tfserving-202207181536-tensorflow-serving   1/1     1            1           2m18s
    2. Run the following command to query the Service:

      kubectl get service

      Expected output:

      NAME                                             TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)             AGE
      bert-tfserving-202207181536-tensorflow-serving   ClusterIP   172.16.52.170   <none>        8500/TCP,8501/TCP   2m45s
    3. Create a file named bert-tfserving-eci-hpa.yaml and copy the following code into the file:

      apiVersion: autoscaling/v2beta2
      kind: HorizontalPodAutoscaler
      metadata:
        name: bert-tfserving-eci-hpa
        namespace: default
      spec:
        scaleTargetRef:
          apiVersion: apps/v1
          kind: Deployment
          name: bert-tfserving-202207181536-tensorflow-serving
        minReplicas: 1
        maxReplicas: 10
        metrics:
        - type: External
          external:
            metric:
              name: sls_ingress_qps
              selector:
                matchLabels:
                  sls.project: "k8s-log-{cluster id}"
                  sls.logstore: "nginx-ingress"
                  sls.ingress.route: "default-bert-tfserving-202207181536-tensorflow-serving-8501"
            target:
              type: AverageValue
              averageValue: 10
                                      

      The following table describes the parameters.

      Parameter

      Description

      scaleTargetRef

      Specifies the object to which the HPA is bound. In this example, the value is set to the name of the Deployment of the inference service configured in Step a.

      minReplicas

      The minimum number of replicated pods.

      maxReplicas

      The maximum number of replicated pods.

      sls.project

      The name of the Log Service project that is used by the cluster. The value of the parameter must be in the format of k8s-log-{cluster id}.

      sls.logstore

      The name of the Log Service Logstore. The default value is nginx-ingress.

      sls.ingress.route

      The Ingress that is used to expose the service. In this example, the value is set to {namespace}-{service name}-{service port}.

      metricname

      The name of the metric. In this example, the value is set to sls_ingress_qps.

      targetaverageValue

      The queries per second (QPS) value that triggers scale-out activities. In this example, the value of this parameter is set to 10. A scale-out activity is triggered when the QPS value is greater than 10.

    4. Run the following command to query the status of the HPA:

      kubectl get hpa

      Expected output:

      NAME                     REFERENCE                                                   TARGETS      MINPODS   MAXPODS   REPLICAS   AGE
      bert-tfserving-eci-hpa   Deployment/bert-tfserving-202207181536-tensorflow-serving   0/10 (avg)   1         10        1          116s
  5. Configure an Internet-facing Ingress.

    By default, the inference service that is deployed by running the arena serve tensorflow command is assigned only a cluster IP address. The service cannot be accessed over the Internet. Therefore, you must create an Internet-facing Ingress for the inference service.

    1. Log on to the ACK console. In the left-side navigation pane, click Clusters.

    2. On the Clusters page, click the name of the cluster that you want to manage and choose Network > Ingresses in the left-side navigation pane.

    3. In the upper part of the Ingresses page, select the namespace where the inference service resides from the Namespace drop-down list, and click Create Ingress. Set the parameters that are described in the following table. For more information about the parameters, see Create an NGINX Ingress.

      • Name: In this example, the value is set to bert-tfserving.

      • Rules:

        • Domain Name: Enter a custom domain name, such as test.example.com.

        • Mappings:

          • Path: The root path / is used in this example.

          • Rule: The default rule (ImplementationSpecific) is used in this example.

          • Service Name: In this example, the value is set to the service name bert-tfserving-202207181536-tensorflow-serving specified in Step b.

          • Port: Port 8501 is used in this example.

    4. On the Ingresses page, check the address of the Ingress in the Rules column.

  6. Use the address that is obtained from Step 5 to perform stress tests on the inference service. If the QPS value is greater than the averageValue configured in the HPA, a scale-out activity is triggered and the number of pods does not exceed the value of maxReplicas. If the QPS value is smaller than the averageValue, a scale-in activity is triggered.