All Products
Search
Document Center

Alibaba Cloud Service Mesh:Use Model Service Mesh to roll out a multi-model inference service

Last Updated:Mar 04, 2024

When you need to run multiple machine learning models to perform inference, you can use Model Service Mesh (ModelMesh) to roll out and manage a multi-model inference service. ModelMesh is implemented based on KServe ModelMesh and optimized for high-scale, high-density, and frequently-changing model use cases. ModelMesh intelligently loads and unloads models to and from memory to strike a balance between responsiveness and computing. This simplifies the deployment and O&M of a multi-model inference service and improves inference efficiency and performance.

Prerequisites

Note

In this example, an ASM ingress gateway is used as the gateway of the cluster. The default gateway name is ingressgateway, port 8008 is enabled, and the HTTP protocol is used.

Features

ModelMesh provides the following features.

Feature

Description

Cache management

  • Pods are managed as a distributed least recently used (LRU) cache.

  • Copies of models are loaded and unloaded based on usage frequency and current request volumes.

Intelligent placement and loading

  • Model placement is balanced by both the cache age across the pods and the request load.

  • Queues are used to handle concurrent model loads and minimize impact on runtime traffic.

Resiliency

Failed model loads are automatically retried in different pods.

Operational simplicity

Rolling model updates are handled automatically and seamlessly.

Step 1: Enable the ModelMesh feature in ASM

  1. Create a model-mesh.yaml file that contains the content in the following code block.

    If you set the value of the enabled parameter to true, the ModelMesh feature is enabled. If you set the value to false, the ModelMesh feature is disabled.

    apiVersion: istio.alibabacloud.com/v1beta1
    kind: ASMKServeConfig
    metadata:
      name: default
    spec:
      enabled: true
      multiModel: true
      tag: v0.11.0
  2. Use kubectl to connect to the ACK cluster (or ASM instance) based on the information in the kubeconfig file, and then run the following command to enable the ModelMesh feature:

    kubectl apply -f model-mesh.yaml
  3. Use kubectl to connect to the ACK cluster based on the information in the kubeconfig file, and then run the following command to check whether a ServingRuntime resource is available:

    kubectl get servingruntimes -n modelmesh-serving

    Expected output:

    NAME                DISABLED   MODELTYPE     CONTAINERS   AGE
    mlserver-1.x                   sklearn       mlserver     1m
    ovms-1.x                       openvino_ir   ovms         1m
    torchserve-0.x                 pytorch-mar   torchserve   1m
    triton-2.x                     keras         triton       1m

    A ServingRuntime resource defines the templates for pods that can serve one or more particular model formats. Pods are automatically provisioned depending on the framework of the deployed model.

    The following table describes the runtimes and model formats supported by ModelMesh. For more information, see Supported Model Formats. If these model servers cannot meet all of your specific requirements, you can create custom model serving runtimes. For more information, see Use ModelMesh to create a custom model serving runtime.

    ServingRuntime

    Supported model framework

    mlserver-1.x

    sklearn, xgboost, and lightgbm

    ovms-1.x

    openvino_ir, onnx

    torchserve-0.x

    pytorch-mar

    triton-2.x

    tensorflow, pytorch, onnx, and tensorrt

Step 2: Configure an ASM environment

  1. Synchronize the modelmesh-serving namespace from the ACK cluster to the ASM instance. For more information, see Synchronize automatic sidecar proxy injection labels from a Kubernetes cluster on the data plane to an ASM instance. After synchronization, confirm that the modelmesh-serving namespace exists.

  2. Create an Istio gateway for the ingress gateway.

    1. Create a grpc-gateway.yaml file that contains the following content:

      Show the grpc-gateway.yaml file

      apiVersion: networking.istio.io/v1beta1
      kind: Gateway
      metadata:
        name: grpc-gateway
        namespace: modelmesh-serving
      spec:
        selector:
          istio: ingressgateway
        servers:
          - hosts:
              - '*'
            port:
              name: grpc
              number: 8008
              protocol: GRPC
      
    2. Use kubectl to connect to the ACK cluster (or ASM instance) based on the information in the kubeconfig file, and then run the following command to create an Istio gateway:

      kubectl apply -f grpc-gateway.yaml
  3. Create a virtual service.

    1. Create a vs-modelmesh-serving-service.yaml file that contains the following content:

      Show the vs-modelmesh-serving-service.yaml file

      apiVersion: networking.istio.io/v1beta1
      kind: VirtualService
      metadata:
        name: vs-modelmesh-serving-service
        namespace: modelmesh-serving
      spec:
        gateways:
          - grpc-gateway
        hosts:
          - '*'
        http:
          - match:
              - port: 8008
            name: default
            route:
              - destination:
                  host: modelmesh-serving
                  port:
                    number: 8033
      
    2. Use kubectl to connect to the ACK cluster (or ASM instance) based on the information in the kubeconfig file, and then run the following command to create a virtual service:

      kubectl apply -f vs-modelmesh-serving-service.yaml
  4. Configure the Google Remote Procedure Call (gRPC)-JSON transcoder.

    1. Create a grpcjsontranscoder-for-kservepredictv2.yaml file that contains the following content:

      apiVersion: istio.alibabacloud.com/v1beta1
      kind: ASMGrpcJsonTranscoder
      metadata:
        name: grpcjsontranscoder-for-kservepredictv2
        namespace: istio-system
      spec:
        builtinProtoDescriptor: kserve_predict_v2
        isGateway: true
        portNumber: 8008
        workloadSelector:
          labels:
            istio: ingressgateway
    2. Use kubectl to connect to the ACK cluster (or ASM instance) based on the information in the kubeconfig file, and then run the following command to deploy the gRPC-JSON transcoder:

      kubectl apply -f grpcjsontranscoder-for-kservepredictv2.yaml
    3. Create a grpcjsontranscoder-increasebufferlimit.yaml file that contains the following content, and set the per_connection_buffer_limit_bytes parameter to increase the size of the response.

      Show the grpcjsontranscoder-increasebufferlimit.yaml file

      apiVersion: networking.istio.io/v1alpha3
      kind: EnvoyFilter
      metadata:
        labels:
          asm-system: "true"
          manager: asm-voyage
          provider: asm
        name: grpcjsontranscoder-increasebufferlimit
        namespace: istio-system
      spec:
        configPatches:
        - applyTo: LISTENER
          match:
            context: GATEWAY
            listener:
              portNumber: 8008
            proxy:
              proxyVersion: ^1.*
          patch:
            operation: MERGE
            value:
              per_connection_buffer_limit_bytes: 100000000
        workloadSelector:
          labels:
            istio: ingressgateway
      
    4. Use kubectl to connect to the ACK cluster (or ASM instance) based on the information in the kubeconfig file, and then run the following command to deploy an Envoy filter:

      kubectl apply -f grpcjsontranscoder-increasebufferlimit.yaml

Step 3: Deploy a sample model

  1. Create a StorageClass. For more information, see Mount a dynamically provisioned NAS volume.

    1. Log on to the ACK console. In the left-side navigation pane, click Clusters.

    2. On the Clusters page, click the name of the cluster that you want to manage and choose Volumes > StorageClasses in the left-side navigation pane.

    3. In the upper-right corner of the StorageClasses page, click Create, set the parameters shown in the following figure, and then click Create.

      Dingtalk_20231107170754.png

  2. Create a persistent volume claim (PVC).

    1. Create a my-models-pvc.yaml file that contains the following content:

      apiVersion: v1
      kind: PersistentVolumeClaim
      metadata:
        name: my-models-pvc
        namespace: modelmesh-serving
      spec:
        accessModes:
          - ReadWriteMany
        resources:
          requests:
            storage: 1Gi
        storageClassName: alibabacloud-cnfs-nas
        volumeMode: Filesystem
    2. Use kubectl to connect to the ACK cluster based on the information in the kubeconfig file, and then run the following command to create a PVC:

      kubectl apply -f my-models-pvc.yaml
    3. Run the following command to view the PVC in the modelmesh-serving namespace:

      kubectl get pvc -n modelmesh-serving

      Expected output:

      NAME             STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS            AGE
      my-models-pvc    Bound    nas-379c32e1-c0ef-43f3-8277-9eb4606b53f8   1Gi        RWX            alibabacloud-cnfs-nas   2h
  3. Create a pod to access the PVC.

    To use the new PVC, you must mount it as a volume to a Kubernetes pod, and then use that pod to upload the model files to a persistent volume.

    1. Create a pvc-access.yaml file that contains the following content.

      The following YAML file indicates that a pvc-access pod is created and the Kubernetes controller is required to claim the previously requested PVC by specifying "my-models-pvc".

      apiVersion: v1
      kind: Pod
      metadata:
        name: "pvc-access"
      spec:
        containers:
          - name: main
            image: ubuntu
            command: ["/bin/sh", "-ec", "sleep 10000"]
            volumeMounts:
              - name: "my-pvc"
                mountPath: "/mnt/models"
        volumes:
          - name: "my-pvc"
            persistentVolumeClaim:
              claimName: "my-models-pvc"
    2. Use kubectl to connect to the ACK cluster based on the information in the kubeconfig file, and then run the following command to create a pod:

      kubectl apply  -n modelmesh-serving  -f pvc-access.yaml
    3. Verify that the pvc-access pod is running.

      kubectl get pods -n modelmesh-serving | grep pvc-access

      Expected output:

      pvc-access             1/1     Running   0          51m
  4. Store the model on the persistent volume.

    Add the AI model to the persistent volume. In this example, the MNIST handwritten digit character recognition model trained with scikit-learn is used. A copy of the mnist-svm.joblib model file can be downloaded from the kserve/modelmesh-minio-examples repository.

    1. Use kubectl to connect to the ACK cluster based on the information in the kubeconfig file, and then run the following command to copy the mnist-svm.joblib model file to the /mnt/models folder in the pvc-access pod:

      kubectl -n modelmesh-serving cp mnist-svm.joblib pvc-access:/mnt/models/
    2. Run the following command to verify that the model exists on the persistent volume:

      kubectl -n modelmesh-serving exec -it pvc-access -- ls -alr /mnt/models/

      Expected output:

      -rw-r--r-- 1  501 staff 344817 Oct 30 11:23 mnist-svm.joblib
  5. Deploy an inference service.

    1. Create a sklearn-mnist.yaml file that contains the following content:

      Show the sklearn-mnist.yaml file

      apiVersion: serving.kserve.io/v1beta1
      kind: InferenceService
      metadata:
        name: sklearn-mnist
        namespace: modelmesh-serving
        annotations:
          serving.kserve.io/deploymentMode: ModelMesh
      spec:
        predictor:
          model:
            modelFormat:
              name: sklearn
            storage:
              parameters:
                type: pvc
                name: my-models-pvc
              path: mnist-svm.joblib
    2. Use kubectl to connect to the ACK cluster based on the information in the kubeconfig file, and then run the following command to deploy the sklearn-mnist inference service:

      kubectl apply -f sklearn-mnist.yaml
    3. Wait dozens of seconds (the length of waiting time depends on the image pulling speed), and then run the following command to check whether the sklearn-mnist inference service is deployed:

      kubectl get isvc -n modelmesh-serving

      Expected output:

      NAME            URL                                               READY
      sklearn-mnist   grpc://modelmesh-serving.modelmesh-serving:8033   True
  6. Perform an inference.

    Run the curl command to send an inference request to the sklearn-mnist model. The data array indicates the grayscale values of the 64 pixels in the image scan of the digit to be classified.

    MODEL_NAME="sklearn-mnist"
    ASM_GW_IP="IP address of the ingress gateway"
    curl -X POST -k "http://${ASM_GW_IP}:8008/v2/models/${MODEL_NAME}/infer" -d '{"inputs": [{"name": "predict", "shape": [1, 64], "datatype": "FP32", "contents": {"fp32_contents": [0.0, 0.0, 1.0, 11.0, 14.0, 15.0, 3.0, 0.0, 0.0, 1.0, 13.0, 16.0, 12.0, 16.0, 8.0, 0.0, 0.0, 8.0, 16.0, 4.0, 6.0, 16.0, 5.0, 0.0, 0.0, 5.0, 15.0, 11.0, 13.0, 14.0, 0.0, 0.0, 0.0, 0.0, 2.0, 12.0, 16.0, 13.0, 0.0, 0.0, 0.0, 0.0, 0.0, 13.0, 16.0, 16.0, 6.0, 0.0, 0.0, 0.0, 0.0, 16.0, 16.0, 16.0, 7.0, 0.0, 0.0, 0.0, 0.0, 11.0, 13.0, 12.0, 1.0, 0.0]}}]}'

    The following code block shows the JSON response. It can be inferred that the scanned digit is 8.

    {
     "modelName": "sklearn-mnist__isvc-3c10c62d34",
     "outputs": [
      {
       "name": "predict",
       "datatype": "INT64",
       "shape": [
        "1",
        "1"
       ],
       "contents": {
        "int64Contents": [
         "8"
        ]
       }
      }
     ]
    }

References

  • When you deploy multiple models that require different runtime environments, or when you need to improve model inference efficiency or control resource allocation, you can use ModelMesh to create custom model serving runtimes. The fine-tuned configurations of custom model serving runtimes ensure that each model runs in the most appropriate environment. For more information, see Use ModelMesh to create a custom model serving runtime.

  • When you need to process large amounts of natural language data or want to build complex language understanding systems, you can use a large language model (LLM) as an inference service. For more information, see Use an LLM as an inference service.

  • When you encounter pod errors, you can troubleshoot them by referring to Pod troubleshooting.