All Products
Search
Document Center

Alibaba Cloud Service Mesh:Multi-model inference serving with ModelMesh

Last Updated:Jun 20, 2026

When you need to run multiple machine learning models for inference, use ModelMesh to deploy and manage multi-model inference services. Based on KServe ModelMesh, this feature is optimized for high-volume, high-density, and frequently changing model use cases. It intelligently loads models into memory and unloads them to balance responsiveness and compute resources, simplifying the deployment and operation of multi-model inference services while improving inference efficiency and performance.

Prerequisites

Note

This topic uses an ASM ingress gateway as the cluster gateway. The gateway is named ingressgateway by default and exposes port 8008 for HTTP traffic.

Features

ModelMesh provides the following features:

Feature

Description

Cache management

  • Automatically optimizes and manages pod memory based on usage frequency and recency.

  • Loads and unloads model replicas based on usage frequency and current request volume.

Intelligent placement and loading

  • Balances model placement across pods based on cache age and request load.

  • Uses queues to handle concurrent model loading and minimize impact on runtime traffic.

Resilience

Automatically retries failed model loads on different pods.

Rolling updates

Automatically and seamlessly handles rolling model updates.

Step 1: Enable ModelMesh in ASM

  1. Log on to the ASM console. In the left-side navigation pane, choose Service Mesh > Mesh Management.

  2. On the Mesh Management page, click the name of the target ASM instance. In the left-side navigation pane, choose Ecosystem > KServe on ASM.

  3. On the KServe on ASM page, click Install Model Service Mesh to enable the ModelMesh feature.

    Note: KServe depends on CertManager. Installing KServe automatically installs the CertManager component. If you use a self-managed CertManager, disable Automatically install the CertManager component in the cluster.

  4. After a few minutes, when the components are ready, run the following command using the KubeConfig for the cluster to verify that the ServingRuntime resources are ready.

  5. kubectl get servingruntimes -n modelmesh-serving

    Expected output:

    NAME                DISABLED   MODELTYPE     CONTAINERS   AGE
    mlserver-1.x                   sklearn       mlserver     1m
    ovms-1.x                       openvino_ir   ovms         1m
    torchserve-0.x                 pytorch-mar   torchserve   1m
    triton-2.x                     keras         triton       1m

    A ServingRuntime defines a pod template for serving one or more specific model formats. ModelMesh automatically provisions the corresponding pod based on the framework of the deployed model.

    The default runtimes and their supported model formats are listed in the following table. For more information, see supported-model-formats. If these model servers do not meet your requirements, you can create a custom model serving runtime. For more information, see Customize a model serving runtime with ModelMesh.

    Model serving runtime

    Supported model frameworks

    mlserver-1.x

    sklearn, xgboost, lightgbm

    ovms-1.x

    openvino_ir, onnx

    torchserve-0.x

    pytorch-mar

    triton-2.x

    tensorflow, pytorch, onnx, tensorrt

Step 2: Configure the ASM environment

  1. Synchronize namespaces from the Kubernetes cluster to the ASM instance. For more information, see Synchronize auto-injection labels from a data plane cluster to an ASM instance. After synchronization, confirm that the modelmesh-serving namespace exists.

  2. Create an ingress gateway rule.

    1. Create a file named grpc-gateway.yaml with the following content.

      grpc-gateway.yaml

      apiVersion: networking.istio.io/v1beta1
      kind: Gateway
      metadata:
        name: grpc-gateway
        namespace: modelmesh-serving
      spec:
        selector:
          istio: ingressgateway
        servers:
          - hosts:
              - '*'
            port:
              name: grpc
              number: 8008
              protocol: GRPC
      
    2. Using the KubeConfig for the cluster associated with your ASM instance, run the following command to create the gateway rule.

      kubectl apply -f grpc-gateway.yaml
  3. Create a virtual service.

    1. Create a file named vs-modelmesh-serving-service.yaml with the following content.

      vs-modelmesh-serving-service.yaml

      apiVersion: networking.istio.io/v1beta1
      kind: VirtualService
      metadata:
        name: vs-modelmesh-serving-service
        namespace: modelmesh-serving
      spec:
        gateways:
          - grpc-gateway
        hosts:
          - '*'
        http:
          - match:
              - port: 8008
            name: default
            route:
              - destination:
                  host: modelmesh-serving
                  port:
                    number: 8033
      
    2. Using the KubeConfig for the cluster associated with your ASM instance, run the following command to create the virtual service.

      kubectl apply -f vs-modelmesh-serving-service.yaml
  4. Configure the gRPC-JSON transcoder.

    1. Create a file named grpcjsontranscoder-for-kservepredictv2.yaml with the following content.

      apiVersion: istio.alibabacloud.com/v1beta1
      kind: ASMGrpcJsonTranscoder
      metadata:
        name: grpcjsontranscoder-for-kservepredictv2
        namespace: istio-system
      spec:
        builtinProtoDescriptor: kserve_predict_v2
        isGateway: true
        portNumber: 8008
        workloadSelector:
          labels:
            istio: ingressgateway
    2. Using the KubeConfig for the cluster associated with your ASM instance, run the following command to deploy the gRPC-JSON transcoder.

      kubectl apply -f grpcjsontranscoder-for-kservepredictv2.yaml
    3. Create a file named grpcjsontranscoder-increasebufferlimit.yaml with the following content. This configuration increases the response size limit by setting the per_connection_buffer_limit_bytes.

      grpcjsontranscoder-increasebufferlimit.yaml

      apiVersion: networking.istio.io/v1alpha3
      kind: EnvoyFilter
      metadata:
        labels:
          asm-system: "true"
          manager: asm-voyage
          provider: asm
        name: grpcjsontranscoder-increasebufferlimit
        namespace: istio-system
      spec:
        configPatches:
        - applyTo: LISTENER
          match:
            context: GATEWAY
            listener:
              portNumber: 8008
            proxy:
              proxyVersion: ^1.*
          patch:
            operation: MERGE
            value:
              per_connection_buffer_limit_bytes: 100000000
        workloadSelector:
          labels:
            istio: ingressgateway
      
    4. Using the KubeConfig for the cluster associated with your ASM instance, run the following command to deploy the EnvoyFilter.

      kubectl apply -f grpcjsontranscoder-increasebufferlimit.yaml

Step 3: Deploy a sample model

  1. Create a StorageClass. For more information, see Use dynamic NAS volumes.

    1. Log on to the ACK console. In the left navigation pane, click Clusters.

    2. On the Clusters page, click the name of your cluster. In the left navigation pane, click Volumes > StorageClasses.

    3. In the upper-right corner of the StorageClasses page, click Create, configure the following parameters, and then click Create.

      Set Name to alibabacloud-cnfs-nas. Set Volume Type to NAS and Storage Driver to CSI. Set Reclaim Policy to Delete. In the Mount Options section, add entries for nolock,tcp,noresvport and vers=3. Select the appropriate Mount Point Domain and set Path to /.

  2. Create a persistent volume claim (PVC).

    1. Create a file named my-models-pvc.yaml with the following content.

      apiVersion: v1
      kind: PersistentVolumeClaim
      metadata:
        name: my-models-pvc
        namespace: modelmesh-serving
      spec:
        accessModes:
          - ReadWriteMany
        resources:
          requests:
            storage: 1Gi
        storageClassName: alibabacloud-cnfs-nas
        volumeMode: Filesystem
    2. Using the KubeConfig for the ACK cluster, run the following command to create the persistent volume claim.

      kubectl apply -f my-models-pvc.yaml
    3. Run the following command to view the PVCs in the modelmesh-serving namespace.

      kubectl get pvc -n modelmesh-serving

      Expected output:

      NAME             STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS            AGE
      my-models-pvc    Bound    nas-379c32e1-c0ef-43f3-8277-9eb4606b53f8   1Gi        RWX            alibabacloud-cnfs-nas   2h
  3. Create a pod to access the PVC.

    To use the new PVC, mount it as a volume in a Kubernetes pod. You can then use that pod to upload model files to the persistent volume.

    1. Create a file named pvc-access.yaml with the following content.

      The following YAML creates a pod named pvc-access that requests the PVC you previously created by specifying the claim name "my-models-pvc".

      apiVersion: v1
      kind: Pod
      metadata:
        name: "pvc-access"
      spec:
        containers:
          - name: main
            image: ubuntu
            command: ["/bin/sh", "-ec", "sleep 10000"]
            volumeMounts:
              - name: "my-pvc"
                mountPath: "/mnt/models"
        volumes:
          - name: "my-pvc"
            persistentVolumeClaim:
              claimName: "my-models-pvc"
    2. Using the KubeConfig for the ACK cluster, run the following command to create the pod.

      kubectl apply  -n modelmesh-serving  -f pvc-access.yaml
    3. Confirm that the pvc-access pod is in the Running state.

      kubectl get pods -n modelmesh-serving | grep pvc-access

      Expected output:

      pvc-access             1/1     Running   0          51m
  4. Store the model on the persistent volume.

    Add the AI model to the storage volume. This topic uses an MNIST handwritten digit recognition model trained with scikit-learn. You can download a copy of the mnist-svm.joblib model file from the kserve/modelmesh-minio-examples repository.

    1. Using the KubeConfig for the ACK cluster, run the following command to copy the mnist-svm.joblib model file to the /mnt/models folder of the pvc-access pod.

      kubectl -n modelmesh-serving cp mnist-svm.joblib pvc-access:/mnt/models/
    2. Run the following command to confirm that the model was uploaded successfully.

      kubectl -n modelmesh-serving exec -it pvc-access -- ls -alr /mnt/models/

      Expected output:

      -rw-r--r-- 1  501 staff 344817 Oct 30 11:23 mnist-svm.joblib
  5. Deploy the inference service.

    1. Create a file named sklearn-mnist.yaml with the following content.

      sklearn-mnist.yaml

      apiVersion: serving.kserve.io/v1beta1
      kind: InferenceService
      metadata:
        name: sklearn-mnist
        namespace: modelmesh-serving
        annotations:
          serving.kserve.io/deploymentMode: ModelMesh
      spec:
        predictor:
          model:
            modelFormat:
              name: sklearn
            storage:
              parameters:
                type: pvc
                name: my-models-pvc
              path: mnist-svm.joblib
    2. Using the KubeConfig for the ACK cluster, run the following command to deploy the sklearn-mnist inference service.

      kubectl apply -f sklearn-mnist.yaml
    3. After a few moments (deployment time varies with image pull speed), run the following command to verify that the sklearn-mnist inference service is ready.

      kubectl get isvc -n modelmesh-serving

      A READY status of True indicates a successful deployment.

      NAME            URL                                               READY
      sklearn-mnist   grpc://modelmesh-serving.modelmesh-serving:8033   True
  6. Send an inference request.

    Use the curl command to send an inference request to the sklearn-mnist model. The data array represents the grayscale values of 64 pixels from a scanned image of a handwritten digit.

    MODEL_NAME="sklearn-mnist"
    ASM_GW_IP=""
    curl -X POST -k "http://${ASM_GW_IP}:8008/v2/models/${MODEL_NAME}/infer" -d '{"inputs": [{"name": "predict", "shape": [1, 64], "datatype": "FP32", "contents": {"fp32_contents": [0.0, 0.0, 1.0, 11.0, 14.0, 15.0, 3.0, 0.0, 0.0, 1.0, 13.0, 16.0, 12.0, 16.0, 8.0, 0.0, 0.0, 8.0, 16.0, 4.0, 6.0, 16.0, 5.0, 0.0, 0.0, 5.0, 15.0, 11.0, 13.0, 14.0, 0.0, 0.0, 0.0, 0.0, 2.0, 12.0, 16.0, 13.0, 0.0, 0.0, 0.0, 0.0, 0.0, 13.0, 16.0, 16.0, 6.0, 0.0, 0.0, 0.0, 0.0, 16.0, 16.0, 16.0, 7.0, 0.0, 0.0, 0.0, 0.0, 11.0, 13.0, 12.0, 1.0, 0.0]}}]}'

    The following JSON response indicates that the model recognized the digit as 8.

    {
     "modelName": "sklearn-mnist__isvc-3c10c62d34",
     "outputs": [
      {
       "name": "predict",
       "datatype": "INT64",
       "shape": [
        "1",
        "1"
       ],
       "contents": {
        "int64Contents": [
         "8"
        ]
       }
      }
     ]
    }

Related topics

  • To accommodate varying environment needs, optimize inference efficiency, or control resource allocation for your multi-model deployments, customize a model serving runtime. This allows you to fine-tune the environment and ensure that each model executes under optimal conditions. For more information, see Customize a model serving runtime with ModelMesh.

  • To process large amounts of natural language data or build complex language understanding systems, convert a large language model into an inference service. For more information, see Convert a large language model into an inference service.

  • If your pods encounter exceptions at runtime, see Troubleshoot pod issues.