All Products
Document Center

Alibaba Cloud Service Mesh:Use ModelMesh to create a custom model serving runtime

Last Updated:Mar 05, 2024

When you deploy multiple models that require different runtime environments, or when you need to improve model inference efficiency or control resource allocation, you can use Model Service Mesh (ModelMesh) to create custom model serving runtimes. The fine-tuned configurations of custom model serving runtimes ensure that each model runs in the most appropriate environment. This can help you improve service quality, reduce costs, and simplify O&M of complex models. This topic describes how to use ModelMesh Serving to customize a model serving runtime.


A Container Service for Kubernetes (ACK) cluster is added to your Service Mesh (ASM) instance and your ASM instance is of version or later.

Feature description

By default, ModelMesh is integrated with the following model serving runtimes.

Model server

Developed by

Applicable framework


Triton Inference Server


TensorFlow, PyTorch, TensorRT, and ONNX

This model server is suitable for high-performance, scalable, and low-latency inference services and provides tools for management and monitoring.



SKLearn, XGBoost, and LightGBM

This model server provides a unified API and framework and supports multiple frameworks and advanced features.

OpenVINO Model Server



This model server uses the hardware acceleration technology of Intel and supports multiple frameworks.



PyTorch (including the eager mode)

TorchServe is a lightweight and scalable model server developed by PyTorch.

If the preceding model servers cannot meet your specific requirements, for example, you need to process custom logic for inference, or the framework required by your model is not supported by the preceding model servers, you can create a custom serving runtime to meet your requirements.

Step 1: Create a custom serving runtime

A namespace-scoped ServingRuntime or a cluster-scoped ClusterServingRuntime defines the templates for pods that can serve one or more particular model formats. Each ServingRuntime or ClusterServingRuntime defines key information such as the container image of a runtime and a list of the supported model formats. Other configurations for the runtime can be passed by environment variables in the spec field.

The ServingRuntime CustomResourceDefinitions (CRDs) allow for improved flexibility and extensibility, enabling you to customize reusable runtimes without modifying the ModelMesh controller code or other resources in the controller namespace. This means that you can easily build a custom runtime to support your framework.

To create custom serving runtimes, you must build a new container image with support for the desired framework and then create a ServingRuntime resource that uses that image. This is especially easy if the framework of the desired runtime uses Python bindings. In this case, you can use the extension point of MLServer to add additional frameworks. MLServer provides a serving interface. ModelMesh Serving integrates MLServer as a ServingRuntime.

To build a Python-based custom serving runtime, perform the following steps:

  1. Implement a class that inherits from the MLModel class of MLServer.

    You can add an implementation of the MLModel class to extend MLServer. Two main functions load() and predict() are involved. Depending on your needs, you can use the load() function to load your model and use the predict() function to make a prediction. You can also view example implementations of the MLModel class in the MLServer documentation.

  2. Package the model class and dependencies into a container image.

    After the model class is implemented, you need to package its dependencies, including MLServer, into an image that is supported as a ServingRuntime resource. MLServer provides a helper for you to build an image by using the mlserver build command. For more information, see Building a custom image.

  3. Create a new ServingRuntime resource by using that image.

    1. Create a new ServingRuntime resource by using the following content and point it to the image you created:

      Show the YAML file

      kind: ServingRuntime
        name: {{CUSTOM-RUNTIME-NAME}}
          - name: {{MODEL-FORMAT-NAMES}}
            version: "1"
            autoSelect: true
        multiModel: true
        grpcDataEndpoint: port:8001
        grpcEndpoint: port:8085
          - name: mlserver
            image: {{CUSTOM-IMAGE-NAME}}
              - name: MLSERVER_MODELS_DIR
                value: "/models/_mlserver_models/"
              - name: MLSERVER_GRPC_PORT
                value: "8001"
              - name: MLSERVER_HTTP_PORT
                value: "8002"
                value: "false"
              - name: MLSERVER_MODEL_NAME
                value: dummy-model
              - name: MLSERVER_HOST
                value: ""
                value: "-1"
                cpu: 500m
                memory: 1Gi
                cpu: "5"
                memory: 1Gi
          serverType: mlserver
          runtimeManagementPort: 8001
          memBufferBytes: 134217728
          modelLoadingTimeoutMillis: 90000




      The name of the runtime, such as my-model-server-0.x.


      The list of model formats that the runtime supports, such as my-model. For example, when you deploy a model of the my-model format, ModelMesh will check the model format against this list to determine whether this runtime is suitable for the model.


      The image created in Step 2.

    2. Run the following command to create a ServingRuntime resource:

      kubectl apply -f ${Name of the YAML file}.yaml

      After you create the ServingRuntime resource, you can see the new custom runtime in your ModelMesh deployment.

Step 2: Deploy a model

To deploy a model by using the newly created runtime, you must create an InferenceService resource to serve the model. This resource is the main interface used by KServe and ModelMesh to manage models. It represents the logical endpoint of the model for serving inferences.

  1. Create an InferenceService resource to serve the model by using the following content:

    Show the YAML file

    kind: InferenceService
      name: my-model-sample
      namespace: modelmesh-serving
      annotations: ModelMesh
            name: my-model
          runtime: my-model-server
            key: localMinIO
            path: sklearn/mnist-svm.joblib

    In the YAML file, the InferenceService resource names the model my-model-sample and declares its model format my-model, which is the same format as the example custom runtime created in the previous step. An optional field runtime is also passed, explicitly telling ModelMesh to use the my-model-server-0.x runtime to deploy this model. The storage field specifies where the model resides. In this case, the localMinIO instance that is deployed by using the quickstart guide of ModelMesh Serving is used.

  2. Run the following command to deploy the InferenceService resource:

    kubectl apply -f ${Name of the YAML file}.yaml