All Products
Search
Document Center

Alibaba Cloud Service Mesh:Create a custom model serving runtime with ModelMesh

Last Updated:Mar 11, 2026

ModelMesh includes built-in runtimes for common inference frameworks, but some workloads require custom pre/post-processing logic, unsupported frameworks, or fine-grained resource control. A custom ServingRuntime lets you define exactly how models are loaded and served, then register that runtime so ModelMesh schedules and scales models onto it automatically.

This guide walks through building a Python-based custom runtime on MLServer, packaging it as a container image, registering it as a ServingRuntime, and deploying a model against it.

Prerequisites

Before you begin, ensure that you have:

Built-in runtimes

Before building a custom runtime, check whether a built-in runtime already supports your model format.

Model serverDeveloperSupported frameworksBest for
Triton Inference ServerNVIDIATensorFlow, PyTorch, TensorRT, ONNXHigh-performance, scalable, low-latency inference with built-in monitoring
MLServerSeldonSKLearn, XGBoost, LightGBMUnified API across multiple ML frameworks
OpenVINO Model ServerIntelOpenVINO, ONNXIntel hardware acceleration
TorchServePyTorchPyTorch (including eager mode)Lightweight PyTorch-native serving

If none of these runtimes support your framework, or if your inference pipeline requires custom logic, create a custom serving runtime as described below.

How it works

A ServingRuntime (namespace-scoped) or ClusterServingRuntime (cluster-scoped) defines the pod template for serving one or more model formats. Each resource specifies:

  • The container image that runs the inference server

  • A list of supported model formats

  • Environment variables for runtime configuration

ServingRuntime is a Kubernetes CustomResourceDefinition (CRD), so you can create reusable runtimes without modifying the ModelMesh controller or any resources in the controller namespace.

For Python-based frameworks, the fastest path is to extend MLServer. MLServer provides the serving interface, you provide the model logic, and ModelMesh handles scheduling and scaling.

The workflow has three steps:

  1. Implement the model class by extending the MLServer MLModel class.

  2. Package the class and its dependencies into a container image.

  3. Register the image as a ServingRuntime in Kubernetes.

Step 1: Implement the MLModel class

Create a Python class that inherits from the MLModel class of MLServer. Implement two methods:

  • load() -- Initialize the model (load weights, warm up caches).

  • predict() -- Run inference on an incoming request.

from typing import List
from mlserver import MLModel, types
from mlserver.utils import get_model_uri

# List the filenames your model uses so MLServer can locate them.
WELLKNOWN_MODEL_FILENAMES = ["model.json", "model.dat"]


class CustomMLModel(MLModel):

    async def load(self) -> bool:
        model_uri = await get_model_uri(
            self._settings, wellknown_filenames=WELLKNOWN_MODEL_FILENAMES
        )
        # TODO: Load the model from model_uri and store it as an instance attribute.
        #       Example: self._model = joblib.load(model_uri)
        self.ready = True
        return self.ready

    async def predict(
        self, payload: types.InferenceRequest
    ) -> types.InferenceResponse:
        payload = self._check_request(payload)
        return types.InferenceResponse(
            model_name=self.name,
            model_version=self.version,
            outputs=self._predict_outputs(payload),
        )

    def _check_request(self, payload):
        # TODO: Validate the request -- check input tensor names, shapes, and types.
        return payload

    def _predict_outputs(self, payload) -> List[types.ResponseOutput]:
        # TODO: Extract input data from payload.
        # TODO: Run the data through your model's prediction logic.
        # TODO: Construct and return a list of ResponseOutput objects.
        outputs = []
        return outputs

For more examples, see the MLServer custom runtime documentation.

Step 2: Package the runtime into a container image

Bundle your CustomMLModel class, MLServer, and all dependencies into a container image.

Option A: Use the MLServer CLI (recommended)

MLServer provides the mlserver build command to generate an image automatically. For details, see Building a custom image.

Option B: Write a Dockerfile

# Use a Python base image that matches your dependency requirements.
FROM python:3.8-slim-buster
RUN pip install mlserver

# Place your MLModel implementation on the Python path.
COPY --chown=${USER} ./custom_model.py /opt/custom_model.py
ENV PYTHONPATH=/opt/

# Set environment variables for ModelMesh compatibility.
# These can also be set in the ServingRuntime YAML, but embedding them here
# ensures consistent behavior during local testing.
ENV MLSERVER_MODELS_DIR=/models/_mlserver_models \
    MLSERVER_GRPC_PORT=8001 \
    MLSERVER_HTTP_PORT=8002 \
    MLSERVER_LOAD_MODELS_AT_STARTUP=false \
    MLSERVER_MODEL_NAME=dummy-model

# Point MLServer to your class so the implementation field is optional
# in model settings.
ENV MLSERVER_MODEL_IMPLEMENTATION=custom_model.CustomMLModel

CMD ["mlserver", "start", "${MLSERVER_MODELS_DIR}"]

Build and push the image:

docker build -t <your-registry>/<your-image-name>:<tag> .
docker push <your-registry>/<your-image-name>:<tag>

Replace the following placeholders with actual values:

PlaceholderDescriptionExample
<your-registry>Container registry addressregistry.cn-hangzhou.aliyuncs.com/my-namespace
<your-image-name>Image namecustom-mlserver
<tag>Image version tagv1.0

Step 3: Create the ServingRuntime

Define a ServingRuntime resource that points to your container image and declares the model formats it supports.

Show the YAML file

apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
  name: <custom-runtime-name>          # Example: my-model-server-0.x
spec:
  supportedModelFormats:
    - name: <model-format-name>        # Example: my-model
      version: "1"
      autoSelect: true
  multiModel: true
  grpcDataEndpoint: port:8001
  grpcEndpoint: port:8085
  containers:
    - name: mlserver
      image: <your-image>              # The image built in Step 2
      env:
        - name: MLSERVER_MODELS_DIR
          value: "/models/_mlserver_models/"
        - name: MLSERVER_GRPC_PORT
          # Must match grpcDataEndpoint above.
          value: "8001"
        - name: MLSERVER_HTTP_PORT
          # Default is 8080, which conflicts with ModelMesh internal ports.
          value: "8002"
        - name: MLSERVER_LOAD_MODELS_AT_STARTUP
          # ModelMesh manages model loading; disable MLServer's built-in loader.
          value: "false"
        - name: MLSERVER_MODEL_NAME
          # Dummy name prevents MLServer from erroring when no models are loaded yet.
          value: dummy-model
        - name: MLSERVER_HOST
          # Bind to localhost so MLServer only listens inside the pod.
          value: "127.0.0.1"
        - name: MLSERVER_GRPC_MAX_MESSAGE_LENGTH
          # Unlimited (-1) because ModelMesh enforces its own message size limits.
          value: "-1"
      resources:
        requests:
          cpu: 500m
          memory: 1Gi
        limits:
          cpu: "5"
          memory: 1Gi
  builtInAdapter:
    serverType: mlserver
    runtimeManagementPort: 8001
    memBufferBytes: 134217728
    modelLoadingTimeoutMillis: 90000

Replace the following placeholders with actual values:

PlaceholderDescriptionExample
<custom-runtime-name>A unique name for the runtimemy-model-server-0.x
<model-format-name>The model format this runtime supports. ModelMesh matches incoming models against this value.my-model
<your-image>The container image from Step 2registry.cn-hangzhou.aliyuncs.com/my-namespace/custom-mlserver:v1.0

Apply the resource:

kubectl apply -f <serving-runtime>.yaml

After the resource is created, the custom runtime appears in your ModelMesh deployment and is ready to serve models that match the declared format.

Step 4: Deploy a model

Create an InferenceService to deploy a model on the custom runtime. InferenceService is the primary resource that KServe and ModelMesh use to manage model endpoints.

Show the YAML file

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: my-model-sample
  namespace: modelmesh-serving
  annotations:
    serving.kserve.io/deploymentMode: ModelMesh
spec:
  predictor:
    model:
      modelFormat:
        name: my-model                 # Must match the format declared in the ServingRuntime
      runtime: my-model-server         # Optional: explicitly select the runtime
      storage:
        key: localMinIO
        path: sklearn/mnist-svm.joblib

Key fields:

FieldPurpose
modelFormat.nameTells ModelMesh which runtime to use. Must match a supportedModelFormats entry in your ServingRuntime.
runtime(Optional) Explicitly selects a runtime by name. If omitted, ModelMesh auto-selects based on modelFormat.
storage.keyReferences a preconfigured storage backend (for example, localMinIO from the ModelMesh Serving quickstart).
storage.pathPath to the model artifact within the storage backend.

Apply the resource:

kubectl apply -f <inference-service>.yaml

Verify that the model is ready:

kubectl get inferenceservice my-model-sample -n modelmesh-serving

Expected output:

NAME              URL   READY   AGE
my-model-sample         True    1m

The READY column shows True after ModelMesh finishes loading the model into the custom runtime.

Debugging

If the custom runtime fails to load models or returns inference errors, enable debug logging by adding these environment variables to the ServingRuntime:

env:
  - name: MLSERVER_DEBUG
    value: "true"
  - name: MLSERVER_MODEL_PARALLEL_WORKERS
    # Set to 0 to disable parallel workers, which simplifies log output.
    value: "0"

Check the runtime pod logs for detailed error messages:

kubectl logs -n modelmesh-serving <pod-name> -c mlserver

What's next