ModelMesh includes built-in runtimes for common inference frameworks, but some workloads require custom pre/post-processing logic, unsupported frameworks, or fine-grained resource control. A custom ServingRuntime lets you define exactly how models are loaded and served, then register that runtime so ModelMesh schedules and scales models onto it automatically.
This guide walks through building a Python-based custom runtime on MLServer, packaging it as a container image, registering it as a ServingRuntime, and deploying a model against it.
Prerequisites
Before you begin, ensure that you have:
A Container Service for Kubernetes (ACK) cluster added to your Service Mesh (ASM) instance
An ASM instance of version 1.18.0.134 or later
Built-in runtimes
Before building a custom runtime, check whether a built-in runtime already supports your model format.
| Model server | Developer | Supported frameworks | Best for |
|---|---|---|---|
| Triton Inference Server | NVIDIA | TensorFlow, PyTorch, TensorRT, ONNX | High-performance, scalable, low-latency inference with built-in monitoring |
| MLServer | Seldon | SKLearn, XGBoost, LightGBM | Unified API across multiple ML frameworks |
| OpenVINO Model Server | Intel | OpenVINO, ONNX | Intel hardware acceleration |
| TorchServe | PyTorch | PyTorch (including eager mode) | Lightweight PyTorch-native serving |
If none of these runtimes support your framework, or if your inference pipeline requires custom logic, create a custom serving runtime as described below.
How it works
A ServingRuntime (namespace-scoped) or ClusterServingRuntime (cluster-scoped) defines the pod template for serving one or more model formats. Each resource specifies:
The container image that runs the inference server
A list of supported model formats
Environment variables for runtime configuration
ServingRuntime is a Kubernetes CustomResourceDefinition (CRD), so you can create reusable runtimes without modifying the ModelMesh controller or any resources in the controller namespace.
For Python-based frameworks, the fastest path is to extend MLServer. MLServer provides the serving interface, you provide the model logic, and ModelMesh handles scheduling and scaling.
The workflow has three steps:
Implement the model class by extending the MLServer
MLModelclass.Package the class and its dependencies into a container image.
Register the image as a ServingRuntime in Kubernetes.
Step 1: Implement the MLModel class
Create a Python class that inherits from the MLModel class of MLServer. Implement two methods:
load()-- Initialize the model (load weights, warm up caches).predict()-- Run inference on an incoming request.
from typing import List
from mlserver import MLModel, types
from mlserver.utils import get_model_uri
# List the filenames your model uses so MLServer can locate them.
WELLKNOWN_MODEL_FILENAMES = ["model.json", "model.dat"]
class CustomMLModel(MLModel):
async def load(self) -> bool:
model_uri = await get_model_uri(
self._settings, wellknown_filenames=WELLKNOWN_MODEL_FILENAMES
)
# TODO: Load the model from model_uri and store it as an instance attribute.
# Example: self._model = joblib.load(model_uri)
self.ready = True
return self.ready
async def predict(
self, payload: types.InferenceRequest
) -> types.InferenceResponse:
payload = self._check_request(payload)
return types.InferenceResponse(
model_name=self.name,
model_version=self.version,
outputs=self._predict_outputs(payload),
)
def _check_request(self, payload):
# TODO: Validate the request -- check input tensor names, shapes, and types.
return payload
def _predict_outputs(self, payload) -> List[types.ResponseOutput]:
# TODO: Extract input data from payload.
# TODO: Run the data through your model's prediction logic.
# TODO: Construct and return a list of ResponseOutput objects.
outputs = []
return outputsFor more examples, see the MLServer custom runtime documentation.
Step 2: Package the runtime into a container image
Bundle your CustomMLModel class, MLServer, and all dependencies into a container image.
Option A: Use the MLServer CLI (recommended)
MLServer provides the mlserver build command to generate an image automatically. For details, see Building a custom image.
Option B: Write a Dockerfile
# Use a Python base image that matches your dependency requirements.
FROM python:3.8-slim-buster
RUN pip install mlserver
# Place your MLModel implementation on the Python path.
COPY --chown=${USER} ./custom_model.py /opt/custom_model.py
ENV PYTHONPATH=/opt/
# Set environment variables for ModelMesh compatibility.
# These can also be set in the ServingRuntime YAML, but embedding them here
# ensures consistent behavior during local testing.
ENV MLSERVER_MODELS_DIR=/models/_mlserver_models \
MLSERVER_GRPC_PORT=8001 \
MLSERVER_HTTP_PORT=8002 \
MLSERVER_LOAD_MODELS_AT_STARTUP=false \
MLSERVER_MODEL_NAME=dummy-model
# Point MLServer to your class so the implementation field is optional
# in model settings.
ENV MLSERVER_MODEL_IMPLEMENTATION=custom_model.CustomMLModel
CMD ["mlserver", "start", "${MLSERVER_MODELS_DIR}"]Build and push the image:
docker build -t <your-registry>/<your-image-name>:<tag> .
docker push <your-registry>/<your-image-name>:<tag>Replace the following placeholders with actual values:
| Placeholder | Description | Example |
|---|---|---|
<your-registry> | Container registry address | registry.cn-hangzhou.aliyuncs.com/my-namespace |
<your-image-name> | Image name | custom-mlserver |
<tag> | Image version tag | v1.0 |
Step 3: Create the ServingRuntime
Define a ServingRuntime resource that points to your container image and declares the model formats it supports.
Replace the following placeholders with actual values:
| Placeholder | Description | Example |
|---|---|---|
<custom-runtime-name> | A unique name for the runtime | my-model-server-0.x |
<model-format-name> | The model format this runtime supports. ModelMesh matches incoming models against this value. | my-model |
<your-image> | The container image from Step 2 | registry.cn-hangzhou.aliyuncs.com/my-namespace/custom-mlserver:v1.0 |
Apply the resource:
kubectl apply -f <serving-runtime>.yamlAfter the resource is created, the custom runtime appears in your ModelMesh deployment and is ready to serve models that match the declared format.
Step 4: Deploy a model
Create an InferenceService to deploy a model on the custom runtime. InferenceService is the primary resource that KServe and ModelMesh use to manage model endpoints.
Key fields:
| Field | Purpose |
|---|---|
modelFormat.name | Tells ModelMesh which runtime to use. Must match a supportedModelFormats entry in your ServingRuntime. |
runtime | (Optional) Explicitly selects a runtime by name. If omitted, ModelMesh auto-selects based on modelFormat. |
storage.key | References a preconfigured storage backend (for example, localMinIO from the ModelMesh Serving quickstart). |
storage.path | Path to the model artifact within the storage backend. |
Apply the resource:
kubectl apply -f <inference-service>.yamlVerify that the model is ready:
kubectl get inferenceservice my-model-sample -n modelmesh-servingExpected output:
NAME URL READY AGE
my-model-sample True 1mThe READY column shows True after ModelMesh finishes loading the model into the custom runtime.
Debugging
If the custom runtime fails to load models or returns inference errors, enable debug logging by adding these environment variables to the ServingRuntime:
env:
- name: MLSERVER_DEBUG
value: "true"
- name: MLSERVER_MODEL_PARALLEL_WORKERS
# Set to 0 to disable parallel workers, which simplifies log output.
value: "0"Check the runtime pod logs for detailed error messages:
kubectl logs -n modelmesh-serving <pod-name> -c mlserverWhat's next
Learn more about ModelMesh Serving custom runtimes in the upstream documentation.
Explore the MLServer custom runtime examples for additional patterns.