When you need to run multiple machine learning models for inference, use ModelMesh to deploy and manage multi-model inference services. Based on KServe ModelMesh, this feature is optimized for high-volume, high-density, and frequently changing model use cases. It intelligently loads models into memory and unloads them to balance responsiveness and compute resources, simplifying the deployment and operation of multi-model inference services while improving inference efficiency and performance.
Prerequisites
-
You have added a cluster to an ASM instance of version 1.18.0.134 or later.
-
You have created an ingress gateway for the cluster. For more information, see Create an ingress gateway.
This topic uses an ASM ingress gateway as the cluster gateway. The gateway is named ingressgateway by default and exposes port 8008 for HTTP traffic.
Features
ModelMesh provides the following features:
|
Feature |
Description |
|
Cache management |
|
|
Intelligent placement and loading |
|
|
Resilience |
Automatically retries failed model loads on different pods. |
|
Rolling updates |
Automatically and seamlessly handles rolling model updates. |
Step 1: Enable ModelMesh in ASM
-
Log on to the ASM console. In the left-side navigation pane, choose .
-
On the Mesh Management page, click the name of the target ASM instance. In the left-side navigation pane, choose .
-
On the KServe on ASM page, click Install Model Service Mesh to enable the ModelMesh feature.
Note: KServe depends on CertManager. Installing KServe automatically installs the CertManager component. If you use a self-managed CertManager, disable Automatically install the CertManager component in the cluster.
-
After a few minutes, when the components are ready, run the following command using the KubeConfig for the cluster to verify that the ServingRuntime resources are ready.
kubectl get servingruntimes -n modelmesh-servingExpected output:
NAME DISABLED MODELTYPE CONTAINERS AGE mlserver-1.x sklearn mlserver 1m ovms-1.x openvino_ir ovms 1m torchserve-0.x pytorch-mar torchserve 1m triton-2.x keras triton 1mA ServingRuntime defines a pod template for serving one or more specific model formats. ModelMesh automatically provisions the corresponding pod based on the framework of the deployed model.
The default runtimes and their supported model formats are listed in the following table. For more information, see supported-model-formats. If these model servers do not meet your requirements, you can create a custom model serving runtime. For more information, see Customize a model serving runtime with ModelMesh.
Model serving runtime
Supported model frameworks
mlserver-1.x
sklearn, xgboost, lightgbm
ovms-1.x
openvino_ir, onnx
torchserve-0.x
pytorch-mar
triton-2.x
tensorflow, pytorch, onnx, tensorrt
Step 2: Configure the ASM environment
-
Synchronize namespaces from the Kubernetes cluster to the ASM instance. For more information, see Synchronize auto-injection labels from a data plane cluster to an ASM instance. After synchronization, confirm that the
modelmesh-servingnamespace exists. -
Create an ingress gateway rule.
-
Create a file named
grpc-gateway.yamlwith the following content. -
Using the KubeConfig for the cluster associated with your ASM instance, run the following command to create the gateway rule.
kubectl apply -f grpc-gateway.yaml
-
-
Create a virtual service.
-
Create a file named
vs-modelmesh-serving-service.yamlwith the following content. -
Using the KubeConfig for the cluster associated with your ASM instance, run the following command to create the virtual service.
kubectl apply -f vs-modelmesh-serving-service.yaml
-
-
Configure the gRPC-JSON transcoder.
-
Create a file named
grpcjsontranscoder-for-kservepredictv2.yamlwith the following content.apiVersion: istio.alibabacloud.com/v1beta1 kind: ASMGrpcJsonTranscoder metadata: name: grpcjsontranscoder-for-kservepredictv2 namespace: istio-system spec: builtinProtoDescriptor: kserve_predict_v2 isGateway: true portNumber: 8008 workloadSelector: labels: istio: ingressgateway -
Using the KubeConfig for the cluster associated with your ASM instance, run the following command to deploy the gRPC-JSON transcoder.
kubectl apply -f grpcjsontranscoder-for-kservepredictv2.yaml -
Create a file named
grpcjsontranscoder-increasebufferlimit.yamlwith the following content. This configuration increases the response size limit by setting theper_connection_buffer_limit_bytes. -
Using the KubeConfig for the cluster associated with your ASM instance, run the following command to deploy the EnvoyFilter.
kubectl apply -f grpcjsontranscoder-increasebufferlimit.yaml
-
Step 3: Deploy a sample model
-
Create a StorageClass. For more information, see Use dynamic NAS volumes.
Log on to the ACK console. In the left navigation pane, click Clusters.
On the Clusters page, click the name of your cluster. In the left navigation pane, click .
-
In the upper-right corner of the StorageClasses page, click Create, configure the following parameters, and then click Create.
Set Name to
alibabacloud-cnfs-nas. Set Volume Type to NAS and Storage Driver to CSI. Set Reclaim Policy to Delete. In the Mount Options section, add entries fornolock,tcp,noresvportandvers=3. Select the appropriate Mount Point Domain and set Path to/.
-
Create a persistent volume claim (PVC).
-
Create a file named
my-models-pvc.yamlwith the following content.apiVersion: v1 kind: PersistentVolumeClaim metadata: name: my-models-pvc namespace: modelmesh-serving spec: accessModes: - ReadWriteMany resources: requests: storage: 1Gi storageClassName: alibabacloud-cnfs-nas volumeMode: Filesystem -
Using the KubeConfig for the ACK cluster, run the following command to create the persistent volume claim.
kubectl apply -f my-models-pvc.yaml -
Run the following command to view the PVCs in the
modelmesh-servingnamespace.kubectl get pvc -n modelmesh-servingExpected output:
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE my-models-pvc Bound nas-379c32e1-c0ef-43f3-8277-9eb4606b53f8 1Gi RWX alibabacloud-cnfs-nas 2h
-
-
Create a pod to access the PVC.
To use the new PVC, mount it as a volume in a Kubernetes pod. You can then use that pod to upload model files to the persistent volume.
-
Create a file named
pvc-access.yamlwith the following content.The following YAML creates a pod named
pvc-accessthat requests the PVC you previously created by specifying the claim name"my-models-pvc".apiVersion: v1 kind: Pod metadata: name: "pvc-access" spec: containers: - name: main image: ubuntu command: ["/bin/sh", "-ec", "sleep 10000"] volumeMounts: - name: "my-pvc" mountPath: "/mnt/models" volumes: - name: "my-pvc" persistentVolumeClaim: claimName: "my-models-pvc" -
Using the KubeConfig for the ACK cluster, run the following command to create the pod.
kubectl apply -n modelmesh-serving -f pvc-access.yaml -
Confirm that the
pvc-accesspod is in the Running state.kubectl get pods -n modelmesh-serving | grep pvc-accessExpected output:
pvc-access 1/1 Running 0 51m
-
-
Store the model on the persistent volume.
Add the AI model to the storage volume. This topic uses an MNIST handwritten digit recognition model trained with scikit-learn. You can download a copy of the mnist-svm.joblib model file from the kserve/modelmesh-minio-examples repository.
-
Using the KubeConfig for the ACK cluster, run the following command to copy the
mnist-svm.joblibmodel file to the/mnt/modelsfolder of thepvc-accesspod.kubectl -n modelmesh-serving cp mnist-svm.joblib pvc-access:/mnt/models/ -
Run the following command to confirm that the model was uploaded successfully.
kubectl -n modelmesh-serving exec -it pvc-access -- ls -alr /mnt/models/Expected output:
-rw-r--r-- 1 501 staff 344817 Oct 30 11:23 mnist-svm.joblib
-
-
Deploy the inference service.
-
Create a file named
sklearn-mnist.yamlwith the following content. -
Using the KubeConfig for the ACK cluster, run the following command to deploy the
sklearn-mnistinference service.kubectl apply -f sklearn-mnist.yaml -
After a few moments (deployment time varies with image pull speed), run the following command to verify that the
sklearn-mnistinference service is ready.kubectl get isvc -n modelmesh-servingA
READYstatus ofTrueindicates a successful deployment.NAME URL READY sklearn-mnist grpc://modelmesh-serving.modelmesh-serving:8033 True
-
-
Send an inference request.
Use the
curlcommand to send an inference request to thesklearn-mnistmodel. The data array represents the grayscale values of 64 pixels from a scanned image of a handwritten digit.MODEL_NAME="sklearn-mnist" ASM_GW_IP="" curl -X POST -k "http://${ASM_GW_IP}:8008/v2/models/${MODEL_NAME}/infer" -d '{"inputs": [{"name": "predict", "shape": [1, 64], "datatype": "FP32", "contents": {"fp32_contents": [0.0, 0.0, 1.0, 11.0, 14.0, 15.0, 3.0, 0.0, 0.0, 1.0, 13.0, 16.0, 12.0, 16.0, 8.0, 0.0, 0.0, 8.0, 16.0, 4.0, 6.0, 16.0, 5.0, 0.0, 0.0, 5.0, 15.0, 11.0, 13.0, 14.0, 0.0, 0.0, 0.0, 0.0, 2.0, 12.0, 16.0, 13.0, 0.0, 0.0, 0.0, 0.0, 0.0, 13.0, 16.0, 16.0, 6.0, 0.0, 0.0, 0.0, 0.0, 16.0, 16.0, 16.0, 7.0, 0.0, 0.0, 0.0, 0.0, 11.0, 13.0, 12.0, 1.0, 0.0]}}]}'The following JSON response indicates that the model recognized the digit as
8.{ "modelName": "sklearn-mnist__isvc-3c10c62d34", "outputs": [ { "name": "predict", "datatype": "INT64", "shape": [ "1", "1" ], "contents": { "int64Contents": [ "8" ] } } ] }
Related topics
-
To accommodate varying environment needs, optimize inference efficiency, or control resource allocation for your multi-model deployments, customize a model serving runtime. This allows you to fine-tune the environment and ensure that each model executes under optimal conditions. For more information, see Customize a model serving runtime with ModelMesh.
-
To process large amounts of natural language data or build complex language understanding systems, convert a large language model into an inference service. For more information, see Convert a large language model into an inference service.
-
If your pods encounter exceptions at runtime, see Troubleshoot pod issues.