Accelerate Model Service Mesh inference using dynamic subset load balancing - Alibaba Cloud Service Mesh

In a multi-model Model Service Mesh deployment, a Kubernetes service distributes inference requests randomly across all model serving runtimes. If a request reaches a runtime that does not host the target model, Model Service Mesh reroutes it internally until it finds the correct runtime. This rerouting adds latency.

Dynamic subset load balancing solves this problem by grouping runtimes based on the models they host. When an inference request arrives at the Service Mesh (ASM) gateway, the gateway reads the model name from the request header and routes the request directly to a runtime that hosts that model. This eliminates rerouting and reduces inference latency.

For the concept overview, see Dynamic subset load balancing.

How it works

Dynamic subset load balancing relies on two Istio resources: a DestinationRule defines *which runtimes* to group together, and a VirtualService defines *how to match requests* to those groups.

DestinationRule groups runtimes into subsets. ASM extends the standard DestinationRule with a dynamicSubset field that groups endpoints by a label key. Model Service Mesh automatically updates runtime labels as models are loaded or unloaded, so subsets stay current without manual intervention.
VirtualService matches incoming requests to subsets. ASM extends the standard VirtualService with a headerToDynamicSubsetKey field that maps a request header (such as model) to the dynamic subset label key. The gateway uses this mapping to route each request to the correct subset.

If no subset matches the request, the fallbackPolicy determines what happens:

Fallback policy	Behavior
`ANY_ENDPOINT`	Routes to any available runtime regardless of labels. Preserves availability at the cost of potential rerouting. Used in this tutorial.

Prerequisites

Before you begin, make sure that you have:

An ASM instance of V1.21.6.47 or later. For more information, see Create an ASM instance
A Container Service for Kubernetes (ACK) cluster added to the ASM instance. For more information, see Add a cluster to an ASM instance
Model Service Mesh enabled with the sklearn-mnist model deployed. For more information, see Use Model Service Mesh to roll out a multi-model inference service

Step 1: Deploy a second model

Dynamic subset load balancing targets multi-model scenarios. This step adds a tf-mnist model (a TensorFlow-based MNIST model served by the Triton runtime) alongside the existing sklearn-mnist model.

Note

This step uses the persistent volume claim (PVC) my-models-pvc created in Use Model Service Mesh to roll out a multi-model inference service. The tf-mnist model content is everything inside the mnist directory.

Store the model on the persistent volume

Connect to the ACK cluster with kubectl, then copy the mnist model directory to the persistent volume:
```
kubectl -n modelmesh-serving cp mnist pvc-access:/mnt/models/
```

Verify that the model exists on the persistent volume:

kubectl -n modelmesh-serving exec -it pvc-access -- ls -alr /mnt/models/

Expected output:

-rw-r--r-- 1  502 staff 344817 Apr 23 08:17 mnist-svm.joblib
drwxr-xr-x 3 root root    4096 Apr 23 08:23 mnist
drwxr-xr-x 1 root root    4096 Apr 23 08:17 ..
drwxrwxrwx 3 root root    4096 Apr 23 08:23 .

Deploy the inference service

Create a file named tf-mnist.yaml with the following content:

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: tf-mnist
  namespace: modelmesh-serving
  annotations:
    serving.kserve.io/deploymentMode: ModelMesh
spec:
  predictor:
    model:
      modelFormat:
        name: tensorflow
      storage:
        parameters:
          type: pvc
          name: my-models-pvc
        path: mnist

Apply the manifest:
```
kubectl apply -f tf-mnist.yaml
```

Wait for the image to pull, then verify that both models are ready:

kubectl get isvc -n modelmesh-serving

Expected output:

NAME            URL                                               READY
sklearn-mnist   grpc://modelmesh-serving.modelmesh-serving:8033   True
tf-mnist        grpc://modelmesh-serving.modelmesh-serving:8033   True

(Optional) Step 2: Benchmark inference latency before optimization

Use fortio to measure baseline latency before enabling dynamic subset load balancing. To obtain the ASM ingress gateway IP address, see Integrate KServe with ASM to implement inference services based on cloud-native AI models.

Set the gateway IP and run a 60-second load test against the tf-mnist model:

ASM_GW_IP="<your-asm-gateway-ip>"
fortio load -jitter=False -H 'model: tf-mnist' -c 1 -qps 100 -t 60s -payload '{"inputs": [{ "name": "inputs", "shape": [1, 784], "datatype": "FP32", "contents": { "fp32_contents": [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.01176471, 0.07058824, 0.07058824, 0.07058824, 0.49411765, 0.53333336, 0.6862745, 0.10196079, 0.6509804, 1.0, 0.96862745, 0.49803922, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.11764706, 0.14117648, 0.36862746, 0.6039216, 0.6666667, 0.99215686, 0.99215686, 0.99215686, 0.99215686, 0.99215686, 0.88235295, 0.6745098, 0.99215686, 0.9490196, 0.7647059, 0.2509804, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.19215687, 0.93333334, 0.99215686, 0.99215686, 0.99215686, 0.99215686, 0.99215686, 0.99215686, 0.99215686, 0.99215686, 0.9843137, 0.3647059, 0.32156864, 0.32156864, 0.21960784, 0.15294118, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.07058824, 0.85882354, 0.99215686, 0.99215686, 0.99215686, 0.99215686, 0.99215686, 0.7764706, 0.7137255, 0.96862745, 0.94509804, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.3137255, 0.6117647, 0.41960785, 0.99215686, 0.99215686, 0.8039216, 0.04313726, 0.0, 0.16862746, 0.6039216, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.05490196, 0.00392157, 0.6039216, 0.99215686, 0.3529412, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.54509807, 0.99215686, 0.74509805, 0.00784314, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.04313726, 0.74509805, 0.99215686, 0.27450982, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.13725491, 0.94509804, 0.88235295, 0.627451, 0.42352942, 0.00392157, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.31764707, 0.9411765, 0.99215686, 0.99215686, 0.46666667, 0.09803922, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.1764706, 0.7294118, 0.99215686, 0.99215686, 0.5882353, 0.10588235, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0627451, 0.3647059, 0.9882353, 0.99215686, 0.73333335, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.9764706, 0.99215686, 0.9764706, 0.2509804, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.18039216, 0.50980395, 0.7176471, 0.99215686, 0.99215686, 0.8117647, 0.00784314, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.15294118, 0.5803922, 0.8980392, 0.99215686, 0.99215686, 0.99215686, 0.98039216, 0.7137255, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.09411765, 0.44705883, 0.8666667, 0.99215686, 0.99215686, 0.99215686, 0.99215686, 0.7882353, 0.30588236, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.09019608, 0.25882354, 0.8352941, 0.99215686, 0.99215686, 0.99215686, 0.99215686, 0.7764706, 0.31764707, 0.00784314, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.07058824, 0.67058825, 0.85882354, 0.99215686, 0.99215686, 0.99215686, 0.99215686, 0.7647059, 0.3137255, 0.03529412, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.21568628, 0.6745098, 0.8862745, 0.99215686, 0.99215686, 0.99215686, 0.99215686, 0.95686275, 0.52156866, 0.04313726, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.53333336, 0.99215686, 0.99215686, 0.99215686, 0.83137256, 0.5294118, 0.5176471, 0.0627451, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] }}]}' -a ${ASM_GW_IP}:8008/v2/models/tf-mnist/infer

Expected output:

Expected output

16:06:53.107 r1 [INF] scli.go:125> Starting, command="Φορτίο", version="1.63.7 h1:S6e+z36nV6o8RYQSUI9EWYxhCoPJy4VdAB2HQROUqMg= go1.22.2 amd64 linux", go-max-procs=8
Fortio 1.63.7 running at 100 queries per second, 8->8 procs, for 1m0s: 192.168.0.7:8008/v2/models/tf-mnist/infer
16:06:53.107 r1 [INF] httprunner.go:121> Starting http test, run=0, url="192.168.0.7:8008/v2/models/tf-mnist/infer", threads=1, qps="100.0", warmup="parallel", conn-reuse=""
16:06:53.107 r1 [WRN] http_client.go:170> Assuming http:// on missing scheme for '192.168.0.7:8008/v2/models/tf-mnist/infer'
Starting at 100 qps with 1 thread(s) [gomax 8] for 1m0s : 6000 calls each (total 6000)
16:07:53.172 r1 [INF] periodic.go:851> T000 ended after 1m0.007662555s : 6000 calls. qps=99.98723070575699
Ended after 1m0.007716622s : 6000 calls. qps=99.987
16:07:53.172 r1 [INF] periodic.go:581> Run ended, run=0, elapsed=60007716622, calls=6000, qps=99.98714061718327
Sleep times : count 5999 avg 0.0021861297 +/- 0.001025 min -0.011550303 max 0.003734337 sum 13.1145918
Aggregated Function Time : count 6000 avg 0.0072130591 +/- 0.0007502 min 0.006125677 max 0.020551562 sum 43.2783545
# range, mid point, percentile, count
>= 0.00612568 <= 0.007 , 0.00656284 , 35.88, 2153
> 0.007 <= 0.008 , 0.0075 , 90.85, 3298
> 0.008 <= 0.009 , 0.0085 , 97.77, 415
> 0.009 <= 0.01 , 0.0095 , 99.42, 99
> 0.01 <= 0.011 , 0.0105 , 99.68, 16
> 0.011 <= 0.012 , 0.0115 , 99.77, 5
> 0.012 <= 0.014 , 0.013 , 99.90, 8
> 0.014 <= 0.016 , 0.015 , 99.95, 3
> 0.016 <= 0.018 , 0.017 , 99.97, 1
> 0.018 <= 0.02 , 0.019 , 99.98, 1
> 0.02 <= 0.0205516 , 0.0202758 , 100.00, 1
# target 50% 0.00725682
# target 75% 0.00771164
# target 90% 0.00798454
# target 99% 0.00974747
# target 99.9% 0.014
Error cases : no data
# Socket and IP used for each connection:
[0]   1 socket used, resolved to 192.168.0.7:8008, connection timing : count 1 avg 0.004218262 +/- 0 min 0.004218262 max 0.004218262 sum 0.004218262
Connection time histogram (s) : count 1 avg 0.004218262 +/- 0 min 0.004218262 max 0.004218262 sum 0.004218262
# range, mid point, percentile, count
>= 0.00421826 <= 0.00421826 , 0.00421826 , 100.00, 1
# target 50% 0.00421826
# target 75% 0.00421826
# target 90% 0.00421826
# target 99% 0.00421826
# target 99.9% 0.00421826
Sockets used: 1 (for perfect keepalive, would be 1)
Uniform: false, Jitter: false, Catchup allowed: true
IP addresses distribution:
192.168.0.7:8008: 1
Code 200 : 6000 (100.0 %)
Response Header Sizes : count 6000 avg 233.00067 +/- 0.02581 min 233 max 234 sum 1398004
Response Body/Total Sizes : count 6000 avg 454.00067 +/- 0.02581 min 454 max 455 sum 2724004
All done 6000 calls (plus 1 warmup) 7.213 ms avg, 100.0 qps
Successfully wrote 11561 bytes of Json data to 2024-04-24-160653_192_168_0_7_8008_v2_models_tf_mnist_infer_iZbp1jfq2w26u2kpxa9r3dZ.json

View the fortio results in a browser:
```
fortio server
```
Open localhost:8080, click saved results, and select the JSON file to view the latency distribution.
Without dynamic subset load balancing, some inference requests show increased latency because they are rerouted through Model Service Mesh before reaching the correct runtime.

Step 3: Enable dynamic subset load balancing

Inference requests reach Model Service Mesh through the modelmesh-serving service in the modelmesh-serving namespace. Configure a DestinationRule and a VirtualService on this service to route requests directly to the correct model serving runtime.

Create the DestinationRule

Apply the following DestinationRule to group model serving runtimes dynamically by the models they host. For more information, see Manage destination rules.

apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: modelmesh-serving
  namespace: modelmesh-serving
spec:
  host: modelmesh-serving
  trafficPolicy:
    loadBalancer:
      dynamicSubset:
        subsetSelectors:
          - fallbackPolicy: ANY_ENDPOINT
            keys:
              - modelmesh.asm.alibabacloud.com

Key fields:

Field	Purpose
`dynamicSubset`	Enables dynamic grouping of endpoints based on runtime labels, instead of static subset definitions
`keys`	The label key used to group runtimes. Model Service Mesh sets the `modelmesh.asm.alibabacloud.com` label on each runtime to indicate which models it hosts
`fallbackPolicy: ANY_ENDPOINT`	When no subset matches, routes to any available runtime instead of rejecting the request

Update the VirtualService

Apply the following VirtualService to map the model request header to the dynamic subset key. For more information, see Manage virtual services.

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: vs-modelmesh-serving-service
  namespace: modelmesh-serving
spec:
  gateways:
    - grpc-gateway
  hosts:
    - '*'
  http:
    - headerToDynamicSubsetKey:
        - header: model
          key: modelmesh.asm.alibabacloud.com
      match:
        - port: 8008
      name: default
      route:
        - destination:
            host: modelmesh-serving
            port:
              number: 8033

The headerToDynamicSubsetKey field is an ASM extension to the standard Istio VirtualService. When the gateway receives a request with a model: tf-mnist header, it looks up the value against the modelmesh.asm.alibabacloud.com label and routes the request to a runtime in the matching dynamic subset.

(Optional) Step 4: Benchmark inference latency after optimization

Run the same fortio test from Step 2 to compare latency.

After enabling dynamic subset load balancing, all inference requests route directly to the correct runtime. The latency distribution tightens significantly, with no outliers caused by internal rerouting.