All Products
Search
Document Center

Alibaba Cloud Service Mesh:Integrate KServe with ASM to deploy cloud-native inference services

Last Updated:Mar 11, 2026

When you run machine learning models on Kubernetes, you need automated scaling, traffic routing, and lifecycle management for inference endpoints. KServe (formerly KFServing) provides these capabilities as a Kubernetes-native inference platform. By integrating KServe with Alibaba Cloud Service Mesh (ASM), you deploy and manage inference services through the ASM control plane -- with automatic traffic-based scaling, scale-to-zero, and canary deployments built in.

Prerequisites

Before you begin, make sure you have:

How it works

KServe runs on your Container Service for Kubernetes (ACK) cluster, managed by ASM. When you deploy an InferenceService resource, KServe:

  1. Provisions a model server and loads the model from the specified storage URI.

  2. Creates Knative services for serverless scaling, including scale-to-zero when idle.

  3. Generates Istio routing resources (a VirtualService and a Gateway) so the model is accessible through the ASM ingress gateway.

KServe supports two deployment modes:

ModeScaling behavior
Serverless (Knative)Automatic scaling, including scale-to-zero
Kubernetes DeploymentStandard Kubernetes resource allocation

The following procedure uses serverless mode with Knative.

KServe

For more details, see the KServe project on GitHub.

Step 1: Enable KServe on ASM

KServe depends on cert-manager for certificate management. When you enable KServe, cert-manager is installed automatically.

  1. Log on to the ASM console. In the left-side navigation pane, choose Service Mesh > Mesh Management.

  2. On the Mesh Management page, click the name of the ASM instance. In the left-side navigation pane, choose Ecosystem > KServe on ASM.

  3. On the KServe on ASM page, click Enable KServe on ASM.

If you already have cert-manager installed in your cluster, turn off Automatically install the CertManager component in the cluster to avoid conflicts.

Step 2: Get the ingress gateway IP address

Record the ASM ingress gateway IP address. You need it to send inference requests in Step 4.

  1. Log on to the ASM console. In the left-side navigation pane, choose Service Mesh > Mesh Management.

  2. On the Mesh Management page, click the name of the ASM instance. In the left-side navigation pane, choose ASM Gateways > Ingress Gateway.

  3. On the Ingress Gateway page, note the Service address of the ingress gateway.

Step 3: Create an inference service

Deploy a scikit-learn iris classification model as an InferenceService to verify the integration.

  1. Connect to the ACK cluster with kubectl and create a namespace for KServe resources:

    kubectl create namespace kserve-test
  2. Create a file named isvc.yaml with the following content:

    apiVersion: "serving.kserve.io/v1beta1"
    kind: "InferenceService"
    metadata:
      name: "sklearn-iris"
    spec:
      predictor:
        model:
          modelFormat:
            name: sklearn
          storageUri: "gs://kfserving-examples/models/sklearn/1.0/model"
  3. Deploy the inference service in the kserve-test namespace:

    kubectl apply -f isvc.yaml -n kserve-test
  4. Verify the service is ready:

    kubectl get inferenceservices sklearn-iris -n kserve-test

    Expected output:

    NAME           URL                                           READY   PREV   LATEST   PREVROLLEDOUTREVISION   LATESTREADYREVISION            AGE
    sklearn-iris   http://sklearn-iris.kserve-test.example.com   True           100                              sklearn-iris-predictor-00001   3h26m

    The READY column shows True when the service is available.

  5. (Optional) View the auto-generated Istio resources

    After the inference service is created, KServe automatically generates a virtual service and an Istio gateway for routing traffic to the model. To view these resources:

    1. Log on to the ASM console. In the left-side navigation pane, choose Service Mesh > Mesh Management.

    2. On the Mesh Management page, click the name of the ASM instance. In the left-side navigation pane, choose Traffic Management Center > VirtualService.

    3. On the VirtualService page, click the Refresh icon next to Namespace and select kserve-test from the drop-down list to view the virtual service.

    4. In the left-side navigation pane, choose ASM Gateways > Gateway. On the Gateway page, select knative-serving from the Namespace drop-down list to view the Istio gateway.

Step 4: Send inference requests

After the inference service is running, send prediction requests through the ASM ingress gateway. The following steps apply to Linux and macOS.

  1. Create an input file with sample data for the iris classification model:

    cat <<EOF > "./iris-input.json"
    {
      "instances": [
        [6.8,  2.8,  4.8,  1.4],
        [6.0,  3.4,  4.5,  1.6]
      ]
    }
    EOF
  2. Get the service hostname:

    SERVICE_HOSTNAME=$(kubectl get inferenceservice sklearn-iris -n kserve-test -o jsonpath='{.status.url}' | cut -d "/" -f 3)
    echo $SERVICE_HOSTNAME

    Expected output:

    sklearn-iris.kserve-test.example.com
  3. Send a prediction request through the ingress gateway. Replace <ingress-gateway-ip> with the IP address from Step 2:

    curl -H "Host: ${SERVICE_HOSTNAME}" \
      http://<ingress-gateway-ip>:80/v1/models/sklearn-iris:predict \
      -d @./iris-input.json

    Expected output:

    {"predictions": [1, 1]}

    The model classifies both input samples as class 1 (Iris versicolor).

  4. Run a load test

    To evaluate throughput and latency under sustained traffic, deploy a preconfigured load test job:

    1. Deploy the load test application:

      kubectl create -f https://alibabacloudservicemesh.oss-cn-beijing.aliyuncs.com/kserve/v0.7/loadtest.yaml
    2. Find the load test pod name:

      kubectl get pod

      Expected output:

      NAME                                                       READY   STATUS      RESTARTS   AGE
      load-testxhwtq-pj9fq                                       0/1     Completed   0          3m24s
      sklearn-iris-predictor-00001-deployment-857f9bb56c-vg8tf   2/2     Running     0          51m
    3. View the test results. Replace <load-test-pod-name> with the actual pod name from the previous step:

      kubectl logs <load-test-pod-name>

      Expected output:

      Requests      [total, rate, throughput]         30000, 500.02, 500.01
      Duration      [total, attack, wait]             59.999s, 59.998s, 1.352ms
      Latencies     [min, mean, 50, 90, 95, 99, max]  1.196ms, 1.463ms, 1.378ms, 1.588ms, 1.746ms, 2.99ms, 18.873ms
      Bytes In      [total, mean]                     690000, 23.00
      Bytes Out     [total, mean]                     2460000, 82.00
      Success       [ratio]                           100.00%
      Status Codes  [code:count]                      200:30000
      Error Set:

      This result shows a 100% success rate across 30,000 requests at 500 requests per second, with a mean latency of 1.463 ms.

What's next