All Products
Search
Document Center

Alibaba Cloud Service Mesh:Canary releases for KServe inference services

Last Updated:Mar 11, 2026

When you deploy a new model version, routing all traffic to it at once risks downtime if the model has issues. Canary releases let you shift a small percentage of traffic to the new version first, validate its behavior, and then either promote it or roll back -- without affecting most users.

How it works

KServe tracks two revisions during a canary release:

RevisionKServe termRole
Current (stable)LatestRolledoutRevisionThe revision that was last promoted to receive 100% of traffic
Canary (new)LatestReadyRevisionThe newest revision that is ready to serve traffic

The canaryTrafficPercent field in the InferenceService spec controls the percentage of traffic routed to the canary revision. KServe routes the remainder to the current revision automatically. For example, canaryTrafficPercent: 10 sends 10% to the canary and 90% to the current revision.

If a revision is unhealthy or defective, KServe does not route traffic to it.

Prerequisites

Before you begin, make sure that you have:

Verify initial traffic distribution

After deploying the inference service, confirm that 100% of traffic goes to the initial revision.

Check the sklearn-iris inference service:

kubectl get isvc -n kserve-test sklearn-iris

Expected output:

NAME           URL                                           READY   PREV   LATEST   PREVROLLEDOUTREVISION   LATESTREADYREVISION            AGE
sklearn-iris   http://sklearn-iris.kserve-test.example.com   True           100                              sklearn-iris-predictor-00001   79s

The LATEST column shows 100, which confirms that all traffic goes to the single existing revision (sklearn-iris-predictor-00001).

Configure a canary release

Split traffic between the current and canary revisions by adding the canaryTrafficPercent field to the InferenceService spec.

  1. Apply the following configuration to route 10% of traffic to a new model version:

    kubectl apply -n kserve-test -f - <<EOF
    apiVersion: "serving.kserve.io/v1beta1"
    kind: "InferenceService"
    metadata:
      name: "sklearn-iris"
    spec:
      predictor:
        canaryTrafficPercent: 10
        model:
          modelFormat:
            name: sklearn
          storageUri: "gs://kfserving-examples/models/sklearn/1.0/model-2"
    EOF

    This sets canaryTrafficPercent to 10, which tells KServe to send 10% of traffic to the canary revision and 90% to the current revision. The storageUri points to the updated model (model-2).

  2. Verify the traffic split:

    kubectl get isvc -n kserve-test sklearn-iris

    Expected output:

    NAME           URL                                           READY   PREV   LATEST   PREVROLLEDOUTREVISION          LATESTREADYREVISION            AGE
    sklearn-iris   http://sklearn-iris.kserve-test.example.com   True    90     10       sklearn-iris-predictor-00001   sklearn-iris-predictor-00002   11m

    PREV shows 90 (current revision) and LATEST shows 10 (canary revision).

  3. Confirm that both revisions are running:

    kubectl get pod -n kserve-test

    Expected output:

    NAME                                                       READY   STATUS    RESTARTS   AGE
    sklearn-iris-predictor-00001-deployment-7965bcc66-grdbq    2/2     Running   0          12m
    sklearn-iris-predictor-00002-deployment-6744dbbd8c-wfghv   2/2     Running   0          86s

    Two pods are running: one for each revision. The pod name suffix identifies the revision -- predictor-00001 for the current revision and predictor-00002 for the canary.

Promote the canary to receive all traffic

After you validate the canary revision, promote it by removing the canaryTrafficPercent field and reapplying the InferenceService configuration.

  1. Apply the configuration without canaryTrafficPercent:

    kubectl apply -n kserve-test -f - <<EOF
    apiVersion: "serving.kserve.io/v1beta1"
    kind: "InferenceService"
    metadata:
      name: "sklearn-iris"
    spec:
      predictor:
        model:
          modelFormat:
            name: sklearn
          storageUri: "gs://kfserving-examples/models/sklearn/1.0/model-2"
    EOF

    Without canaryTrafficPercent, KServe routes 100% of traffic to the latest ready revision.

  2. Verify that all traffic goes to the new revision:

    kubectl get isvc -n kserve-test sklearn-iris

    Expected output:

    NAME           URL                                           READY   PREV   LATEST   PREVROLLEDOUTREVISION   LATESTREADYREVISION            AGE
    sklearn-iris   http://sklearn-iris.kserve-test.example.com   True           100                              sklearn-iris-predictor-00002   18m

    All traffic goes to sklearn-iris-predictor-00002.

Roll back to the previous revision

If the canary revision has issues, set canaryTrafficPercent to 0 to shift all traffic back to the previous stable revision.

  1. Set canaryTrafficPercent to 0:

    kubectl apply -n kserve-test -f - <<EOF
    apiVersion: "serving.kserve.io/v1beta1"
    kind: "InferenceService"
    metadata:
      name: "sklearn-iris"
    spec:
      predictor:
        canaryTrafficPercent: 0
        model:
          modelFormat:
            name: sklearn
          storageUri: "gs://kfserving-examples/models/sklearn/1.0/model-2"
    EOF
  2. Verify the rollback:

    kubectl get isvc -n kserve-test sklearn-iris

    Expected output:

    NAME           URL                                           READY   PREV   LATEST   PREVROLLEDOUTREVISION          LATESTREADYREVISION            AGE
    sklearn-iris   http://sklearn-iris.kserve-test.example.com   True    100    0        sklearn-iris-predictor-00001   sklearn-iris-predictor-00002   22m

    PREV shows 100 -- all traffic goes to the previous revision (sklearn-iris-predictor-00001).

Route traffic to a specific revision by tag

Tag-based routing lets you send requests directly to a specific revision by URL, independent of the percentage-based traffic split. Use this to test a canary revision in isolation before routing live traffic to it.

  1. Enable tag-based routing by adding the serving.kserve.io/enable-tag-routing annotation:

    kubectl apply -n kserve-test -f - <<EOF
    apiVersion: "serving.kserve.io/v1beta1"
    kind: "InferenceService"
    metadata:
      name: "sklearn-iris"
      annotations:
        serving.kserve.io/enable-tag-routing: "true"
    spec:
      predictor:
        canaryTrafficPercent: 10
        model:
          modelFormat:
            name: sklearn
          storageUri: "gs://kfserving-examples/models/sklearn/1.0/model-2"
    EOF
  2. Check the tagged URLs in the InferenceService status:

    kubectl get isvc -n kserve-test sklearn-iris -oyaml

    The status.components.predictor.traffic section shows two tagged entries:

    traffic:
    - latestRevision: true
      percent: 10
      revisionName: sklearn-iris-predictor-00003
      tag: latest
      url: http://latest-sklearn-iris-predictor.kserve-test.example.com
    - latestRevision: false
      percent: 90
      revisionName: sklearn-iris-predictor-00001
      tag: prev
      url: http://prev-sklearn-iris-predictor.kserve-test.example.com

    Each revision gets a dedicated URL with a prefix that matches its tag:

    RevisionTagURL
    Canarylatesthttp://latest-sklearn-iris-predictor.kserve-test.example.com
    Currentprevhttp://prev-sklearn-iris-predictor.kserve-test.example.com
  3. Send a request to a specific revision by setting the Host header to its tagged URL:

    curl -v \
      -H "Host: latest-sklearn-iris-predictor.kserve-test.example.com" \
      http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/sklearn-iris:predict \
      -d @./iris-input.json
    VariableDescription
    ${INGRESS_HOST}Hostname or IP address of the ASM ingress gateway
    ${INGRESS_PORT}Port of the ASM ingress gateway

    To route the request to the current revision instead, change the Host header to prev-sklearn-iris-predictor.kserve-test.example.com.