Implement canary releases of inference services with KServe - Alibaba Cloud Service Mesh

When you deploy a new model version, routing all traffic to it at once risks downtime if the model has issues. Canary releases let you shift a small percentage of traffic to the new version first, validate its behavior, and then either promote it or roll back -- without affecting most users.

How it works

KServe tracks two revisions during a canary release:

Revision	KServe term	Role
Current (stable)	`LatestRolledoutRevision`	The revision that was last promoted to receive 100% of traffic
Canary (new)	`LatestReadyRevision`	The newest revision that is ready to serve traffic

The canaryTrafficPercent field in the InferenceService spec controls the percentage of traffic routed to the canary revision. KServe routes the remainder to the current revision automatically. For example, canaryTrafficPercent: 10 sends 10% to the canary and 90% to the current revision.

If a revision is unhealthy or defective, KServe does not route traffic to it.

Prerequisites

Before you begin, make sure that you have:

A working inference service deployed by following Integrate the cloud-native inference service KServe with ASM

Verify initial traffic distribution

After deploying the inference service, confirm that 100% of traffic goes to the initial revision.

Check the sklearn-iris inference service:

kubectl get isvc -n kserve-test sklearn-iris

Expected output:

NAME           URL                                           READY   PREV   LATEST   PREVROLLEDOUTREVISION   LATESTREADYREVISION            AGE
sklearn-iris   http://sklearn-iris.kserve-test.example.com   True           100                              sklearn-iris-predictor-00001   79s

The LATEST column shows 100, which confirms that all traffic goes to the single existing revision (sklearn-iris-predictor-00001).

Configure a canary release

Split traffic between the current and canary revisions by adding the canaryTrafficPercent field to the InferenceService spec.

Apply the following configuration to route 10% of traffic to a new model version:

kubectl apply -n kserve-test -f - <<EOF
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
  name: "sklearn-iris"
spec:
  predictor:
    canaryTrafficPercent: 10
    model:
      modelFormat:
        name: sklearn
      storageUri: "gs://kfserving-examples/models/sklearn/1.0/model-2"
EOF

This sets canaryTrafficPercent to 10, which tells KServe to send 10% of traffic to the canary revision and 90% to the current revision. The storageUri points to the updated model (model-2).

Verify the traffic split:

kubectl get isvc -n kserve-test sklearn-iris

Expected output:

NAME           URL                                           READY   PREV   LATEST   PREVROLLEDOUTREVISION          LATESTREADYREVISION            AGE
sklearn-iris   http://sklearn-iris.kserve-test.example.com   True    90     10       sklearn-iris-predictor-00001   sklearn-iris-predictor-00002   11m

PREV shows 90 (current revision) and LATEST shows 10 (canary revision).

Confirm that both revisions are running:

kubectl get pod -n kserve-test

Expected output:

NAME                                                       READY   STATUS    RESTARTS   AGE
sklearn-iris-predictor-00001-deployment-7965bcc66-grdbq    2/2     Running   0          12m
sklearn-iris-predictor-00002-deployment-6744dbbd8c-wfghv   2/2     Running   0          86s

Two pods are running: one for each revision. The pod name suffix identifies the revision -- predictor-00001 for the current revision and predictor-00002 for the canary.

Promote the canary to receive all traffic

After you validate the canary revision, promote it by removing the canaryTrafficPercent field and reapplying the InferenceService configuration.

Apply the configuration without canaryTrafficPercent:

kubectl apply -n kserve-test -f - <<EOF
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
  name: "sklearn-iris"
spec:
  predictor:
    model:
      modelFormat:
        name: sklearn
      storageUri: "gs://kfserving-examples/models/sklearn/1.0/model-2"
EOF

Without canaryTrafficPercent, KServe routes 100% of traffic to the latest ready revision.

Verify that all traffic goes to the new revision:

kubectl get isvc -n kserve-test sklearn-iris

Expected output:

NAME           URL                                           READY   PREV   LATEST   PREVROLLEDOUTREVISION   LATESTREADYREVISION            AGE
sklearn-iris   http://sklearn-iris.kserve-test.example.com   True           100                              sklearn-iris-predictor-00002   18m

All traffic goes to sklearn-iris-predictor-00002.

Roll back to the previous revision

If the canary revision has issues, set canaryTrafficPercent to 0 to shift all traffic back to the previous stable revision.

Set canaryTrafficPercent to 0:

kubectl apply -n kserve-test -f - <<EOF
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
  name: "sklearn-iris"
spec:
  predictor:
    canaryTrafficPercent: 0
    model:
      modelFormat:
        name: sklearn
      storageUri: "gs://kfserving-examples/models/sklearn/1.0/model-2"
EOF

Verify the rollback:

kubectl get isvc -n kserve-test sklearn-iris

Expected output:

NAME           URL                                           READY   PREV   LATEST   PREVROLLEDOUTREVISION          LATESTREADYREVISION            AGE
sklearn-iris   http://sklearn-iris.kserve-test.example.com   True    100    0        sklearn-iris-predictor-00001   sklearn-iris-predictor-00002   22m

PREV shows 100 -- all traffic goes to the previous revision (sklearn-iris-predictor-00001).

Route traffic to a specific revision by tag

Tag-based routing lets you send requests directly to a specific revision by URL, independent of the percentage-based traffic split. Use this to test a canary revision in isolation before routing live traffic to it.

Enable tag-based routing by adding the serving.kserve.io/enable-tag-routing annotation:

kubectl apply -n kserve-test -f - <<EOF
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
  name: "sklearn-iris"
  annotations:
    serving.kserve.io/enable-tag-routing: "true"
spec:
  predictor:
    canaryTrafficPercent: 10
    model:
      modelFormat:
        name: sklearn
      storageUri: "gs://kfserving-examples/models/sklearn/1.0/model-2"
EOF

Check the tagged URLs in the InferenceService status:

kubectl get isvc -n kserve-test sklearn-iris -oyaml

The status.components.predictor.traffic section shows two tagged entries:

traffic:
- latestRevision: true
  percent: 10
  revisionName: sklearn-iris-predictor-00003
  tag: latest
  url: http://latest-sklearn-iris-predictor.kserve-test.example.com
- latestRevision: false
  percent: 90
  revisionName: sklearn-iris-predictor-00001
  tag: prev
  url: http://prev-sklearn-iris-predictor.kserve-test.example.com

Each revision gets a dedicated URL with a prefix that matches its tag:

Revision	Tag	URL
Canary	`latest`	`http://latest-sklearn-iris-predictor.kserve-test.example.com`
Current	`prev`	`http://prev-sklearn-iris-predictor.kserve-test.example.com`

Send a request to a specific revision by setting the Host header to its tagged URL:
```
curl -v \
  -H "Host: latest-sklearn-iris-predictor.kserve-test.example.com" \
  http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/sklearn-iris:predict \
  -d @./iris-input.json
```
Variable Description
${INGRESS_HOST} Hostname or IP address of the ASM ingress gateway
${INGRESS_PORT} Port of the ASM ingress gateway
To route the request to the current revision instead, change the Host header to prev-sklearn-iris-predictor.kserve-test.example.com.

Variable	Description
`${INGRESS_HOST}`	Hostname or IP address of the ASM ingress gateway
`${INGRESS_PORT}`	Port of the ASM ingress gateway