If you use KServe to implement canary releases of inference services, you can better manage the deployment of inference services, reduce the impact of potential errors and failures on users, and guarantee the high availability and stability of the inference services.
Feature description
KServe supports a canary release of an inference service so that a new version of the inference service can receive a portion of the traffic. If a release step fails, the canary release policy can also roll the service back to the earlier version.
In KServe, the last ready revision receives 100% of the traffic. The canaryTrafficPercent
field specifies the percentage of the traffic that should be routed to the new revision. Based on the value of the canaryTrafficPercent
field, KServe automatically distributes traffic to the last ready revision and the revision that is being released.
When the first revision of an inference service is deployed, it receives 100% of the traffic. When multiple revisions are deployed, as in Step 2, a canary release policy can be configured to route 10% of the traffic to the new revision (LatestReadyRevision) and 90% of the traffic to the earlier revision (LatestRolledoutRevision). If a revision is unhealthy or defective, traffic will not be routed to that revision to ensure stability and reliability.
Prerequisites
An inference service is deployed and can run normally. For more information, see Integrate the cloud-native inference service KServe with ASM.
Step 1: View the traffic distribution of the inference service
After the inference service mentioned in Prerequisites is deployed, you can see that 100% of the traffic is routed to revision 1 of the reference service.
Run the following command to view information about the sklearn-iris inference service:
kubectl get isvc -n kserve-test sklearn-iris
Expected output:
NAME URL READY PREV LATEST PREVROLLEDOUTREVISION LATESTREADYREVISION AGE
sklearn-iris http://sklearn-iris.kserve-test.example.com True 100 sklearn-iris-predictor-00001 79s
The output indicates that the proportion of traffic routed to the inference service in the LATEST
column is 100%.
Step 2: Update the configuration of the inference service by configuring a canary release policy
Run the following command to add the
canaryTrafficPercent
field to thepredictor
field and updatestorageUri
to use a new inference service:kubectl apply -n kserve-test -f - <<EOF apiVersion: "serving.kserve.io/v1beta1" kind: "InferenceService" metadata: name: "sklearn-iris" spec: predictor: canaryTrafficPercent: 10 model: modelFormat: name: sklearn storageUri: "gs://kfserving-examples/models/sklearn/1.0/model-2" EOF
After the command is executed, the configuration of the
sklearn-iris
inference service is updated. The value of the addedcanaryTrafficPercent
field is 10, indicating that 10% of the traffic will be routed to the new inference service (revision 2), and the remaining 90% of the traffic will still be routed to the old inference service (revision 1). As defined by canary release, traffic will be split between the new revision 2 and the earlier revision 1.Run the following command to view information about the sklearn-iris inference service:
kubectl get isvc -n kserve-test sklearn-iris
Expected output:
NAME URL READY PREV LATEST PREVROLLEDOUTREVISION LATESTREADYREVISION AGE sklearn-iris http://sklearn-iris.kserve-test.example.com True 90 10 sklearn-iris-predictor-00001 sklearn-iris-predictor-00002 11m
The output indicates that 90% of the traffic is routed to the old inference service (revision 1), and 10% of the traffic is routed to the new inference service (revision 2).
Run the following command to view the information about the running pods:
kubectl get pod -n kserve-test
Expected output:
NAME READY STATUS RESTARTS AGE sklearn-iris-predictor-00001-deployment-7965bcc66-grdbq 2/2 Running 0 12m sklearn-iris-predictor-00002-deployment-6744dbbd8c-wfghv 2/2 Running 0 86s
The output indicates that two pods are running for the old inference service and the new inference service respectively, and 10% of the traffic is routed to the new inference service.
NoteThe name of the pod for revision 1 contains
predictor-00001
and that for revision 2 containspredictor-00002
.
Step 3: Switch to the new revision
If the new inference service works well and passes validation tests, you can switch to the new revision by removing the canaryTrafficPercent
field and reapplying the custom resources of the inference service.
Run the following command to remove the
canaryTrafficPercent
field and reapply the custom resources of the inference service to switch to the new revision:kubectl apply -n kserve-test -f - <<EOF apiVersion: "serving.kserve.io/v1beta1" kind: "InferenceService" metadata: name: "sklearn-iris" spec: predictor: model: modelFormat: name: sklearn storageUri: "gs://kfserving-examples/models/sklearn/1.0/model-2" EOF
Run the following command to view information about the sklearn-iris inference service:
kubectl get isvc -n kserve-test sklearn-iris
Expected output:
NAME URL READY PREV LATEST PREVROLLEDOUTREVISION LATESTREADYREVISION AGE sklearn-iris http://sklearn-iris.kserve-test.example.com True 100 sklearn-iris-predictor-00002 18m
The output indicates that all traffic is routed to revision 2 of the new inference service.
Related operations
Roll back to an earlier revision
You can roll back to the old inference service (revision 1) by setting the canaryTrafficPercent
field of the new inference service (revision 2) to 0. When the setting takes effect, the inference service rolls back from revision 2 to revision 1 and the proportion of traffic routed to revision 2 changes to 0.
Run the following command to set the proportion of the traffic routed to the inference service (revision 2) to 0%:
kubectl apply -n kserve-test -f - <<EOF apiVersion: "serving.kserve.io/v1beta1" kind: "InferenceService" metadata: name: "sklearn-iris" spec: predictor: canaryTrafficPercent: 0 model: modelFormat: name: sklearn storageUri: "gs://kfserving-examples/models/sklearn/1.0/model-2" EOF
Run the following command to view the information about the sklearn-iris inference service:
kubectl get isvc -n kserve-test sklearn-iris
Expected output:
NAME URL READY PREV LATEST PREVROLLEDOUTREVISION LATESTREADYREVISION AGE sklearn-iris http://sklearn-iris.kserve-test.example.com True 100 0 sklearn-iris-predictor-00001 sklearn-iris-predictor-00002 22m
The output indicates that 100% of the traffic is routed to the old inference service (revision 1).
Use a tag to route traffic
You can enable tag-based routing by setting the tag serving.kserve.io/enable-tag-routing
to true and explicitly route traffic to the new inference service (revision 2) or the old inference service (revision 1) by using the tag in the request URL.
Run the following command to apply the inference service of the new revision (revision 2) with
canaryTrafficPercent
set to 10 andserving.kserve.io/enable-tag-routing
set to true:kubectl apply -n kserve-test -f - <<EOF apiVersion: "serving.kserve.io/v1beta1" kind: "InferenceService" metadata: name: "sklearn-iris" annotations: serving.kserve.io/enable-tag-routing: "true" spec: predictor: canaryTrafficPercent: 10 model: modelFormat: name: sklearn storageUri: "gs://kfserving-examples/models/sklearn/1.0/model-2" EOF
Run the following command to view the status of the inference service:
kubectl get isvc -n kserve-test sklearn-iris -oyaml .... status: address: url: http://sklearn-iris.kserve-test.svc.cluster.local components: predictor: address: url: http://sklearn-iris-predictor.kserve-test.svc.cluster.local latestCreatedRevision: sklearn-iris-predictor-00003 latestReadyRevision: sklearn-iris-predictor-00003 latestRolledoutRevision: sklearn-iris-predictor-00001 previousRolledoutRevision: sklearn-iris-predictor-00001 traffic: - latestRevision: true percent: 10 revisionName: sklearn-iris-predictor-00003 tag: latest url: http://latest-sklearn-iris-predictor.kserve-test.example.com - latestRevision: false percent: 90 revisionName: sklearn-iris-predictor-00001 tag: prev url: http://prev-sklearn-iris-predictor.kserve-test.example.com url: http://sklearn-iris-predictor.kserve-test.example.com ....
The output contains two URLs. One is the URL of the new inference service and the other is the URL of the old inference service. You can distinguish them based on the prefix
latest-
orprev-
added to the URL.The URL of the new inference service is
http://latest-sklearn-iris-predictor.kserve-test.example.com
.The URL of the old inference service is
http://prev-sklearn-iris-predictor.kserve-test.example.com
.Run the following command to add the corresponding URL to a request based on the specific revision you want to access, call the inference service, and obtain the results.
In the following command,
${INGRESS_HOST}
and${INGRESS_PORT}
indicate the host and port of the ingress gateway, andlatest-sklearn-iris-predictor.kserve-test.example.com
indicates the URL of the inference service that you want to access. You can modify the configuration based on your business requirement.curl -v -H "Host: latest-sklearn-iris-predictor.kserve-test.example.com" http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/sklearn-iris:predict -d @./iris-input.json