All Products
Search
Document Center

Alibaba Cloud Service Mesh:Use KServe to implement canary releases of inference services

Last Updated:Dec 29, 2023

If you use KServe to implement canary releases of inference services, you can better manage the deployment of inference services, reduce the impact of potential errors and failures on users, and guarantee the high availability and stability of the inference services.

Feature description

KServe supports a canary release of an inference service so that a new version of the inference service can receive a portion of the traffic. If a release step fails, the canary release policy can also roll the service back to the earlier version.

In KServe, the last ready revision receives 100% of the traffic. The canaryTrafficPercent field specifies the percentage of the traffic that should be routed to the new revision. Based on the value of the canaryTrafficPercent field, KServe automatically distributes traffic to the last ready revision and the revision that is being released.

When the first revision of an inference service is deployed, it receives 100% of the traffic. When multiple revisions are deployed, as in Step 2, a canary release policy can be configured to route 10% of the traffic to the new revision (LatestReadyRevision) and 90% of the traffic to the earlier revision (LatestRolledoutRevision). If a revision is unhealthy or defective, traffic will not be routed to that revision to ensure stability and reliability.

Prerequisites

An inference service is deployed and can run normally. For more information, see Integrate the cloud-native inference service KServe with ASM.

Step 1: View the traffic distribution of the inference service

After the inference service mentioned in Prerequisites is deployed, you can see that 100% of the traffic is routed to revision 1 of the reference service.

Run the following command to view information about the sklearn-iris inference service:

kubectl get isvc -n kserve-test sklearn-iris

Expected output:

NAME           URL                                           READY   PREV   LATEST   PREVROLLEDOUTREVISION   LATESTREADYREVISION            AGE
sklearn-iris   http://sklearn-iris.kserve-test.example.com   True           100                              sklearn-iris-predictor-00001   79s

The output indicates that the proportion of traffic routed to the inference service in the LATEST column is 100%.

Step 2: Update the configuration of the inference service by configuring a canary release policy

  1. Run the following command to add the canaryTrafficPercent field to the predictor field and update storageUri to use a new inference service:

    kubectl apply -n kserve-test -f - <<EOF
    apiVersion: "serving.kserve.io/v1beta1"
    kind: "InferenceService"
    metadata:
      name: "sklearn-iris"
    spec:
      predictor:
        canaryTrafficPercent: 10
        model:
          modelFormat:
            name: sklearn
          storageUri: "gs://kfserving-examples/models/sklearn/1.0/model-2"
    EOF
    

    After the command is executed, the configuration of the sklearn-iris inference service is updated. The value of the added canaryTrafficPercent field is 10, indicating that 10% of the traffic will be routed to the new inference service (revision 2), and the remaining 90% of the traffic will still be routed to the old inference service (revision 1). As defined by canary release, traffic will be split between the new revision 2 and the earlier revision 1.

  2. Run the following command to view information about the sklearn-iris inference service:

    kubectl get isvc -n kserve-test sklearn-iris

    Expected output:

    NAME           URL                                           READY   PREV   LATEST   PREVROLLEDOUTREVISION          LATESTREADYREVISION            AGE
    sklearn-iris   http://sklearn-iris.kserve-test.example.com   True    90     10       sklearn-iris-predictor-00001   sklearn-iris-predictor-00002   11m

    The output indicates that 90% of the traffic is routed to the old inference service (revision 1), and 10% of the traffic is routed to the new inference service (revision 2).

  3. Run the following command to view the information about the running pods:

    kubectl get pod -n kserve-test

    Expected output:

    NAME                                                       READY   STATUS    RESTARTS   AGE
    sklearn-iris-predictor-00001-deployment-7965bcc66-grdbq    2/2     Running   0          12m
    sklearn-iris-predictor-00002-deployment-6744dbbd8c-wfghv   2/2     Running   0          86s

    The output indicates that two pods are running for the old inference service and the new inference service respectively, and 10% of the traffic is routed to the new inference service.

    Note

    The name of the pod for revision 1 contains predictor-00001 and that for revision 2 contains predictor-00002.

Step 3: Switch to the new revision

If the new inference service works well and passes validation tests, you can switch to the new revision by removing the canaryTrafficPercent field and reapplying the custom resources of the inference service.

  1. Run the following command to remove the canaryTrafficPercent field and reapply the custom resources of the inference service to switch to the new revision:

    kubectl apply -n kserve-test -f - <<EOF
    apiVersion: "serving.kserve.io/v1beta1"
    kind: "InferenceService"
    metadata:
      name: "sklearn-iris"
    spec:
      predictor:
        model:
          modelFormat:
            name: sklearn
          storageUri: "gs://kfserving-examples/models/sklearn/1.0/model-2"
    EOF
    
  2. Run the following command to view information about the sklearn-iris inference service:

    kubectl get isvc -n kserve-test sklearn-iris

    Expected output:

    NAME           URL                                           READY   PREV   LATEST   PREVROLLEDOUTREVISION   LATESTREADYREVISION            AGE
    sklearn-iris   http://sklearn-iris.kserve-test.example.com   True           100                              sklearn-iris-predictor-00002   18m

    The output indicates that all traffic is routed to revision 2 of the new inference service.

Related operations

Roll back to an earlier revision

You can roll back to the old inference service (revision 1) by setting the canaryTrafficPercent field of the new inference service (revision 2) to 0. When the setting takes effect, the inference service rolls back from revision 2 to revision 1 and the proportion of traffic routed to revision 2 changes to 0.

  1. Run the following command to set the proportion of the traffic routed to the inference service (revision 2) to 0%:

    kubectl apply -n kserve-test -f - <<EOF
    apiVersion: "serving.kserve.io/v1beta1"
    kind: "InferenceService"
    metadata:
      name: "sklearn-iris"
    spec:
      predictor:
        canaryTrafficPercent: 0
        model:
          modelFormat:
            name: sklearn
          storageUri: "gs://kfserving-examples/models/sklearn/1.0/model-2"
    EOF
    
  2. Run the following command to view the information about the sklearn-iris inference service:

    kubectl get isvc -n kserve-test sklearn-iris

    Expected output:

    NAME           URL                                           READY   PREV   LATEST   PREVROLLEDOUTREVISION          LATESTREADYREVISION            AGE
    sklearn-iris   http://sklearn-iris.kserve-test.example.com   True    100    0        sklearn-iris-predictor-00001   sklearn-iris-predictor-00002   22m

    The output indicates that 100% of the traffic is routed to the old inference service (revision 1).

Use a tag to route traffic

You can enable tag-based routing by setting the tag serving.kserve.io/enable-tag-routing to true and explicitly route traffic to the new inference service (revision 2) or the old inference service (revision 1) by using the tag in the request URL.

  1. Run the following command to apply the inference service of the new revision (revision 2) with canaryTrafficPercent set to 10 and serving.kserve.io/enable-tag-routing set to true:

    kubectl apply -n kserve-test -f - <<EOF
    apiVersion: "serving.kserve.io/v1beta1"
    kind: "InferenceService"
    metadata:
      name: "sklearn-iris"
      annotations:
        serving.kserve.io/enable-tag-routing: "true"
    spec:
      predictor:
        canaryTrafficPercent: 10
        model:
          modelFormat:
            name: sklearn
          storageUri: "gs://kfserving-examples/models/sklearn/1.0/model-2"
    EOF
    
  2. Run the following command to view the status of the inference service:

    kubectl get isvc -n kserve-test sklearn-iris -oyaml
     ....
     status:
      address:
        url: http://sklearn-iris.kserve-test.svc.cluster.local
      components:
        predictor:
          address:
            url: http://sklearn-iris-predictor.kserve-test.svc.cluster.local
          latestCreatedRevision: sklearn-iris-predictor-00003
          latestReadyRevision: sklearn-iris-predictor-00003
          latestRolledoutRevision: sklearn-iris-predictor-00001
          previousRolledoutRevision: sklearn-iris-predictor-00001
          traffic:
          - latestRevision: true
            percent: 10
            revisionName: sklearn-iris-predictor-00003
            tag: latest
            url: http://latest-sklearn-iris-predictor.kserve-test.example.com
          - latestRevision: false
            percent: 90
            revisionName: sklearn-iris-predictor-00001
            tag: prev
            url: http://prev-sklearn-iris-predictor.kserve-test.example.com
          url: http://sklearn-iris-predictor.kserve-test.example.com
     
     ....

    The output contains two URLs. One is the URL of the new inference service and the other is the URL of the old inference service. You can distinguish them based on the prefix latest- or prev- added to the URL.

    The URL of the new inference service is http://latest-sklearn-iris-predictor.kserve-test.example.com.

    The URL of the old inference service is http://prev-sklearn-iris-predictor.kserve-test.example.com.

  3. Run the following command to add the corresponding URL to a request based on the specific revision you want to access, call the inference service, and obtain the results.

    In the following command, ${INGRESS_HOST} and ${INGRESS_PORT} indicate the host and port of the ingress gateway, and latest-sklearn-iris-predictor.kserve-test.example.com indicates the URL of the inference service that you want to access. You can modify the configuration based on your business requirement.

    curl -v -H "Host: latest-sklearn-iris-predictor.kserve-test.example.com" http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/sklearn-iris:predict -d @./iris-input.json