KServe, formerly known as KFServing, is a model server and inference engine for cloud-native environments. It supports automatic scaling, scale-to-zero, and canary deployment. This topic describes how to deploy KServe to provide AI services based on Service Mesh (ASM) and Alibaba Cloud Container Service for Kubernetes (ACK).

Prerequisites

Background information

As a model server, KServe supports the deployment of machine learning and deep learning models at scale. KServe can be deployed to work in traditional Kubernetes Deployment mode or in serverless mode with the support for scale-to-zero. It provides automatic traffic-based scaling and blue-green or canary deployment for models.

KServe

Step 1: Install the KServe component cert-manager

  1. Log on to the ASM console. In the left-side navigation pane, choose Service Mesh > Mesh Management.
  2. On the Mesh Management page, click the name of the ASM instance. In the left-side navigation pane, choose Ecosystem > KServe on ASM.
  3. On the KServe on ASM page, click Enable KServe on ASM.
    KServe depends on the cert-manager component. When you install KServe, the cert-manager component is automatically installed. If you want to use your own cert-manager component, turn off Automatically install the CertManager component in the cluster.

Step 2: Query the IP address of the ASM instance's ingress gateway

  1. Log on to the ASM console. In the left-side navigation pane, choose Service Mesh > Mesh Management.
  2. On the Mesh Management page, click the name of the ASM instance. In the left-side navigation pane, choose ASM Gateways > Ingress Gateway.
  3. On the Ingress Gateway page, view and save the service address of the ASM instance's ingress gateway.
    The service address is the IP address of the ingress gateway.

Step 3: Create an inference service

In this topic, the scikit-learn training model is used for testing.

  1. Use kubectl to connect the ACK cluster on the data plane, and run the following command to create a namespace in which KServe resources will be deployed:
    kubectl create namespace kserve-test
  2. Create an inference service named sklearn-iris.
    1. Create a file named isvc.yaml that contains the following content:
      apiVersion: "serving.kserve.io/v1beta1"
      kind: "InferenceService"
      metadata:
        name: "sklearn-iris"
      spec:
        predictor:
          model:
            modelFormat:
              name: sklearn
            storageUri: "gs://kfserving-examples/models/sklearn/1.0/model"
    2. Run the following command to create the sklearn-iris service in the kserve-test namespace:
      kubectl apply -f isvc.yaml -n kserve-test
  3. Run the following command to query whether the sklearn-iris service is successfully created:
    kubectl get inferenceservices sklearn-iris -n kserve-test
    Expected output:
    NAME           URL                                           READY   PREV   LATEST   PREVROLLEDOUTREVISION   LATESTREADYREVISION            AGE
    sklearn-iris   http://sklearn-iris.kserve-test.example.com   True           100                              sklearn-iris-predictor-00001   3h26m
    The output shows that the value of READY is True. This indicates that the sklearn-iris service is successfully created.
  4. Optional: View the virtual service and gateway created for the scikit-learn model.
    After the sklearn-iris service is created, a virtual service and a gateway are automatically created for the scikit-learn model. To view the created virtual service and gateway, perform the following steps:
    1. Log on to the ASM console. In the left-side navigation pane, choose Service Mesh > Mesh Management.
    2. On the Mesh Management page, click the name of the ASM instance. In the left-side navigation pane, choose Traffic Management Center > VirtualService.
    3. On the VirtualService page, click the Refresh icon next to Namespace and select kserve-test from the Namespace drop-down list to view the created virtual service in the service list.
    4. In the left-side navigation pane, choose ASM Gateways > Gateway.
    5. In the upper part of the Gateway page, select knative-serving from the Namespace drop-down list to view the created gateway in the gateway list.

Step 4: Access the service provided by the scikit-learn model

The following section describes the steps for accessing the service provided by the scikit-learn model in the Linux and Mac operating systems.

  1. Run the following command to create an input file of the scikit-learn model:
    cat <<EOF > "./iris-input.json"
    {
      "instances": [
        [6.8,  2.8,  4.8,  1.4],
        [6.0,  3.4,  4.5,  1.6]
      ]
    }
    EOF
  2. Test access to the service provided by the scikit-learn model through the ingress gateway.
    1. Run the following command to obtain the value of SERVICE_HOSTNAME:
      SERVICE_HOSTNAME=$(kubectl get inferenceservice sklearn-iris -n kserve-test -o jsonpath='{.status.url}' | cut -d "/" -f 3)
      echo $SERVICE_HOSTNAME
      Expected output:
      sklearn-iris.kserve-test.example.com
    2. Run the following command to access the service. In this step, set ASM_GATEWAY to the IP address of the ingress gateway queried in Step 2.
      ASM_GATEWAY="XXXX" # Replace XXXX with the IP address of the ingress gateway. 
      curl  -H "Host: ${SERVICE_HOSTNAME}" http://${ASM_GATEWAY}:80/v1/models/sklearn-iris:predict -d @./iris-input.json
      Expected output:
      {"predictions": [1, 1]}
  3. Test the performance of the service provided by the scikit-learn model.
    1. Run the following command to deploy an application for stress testing:
      kubectl create -f https://alibabacloudservicemesh.oss-cn-beijing.aliyuncs.com/kserve/v0.7/loadtest.yaml
    2. Run the following command to query the names of pods:
      kubectl get pod
      Expected output:
      NAME                                                       READY   STATUS      RESTARTS   AGE
      load-testxhwtq-pj9fq                                       0/1     Completed   0          3m24s
      sklearn-iris-predictor-00001-deployment-857f9bb56c-vg8tf   2/2     Running     0          51m
    3. Run the following command to view the test result logs:
      kubectl logs load-testxhwtq-pj9fq # Replace the pod name with the name of the pod on which the application that you deploy for stress testing runs. 
      Expected output:
      Requests      [total, rate, throughput]         30000, 500.02, 500.01
      Duration      [total, attack, wait]             59.999s, 59.998s, 1.352ms
      Latencies     [min, mean, 50, 90, 95, 99, max]  1.196ms, 1.463ms, 1.378ms, 1.588ms, 1.746ms, 2.99ms, 18.873ms
      Bytes In      [total, mean]                     690000, 23.00
      Bytes Out     [total, mean]                     2460000, 82.00
      Success       [ratio]                           100.00%
      Status Codes  [code:count]                      200:30000
      Error Set: