All Products
Search
Document Center

Container Service for Kubernetes:Quickly deploy an inference service based on KServe

Last Updated:Feb 14, 2025

KServe is a Kubernetes-based machine learning model serving framework. It supports deploying one or multiple trained models, such as TFServing, TorchServe, Triton, and other inference servers, as Kubernetes CustomResourceDefinitions (CRDs) to the model serving runtime. This simplifies and accelerates the processes of deploying, updating, and scaling models. This topic describes how to quickly deploy an inference service with KServe in Knative, aiding you in deploying machine learning models effectively in production environments.

After you use Knative InferenceService to deploy models, you can use the following serverless features provided by KServe:

  • Scaling to zero

  • Auto scaling based on requests per second (RPS), concurrency, and CPU and GPU metrics

  • Version management

  • Traffic management

  • Security authentication

  • Out-of-the-box metrics

Expand to view KServe introduction

  • ModelServer and MLServer

    ModelServer and MLServer are two model serving runtimes used in KServe to deploy and manage machine learning models. These model serving runtimes allow you to use out-of-the-box model serving. ModelServer is a Python model serving runtime implemented with KServe prediction protocol v1. MLServer implements KServe prediction protocol v2 with REST and gRPC. You can also build custom model servers for complex use cases. In addition, KServe provides basic API primitives to allow you to build custom model serving runtimes with ease. You can use other tools such as BentoML to build custom model serving images.

  • KServe Controller

    The KServe Controller is a key component of KServe. It manages custom InferenceService resources, and creates and deploys Knative Services to automate resource scaling. It can scale the Deployment of a Knative Service based on the traffic volume. When no requests are sent to the Knative Service, the KServe Controller automatically scales the Service pods to zero. Auto scaling can exploit model serving resources in a more efficient way and prevent resource waste.

    image

Prerequisites

Knative is deployed in your cluster. For more information, see Deploy and manage Knative.

Step 1: Deploy the KServe component

  1. Log on to the ACK console. In the left-side navigation pane, click Clusters.

  2. On the Clusters page, find the cluster that you want to manage and click its name. In the left-side navigation pane, choose Applications > Knative.

  3. On the Components tab, find KServe and click Deploy in the Actions column. Complete the deployment as prompted.

    If the Status column of the KServe component displays Deployed, the component is deployed.

Step 2: Deploy an inference service

You need to first deploy a predictive inference service that uses the scikit-learn model trained based on the Iris dataset. The dataset covers three Iris types: Iris Setosa (index 0), Iris Versicolour (index 1), and Iris Virginica (index 2). You can send inference requests to the model to predict the types of Irises.

Note

The Iris dataset contains 50 samples for each type of Irises. Each sample has four features, including the length and width of sepals and the length and width of petals.

  1. Run the following command to deploy an inference service named sklearn-iris:

    kubectl apply -f - <<EOF
    apiVersion: "serving.kserve.io/v1beta1"
    kind: "InferenceService"
    metadata:
      name: "sklearn-iris"
    spec:
      predictor:
        model:
          modelFormat:
            name: sklearn
          storageUri: "gs://kfserving-examples/models/sklearn/1.0/model"
    EOF
  2. Run the following command to query the status of the Service:

    kubectl get inferenceservices sklearn-iris

    Expected output:

    NAME           URL                                                         READY   PREV   LATEST   PREVROLLEDOUTREVISION   LATESTREADYREVISION                    AGE
    sklearn-iris   http://sklearn-iris-predictor-default.default.example.com   True           100                              sklearn-iris-predictor-default-00001   51s

Step 3: Access the Service

The IP address and access method of the Service vary based on the gateway that is used.

ALB

  1. Run the following command to query the address of the ALB gateway:

    kubectl get albconfig knative-internet                         

    Expected output:

    NAME               ALBID                    DNSNAME                                              PORT&PROTOCOL   CERTID   AGE
    knative-internet   alb-hvd8nngl0l*******   alb-hvd8nngl0l******.cn-<region>.alb.aliyuncs.com                               2
  2. Run the following command to add the following JSON code to the ./iris-input.json file to create inference requests:

    cat <<EOF > "./iris-input.json"
    {
      "instances": [
        [6.8,  2.8,  4.8,  1.4],
        [6.0,  3.4,  4.5,  1.6]
      ]
    }
    EOF
  3. Run the following command to access the Service:

    INGRESS_DOMAIN=$(kubectl get albconfig knative-internet -o jsonpath='{.status.loadBalancer.dnsname}')
    SERVICE_HOSTNAME=$(kubectl get inferenceservice sklearn-iris -o jsonpath='{.status.url}' | cut -d "/" -f 3)
    curl -v -H "Host: ${SERVICE_HOSTNAME}" "http://${INGRESS_DOMAIN}/v1/models/sklearn-iris:predict" -d @./iris-input.json

    Expected output:

    *   Trying 120.77.XX.XX...
    * TCP_NODELAY set
    * Connected to alb-hvd8nngl0l******.cn-<region>.alb.aliyuncs.com (120.77.XX.XX) port 80 (#0)
    > POST /v1/models/sklearn-iris:predict HTTP/1.1
    > Host: sklearn-iris-predictor-default.default.example.com
    > User-Agent: curl/7.58.0
    > Accept: */*
    > Content-Length: 76
    > Content-Type: application/x-www-form-urlencoded
    > 
    * upload completely sent off: 76 out of 76 bytes
    < HTTP/1.1 200 OK
    < Date: Thu, 13 Jul 2023 01:48:44 GMT
    < Content-Type: application/json
    < Content-Length: 21
    < Connection: keep-alive
    < 
    * Connection #0 to host alb-hvd8nngl0l******.cn-<region>.alb.aliyuncs.com left intact
    {"predictions":[1,1]}

    {"predictions": [1, 1]} is returned, which indicates that both samples sent to the inference service match index is 1. This means that the Irises in both samples are Iris Versicolour.

MSE

  1. Run the following command to query the address of the MSE gateway:

    kubectl -n knative-serving get ing stats-ingress

    Expected output:

    NAME            CLASS                  HOSTS   ADDRESS                         PORTS   AGE
    stats-ingress   knative-ingressclass   *       192.168.XX.XX,47.107.XX.XX      80      15d

    47.107.XX.XX in the ADDRESS column is the public IP address of the MSE gateway, which will be used to access the inference Service. The order in which the public and private IP addresses of the MSE gateway are sorted is not fixed. In some cases, the public IP address follows the private IP address, for example, 47.107.XX.XX,192.168.XX.XX.

  2. Run the following command to add the following JSON code to the ./iris-input.json file to create inference requests:

    cat <<EOF > "./iris-input.json"
    {
      "instances": [
        [6.8,  2.8,  4.8,  1.4],
        [6.0,  3.4,  4.5,  1.6]
      ]
    }
    EOF
  3. Run the following command to access the Service:

    # The order in which the public and private IP addresses of the MSE gateway are sorted is not fixed. In this example, the public IP address is used to access the inference Service. ingress[1] indicates that the public IP address follows the private IP address and ingress[0] indicates that the private IP address follows the public IP address. Choose one of them based on the actual order of the IP addresses. 
    INGRESS_HOST=$(kubectl -n knative-serving get ing stats-ingress -o jsonpath='{.status.loadBalancer.ingress[1].ip}')
    SERVICE_HOSTNAME=$(kubectl get inferenceservice sklearn-iris -o jsonpath='{.status.url}' | cut -d "/" -f 3)
    curl -v -H "Host: ${SERVICE_HOSTNAME}" "http://${INGRESS_HOST}/v1/models/sklearn-iris:predict" -d @./iris-input.json

    Expected output:

    *   Trying 47.107.XX.XX... # 47.107.XX.XX is the public IP address of the MSE gateway. 
    * TCP_NODELAY set
    * Connected to 47.107.XX.XX (47.107.XX.XX) port 80 (#0)
    > POST /v1/models/sklearn-iris:predict HTTP/1.1
    > Host: sklearn-iris-predictor-default.default.example.com
    > User-Agent: curl/7.58.0
    > Accept: */*
    > Content-Length: 76
    > Content-Type: application/x-www-form-urlencoded
    > 
    * upload completely sent off: 76 out of 76 bytes
    < HTTP/1.1 200 OK
    < content-length: 21
    < content-type: application/json
    < date: Tue, 11 Jul 2023 09:56:00 GMT
    < server: istio-envoy
    < req-cost-time: 5
    < req-arrive-time: 1689069360639
    < resp-start-time: 1689069360645
    < x-envoy-upstream-service-time: 4
    < 
    * Connection #0 to host 47.107.XX.XX left intact
    {"predictions":[1,1]}

    {"predictions": [1, 1]} is returned, which indicates that both samples sent to the inference Service match index is 1. This means that the Irises in both samples are Iris Versicolour.

Kourier

  1. Run the following command to query the address of the Kourier gateway:

    kubectl -n knative-serving get svc kourier

    Expected output:

    NAME      TYPE           CLUSTER-IP    EXTERNAL-IP      PORT(S)                      AGE
    kourier   LoadBalancer   192.168.XX.XX   121.40.XX.XX  80:31158/TCP,443:32491/TCP   49m

    The IP address of the inference Service is 121.40.XX.XX and the Service ports are HTTP 80 and HTTPS 443.

  2. Run the following command to add the following JSON code to the ./iris-input.json file to create inference requests:

    cat <<EOF > "./iris-input.json"
    {
      "instances": [
        [6.8,  2.8,  4.8,  1.4],
        [6.0,  3.4,  4.5,  1.6]
      ]
    }
    EOF
  3. Run the following command to access the Service:

    INGRESS_HOST=$(kubectl -n knative-serving get service kourier -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
    SERVICE_HOSTNAME=$(kubectl get inferenceservice sklearn-iris -o jsonpath='{.status.url}' | cut -d "/" -f 3)
    curl -v -H "Host: ${SERVICE_HOSTNAME}" "http://${INGRESS_HOST}/v1/models/sklearn-iris:predict" -d @./iris-input.json

    Expected output:

    *   Trying 121.40.XX.XX...
    * TCP_NODELAY set
    * Connected to 121.40.XX.XX (121.40.XX.XX) port 80 (#0)
    > POST /v1/models/sklearn-iris:predict HTTP/1.1
    > Host: sklearn-iris-predictor-default.default.example.com
    > User-Agent: curl/7.58.0
    > Accept: */*
    > Content-Length: 76
    > Content-Type: application/x-www-form-urlencoded
    > 
    * upload completely sent off: 76 out of 76 bytes
    < HTTP/1.1 200 OK
    < content-length: 21
    < content-type: application/json
    < date: Wed, 12 Jul 2023 08:23:13 GMT
    < server: envoy
    < x-envoy-upstream-service-time: 4
    < 
    * Connection #0 to host 121.40.XX.XX left intact
    {"predictions":[1,1]}

    {"predictions": [1, 1]} is returned, which indicates that both samples sent to the inference Service match index is 1. This means that the Irises in both samples are Iris Versicolour.

References