All Products
Search
Document Center

Container Compute Service:Quickly deploy an inference service using KServe

Last Updated:Jan 19, 2026

You can use KServe on to deploy AI models as serverless inference services. This provides key features like auto-scaling, multi-version management, and canary releases.

Prerequisites

The KServe component must be deployed. For more information, see Deploy the KServe component.

Step 1: Deploy the InferenceService

First, deploy an InferenceService with prediction capabilities. This service uses a scikit-learn model trained on the iris dataset. The dataset has three output classes: Iris Setosa (index: 0), Iris Versicolour (index: 1), and Iris Virginica (index: 2). You can then send inference requests to the deployed model to predict the corresponding iris plant class.

Note

The iris dataset consists of 50 data points for each of three types of iris flowers. Each sample contains four features: the length and width of its sepals and petals.

  1. Run the following command to create an InferenceService named sklearn-iris.

    kubectl apply -f - <<EOF
    apiVersion: "serving.kserve.io/v1beta1"
    kind: "InferenceService"
    metadata:
      name: "sklearn-iris"
    spec:
      predictor:
        model:
          modelFormat:
            name: sklearn
          storageUri: "gs://kfserving-examples/models/sklearn/1.0/model"
    EOF
  2. Run the following command to check the service status.

    kubectl get inferenceservices sklearn-iris

    Expected output:

    NAME           URL                                                         READY   PREV   LATEST   PREVROLLEDOUTREVISION   LATESTREADYREVISION                    AGE
    sklearn-iris   http://sklearn-iris-predictor-default.default.example.com   True           100                              sklearn-iris-predictor-default-00001   51s

Step 2: Access the service

The IP address and access method vary depending on the service gateway. Select the appropriate method for your gateway.

ALB

  1. Run the following command to retrieve the Application Load Balancer (ALB) endpoint.

    kubectl get albconfig knative-internet                         

    Expected output:

    NAME               ALBID                    DNSNAME                                              PORT&PROTOCOL   CERTID   AGE
    knative-internet   alb-hvd8nngl0l*******   alb-hvd8nngl0l******.cn-<region>.alb.aliyuncs.com                               2
  2. Run the following command to write the following JSON content to the ./iris-input.json file to prepare the input for the inference request.

    cat <<EOF > "./iris-input.json"
    {
      "instances": [
        [6.8,  2.8,  4.8,  1.4],
        [6.0,  3.4,  4.5,  1.6]
      ]
    }
    EOF
  3. Run the following command to access the service.

    INGRESS_DOMAIN=$(kubectl get albconfig knative-internet -o jsonpath='{.status.loadBalancer.dnsname}')
    SERVICE_HOSTNAME=$(kubectl get inferenceservice sklearn-iris -o jsonpath='{.status.url}' | cut -d "/" -f 3)
    curl -v -H "Host: ${SERVICE_HOSTNAME}" "http://${INGRESS_DOMAIN}/v1/models/sklearn-iris:predict" -d @./iris-input.json

    Expected output:

    *   Trying 120.77.XX.XX...
    * TCP_NODELAY set
    * Connected to alb-hvd8nngl0l******.cn-<region>.alb.aliyuncs.com (120.77.XX.XX) port 80 (#0)
    > POST /v1/models/sklearn-iris:predict HTTP/1.1
    > Host: sklearn-iris-predictor-default.default.example.com
    > User-Agent: curl/7.58.0
    > Accept: */*
    > Content-Length: 76
    > Content-Type: application/x-www-form-urlencoded
    > 
    * upload completely sent off: 76 out of 76 bytes
    < HTTP/1.1 200 OK
    < Date: Thu, 13 Jul 2023 01:48:44 GMT
    < Content-Type: application/json
    < Content-Length: 21
    < Connection: keep-alive
    < 
    * Connection #0 to host alb-hvd8nngl0l******.cn-<region>.alb.aliyuncs.com left intact
    {"predictions":[1,1]}

    The output is {"predictions": [1, 1]}. This indicates that for the two data points sent for inference, the model predicts that both flowers are Iris Versicolour (index 1).

Kourier

  1. Run the following command to retrieve the Kourier service endpoint.

    kubectl -n knative-serving get svc kourier

    Expected output:

    NAME      TYPE           CLUSTER-IP    EXTERNAL-IP      PORT(S)                      AGE
    kourier   LoadBalancer   192.168.XX.XX   121.40.XX.XX  80:31158/TCP,443:32491/TCP   49m

    The access IP address for the service is 121.40.XX.XX, and the ports are 80 (HTTP) and 443 (HTTPS).

  2. Run the following command to write the following JSON content to the ./iris-input.json file to prepare the input for the inference request.

    cat <<EOF > "./iris-input.json"
    {
      "instances": [
        [6.8,  2.8,  4.8,  1.4],
        [6.0,  3.4,  4.5,  1.6]
      ]
    }
    EOF
  3. Run the following command to access the service.

    INGRESS_HOST=$(kubectl -n knative-serving get service kourier -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
    SERVICE_HOSTNAME=$(kubectl get inferenceservice sklearn-iris -o jsonpath='{.status.url}' | cut -d "/" -f 3)
    curl -v -H "Host: ${SERVICE_HOSTNAME}" "http://${INGRESS_HOST}/v1/models/sklearn-iris:predict" -d @./iris-input.json

    Expected output:

    *   Trying 121.40.XX.XX...
    * TCP_NODELAY set
    * Connected to 121.40.XX.XX (121.40.XX.XX) port 80 (#0)
    > POST /v1/models/sklearn-iris:predict HTTP/1.1
    > Host: sklearn-iris-predictor-default.default.example.com
    > User-Agent: curl/7.58.0
    > Accept: */*
    > Content-Length: 76
    > Content-Type: application/x-www-form-urlencoded
    > 
    * upload completely sent off: 76 out of 76 bytes
    < HTTP/1.1 200 OK
    < content-length: 21
    < content-type: application/json
    < date: Wed, 12 Jul 2023 08:23:13 GMT
    < server: envoy
    < x-envoy-upstream-service-time: 4
    < 
    * Connection #0 to host 121.40.XX.XX left intact
    {"predictions":[1,1]}

    The output is {"predictions": [1, 1]}. This indicates that for the two data points sent for inference, the model predicts that both flowers are Iris Versicolour (index 1).