All Products
Search
Document Center

Container Compute Service:Quickly deploy an inference service using KServe

Last Updated:Mar 26, 2026

Use KServe on ACK Serverless Knative to deploy AI models as serverless inference services with automatic scaling, multi-version management, and phased releases.

Prerequisites

Before you begin, ensure that you have:

Step 1: Deploy the InferenceService

Because the model is deployed as a KServe InferenceService rather than a plain Kubernetes Service, you only need to provide a storageUri — auto-scaling, traffic management, and versioning are handled automatically.

The following example deploys a scikit-learn model trained on the iris dataset. The model accepts four-feature inputs (sepal length, sepal width, petal length, petal width) and predicts one of three iris classes: Iris Setosa (index 0), Iris Versicolour (index 1), or Iris Virginica (index 2).

Note

The iris dataset contains 50 samples per class, each with four measurements.

  1. Create an InferenceService named sklearn-iris.

    kubectl apply -f - <<EOF
    apiVersion: "serving.kserve.io/v1beta1"
    kind: "InferenceService"
    metadata:
      name: "sklearn-iris"
    spec:
      predictor:
        model:
          modelFormat:
            name: sklearn
          storageUri: "gs://kfserving-examples/models/sklearn/1.0/model"
    EOF
  2. Verify that the service is ready.

    kubectl get inferenceservices sklearn-iris

    The expected output is similar to:

    NAME           URL                                                         READY   PREV   LATEST   PREVROLLEDOUTREVISION   LATESTREADYREVISION                    AGE
    sklearn-iris   http://sklearn-iris-predictor-default.default.example.com   True           100                              sklearn-iris-predictor-default-00001   51s

    The service is ready when the READY column shows True.

Step 2: Send an inference request

The access method depends on your ingress gateway. Select the section that matches your setup.

Application Load Balancer (ALB)

  1. Get the ALB endpoint.

    kubectl get albconfig knative-internet

    The expected output is similar to:

    NAME               ALBID                    DNSNAME                                              PORT&PROTOCOL   CERTID   AGE
    knative-internet   alb-hvd8nngl0l*******   alb-hvd8nngl0l******.cn-<region>.alb.aliyuncs.com                               2
  2. Create the input file for the inference request.

    cat <<EOF > "./iris-input.json"
    {
      "instances": [
        [6.8,  2.8,  4.8,  1.4],
        [6.0,  3.4,  4.5,  1.6]
      ]
    }
    EOF
  3. Send the inference request.

    INGRESS_DOMAIN=$(kubectl get albconfig knative-internet -o jsonpath='{.status.loadBalancer.dnsname}')
    SERVICE_HOSTNAME=$(kubectl get inferenceservice sklearn-iris -o jsonpath='{.status.url}' | cut -d "/" -f 3)
    curl -v -H "Host: ${SERVICE_HOSTNAME}" "http://${INGRESS_DOMAIN}/v1/models/sklearn-iris:predict" -d @./iris-input.json

    The expected output is similar to:

    *   Trying 120.77.XX.XX...
    * TCP_NODELAY set
    * Connected to alb-hvd8nngl0l******.cn-<region>.alb.aliyuncs.com (120.77.XX.XX) port 80 (#0)
    > POST /v1/models/sklearn-iris:predict HTTP/1.1
    > Host: sklearn-iris-predictor-default.default.example.com
    > User-Agent: curl/7.58.0
    > Accept: */*
    > Content-Length: 76
    > Content-Type: application/x-www-form-urlencoded
    >
    * upload completely sent off: 76 out of 76 bytes
    < HTTP/1.1 200 OK
    < Date: Thu, 13 Jul 2023 01:48:44 GMT
    < Content-Type: application/json
    < Content-Length: 21
    < Connection: keep-alive
    <
    * Connection #0 to host alb-hvd8nngl0l******.cn-<region>.alb.aliyuncs.com left intact
    {"predictions":[1,1]}

    Both data points are predicted as index 1, which corresponds to Iris Versicolour.

Kourier

  1. Get the Kourier service endpoint.

    kubectl -n knative-serving get svc kourier

    The expected output is similar to:

    NAME      TYPE           CLUSTER-IP      EXTERNAL-IP      PORT(S)                      AGE
    kourier   LoadBalancer   192.168.XX.XX   121.40.XX.XX     80:31158/TCP,443:32491/TCP   49m

    The external IP (121.40.XX.XX) is the access address. The service listens on port 80 (HTTP) and 443 (HTTPS).

  2. Create the input file for the inference request.

    cat <<EOF > "./iris-input.json"
    {
      "instances": [
        [6.8,  2.8,  4.8,  1.4],
        [6.0,  3.4,  4.5,  1.6]
      ]
    }
    EOF
  3. Send the inference request.

    INGRESS_HOST=$(kubectl -n knative-serving get service kourier -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
    SERVICE_HOSTNAME=$(kubectl get inferenceservice sklearn-iris -o jsonpath='{.status.url}' | cut -d "/" -f 3)
    curl -v -H "Host: ${SERVICE_HOSTNAME}" "http://${INGRESS_HOST}/v1/models/sklearn-iris:predict" -d @./iris-input.json

    The expected output is similar to:

    *   Trying 121.40.XX.XX...
    * TCP_NODELAY set
    * Connected to 121.40.XX.XX (121.40.XX.XX) port 80 (#0)
    > POST /v1/models/sklearn-iris:predict HTTP/1.1
    > Host: sklearn-iris-predictor-default.default.example.com
    > User-Agent: curl/7.58.0
    > Accept: */*
    > Content-Length: 76
    > Content-Type: application/x-www-form-urlencoded
    >
    * upload completely sent off: 76 out of 76 bytes
    < HTTP/1.1 200 OK
    < content-length: 21
    < content-type: application/json
    < date: Wed, 12 Jul 2023 08:23:13 GMT
    < server: envoy
    < x-envoy-upstream-service-time: 4
    <
    * Connection #0 to host 121.40.XX.XX left intact
    {"predictions":[1,1]}

    Both data points are predicted as index 1, which corresponds to Iris Versicolour.