All Products
Search
Document Center

Container Service for Kubernetes:Deploy an inference service with KServe

Last Updated:Nov 06, 2025

You can use KServe on ACK Knative to deploy AI models as serverless inference services. This provides key features like auto-scaling, multi-version management, and canary releases.

Step 1: Install and configure KServe

To ensure smooth integration between KServe and Knative's ALB Ingress or Kourier gateway, first install the KServe component, then modify its default settings to disable its built-in Istio VirtualService creation.

  1. Install the KServe component.

    1. Log on to the ACK console. In the left navigation pane, click Clusters.

    2. On the Clusters page, find the cluster you want and click its name. In the left navigation pane, choose Applications > Knative.

    3. On the Components tab, find and deploy the KServe component in the Add-on section.

  2. Disable Istio VirtualService creation.

    Edit the inferenceservice-config ConfigMap to set disableIstioVirtualHost to true.

    kubectl get configmap inferenceservice-config -n kserve -o yaml \
    | sed 's/"disableIstioVirtualHost": false/"disableIstioVirtualHost": true/g' \
    | kubectl apply -f -

    Expected output:

    configmap/inferenceservice-config configured
  3. Verify the configuration change.

    kubectl get configmap inferenceservice-config -n kserve -o yaml \
    | grep '"disableIstioVirtualHost":' \
    | tail -n1 \
    | awk -F':' '{gsub(/[ ,]/,"",$2); print $2}'

    The output should be true.

  4. Restart the KServe controller to apply the changes.

    kubectl rollout restart deployment kserve-controller-manager -n kserve

Step 2: Deploy the InferenceService

This example deploys a scikit-learn classification model trained on the Iris dataset. The service accepts an array of four measurements for a flower and predicts which of the three species it belongs to.

Input (an array of four numerical features):

  1. Sepal length

  2. Sepal width

  3. Petal length

  4. Petal width

Output (the predicted class index):

  • 0: Iris setosa

  • 1: Iris versicolour

  • 2: Iris virginica

  1. Create a file named inferenceservice.yaml to deploy the InferenceService.

    apiVersion: "serving.kserve.io/v1beta1"
    kind: "InferenceService"
    metadata:
      name: "sklearn-iris"
    spec:
      predictor:
        model:
          # The format of the model, in this case scikit-learn
          modelFormat:
            name: sklearn
          image: "kube-ai-registry.cn-shanghai.cr.aliyuncs.com/ai-sample/kserve-sklearn-server:v0.12.0"
          command:
          - sh
          - -c
          - "python -m sklearnserver --model_name=sklearn-iris --model_dir=/models --http_port=8080"
  2. Deploy the InferenceService.

    kubectl apply -f inferenceservice.yaml
  3. Check the service status.

    kubectl get inferenceservices sklearn-iris

    In the output, when the READY column shows True, the service is up and running.

    NAME           URL                                                         READY   PREV   LATEST   PREVROLLEDOUTREVISION   LATESTREADYREVISION                    AGE
    sklearn-iris   http://sklearn-iris-predictor-default.default.example.com   True           100                              sklearn-iris-predictor-default-00001   51s

Step 3: Access the service

Send an inference request to the service via the cluster's ingress gateway.

  1. On the Services tab of the Knative page, get the gateway address and the default domain name to access the service.

    The following figure shows an example using an ALB Ingress. The interface for a Kourier gateway is similar. image
  2. Prepare the request data.

    In your local terminal, create a file named ./iris-input.json containing the request payload. This example includes two samples to be predicted.

    cat <<EOF > "./iris-input.json"
    {
      "instances": [
        [6.8,  2.8,  4.8,  1.4],
        [6.0,  3.4,  4.5,  1.6]
      ]
    }
    EOF
  3. Send an inference request from your local terminal to access the service. Replace ${INGRESS_DOMAIN} with the gateway address from Step 1.

    curl -H "Content-Type: application/json" -H "Host: sklearn-iris-predictor.default.example.com" "http://${INGRESS_DOMAIN}/v1/models/sklearn-iris:predict" -d @./iris-input.json

    The output indicates that the model predicted both input samples belong to class 1 (Iris Versicolour).

    {"predictions":[1,1]}

Billing

The KServe and Knative components themselves do not incur additional charges. However, you will be billed for the underlying resources you use, including computing resources such as Elastic Compute Service (ECS) instances and Elastic Container Instances, and network resources such as Application Load Balancer (ALB) and Classic Load Balancer (CLB) instances. For details, see Cloud resource fees.

FAQ

Why is my InferenceService stuck in a Not Ready state?

To debug an InferenceService that fails to become ready, first inspect its events, then check the status of its associated pods, and review container logs.

Follow these steps:

  1. Run kubectl describe inferenceservice <your-service-name> and check the events for any error messages. Replace <your-service-name> with your actual service name.

  2. Run kubectl get pods to see if any pods associated with the service are in an Error or CrashLoopBackOff state. Pods for an InferenceService are typically prefixed with the service name.

  3. If a pod is in an error state, check its logs with kubectl logs <pod-name> -c kserve-container to diagnose the failure. This can reveal issues such as a model failing to download due to network problems or an incorrect model file format.

How do I deploy my own custom-trained model?

  1. Upload your model file to an accessible Object Storage Service (OSS) bucket.

  2. Configure your InferenceService manifest to point to the model:

    • Set the spec.predictor.model.storageUri field to the URI of your model file in the OSS bucket.

    • Set the modelFormat field based on your model's framework, such as tensorflow, pytorch, or onnx.

How do I configure GPU resources for my model?

If your model requires a GPU for inference, you can request GPU resources by adding a resources field to the predictor section of your InferenceService YAML manifest.

For more information on using GPUs with Knative, see Use GPU resources.
spec:
  predictor:
    resources:
      requests:
        nvidia.com/gpu: "1"
      limits:
        nvidia.com/gpu: "1"

References