You can use KServe on ACK Knative to deploy AI models as serverless inference services. This provides key features like auto-scaling, multi-version management, and canary releases.
Step 1: Install and configure KServe
To ensure smooth integration between KServe and Knative's ALB Ingress or Kourier gateway, first install the KServe component, then modify its default settings to disable its built-in Istio VirtualService creation.
Install the KServe component.
Log on to the ACK console. In the left navigation pane, click Clusters.
On the Clusters page, find the cluster you want and click its name. In the left navigation pane, choose .
On the Components tab, find and deploy the KServe component in the Add-on section.
Disable Istio VirtualService creation.
Edit the
inferenceservice-configConfigMap to setdisableIstioVirtualHosttotrue.kubectl get configmap inferenceservice-config -n kserve -o yaml \ | sed 's/"disableIstioVirtualHost": false/"disableIstioVirtualHost": true/g' \ | kubectl apply -f -Expected output:
configmap/inferenceservice-config configuredVerify the configuration change.
kubectl get configmap inferenceservice-config -n kserve -o yaml \ | grep '"disableIstioVirtualHost":' \ | tail -n1 \ | awk -F':' '{gsub(/[ ,]/,"",$2); print $2}'The output should be
true.Restart the KServe controller to apply the changes.
kubectl rollout restart deployment kserve-controller-manager -n kserve
Step 2: Deploy the InferenceService
This example deploys a scikit-learn classification model trained on the Iris dataset. The service accepts an array of four measurements for a flower and predicts which of the three species it belongs to.
Input (an array of four numerical features):
| Output (the predicted class index):
|
Create a file named
inferenceservice.yamlto deploy the InferenceService.apiVersion: "serving.kserve.io/v1beta1" kind: "InferenceService" metadata: name: "sklearn-iris" spec: predictor: model: # The format of the model, in this case scikit-learn modelFormat: name: sklearn image: "kube-ai-registry.cn-shanghai.cr.aliyuncs.com/ai-sample/kserve-sklearn-server:v0.12.0" command: - sh - -c - "python -m sklearnserver --model_name=sklearn-iris --model_dir=/models --http_port=8080"Deploy the InferenceService.
kubectl apply -f inferenceservice.yamlCheck the service status.
kubectl get inferenceservices sklearn-irisIn the output, when the
READYcolumn showsTrue, the service is up and running.NAME URL READY PREV LATEST PREVROLLEDOUTREVISION LATESTREADYREVISION AGE sklearn-iris http://sklearn-iris-predictor-default.default.example.com True 100 sklearn-iris-predictor-default-00001 51s
Step 3: Access the service
Send an inference request to the service via the cluster's ingress gateway.
On the Services tab of the Knative page, get the gateway address and the default domain name to access the service.
The following figure shows an example using an ALB Ingress. The interface for a Kourier gateway is similar.

Prepare the request data.
In your local terminal, create a file named
./iris-input.jsoncontaining the request payload. This example includes two samples to be predicted.cat <<EOF > "./iris-input.json" { "instances": [ [6.8, 2.8, 4.8, 1.4], [6.0, 3.4, 4.5, 1.6] ] } EOFSend an inference request from your local terminal to access the service. Replace
${INGRESS_DOMAIN}with the gateway address from Step 1.curl -H "Content-Type: application/json" -H "Host: sklearn-iris-predictor.default.example.com" "http://${INGRESS_DOMAIN}/v1/models/sklearn-iris:predict" -d @./iris-input.jsonThe output indicates that the model predicted both input samples belong to class
1(Iris Versicolour).{"predictions":[1,1]}
Billing
The KServe and Knative components themselves do not incur additional charges. However, you will be billed for the underlying resources you use, including computing resources such as Elastic Compute Service (ECS) instances and Elastic Container Instances, and network resources such as Application Load Balancer (ALB) and Classic Load Balancer (CLB) instances. For details, see Cloud resource fees.
FAQ
Why is my InferenceService stuck in a Not Ready state?
To debug an InferenceService that fails to become ready, first inspect its events, then check the status of its associated pods, and review container logs.
Follow these steps:
Run
kubectl describe inferenceservice <your-service-name>and check the events for any error messages. Replace<your-service-name>with your actual service name.Run
kubectl get podsto see if any pods associated with the service are in anErrororCrashLoopBackOffstate. Pods for anInferenceServiceare typically prefixed with the service name.If a pod is in an error state, check its logs with
kubectl logs <pod-name> -c kserve-containerto diagnose the failure. This can reveal issues such as a model failing to download due to network problems or an incorrect model file format.
How do I deploy my own custom-trained model?
Upload your model file to an accessible Object Storage Service (OSS) bucket.
Configure your
InferenceServicemanifest to point to the model:Set the
spec.predictor.model.storageUrifield to the URI of your model file in the OSS bucket.Set the
modelFormatfield based on your model's framework, such astensorflow,pytorch, oronnx.
How do I configure GPU resources for my model?
If your model requires a GPU for inference, you can request GPU resources by adding a resources field to the predictor section of your InferenceService YAML manifest.
For more information on using GPUs with Knative, see Use GPU resources.
spec:
predictor:
resources:
requests:
nvidia.com/gpu: "1"
limits:
nvidia.com/gpu: "1"References
Best practices for deploying AI inference services in Knative.
ACK Knative provides an application template for Stable Diffusion. For details, see Deploy a Stable Diffusion Service in a production environment based on Knative.