Use KServe on ACK Serverless Knative to deploy AI models as serverless inference services with automatic scaling, multi-version management, and phased releases.
Prerequisites
Before you begin, ensure that you have:
-
The KServe component deployed. For more information, see Deploy the KServe component.
Step 1: Deploy the InferenceService
Because the model is deployed as a KServe InferenceService rather than a plain Kubernetes Service, you only need to provide a storageUri — auto-scaling, traffic management, and versioning are handled automatically.
The following example deploys a scikit-learn model trained on the iris dataset. The model accepts four-feature inputs (sepal length, sepal width, petal length, petal width) and predicts one of three iris classes: Iris Setosa (index 0), Iris Versicolour (index 1), or Iris Virginica (index 2).
The iris dataset contains 50 samples per class, each with four measurements.
-
Create an InferenceService named sklearn-iris.
kubectl apply -f - <<EOF apiVersion: "serving.kserve.io/v1beta1" kind: "InferenceService" metadata: name: "sklearn-iris" spec: predictor: model: modelFormat: name: sklearn storageUri: "gs://kfserving-examples/models/sklearn/1.0/model" EOF -
Verify that the service is ready.
kubectl get inferenceservices sklearn-irisThe expected output is similar to:
NAME URL READY PREV LATEST PREVROLLEDOUTREVISION LATESTREADYREVISION AGE sklearn-iris http://sklearn-iris-predictor-default.default.example.com True 100 sklearn-iris-predictor-default-00001 51sThe service is ready when the
READYcolumn showsTrue.
Step 2: Send an inference request
The access method depends on your ingress gateway. Select the section that matches your setup.
Application Load Balancer (ALB)
-
Get the ALB endpoint.
kubectl get albconfig knative-internetThe expected output is similar to:
NAME ALBID DNSNAME PORT&PROTOCOL CERTID AGE knative-internet alb-hvd8nngl0l******* alb-hvd8nngl0l******.cn-<region>.alb.aliyuncs.com 2 -
Create the input file for the inference request.
cat <<EOF > "./iris-input.json" { "instances": [ [6.8, 2.8, 4.8, 1.4], [6.0, 3.4, 4.5, 1.6] ] } EOF -
Send the inference request.
INGRESS_DOMAIN=$(kubectl get albconfig knative-internet -o jsonpath='{.status.loadBalancer.dnsname}') SERVICE_HOSTNAME=$(kubectl get inferenceservice sklearn-iris -o jsonpath='{.status.url}' | cut -d "/" -f 3) curl -v -H "Host: ${SERVICE_HOSTNAME}" "http://${INGRESS_DOMAIN}/v1/models/sklearn-iris:predict" -d @./iris-input.jsonThe expected output is similar to:
* Trying 120.77.XX.XX... * TCP_NODELAY set * Connected to alb-hvd8nngl0l******.cn-<region>.alb.aliyuncs.com (120.77.XX.XX) port 80 (#0) > POST /v1/models/sklearn-iris:predict HTTP/1.1 > Host: sklearn-iris-predictor-default.default.example.com > User-Agent: curl/7.58.0 > Accept: */* > Content-Length: 76 > Content-Type: application/x-www-form-urlencoded > * upload completely sent off: 76 out of 76 bytes < HTTP/1.1 200 OK < Date: Thu, 13 Jul 2023 01:48:44 GMT < Content-Type: application/json < Content-Length: 21 < Connection: keep-alive < * Connection #0 to host alb-hvd8nngl0l******.cn-<region>.alb.aliyuncs.com left intact {"predictions":[1,1]}Both data points are predicted as index 1, which corresponds to Iris Versicolour.
Kourier
-
Get the Kourier service endpoint.
kubectl -n knative-serving get svc kourierThe expected output is similar to:
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE kourier LoadBalancer 192.168.XX.XX 121.40.XX.XX 80:31158/TCP,443:32491/TCP 49mThe external IP (
121.40.XX.XX) is the access address. The service listens on port 80 (HTTP) and 443 (HTTPS). -
Create the input file for the inference request.
cat <<EOF > "./iris-input.json" { "instances": [ [6.8, 2.8, 4.8, 1.4], [6.0, 3.4, 4.5, 1.6] ] } EOF -
Send the inference request.
INGRESS_HOST=$(kubectl -n knative-serving get service kourier -o jsonpath='{.status.loadBalancer.ingress[0].ip}') SERVICE_HOSTNAME=$(kubectl get inferenceservice sklearn-iris -o jsonpath='{.status.url}' | cut -d "/" -f 3) curl -v -H "Host: ${SERVICE_HOSTNAME}" "http://${INGRESS_HOST}/v1/models/sklearn-iris:predict" -d @./iris-input.jsonThe expected output is similar to:
* Trying 121.40.XX.XX... * TCP_NODELAY set * Connected to 121.40.XX.XX (121.40.XX.XX) port 80 (#0) > POST /v1/models/sklearn-iris:predict HTTP/1.1 > Host: sklearn-iris-predictor-default.default.example.com > User-Agent: curl/7.58.0 > Accept: */* > Content-Length: 76 > Content-Type: application/x-www-form-urlencoded > * upload completely sent off: 76 out of 76 bytes < HTTP/1.1 200 OK < content-length: 21 < content-type: application/json < date: Wed, 12 Jul 2023 08:23:13 GMT < server: envoy < x-envoy-upstream-service-time: 4 < * Connection #0 to host 121.40.XX.XX left intact {"predictions":[1,1]}Both data points are predicted as index 1, which corresponds to Iris Versicolour.