Running LLM inference on Kubernetes requires managing Layer 7 routing, load balancing across GPU pods, and scaling GPU resources based on request concurrency — all without dedicated infrastructure tooling. The ACK Gateway with Inference Extension solves this by integrating the Kubernetes Gateway API and the Inference Extension specification with the Knative serverless architecture. You get intelligent traffic scheduling and concurrency-based autoscaling without building custom routing infrastructure.
How it works
Gateway with Inference Extension introduces two Custom Resource Definitions (CRDs) for AI inference scenarios:
-
InferencePool: Groups pods that share the same compute configuration, accelerator type, foundation model, and model server. An InferencePool can span multiple nodes for high availability.
-
InferenceObjective: Defines the model an InferencePool serves and the criticality level of that workload. Pods marked as
Criticalreceive priority processing.
Adding the AI gateway annotation to a Knative Service triggers automatic integration with these CRDs, enabling intelligent traffic scheduling without additional configuration.
Prerequisites
Before you begin, ensure that you have:
Cluster requirements:
-
An ACK Pro managed cluster with Knative deployed. See Deploy and manage Knative components.
-
The Gateway API component installed.
-
The Gateway with Inference Extension component, version v1.4.0-apsara.4 or later, with Enable Gateway API Inference Extension selected during installation.
-
GPU nodes with at least 32 GiB of memory per node. Label each GPU node using Node Label in the console: set Key to
ack.aliyun.com/nvidia-driver-versionand Value to550.144.03. Use GPU driver version 550.144.03 or later. For details, see Customize the GPU driver version of a node by specifying a version number.
Other resources:
-
An Object Storage Service (OSS) bucket in the same region as your cluster, to avoid internal network traffic fees and reduce latency.
Set up your environment variables
Declare the following variables before you start. Later steps reference these variables directly so you can copy commands without modification.
export BUCKET_NAME="<your-bucket-name>" # Your OSS bucket name
export ACCESS_KEY_ID="<your-access-key-id>" # Your Alibaba Cloud AccessKey ID
export ACCESS_KEY_SECRET="<your-access-key-secret>" # Your Alibaba Cloud AccessKey Secret
Replace the placeholder values with your actual values. All subsequent commands use these variables.
Step 1: Enable Gateway API support in Knative
Configure Knative to use the Gateway API as its ingress controller.
-
Edit the
config-networkConfigMap.kubectl edit configmap config-network -n knative-serving -
In the
datafield, setingress.classto the following value, then save.apiVersion: v1 data: ... ingress.class: gateway-api.ingress.networking.knative.dev ... kind: ConfigMap metadata: name: config-network namespace: knative-serving ...
-
Verify the change took effect.
kubectl get configmap config-network -n knative-serving -o yaml | grep "ingress.class"Expected output:
ingress.class: gateway-api.ingress.networking.knative.dev
Step 2: Create an inference gateway
Create a Gateway resource that listens for external requests on port 8888.
-
Create a file named
knative-gateway.yaml.kind: Gateway apiVersion: gateway.networking.k8s.io/v1 metadata: name: knative-gateway namespace: knative-serving spec: gatewayClassName: ack-gateway listeners: - name: default port: 80 protocol: HTTP allowedRoutes: namespaces: from: All - name: llm-gw protocol: HTTP port: 8888 # Listening port for the inference service allowedRoutes: namespaces: from: All -
Deploy the Gateway.
kubectl apply -f knative-gateway.yaml -
Verify the Gateway is ready.
kubectl get gateway knative-gateway -n knative-servingThe
PROGRAMMEDfield must beTrueand theADDRESSfield must show an assigned IP address.NAME CLASS ADDRESS PROGRAMMED AGE knative-gateway ack-gateway 47.XX.XX.198 True 22s
Step 3: Prepare model data and configure storage
Mount model data from OSS using a static PersistentVolume (PV) to avoid downloading the model each time a container starts.
Download the model and upload it to OSS
This guide uses the Qwen1.5-4B-Chat model as an example. You can use a temporary Elastic Compute Service (ECS) instance to prepare the model data, then release it after the upload completes.
-
Purchase an ECS instance, then download the model.
# Install Git LFS sudo yum install -y git git-lfs git lfs install # Clone the repository without downloading large files first GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/qwen/Qwen1.5-4B-Chat.git # Download the large model files cd Qwen1.5-4B-Chat git lfs pull -
Upload the model to your OSS bucket using ossutil.
For ossutil installation instructions, see Install ossutil.
# Create the target directory in OSS ossutil mkdir oss://${BUCKET_NAME}/models/Qwen1.5-4B-Chat # Upload all model files recursively ossutil cp -r ./ oss://${BUCKET_NAME}/models/Qwen1.5-4B-Chat
Configure a PersistentVolume and PersistentVolumeClaim
Create an OSS static PersistentVolumeClaim (PVC) for faster model loading. For background, see Use a static ossfs 1.0 persistent volume.
-
Create a Secret with your OSS credentials.
kubectl create secret generic oss-secret \ --from-literal=akId="${ACCESS_KEY_ID}" \ --from-literal=akSecret="${ACCESS_KEY_SECRET}" \ --namespace default -
Create a file named
oss-storage.yaml. Replace<your-bucket-name>with your actual bucket name.apiVersion: v1 kind: PersistentVolume metadata: name: llm-model labels: alicloud-pvname: llm-model spec: capacity: storage: 30Gi accessModes: - ReadWriteMany persistentVolumeReclaimPolicy: Retain storageClassName: oss csi: driver: ossplugin.csi.alibabacloud.com volumeHandle: llm-model nodePublishSecretRef: name: oss-secret namespace: default volumeAttributes: bucket: "<your-bucket-name>" url: "http://oss-cn-hangzhou-internal.aliyuncs.com" # Endpoint for the bucket region path: "/models/Qwen1.5-4B-Chat" # Path to the model in OSS --- apiVersion: v1 kind: PersistentVolumeClaim metadata: name: llm-model namespace: default spec: accessModes: - ReadWriteMany storageClassName: oss resources: requests: storage: 30Gi selector: matchLabels: alicloud-pvname: llm-model # Binds to the PV above by label -
Deploy the PV and PVC.
kubectl apply -f oss-storage.yaml
Step 4: Deploy the Knative inference service
Create a Knative Service that enables the AI inference gateway and runs vLLM as the inference engine.
-
Create a file named
qwen-service.yaml. Key annotations and what they do:Annotation Value Description knative.aliyun.com/ai-gatewayinferenceEnables AI inference gateway integration knative.aliyun.com/ai-gateway-inference-priority"1"Sets the routing priority for this service autoscaling.knative.dev/metric"concurrency"Scales based on concurrent request count autoscaling.knative.dev/target"2"Target number of concurrent requests per pod autoscaling.knative.dev/max-scale"3"Maximum number of running pods autoscaling.knative.dev/min-scale"1"Minimum pods to keep running. Set to at least 1 — LLM containers take a long time to start, and cold starts cause request timeouts apiVersion: serving.knative.dev/v1 kind: Service metadata: name: qwen namespace: default annotations: knative.aliyun.com/ai-gateway: inference knative.aliyun.com/ai-gateway-inference-priority: "1" labels: release: qwen spec: template: metadata: annotations: autoscaling.knative.dev/metric: "concurrency" autoscaling.knative.dev/target: "2" autoscaling.knative.dev/max-scale: "3" autoscaling.knative.dev/min-scale: "1" labels: release: qwen spec: containers: - name: vllm-container image: ac2-registry.cn-hangzhou.cr.aliyuncs.com/ac2/vllm:0.4.1-ubuntu22.04 command: - sh - -c - python3 -m vllm.entrypoints.openai.api_server --port 8080 --trust-remote-code --model /models/Qwen1.5-4B-Chat/ --gpu-memory-utilization 0.95 --max-model-len 8192 --dtype half ports: - containerPort: 8080 readinessProbe: tcpSocket: port: 8080 initialDelaySeconds: 15 periodSeconds: 5 resources: limits: cpu: "32" memory: 64Gi nvidia.com/gpu: "1" requests: cpu: "8" memory: 32Gi nvidia.com/gpu: "1" volumeMounts: - mountPath: /models/Qwen1.5-4B-Chat # Must match the --model path in the start command name: llm-model volumes: - name: llm-model persistentVolumeClaim: claimName: llm-model -
Deploy the service.
kubectl apply -f qwen-service.yaml -
Wait for the service to be ready. LLM containers can take several minutes to start.
kubectl get ksvc qwen -n defaultThe service is ready when
READYshowsTrue:NAME URL LATESTCREATED LATESTREADY READY REASON qwen http://qwen.default.example.com qwen-00001 qwen-00001 True
Step 5: Validate the inference service
After the service is ready, send a test request through the gateway.
-
Get the gateway IP address.
export GATEWAY_HOST=$(kubectl -n knative-serving get gateway/knative-gateway -o jsonpath='{.status.addresses[0].value}') echo "Gateway address: $GATEWAY_HOST" -
Send a test request in OpenAI-compatible format.
curl http://${GATEWAY_HOST}:8888/v1/chat/completions \ -H "Host: qwen.default.example.com" \ -H "Content-Type: application/json" \ -d '{ "model": "/models/Qwen1.5-4B-Chat/", "messages": [ {"role": "user", "content": "Describe Kubernetes in one sentence."} ], "max_tokens": 50 }'A successful response returns JSON with a
choicesfield. Thecontentvalue insidechoicescontains the model's reply.
Billing
The Knative component itself has no extra fees. The underlying cloud resources are billed separately:
-
GPU instances: GPU instances are expensive. Use node auto-scaling to keep costs under control.
-
OSS: Billed for storage and requests. Public network access also incurs outbound traffic fees.
-
Server Load Balancer (SLB): The Internet-facing SLB instance attached to the gateway incurs traffic fees.
For a complete breakdown, see Cloud product resource fees.
What's next
You can extend this setup to support more advanced AI service patterns in Knative:
-
Deploy A2A in Knative — agent-to-agent communication with on-demand scaling
-
Deploy MCP Server in Knative — event-driven execution for AI services