Built on the Kubernetes Gateway API and the Inference Extension specification, the Gateway with Inference Extension component works with the Knative Serverless architecture to simplify managing generative AI inference services. It provides efficient Layer-7 routing and load balancing across multiple inference service workloads and enables GPU resource autoscaling based on request concurrency.
How it works
Gateway with Inference Extension extends the Gateway API for AI inference scenarios with the following CustomResourceDefinitions (CRDs).
InferencePool: Logically groups resources for AI model services. It represents a set of Pods that share the same compute configuration, accelerator type, base model, and model server. An InferencePool can span multiple nodes to provide high availability.
InferenceObjective: Defines the objectives for a model service, specifying the model name served by Pods in an InferencePool and its criticality level. Workloads marked as
Criticalreceive higher processing priority.
In Knative, enabling the AI gateway annotation allows a Knative Service to automatically use these CRDs for intelligent traffic scheduling.
Prerequisites
You have created an ACK Managed Pro cluster that meets the following requirements:
Knative is deployed. For more information, see Deploy and manage Knative components.
The Gateway API component is installed.
The Gateway with Inference Extension component of version v1.4.0-apsara.4 or later is installed, and you selected Enable Gateway API Inference Extension during installation.
The cluster contains GPU nodes, each with at least 32 GiB of memory (this topic uses Qwen1.5-4B as an example). A specific Node Labels is required on the nodes to specify the driver version: set the key to
ack.aliyun.com/nvidia-driver-versionand the value to550.144.03.We recommend a GPU node driver version of 550.144.03 or later. For more information, see Customize the GPU driver version of a node by specifying a version number.
You have created an OSS Bucket.
We recommend choosing the same region as your cluster to avoid cross-region data transfer charges and reduce latency.
Step 1: Enable Gateway API support in Knative
Modify the Knative network configuration to specify the Gateway API as the Ingress controller.
Edit the
config-networkConfigMap.kubectl edit configmap config-network -n knative-servingIn the
datafield, modifyingress.classand then save your changes.apiVersion: v1 data: ... # Modify ingress.class to use the Gateway API as the Ingress controller. ingress.class: gateway-api.ingress.networking.knative.dev ... kind: ConfigMap metadata: name: config-network namespace: knative-serving ...Verify that the change has taken effect.
kubectl get configmap config-network -n knative-serving -o yaml | grep "ingress.class"Expected output:
ingress.class: gateway-api.ingress.networking.knative.dev
Step 2: Create an inference gateway resource
Create a Gateway resource to listen for external requests. This example configures the gateway to listen on port 8888.
Create the gateway configuration file
knative-gateway.yaml.kind: Gateway apiVersion: gateway.networking.k8s.io/v1 metadata: name: knative-gateway namespace: knative-serving spec: gatewayClassName: ack-gateway listeners: - name: default port: 80 protocol: HTTP allowedRoutes: namespaces: from: All - name: llm-gw protocol: HTTP # The port that the inference service listens on. port: 8888 allowedRoutes: namespaces: from: AllDeploy the gateway resource.
kubectl apply -f knative-gateway.yamlCheck the gateway status.
kubectl get gateway knative-gateway -n knative-servingIn the output, ensure that
PROGRAMMEDisTrueand that an IP address is assigned in theADDRESSfield.NAME CLASS ADDRESS PROGRAMMED AGE knative-gateway ack-gateway 47.XX.XX.198 True 22s
Step 3: Prepare model data and configure storage
To avoid re-downloading the model every time a container starts, we recommend using an OSS static volume to store and mount the model data.
1. Download model and upload to OSS
This step uses the Qwen1.5-4B-Chat model as an example. You can temporarily purchase an ECS instance to prepare the model data and release it after you are finished.
Download the model to a local directory.
# Install Git LFS sudo yum install -y git git-lfs git lfs install # Clone the model repository (skip smudge to speed up) GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/qwen/Qwen1.5-4B-Chat.git # Download the actual large files cd Qwen1.5-4B-Chat git lfs pullUse ossutil to upload the model to your OSS Bucket.
Replace
<Bucket-Name>with your actual OSS Bucket name.To install ossutil, see Install ossutil.
# Create a directory. ossutil mkdir oss://<Bucket-Name>/models/Qwen1.5-4B-Chat # Upload files recursively (-r indicates recursive upload). ossutil cp -r ./ oss://<Bucket-Name>/models/Qwen1.5-4B-Chat
2. Configure a PV and PVC
To improve model loading performance, this example creates an OSS static volume. For detailed steps, see Use an ossfs 1.0 static volume.
Create an OSS access credential (Secret).
Replace
<AccessKey-ID>and<AccessKey-Secret>with your actual information.kubectl create secret generic oss-secret \ --from-literal=akId='<AccessKey-ID>' \ --from-literal=akSecret='<AccessKey-Secret>' \ --namespace defaultCreate the
oss-storage.yamlfile.apiVersion: v1 kind: PersistentVolume metadata: name: llm-model labels: alicloud-pvname: llm-model spec: capacity: storage: 30Gi # Access mode accessModes: - ReadWriteMany persistentVolumeReclaimPolicy: Retain storageClassName: oss csi: driver: ossplugin.csi.alibabacloud.com volumeHandle: llm-model # Get AccessKey information from the Secret object. nodePublishSecretRef: name: oss-secret namespace: default volumeAttributes: # Replace with your actual OSS Bucket name. bucket: "<Your-Bucket-Name>" # The internal endpoint for the bucket's region. url: "http://oss-cn-hangzhou-internal.aliyuncs.com" # The relative path in OSS. path: "/models/Qwen1.5-4B-Chat" --- apiVersion: v1 kind: PersistentVolumeClaim metadata: name: llm-model namespace: default spec: accessModes: - ReadWriteMany storageClassName: oss resources: requests: # Requested storage size, which cannot exceed the total volume size. storage: 30Gi selector: matchLabels: # Select the PV by using this label. alicloud-pvname: llm-modelDeploy the PV and PVC.
kubectl apply -f oss-storage.yaml
Step 4: Deploy the Knative inference service
Create a Knative Service, enable the AI gateway feature, and configure the vLLM engine for inference.
Create the service configuration file
qwen-service.yaml.Key configurations:
knative.aliyun.com/ai-gateway: inference: Enables the inference gateway extension.autoscaling.knative.dev/metric: "concurrency": Autoscales based on the number of concurrent requests.
apiVersion: serving.knative.dev/v1 kind: Service metadata: name: qwen namespace: default annotations: # Enable the AI inference gateway. knative.aliyun.com/ai-gateway: inference knative.aliyun.com/ai-gateway-inference-priority: "1" labels: release: qwen spec: template: metadata: annotations: # Autoscaling metric: concurrency. autoscaling.knative.dev/metric: "concurrency" # Target concurrency. autoscaling.knative.dev/target: "2" # Maximum number of instances. autoscaling.knative.dev/max-scale: "3" # Minimum number of instances. For large models, a minimum of 1 is recommended to keep an instance warm and avoid request timeouts on cold starts. autoscaling.knative.dev/min-scale: "1" labels: release: qwen spec: containers: - name: vllm-container image: ac2-registry.cn-hangzhou.cr.aliyuncs.com/ac2/vllm:0.4.1-ubuntu22.04 command: - sh - -c - python3 -m vllm.entrypoints.openai.api_server --port 8080 --trust-remote-code --model /models/Qwen1.5-4B-Chat/ --gpu-memory-utilization 0.95 --max-model-len 8192 --dtype half ports: - containerPort: 8080 readinessProbe: tcpSocket: port: 8080 initialDelaySeconds: 15 periodSeconds: 5 resources: limits: cpu: "32" memory: 64Gi # Request GPU resources. nvidia.com/gpu: "1" requests: cpu: "8" memory: 32Gi nvidia.com/gpu: "1" volumeMounts: # The mount path must match the model parameter in the startup command. - mountPath: /models/Qwen1.5-4B-Chat name: llm-model volumes: - name: llm-model persistentVolumeClaim: claimName: llm-modelDeploy the service.
kubectl apply -f qwen-service.yamlCheck the deployment progress (wait for
Readyto beTrue).kubectl get ksvc qwen -n default
Step 5: Verify the inference service
After the service is deployed, use the gateway IP address to access the inference API.
Get the gateway IP address.
export GATEWAY_HOST=$(kubectl -n knative-serving get gateway/knative-gateway -o jsonpath='{.status.addresses[0].value}') echo "Gateway IP address: $GATEWAY_HOST"Send a test request.
This step simulates an OpenAI-formatted chat request.
curl http://${GATEWAY_HOST}:8888/v1/chat/completions \ -H "Host: qwen.default.example.com" \ -H "Content-Type: application/json" \ -d '{ "model": "/models/Qwen1.5-4B-Chat/", "messages": [ {"role": "user", "content": "Explain Kubernetes in one sentence."} ], "max_tokens": 50 }'The terminal should return JSON data that contains the
choicesfield, wherecontentcontains the model's response.
Billing
The Knative component itself does not incur additional charges. However, you will be billed for the cloud resources that your services use, such as compute, networking, and storage.
GPU instances: GPU instances are expensive. To control costs, we recommend using them with node scaling.
OSS: Charges include OSS storage and request fees. If public access is involved, you also incur egress traffic charges.
Classic Load Balancer (CLB): The public-facing load balancer instance bound to the gateway incurs traffic fees.
For more information, see Cloud product resource fees.
Related documents
Knative also supports deploying other services, such as A2A and MCP Server. This allows you to apply Serverless benefits like on-demand scaling and event-driven patterns to other advanced AI services.