Deploy LLM services and implement intelligent routing in Knative - Container Service for Kubernetes

Built on the Kubernetes Gateway API and the Inference Extension specification, the Gateway with Inference Extension component works with the Knative Serverless architecture to simplify managing generative AI inference services. It provides efficient Layer-7 routing and load balancing across multiple inference service workloads and enables GPU resource autoscaling based on request concurrency.

How it works

Gateway with Inference Extension extends the Gateway API for AI inference scenarios with the following CustomResourceDefinitions (CRDs).

InferencePool: Logically groups resources for AI model services. It represents a set of Pods that share the same compute configuration, accelerator type, base model, and model server. An InferencePool can span multiple nodes to provide high availability.
InferenceObjective: Defines the objectives for a model service, specifying the model name served by Pods in an InferencePool and its criticality level. Workloads marked as Critical receive higher processing priority.

In Knative, enabling the AI gateway annotation allows a Knative Service to automatically use these CRDs for intelligent traffic scheduling.

Prerequisites

You have created an ACK Managed Pro cluster that meets the following requirements:
- Knative is deployed. For more information, see Deploy and manage Knative components.
- The Gateway API component is installed.
- The Gateway with Inference Extension component of version v1.4.0-apsara.4 or later is installed, and you selected Enable Gateway API Inference Extension during installation.
- The cluster contains GPU nodes, each with at least 32 GiB of memory (this topic uses Qwen1.5-4B as an example). A specific Node Labels is required on the nodes to specify the driver version: set the key to ack.aliyun.com/nvidia-driver-version and the value to 550.144.03.
  We recommend a GPU node driver version of 550.144.03 or later. For more information, see Customize the GPU driver version of a node by specifying a version number.
You have created an OSS Bucket.
We recommend choosing the same region as your cluster to avoid cross-region data transfer charges and reduce latency.

Step 1: Enable Gateway API support in Knative

Modify the Knative network configuration to specify the Gateway API as the Ingress controller.

Edit the config-network ConfigMap.

kubectl edit configmap config-network -n knative-serving

In the data field, modify ingress.class and then save your changes.

apiVersion: v1
data:
  ...
  # Modify ingress.class to use the Gateway API as the Ingress controller.
  ingress.class: gateway-api.ingress.networking.knative.dev 
  ...
kind: ConfigMap
metadata:
  name: config-network
  namespace: knative-serving
  ...

Verify that the change has taken effect.

kubectl get configmap config-network -n knative-serving -o yaml | grep "ingress.class"

Expected output:

  ingress.class: gateway-api.ingress.networking.knative.dev

Step 2: Create an inference gateway resource

Create a Gateway resource to listen for external requests. This example configures the gateway to listen on port 8888.

Create the gateway configuration file knative-gateway.yaml.

kind: Gateway
apiVersion: gateway.networking.k8s.io/v1
metadata:
  name: knative-gateway
  namespace: knative-serving
spec:
  gatewayClassName: ack-gateway
  listeners:
  - name: default
    port: 80
    protocol: HTTP
    allowedRoutes:
      namespaces:
        from: All
  - name: llm-gw
    protocol: HTTP
    # The port that the inference service listens on.
    port: 8888  
    allowedRoutes:
      namespaces:
        from: All

Deploy the gateway resource.
```
kubectl apply -f knative-gateway.yaml
```

Check the gateway status.

kubectl get gateway knative-gateway -n knative-serving

In the output, ensure that PROGRAMMED is True and that an IP address is assigned in the ADDRESS field.

NAME              CLASS         ADDRESS        PROGRAMMED   AGE
knative-gateway   ack-gateway   47.XX.XX.198   True         22s

Step 3: Prepare model data and configure storage

To avoid re-downloading the model every time a container starts, we recommend using an OSS static volume to store and mount the model data.

1. Download model and upload to OSS

This step uses the Qwen1.5-4B-Chat model as an example. You can temporarily purchase an ECS instance to prepare the model data and release it after you are finished.

Download the model to a local directory.

# Install Git LFS
sudo yum install -y git git-lfs
git lfs install
# Clone the model repository (skip smudge to speed up)
GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/qwen/Qwen1.5-4B-Chat.git
# Download the actual large files
cd Qwen1.5-4B-Chat
git lfs pull

Use ossutil to upload the model to your OSS Bucket.

Replace <Bucket-Name> with your actual OSS Bucket name.

To install ossutil, see Install ossutil.

# Create a directory.
ossutil mkdir oss://<Bucket-Name>/models/Qwen1.5-4B-Chat
# Upload files recursively (-r indicates recursive upload).
ossutil cp -r ./ oss://<Bucket-Name>/models/Qwen1.5-4B-Chat

2. Configure a PV and PVC

To improve model loading performance, this example creates an OSS static volume. For detailed steps, see Use an ossfs 1.0 static volume.

Create an OSS access credential (Secret).

Replace <AccessKey-ID> and <AccessKey-Secret> with your actual information.

kubectl create secret generic oss-secret \
  --from-literal=akId='<AccessKey-ID>' \
  --from-literal=akSecret='<AccessKey-Secret>' \
  --namespace default

Create the oss-storage.yaml file.

apiVersion: v1
kind: PersistentVolume
metadata:
  name: llm-model
  labels:
    alicloud-pvname: llm-model
spec:
  capacity:
    storage: 30Gi
  # Access mode
  accessModes:
    - ReadWriteMany            
  persistentVolumeReclaimPolicy: Retain
  storageClassName: oss
  csi:
    driver: ossplugin.csi.alibabacloud.com
    volumeHandle: llm-model
    # Get AccessKey information from the Secret object.
    nodePublishSecretRef:
      name: oss-secret
      namespace: default
    volumeAttributes:
      # Replace with your actual OSS Bucket name.
      bucket: "<Your-Bucket-Name>"         
      # The internal endpoint for the bucket's region.
      url: "http://oss-cn-hangzhou-internal.aliyuncs.com" 
      # The relative path in OSS.
      path: "/models/Qwen1.5-4B-Chat"     
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: llm-model
  namespace: default
spec:
  accessModes:
    - ReadWriteMany
  storageClassName: oss
  resources:
    requests:
      # Requested storage size, which cannot exceed the total volume size.
      storage: 30Gi
  selector:
    matchLabels:
      # Select the PV by using this label.
      alicloud-pvname: llm-model

Deploy the PV and PVC.
```
kubectl apply -f oss-storage.yaml
```

Step 4: Deploy the Knative inference service

Create a Knative Service, enable the AI gateway feature, and configure the vLLM engine for inference.

Create the service configuration file qwen-service.yaml.

Key configurations:

knative.aliyun.com/ai-gateway: inference: Enables the inference gateway extension.
autoscaling.knative.dev/metric: "concurrency": Autoscales based on the number of concurrent requests.

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: qwen
  namespace: default
  annotations:
    # Enable the AI inference gateway.
    knative.aliyun.com/ai-gateway: inference          
    knative.aliyun.com/ai-gateway-inference-priority: "1"
  labels:
    release: qwen
spec:
  template:
    metadata:
      annotations:
        # Autoscaling metric: concurrency.
        autoscaling.knative.dev/metric: "concurrency" 
        # Target concurrency.
        autoscaling.knative.dev/target: "2"           
        # Maximum number of instances.
        autoscaling.knative.dev/max-scale: "3"        
        # Minimum number of instances. For large models, a minimum of 1 is recommended to keep an instance warm and avoid request timeouts on cold starts.
        autoscaling.knative.dev/min-scale: "1"        
      labels:
        release: qwen
    spec:
      containers:
      - name: vllm-container
        image: ac2-registry.cn-hangzhou.cr.aliyuncs.com/ac2/vllm:0.4.1-ubuntu22.04
        command:
        - sh
        - -c
        - python3 -m vllm.entrypoints.openai.api_server --port 8080 --trust-remote-code --model /models/Qwen1.5-4B-Chat/ --gpu-memory-utilization 0.95 --max-model-len 8192 --dtype half
        ports:
        - containerPort: 8080
        readinessProbe:
          tcpSocket:
            port: 8080
          initialDelaySeconds: 15
          periodSeconds: 5
        resources:
          limits:
            cpu: "32"
            memory: 64Gi
            # Request GPU resources.
            nvidia.com/gpu: "1" 
          requests:
            cpu: "8"
            memory: 32Gi
            nvidia.com/gpu: "1"
        volumeMounts:
        # The mount path must match the model parameter in the startup command.
        - mountPath: /models/Qwen1.5-4B-Chat 
          name: llm-model
      volumes:
      - name: llm-model
        persistentVolumeClaim:
          claimName: llm-model

Deploy the service.
```
kubectl apply -f qwen-service.yaml
```
Check the deployment progress (wait for Ready to be True).
```
kubectl get ksvc qwen -n default
```

Step 5: Verify the inference service

After the service is deployed, use the gateway IP address to access the inference API.

Get the gateway IP address.

export GATEWAY_HOST=$(kubectl -n knative-serving get gateway/knative-gateway -o jsonpath='{.status.addresses[0].value}')
echo "Gateway IP address: $GATEWAY_HOST"

Send a test request.

This step simulates an OpenAI-formatted chat request.

curl http://${GATEWAY_HOST}:8888/v1/chat/completions \
  -H "Host: qwen.default.example.com" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "/models/Qwen1.5-4B-Chat/",
    "messages": [
      {"role": "user", "content": "Explain Kubernetes in one sentence."}
    ],
    "max_tokens": 50
  }'

The terminal should return JSON data that contains the choices field, where content contains the model's response.

Billing

The Knative component itself does not incur additional charges. However, you will be billed for the cloud resources that your services use, such as compute, networking, and storage.

GPU instances: GPU instances are expensive. To control costs, we recommend using them with node scaling.
OSS: Charges include OSS storage and request fees. If public access is involved, you also incur egress traffic charges.
Classic Load Balancer (CLB): The public-facing load balancer instance bound to the gateway incurs traffic fees.

For more information, see Cloud product resource fees.

Container Service for Kubernetes:Deploy LLM services and implement intelligent routing in Knative