All Products
Search
Document Center

Container Service for Kubernetes:Deploy LLM services and implement smart routing using an AI inference gateway in Knative

Last Updated:Mar 26, 2026

Running LLM inference on Kubernetes requires managing Layer 7 routing, load balancing across GPU pods, and scaling GPU resources based on request concurrency — all without dedicated infrastructure tooling. The ACK Gateway with Inference Extension solves this by integrating the Kubernetes Gateway API and the Inference Extension specification with the Knative serverless architecture. You get intelligent traffic scheduling and concurrency-based autoscaling without building custom routing infrastructure.

How it works

Gateway with Inference Extension introduces two Custom Resource Definitions (CRDs) for AI inference scenarios:

  • InferencePool: Groups pods that share the same compute configuration, accelerator type, foundation model, and model server. An InferencePool can span multiple nodes for high availability.

  • InferenceObjective: Defines the model an InferencePool serves and the criticality level of that workload. Pods marked as Critical receive priority processing.

Adding the AI gateway annotation to a Knative Service triggers automatic integration with these CRDs, enabling intelligent traffic scheduling without additional configuration.

Prerequisites

Before you begin, ensure that you have:

Cluster requirements:

Other resources:

Set up your environment variables

Declare the following variables before you start. Later steps reference these variables directly so you can copy commands without modification.

export BUCKET_NAME="<your-bucket-name>"         # Your OSS bucket name
export ACCESS_KEY_ID="<your-access-key-id>"     # Your Alibaba Cloud AccessKey ID
export ACCESS_KEY_SECRET="<your-access-key-secret>" # Your Alibaba Cloud AccessKey Secret

Replace the placeholder values with your actual values. All subsequent commands use these variables.

Step 1: Enable Gateway API support in Knative

Configure Knative to use the Gateway API as its ingress controller.

  1. Edit the config-network ConfigMap.

    kubectl edit configmap config-network -n knative-serving
  2. In the data field, set ingress.class to the following value, then save.

    apiVersion: v1
    data:
      ...
      ingress.class: gateway-api.ingress.networking.knative.dev
      ...
    kind: ConfigMap
    metadata:
      name: config-network
      namespace: knative-serving
      ...

    image

  3. Verify the change took effect.

    kubectl get configmap config-network -n knative-serving -o yaml | grep "ingress.class"

    Expected output:

    ingress.class: gateway-api.ingress.networking.knative.dev

Step 2: Create an inference gateway

Create a Gateway resource that listens for external requests on port 8888.

  1. Create a file named knative-gateway.yaml.

    kind: Gateway
    apiVersion: gateway.networking.k8s.io/v1
    metadata:
      name: knative-gateway
      namespace: knative-serving
    spec:
      gatewayClassName: ack-gateway
      listeners:
      - name: default
        port: 80
        protocol: HTTP
        allowedRoutes:
          namespaces:
            from: All
      - name: llm-gw
        protocol: HTTP
        port: 8888        # Listening port for the inference service
        allowedRoutes:
          namespaces:
            from: All
  2. Deploy the Gateway.

    kubectl apply -f knative-gateway.yaml
  3. Verify the Gateway is ready.

    kubectl get gateway knative-gateway -n knative-serving

    The PROGRAMMED field must be True and the ADDRESS field must show an assigned IP address.

    NAME              CLASS         ADDRESS        PROGRAMMED   AGE
    knative-gateway   ack-gateway   47.XX.XX.198   True         22s

Step 3: Prepare model data and configure storage

Mount model data from OSS using a static PersistentVolume (PV) to avoid downloading the model each time a container starts.

Download the model and upload it to OSS

This guide uses the Qwen1.5-4B-Chat model as an example. You can use a temporary Elastic Compute Service (ECS) instance to prepare the model data, then release it after the upload completes.

  1. Purchase an ECS instance, then download the model.

    # Install Git LFS
    sudo yum install -y git git-lfs
    git lfs install
    
    # Clone the repository without downloading large files first
    GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/qwen/Qwen1.5-4B-Chat.git
    
    # Download the large model files
    cd Qwen1.5-4B-Chat
    git lfs pull
  2. Upload the model to your OSS bucket using ossutil.

    For ossutil installation instructions, see Install ossutil.
    # Create the target directory in OSS
    ossutil mkdir oss://${BUCKET_NAME}/models/Qwen1.5-4B-Chat
    
    # Upload all model files recursively
    ossutil cp -r ./ oss://${BUCKET_NAME}/models/Qwen1.5-4B-Chat

Configure a PersistentVolume and PersistentVolumeClaim

Create an OSS static PersistentVolumeClaim (PVC) for faster model loading. For background, see Use a static ossfs 1.0 persistent volume.

  1. Create a Secret with your OSS credentials.

    kubectl create secret generic oss-secret \
      --from-literal=akId="${ACCESS_KEY_ID}" \
      --from-literal=akSecret="${ACCESS_KEY_SECRET}" \
      --namespace default
  2. Create a file named oss-storage.yaml. Replace <your-bucket-name> with your actual bucket name.

    apiVersion: v1
    kind: PersistentVolume
    metadata:
      name: llm-model
      labels:
        alicloud-pvname: llm-model
    spec:
      capacity:
        storage: 30Gi
      accessModes:
        - ReadWriteMany
      persistentVolumeReclaimPolicy: Retain
      storageClassName: oss
      csi:
        driver: ossplugin.csi.alibabacloud.com
        volumeHandle: llm-model
        nodePublishSecretRef:
          name: oss-secret
          namespace: default
        volumeAttributes:
          bucket: "<your-bucket-name>"
          url: "http://oss-cn-hangzhou-internal.aliyuncs.com"   # Endpoint for the bucket region
          path: "/models/Qwen1.5-4B-Chat"                       # Path to the model in OSS
    ---
    apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      name: llm-model
      namespace: default
    spec:
      accessModes:
        - ReadWriteMany
      storageClassName: oss
      resources:
        requests:
          storage: 30Gi
      selector:
        matchLabels:
          alicloud-pvname: llm-model    # Binds to the PV above by label
  3. Deploy the PV and PVC.

    kubectl apply -f oss-storage.yaml

Step 4: Deploy the Knative inference service

Create a Knative Service that enables the AI inference gateway and runs vLLM as the inference engine.

  1. Create a file named qwen-service.yaml. Key annotations and what they do:

    Annotation Value Description
    knative.aliyun.com/ai-gateway inference Enables AI inference gateway integration
    knative.aliyun.com/ai-gateway-inference-priority "1" Sets the routing priority for this service
    autoscaling.knative.dev/metric "concurrency" Scales based on concurrent request count
    autoscaling.knative.dev/target "2" Target number of concurrent requests per pod
    autoscaling.knative.dev/max-scale "3" Maximum number of running pods
    autoscaling.knative.dev/min-scale "1" Minimum pods to keep running. Set to at least 1 — LLM containers take a long time to start, and cold starts cause request timeouts
    apiVersion: serving.knative.dev/v1
    kind: Service
    metadata:
      name: qwen
      namespace: default
      annotations:
        knative.aliyun.com/ai-gateway: inference
        knative.aliyun.com/ai-gateway-inference-priority: "1"
      labels:
        release: qwen
    spec:
      template:
        metadata:
          annotations:
            autoscaling.knative.dev/metric: "concurrency"
            autoscaling.knative.dev/target: "2"
            autoscaling.knative.dev/max-scale: "3"
            autoscaling.knative.dev/min-scale: "1"
          labels:
            release: qwen
        spec:
          containers:
          - name: vllm-container
            image: ac2-registry.cn-hangzhou.cr.aliyuncs.com/ac2/vllm:0.4.1-ubuntu22.04
            command:
            - sh
            - -c
            - python3 -m vllm.entrypoints.openai.api_server --port 8080 --trust-remote-code --model /models/Qwen1.5-4B-Chat/ --gpu-memory-utilization 0.95 --max-model-len 8192 --dtype half
            ports:
            - containerPort: 8080
            readinessProbe:
              tcpSocket:
                port: 8080
              initialDelaySeconds: 15
              periodSeconds: 5
            resources:
              limits:
                cpu: "32"
                memory: 64Gi
                nvidia.com/gpu: "1"
              requests:
                cpu: "8"
                memory: 32Gi
                nvidia.com/gpu: "1"
            volumeMounts:
            - mountPath: /models/Qwen1.5-4B-Chat   # Must match the --model path in the start command
              name: llm-model
          volumes:
          - name: llm-model
            persistentVolumeClaim:
              claimName: llm-model
  2. Deploy the service.

    kubectl apply -f qwen-service.yaml
  3. Wait for the service to be ready. LLM containers can take several minutes to start.

    kubectl get ksvc qwen -n default

    The service is ready when READY shows True:

    NAME   URL                                        LATESTCREATED   LATESTREADY   READY   REASON
    qwen   http://qwen.default.example.com            qwen-00001      qwen-00001    True

Step 5: Validate the inference service

After the service is ready, send a test request through the gateway.

  1. Get the gateway IP address.

    export GATEWAY_HOST=$(kubectl -n knative-serving get gateway/knative-gateway -o jsonpath='{.status.addresses[0].value}')
    echo "Gateway address: $GATEWAY_HOST"
  2. Send a test request in OpenAI-compatible format.

    curl http://${GATEWAY_HOST}:8888/v1/chat/completions \
      -H "Host: qwen.default.example.com" \
      -H "Content-Type: application/json" \
      -d '{
        "model": "/models/Qwen1.5-4B-Chat/",
        "messages": [
          {"role": "user", "content": "Describe Kubernetes in one sentence."}
        ],
        "max_tokens": 50
      }'

    A successful response returns JSON with a choices field. The content value inside choices contains the model's reply.

Billing

The Knative component itself has no extra fees. The underlying cloud resources are billed separately:

  • GPU instances: GPU instances are expensive. Use node auto-scaling to keep costs under control.

  • OSS: Billed for storage and requests. Public network access also incurs outbound traffic fees.

  • Server Load Balancer (SLB): The Internet-facing SLB instance attached to the gateway incurs traffic fees.

For a complete breakdown, see Cloud product resource fees.

What's next

You can extend this setup to support more advanced AI service patterns in Knative: