All Products
Search
Document Center

Container Service for Kubernetes:Deploy LLM services and implement intelligent routing in Knative

Last Updated:Jun 25, 2026

Built on the Kubernetes Gateway API and the Inference Extension specification, the Gateway with Inference Extension component works with the Knative Serverless architecture to simplify managing generative AI inference services. It provides efficient Layer-7 routing and load balancing across multiple inference service workloads and enables GPU resource autoscaling based on request concurrency.

How it works

Gateway with Inference Extension extends the Gateway API for AI inference scenarios with the following CustomResourceDefinitions (CRDs).

  • InferencePool: Logically groups resources for AI model services. It represents a set of Pods that share the same compute configuration, accelerator type, base model, and model server. An InferencePool can span multiple nodes to provide high availability.

  • InferenceObjective: Defines the objectives for a model service, specifying the model name served by Pods in an InferencePool and its criticality level. Workloads marked as Critical receive higher processing priority.

In Knative, enabling the AI gateway annotation allows a Knative Service to automatically use these CRDs for intelligent traffic scheduling.

Prerequisites

  • You have created an ACK Managed Pro cluster that meets the following requirements:

  • You have created an OSS Bucket.

    We recommend choosing the same region as your cluster to avoid cross-region data transfer charges and reduce latency.

Step 1: Enable Gateway API support in Knative

Modify the Knative network configuration to specify the Gateway API as the Ingress controller.

  1. Edit the config-network ConfigMap.

    kubectl edit configmap config-network -n knative-serving
  2. In the data field, modify ingress.class and then save your changes.

    apiVersion: v1
    data:
      ...
      # Modify ingress.class to use the Gateway API as the Ingress controller.
      ingress.class: gateway-api.ingress.networking.knative.dev 
      ...
    kind: ConfigMap
    metadata:
      name: config-network
      namespace: knative-serving
      ...
  3. Verify that the change has taken effect.

    kubectl get configmap config-network -n knative-serving -o yaml | grep "ingress.class"

    Expected output:

      ingress.class: gateway-api.ingress.networking.knative.dev

Step 2: Create an inference gateway resource

Create a Gateway resource to listen for external requests. This example configures the gateway to listen on port 8888.

  1. Create the gateway configuration file knative-gateway.yaml.

    kind: Gateway
    apiVersion: gateway.networking.k8s.io/v1
    metadata:
      name: knative-gateway
      namespace: knative-serving
    spec:
      gatewayClassName: ack-gateway
      listeners:
      - name: default
        port: 80
        protocol: HTTP
        allowedRoutes:
          namespaces:
            from: All
      - name: llm-gw
        protocol: HTTP
        # The port that the inference service listens on.
        port: 8888  
        allowedRoutes:
          namespaces:
            from: All
  2. Deploy the gateway resource.

    kubectl apply -f knative-gateway.yaml
  3. Check the gateway status.

    kubectl get gateway knative-gateway -n knative-serving

    In the output, ensure that PROGRAMMED is True and that an IP address is assigned in the ADDRESS field.

    NAME              CLASS         ADDRESS        PROGRAMMED   AGE
    knative-gateway   ack-gateway   47.XX.XX.198   True         22s

Step 3: Prepare model data and configure storage

To avoid re-downloading the model every time a container starts, we recommend using an OSS static volume to store and mount the model data.

1. Download model and upload to OSS

This step uses the Qwen1.5-4B-Chat model as an example. You can temporarily purchase an ECS instance to prepare the model data and release it after you are finished.

  1. Download the model to a local directory.

    # Install Git LFS
    sudo yum install -y git git-lfs
    git lfs install
    # Clone the model repository (skip smudge to speed up)
    GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/qwen/Qwen1.5-4B-Chat.git
    # Download the actual large files
    cd Qwen1.5-4B-Chat
    git lfs pull
  2. Use ossutil to upload the model to your OSS Bucket.

    Replace <Bucket-Name> with your actual OSS Bucket name.

    To install ossutil, see Install ossutil.
    # Create a directory.
    ossutil mkdir oss://<Bucket-Name>/models/Qwen1.5-4B-Chat
    # Upload files recursively (-r indicates recursive upload).
    ossutil cp -r ./ oss://<Bucket-Name>/models/Qwen1.5-4B-Chat

2. Configure a PV and PVC

To improve model loading performance, this example creates an OSS static volume. For detailed steps, see Use an ossfs 1.0 static volume.

  1. Create an OSS access credential (Secret).

    Replace <AccessKey-ID> and <AccessKey-Secret> with your actual information.

    kubectl create secret generic oss-secret \
      --from-literal=akId='<AccessKey-ID>' \
      --from-literal=akSecret='<AccessKey-Secret>' \
      --namespace default
  2. Create the oss-storage.yaml file.

    apiVersion: v1
    kind: PersistentVolume
    metadata:
      name: llm-model
      labels:
        alicloud-pvname: llm-model
    spec:
      capacity:
        storage: 30Gi
      # Access mode
      accessModes:
        - ReadWriteMany            
      persistentVolumeReclaimPolicy: Retain
      storageClassName: oss
      csi:
        driver: ossplugin.csi.alibabacloud.com
        volumeHandle: llm-model
        # Get AccessKey information from the Secret object.
        nodePublishSecretRef:
          name: oss-secret
          namespace: default
        volumeAttributes:
          # Replace with your actual OSS Bucket name.
          bucket: "<Your-Bucket-Name>"         
          # The internal endpoint for the bucket's region.
          url: "http://oss-cn-hangzhou-internal.aliyuncs.com" 
          # The relative path in OSS.
          path: "/models/Qwen1.5-4B-Chat"     
    ---
    apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      name: llm-model
      namespace: default
    spec:
      accessModes:
        - ReadWriteMany
      storageClassName: oss
      resources:
        requests:
          # Requested storage size, which cannot exceed the total volume size.
          storage: 30Gi
      selector:
        matchLabels:
          # Select the PV by using this label.
          alicloud-pvname: llm-model
  3. Deploy the PV and PVC.

    kubectl apply -f oss-storage.yaml

Step 4: Deploy the Knative inference service

Create a Knative Service, enable the AI gateway feature, and configure the vLLM engine for inference.

  1. Create the service configuration file qwen-service.yaml.

    Key configurations:

    • knative.aliyun.com/ai-gateway: inference: Enables the inference gateway extension.

    • autoscaling.knative.dev/metric: "concurrency": Autoscales based on the number of concurrent requests.

    apiVersion: serving.knative.dev/v1
    kind: Service
    metadata:
      name: qwen
      namespace: default
      annotations:
        # Enable the AI inference gateway.
        knative.aliyun.com/ai-gateway: inference          
        knative.aliyun.com/ai-gateway-inference-priority: "1"
      labels:
        release: qwen
    spec:
      template:
        metadata:
          annotations:
            # Autoscaling metric: concurrency.
            autoscaling.knative.dev/metric: "concurrency" 
            # Target concurrency.
            autoscaling.knative.dev/target: "2"           
            # Maximum number of instances.
            autoscaling.knative.dev/max-scale: "3"        
            # Minimum number of instances. For large models, a minimum of 1 is recommended to keep an instance warm and avoid request timeouts on cold starts.
            autoscaling.knative.dev/min-scale: "1"        
          labels:
            release: qwen
        spec:
          containers:
          - name: vllm-container
            image: ac2-registry.cn-hangzhou.cr.aliyuncs.com/ac2/vllm:0.4.1-ubuntu22.04
            command:
            - sh
            - -c
            - python3 -m vllm.entrypoints.openai.api_server --port 8080 --trust-remote-code --model /models/Qwen1.5-4B-Chat/ --gpu-memory-utilization 0.95 --max-model-len 8192 --dtype half
            ports:
            - containerPort: 8080
            readinessProbe:
              tcpSocket:
                port: 8080
              initialDelaySeconds: 15
              periodSeconds: 5
            resources:
              limits:
                cpu: "32"
                memory: 64Gi
                # Request GPU resources.
                nvidia.com/gpu: "1" 
              requests:
                cpu: "8"
                memory: 32Gi
                nvidia.com/gpu: "1"
            volumeMounts:
            # The mount path must match the model parameter in the startup command.
            - mountPath: /models/Qwen1.5-4B-Chat 
              name: llm-model
          volumes:
          - name: llm-model
            persistentVolumeClaim:
              claimName: llm-model
  2. Deploy the service.

    kubectl apply -f qwen-service.yaml
  3. Check the deployment progress (wait for Ready to be True).

    kubectl get ksvc qwen -n default

Step 5: Verify the inference service

After the service is deployed, use the gateway IP address to access the inference API.

  1. Get the gateway IP address.

    export GATEWAY_HOST=$(kubectl -n knative-serving get gateway/knative-gateway -o jsonpath='{.status.addresses[0].value}')
    echo "Gateway IP address: $GATEWAY_HOST"
  2. Send a test request.

    This step simulates an OpenAI-formatted chat request.

    curl http://${GATEWAY_HOST}:8888/v1/chat/completions \
      -H "Host: qwen.default.example.com" \
      -H "Content-Type: application/json" \
      -d '{
        "model": "/models/Qwen1.5-4B-Chat/",
        "messages": [
          {"role": "user", "content": "Explain Kubernetes in one sentence."}
        ],
        "max_tokens": 50
      }'

    The terminal should return JSON data that contains the choices field, where content contains the model's response.

Billing

The Knative component itself does not incur additional charges. However, you will be billed for the cloud resources that your services use, such as compute, networking, and storage.

  • GPU instances: GPU instances are expensive. To control costs, we recommend using them with node scaling.

  • OSS: Charges include OSS storage and request fees. If public access is involved, you also incur egress traffic charges.

  • Classic Load Balancer (CLB): The public-facing load balancer instance bound to the gateway incurs traffic fees.

For more information, see Cloud product resource fees.

Related documents

Knative also supports deploying other services, such as A2A and MCP Server. This allows you to apply Serverless benefits like on-demand scaling and event-driven patterns to other advanced AI services.