All Products
Search
Document Center

Container Service for Kubernetes:Implement inference request queueing and priority scheduling with intelligent inference routing

Last Updated:Aug 29, 2025

Gateway with Inference Extension supports load-aware request queueing and priority scheduling for inference services. When a generative AI inference server is operating at full capacity, the gateway prioritizes requests in the queue based on their assigned model criticality. This ensures that requests for high-priority models are processed first. This topic introduces these capabilities of Gateway with Inference Extension.

Important

This feature requires version 1.4.0 or later of Gateway with Inference Extension.

Background

For generative AI inference services, the request throughput of a single inference server is strictly limited by its GPU resources. When many concurrent requests are sent to the same server, resources such as key-value (KV) caching in the inference engine become fully occupied, degrading response times and token throughput for all requests.

Gateway with Inference Extension addresses this by monitoring multiple metrics to assess the internal state of each inference server. When a server's load reaches capacity, the gateway queues incoming inference requests, preventing the server from being overloaded and maintaining overall service quality.

Prerequisites

Note

For the image described in this topic, we recommend using A10 cards for ACK clusters and GN8IS cards for Alibaba Cloud Container Compute Service (ACS) GPU computing power.

Due to the large size of the LLM image, we recommend that you transfer it to Container Registry in advance and pull it using the internal network address. The speed of pulling from the public network depends on the bandwidth configuration of the cluster elastic IP address (EIP), which may result in longer wait times.

Procedure

Step 1: Deploy a sample inference service

  1. Create vllm-service.yaml.

    View the YAML content

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      labels:
        app: qwen
      name: qwen
    spec:
      replicas: 5
      selector:
        matchLabels:
          app: qwen
      template:
        metadata:
          annotations:
            prometheus.io/path: /metrics
            prometheus.io/port: "8000"
            prometheus.io/scrape: "true"
          labels:
            app: qwen
        spec:
          containers:
          - command:
            - sh
            - -c
            - vllm serve /models/Qwen-2.5-7B-Instruct --port 8000 --enable_prefix_caching --trust-remote-code --served-model-name /model/qwen --max-model-len 8192 --gpu-memory-utilization 0.95 --enforce-eager --enable-lora --max-loras 2 --max-cpu-loras 4 --lora-modules travel-helper-v1=/models/Qwen-TravelHelper-Lora travel-helper-v2=/models/Qwen-TravelHelper-Lora-v2
            image: registry-cn-hangzhou.ack.aliyuncs.com/dev/qwen-2.5-7b-instruct-lora:v0.1
            imagePullPolicy: IfNotPresent
            name: custom-serving
            ports:
            - containerPort: 8000
              name: http
              protocol: TCP
            readinessProbe:
              failureThreshold: 3
              initialDelaySeconds: 30
              periodSeconds: 30
              successThreshold: 1
              tcpSocket:
                port: 8000
              timeoutSeconds: 1
            resources:
              limits:
                nvidia.com/gpu: "1"
            terminationMessagePath: /dev/termination-log
            terminationMessagePolicy: File
            volumeMounts:
            - mountPath: /dev/shm
              name: dshm
          dnsPolicy: ClusterFirst
          restartPolicy: Always
          schedulerName: default-scheduler
          securityContext: {}
          terminationGracePeriodSeconds: 30
          volumes:
          - emptyDir:
              medium: Memory
              sizeLimit: 30Gi
            name: dshm
    ---
    apiVersion: v1
    kind: Service
    metadata:
      labels:
        app: qwen
      name: qwen
    spec:
      ports:
      - name: http-serving
        port: 8000
        protocol: TCP
        targetPort: 8000
      selector:
        app: qwen
  2. Deploy the sample inference service.

    kubectl apply -f vllm-service.yaml

Step 2: Configure inference routing

In this step, you will create InferencePool and InferenceModel resources. By adding the inference-epp-env.networking.x-k8s.io/experimental-use-queueing: "true" and inference-epp-env.networking.x-k8s.io/experimental-use-scheduler-v2: "true" annotations to InferencePool, you enable the request queuing feature for the selected inference services.

  1. Create a file named inference-pool.yaml.

    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferencePool
    metadata:
      annotations:
        inference-epp-env.networking.x-k8s.io/experimental-use-queueing: "true"
        inference-epp-env.networking.x-k8s.io/experimental-use-scheduler-v2: "true"
      name: qwen-pool
      namespace: default
    spec:
      extensionRef:
        group: ""
        kind: Service
        name: qwen-ext-proc
      selector:
        app: qwen
      targetPortNumber: 8000
    ---
    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferenceModel
    metadata:
      name: qwen-model
    spec:
      criticality: Critical
      modelName: qwen
      poolRef:
        group: inference.networking.x-k8s.io
        kind: InferencePool
        name: qwen-pool
      targetModels:
      - name: qwen
        weight: 100
    ---
    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferenceModel
    metadata:
      name: travel-helper-model
    spec:
      criticality: Standard
      modelName: travel-helper
      poolRef:
        group: inference.networking.x-k8s.io
        kind: InferencePool
        name: qwen-pool
      targetModels:
      - name: travel-helper-v1
        weight: 100

    This configuration defines two InferenceModel resources that represent the two models served by the sample inference service:

    • qwen-model: Represents the base model qwen and is assigned the Critical criticality level.

    • travel-helper-model: Represents the LoRA model travel-helper and is assigned the Standard criticality level.

    The available criticality levels, in order of priority, are Critical > Standard > Sheddable. When queuing is enabled and the backend servers are at full capacity, requests for higher-priority models will be processed before lower-priority ones.

  2. Deploy the inference routing configuration.

    kubectl apply -f inference-pool.yaml

Step 3: Deploy the gateway and routing rules

In this step, you will configure a gateway and an HTTPRoute to route requests for the qwen and travel-helper models to the qwen-pool backend InferencePool.

  1. Create a file named inference-gateway.yaml.

    apiVersion: gateway.networking.k8s.io/v1
    kind: GatewayClass
    metadata:
      name: inference-gateway
    spec:
      controllerName: gateway.envoyproxy.io/gatewayclass-controller
    ---
    apiVersion: gateway.networking.k8s.io/v1
    kind: Gateway
    metadata:
      name: inference-gateway
    spec:
      gatewayClassName: inference-gateway
      listeners:
        - name: llm-gw
          protocol: HTTP
          port: 8081
    ---
    apiVersion: gateway.networking.k8s.io/v1
    kind: HTTPRoute
    metadata:
      name: llm-route
      namespace: default
    spec:
      parentRefs:
      - group: gateway.networking.k8s.io
        kind: Gateway
        name: inference-gateway
        sectionName: llm-gw
      rules:
      - backendRefs:
        - group: inference.networking.x-k8s.io
          kind: InferencePool
          name: qwen-pool
        matches:
        - headers:
          - type: Exact
            name: X-Gateway-Model-Name
            value: qwen
        - headers:
          - type: RegularExpression
            name: X-Gateway-Model-Name
            value: travel-helper.*
    ---
    apiVersion: gateway.envoyproxy.io/v1alpha1
    kind: BackendTrafficPolicy
    metadata:
      name: backend-timeout
    spec:
      timeout:
        http:
          requestTimeout: 24h
      targetRef:
        group: gateway.networking.k8s.io
        kind: Gateway
        name: inference-gateway
  2. Deploy the gateway.

    kubectl apply -f inference-gateway.yaml

Step 4: Validate queueing and priority scheduling

In this step, you will use the vLLM benchmark tool to simultaneously load test both the qwen and travel-helper models, pushing the inference servers to full capacity.

  1. Deploy the benchmark workload.

    kubectl apply -f- <<EOF
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      labels:
        app: vllm-benchmark
      name: vllm-benchmark
      namespace: default
    spec:
      progressDeadlineSeconds: 600
      replicas: 1
      revisionHistoryLimit: 10
      selector:
        matchLabels:
          app: vllm-benchmark
      strategy:
        rollingUpdate:
          maxSurge: 25%
          maxUnavailable: 25%
        type: RollingUpdate
      template:
        metadata:
          creationTimestamp: null
          labels:
            app: vllm-benchmark
        spec:
          containers:
          - command:
            - sh
            - -c
            - sleep inf
            image: registry-cn-hangzhou.ack.aliyuncs.com/dev/llm-benchmark:random-and-qa
            imagePullPolicy: IfNotPresent
            name: vllm-benchmark
            resources: {}
            terminationMessagePath: /dev/termination-log
            terminationMessagePolicy: File
          dnsPolicy: ClusterFirst
          restartPolicy: Always
          schedulerName: default-scheduler
          securityContext: {}
          terminationGracePeriodSeconds: 30
    EOF
  2. Get the internal IP of the gateway.

    export GW_IP=$(kubectl get svc -n envoy-gateway-system -l gateway.envoyproxy.io/owning-gateway-namespace=default,gateway.envoyproxy.io/owning-gateway-name=inference-gateway -o jsonpath='{.items[0].spec.clusterIP}')
  3. Open two separate terminal windows and run the load tests simultaneously.

    Important

    The following data was generated in a test environment and is for reference only. Your results may vary depending on your environment.

    Terminal 1: Load test the qwen (critical) model

    kubectl exec -it deploy/vllm-benchmark -- env GW_IP=${GW_IP} python3 /root/vllm/benchmarks/benchmark_serving.py \
    --backend vllm \
    --model /models/DeepSeek-R1-Distill-Qwen-7B \
    --served-model-name qwen \
    --trust-remote-code \
    --dataset-name random \
    --random-prefix-len 1000 \
    --random-input-len 3000 \
    --random-output-len 3000 \
    --random-range-ratio 0.2 \
    --num-prompts 300 \
    --max-concurrency 60 \
    --host $GW_IP \
    --port 8081 \
    --endpoint /v1/completions \
    --save-result \
    2>&1 | tee benchmark_serving.txt

    Expected output:

    ============ Serving Benchmark Result ============
    Successful requests:                     293       
    Benchmark duration (s):                  1005.55   
    Total input tokens:                      1163919   
    Total generated tokens:                  837560    
    Request throughput (req/s):              0.29      
    Output token throughput (tok/s):         832.94    
    Total Token throughput (tok/s):          1990.43   
    ---------------Time to First Token----------------
    Mean TTFT (ms):                          21329.91  
    Median TTFT (ms):                        15754.01  
    P99 TTFT (ms):                           140782.55 
    -----Time per Output Token (excl. 1st token)------
    Mean TPOT (ms):                          58.58     
    Median TPOT (ms):                        58.36     
    P99 TPOT (ms):                           91.09     
    ---------------Inter-token Latency----------------
    Mean ITL (ms):                           58.32     
    Median ITL (ms):                         50.56     
    P99 ITL (ms):                            64.12     
    ==================================================

    Terminal 2: Load test the travel-helper (standard) model

    kubectl exec -it deploy/vllm-benchmark -- env GW_IP=${GW_IP} python3 /root/vllm/benchmarks/benchmark_serving.py \
    --backend vllm \
    --model /models/DeepSeek-R1-Distill-Qwen-7B \
    --served-model-name travel-helper \
    --trust-remote-code \
    --dataset-name random \
    --random-prefix-len 1000 \
    --random-input-len 3000 \
    --random-output-len 3000 \
    --random-range-ratio 0.2 \
    --num-prompts 300 \
    --max-concurrency 60 \
    --host $GW_IP \
    --port 8081 \
    --endpoint /v1/completions \
    --save-result \
    2>&1 | tee benchmark_serving.txt

    Expected output:

    ============ Serving Benchmark Result ============
    Successful requests:                     165       
    Benchmark duration (s):                  889.41    
    Total input tokens:                      660560    
    Total generated tokens:                  492207    
    Request throughput (req/s):              0.19      
    Output token throughput (tok/s):         553.41    
    Total Token throughput (tok/s):          1296.10   
    ---------------Time to First Token----------------
    Mean TTFT (ms):                          44201.12  
    Median TTFT (ms):                        28757.03  
    P99 TTFT (ms):                           214710.13 
    -----Time per Output Token (excl. 1st token)------
    Mean TPOT (ms):                          67.38     
    Median TPOT (ms):                        60.51     
    P99 TPOT (ms):                           118.36    
    ---------------Inter-token Latency----------------
    Mean ITL (ms):                           66.98     
    Median ITL (ms):                         51.25     
    P99 ITL (ms):                            64.87     
    ==================================================

    The benchmark results confirm the effectiveness of priority scheduling under full load:

    • Faster response times: The high-priority qwen model's average Time to First Token (TTFT) was 50% lower than that of the standard-priority travel-helper model.

    • Higher reliability: The qwen model also returned 96% fewer request errors.