All Products
Search
Document Center

Container Service for Kubernetes:Use Gateway with Inference Extension to queue and prioritize inference requests

Last Updated:Mar 26, 2026

Gateway with Inference Extension supports load-aware request queueing and priority scheduling for inference services. When GPU resources are saturated, the gateway queues incoming requests and processes them based on model priority, ensuring high-priority models receive faster responses.

Important

This feature requires Gateway with Inference Extension version 1.4.0 or later.

How it works

Generative AI inference servers have a hard limit on GPU throughput. When too many concurrent requests arrive, resources such as KV (key-value) cache fill up entirely, degrading response times and token throughput for all requests.

Gateway with Inference Extension monitors each inference server's internal metrics to detect saturation. When a server is at capacity, the gateway queues requests centrally rather than letting them pile up inside the server. It then dispatches queued requests in priority order—high-priority models are always served before lower-priority ones.

Priority levels

Assign a criticality level to each InferenceModel to control queue priority under saturation:

Criticality level Priority order
Critical Highest
Standard Medium
Schedulable Lowest

When queueing is enabled and backend servers are saturated, Critical requests are dispatched before Standard requests, which are dispatched before Schedulable requests.

Prerequisites

Before you begin, make sure you have:

For the image described in this topic, use A10 cards for ACK clusters and GN8IS cards for ACS GPU computing power. Due to the large size of the LLM image, transfer it to Container Registry in advance and pull it using the internal network address. Pulling from the public network depends on the bandwidth of the cluster elastic IP address (EIP) and may result in longer wait times.

Enable request queueing and priority scheduling

Step 1: Deploy a sample inference service

Create vllm-service.yaml.

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: qwen
  name: qwen
spec:
  progressDeadlineSeconds: 600
  replicas: 5
  selector:
    matchLabels:
      app: qwen
  template:
    metadata:
      annotations:
        prometheus.io/path: /metrics
        prometheus.io/port: "8000"
        prometheus.io/scrape: "true"
      labels:
        app: qwen
        alibabacloud.com/compute-class: gpu
        alibabacloud.com/compute-qos: default
        alibabacloud.com/gpu-model-series: GN8IS
    spec:
      containers:
        - command:
            - sh
            - -c
            - vllm serve /models/Qwen-2.5-7B-Instruct --port 8000 --enable_prefix_caching --trust-remote-code --served-model-name /model/qwen --gpu-memory-utilization 0.95 --enforce-eager --enable-lora --max-loras 2 --max-cpu-loras 4 --lora-modules travel-helper-v1=/models/Qwen-TravelHelper-Lora travel-helper-v2=/models/Qwen-TravelHelper-Lora-v2
          image: registry-cn-hangzhou.ack.aliyuncs.com/dev/qwen-2.5-7b-instruct-lora:v0.1
          imagePullPolicy: IfNotPresent
          name: custom-serving
          ports:
            - containerPort: 8000
              name: http
              protocol: TCP
          readinessProbe:
            failureThreshold: 3
            initialDelaySeconds: 30
            periodSeconds: 30
            successThreshold: 1
            tcpSocket:
              port: 8000
            timeoutSeconds: 1
          resources:
            limits:
              nvidia.com/gpu: "1"
              cpu: "8"
              memory: 30G
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          volumeMounts:
            - mountPath: /dev/shm
              name: dshm
      restartPolicy: Always
      volumes:
        - emptyDir:
            medium: Memory
            sizeLimit: 30Gi
          name: dshm
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app: qwen
  name: qwen
spec:
  ports:
    - name: http-serving
      port: 8000
      protocol: TCP
      targetPort: 8000
  selector:
    app: qwen

Deploy the sample inference service.

kubectl apply -f vllm-service.yaml

Step 2: Configure inference routing

Create InferencePool and InferenceModel resources. Adding the inference-epp-env.networking.x-k8s.io/experimental-use-queueing: "true" and inference-epp-env.networking.x-k8s.io/experimental-use-scheduler-v2: "true" annotations to InferencePool enables request queueing for the selected inference services.

  1. Create a file named inference-pool.yaml.

    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferencePool
    metadata:
      annotations:
        inference-epp-env.networking.x-k8s.io/experimental-use-queueing: "true"
        inference-epp-env.networking.x-k8s.io/experimental-use-scheduler-v2: "true"
      name: qwen-pool
      namespace: default
    spec:
      extensionRef:
        group: ""
        kind: Service
        name: qwen-ext-proc
      selector:
        app: qwen
      targetPortNumber: 8000
    ---
    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferenceModel
    metadata:
      name: qwen-model
    spec:
      criticality: Critical
      modelName: qwen
      poolRef:
        group: inference.networking.x-k8s.io
        kind: InferencePool
        name: qwen-pool
      targetModels:
      - name: qwen
        weight: 100
    ---
    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferenceModel
    metadata:
      name: travel-helper-model
    spec:
      criticality: Standard
      modelName: travel-helper
      poolRef:
        group: inference.networking.x-k8s.io
        kind: InferencePool
        name: qwen-pool
      targetModels:
      - name: travel-helper-v1
        weight: 100

    This configuration defines two InferenceModel resources for the models served by the sample inference service:

    Model Criticality
    qwen-model (base model qwen) Critical
    travel-helper-model (LoRA model travel-helper) Standard
  2. Deploy the inference routing configuration.

    kubectl apply -f inference-pool.yaml

Step 3: Deploy the gateway and routing rules

Configure a Gateway and an HTTPRoute to route requests for the qwen and travel-helper models to the qwen-pool backend InferencePool.

  1. Create a file named inference-gateway.yaml.

    apiVersion: gateway.networking.k8s.io/v1
    kind: GatewayClass
    metadata:
      name: inference-gateway
    spec:
      controllerName: gateway.envoyproxy.io/gatewayclass-controller
    ---
    apiVersion: gateway.networking.k8s.io/v1
    kind: Gateway
    metadata:
      name: inference-gateway
    spec:
      gatewayClassName: inference-gateway
      listeners:
        - name: llm-gw
          protocol: HTTP
          port: 8081
    ---
    apiVersion: gateway.networking.k8s.io/v1
    kind: HTTPRoute
    metadata:
      name: llm-route
      namespace: default
    spec:
      parentRefs:
      - group: gateway.networking.k8s.io
        kind: Gateway
        name: inference-gateway
        sectionName: llm-gw
      rules:
      - backendRefs:
        - group: inference.networking.x-k8s.io
          kind: InferencePool
          name: qwen-pool
        matches:
        - headers:
          - type: Exact
            name: X-Gateway-Model-Name
            value: qwen
        - headers:
          - type: RegularExpression
            name: X-Gateway-Model-Name
            value: travel-helper.*
    ---
    apiVersion: gateway.envoyproxy.io/v1alpha1
    kind: BackendTrafficPolicy
    metadata:
      name: backend-timeout
    spec:
      timeout:
        http:
          requestTimeout: 24h
      targetRef:
        group: gateway.networking.k8s.io
        kind: Gateway
        name: inference-gateway
  2. Deploy the gateway.

    kubectl apply -f inference-gateway.yaml

Step 4: Validate queueing and priority scheduling

Use the vLLM benchmark tool to simultaneously load test both the qwen and travel-helper models, pushing the inference servers to full capacity.

  1. Deploy the benchmark workload.

    kubectl apply -f- <<EOF
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      labels:
        app: vllm-benchmark
      name: vllm-benchmark
      namespace: default
    spec:
      progressDeadlineSeconds: 600
      replicas: 1
      revisionHistoryLimit: 10
      selector:
        matchLabels:
          app: vllm-benchmark
      strategy:
        rollingUpdate:
          maxSurge: 25%
          maxUnavailable: 25%
        type: RollingUpdate
      template:
        metadata:
          creationTimestamp: null
          labels:
            app: vllm-benchmark
        spec:
          containers:
          - command:
            - sh
            - -c
            - sleep inf
            image: registry-cn-hangzhou.ack.aliyuncs.com/dev/llm-benchmark:random-and-qa
            imagePullPolicy: IfNotPresent
            name: vllm-benchmark
            resources: {}
            terminationMessagePath: /dev/termination-log
            terminationMessagePolicy: File
          dnsPolicy: ClusterFirst
          restartPolicy: Always
          schedulerName: default-scheduler
          securityContext: {}
          terminationGracePeriodSeconds: 30
    EOF
  2. Get the internal IP of the gateway.

    export GW_IP=$(kubectl get svc -n envoy-gateway-system -l gateway.envoyproxy.io/owning-gateway-namespace=default,gateway.envoyproxy.io/owning-gateway-name=inference-gateway -o jsonpath='{.items[0].spec.clusterIP}')
  3. Open two separate terminal windows and run the load tests simultaneously.

    Important

    The following data was generated in a test environment and is for reference only. Your results may vary depending on your environment.

    Terminal 1: Load test the `qwen` (Critical) model

    kubectl exec -it deploy/vllm-benchmark -- env GW_IP=${GW_IP} python3 /root/vllm/benchmarks/benchmark_serving.py \
    --backend vllm \
    --model /models/DeepSeek-R1-Distill-Qwen-7B \
    --served-model-name qwen \
    --trust-remote-code \
    --dataset-name random \
    --random-prefix-len 1000 \
    --random-input-len 3000 \
    --random-output-len 3000 \
    --random-range-ratio 0.2 \
    --num-prompts 300 \
    --max-concurrency 60 \
    --host $GW_IP \
    --port 8081 \
    --endpoint /v1/completions \
    --save-result \
    2>&1 | tee benchmark_serving.txt

    Expected output:

    ============ Serving Benchmark Result ============
    Successful requests:                     293
    Benchmark duration (s):                  1005.55
    Total input tokens:                      1163919
    Total generated tokens:                  837560
    Request throughput (req/s):              0.29
    Output token throughput (tok/s):         832.94
    Total Token throughput (tok/s):          1990.43
    ---------------Time to First Token----------------
    Mean TTFT (ms):                          21329.91
    Median TTFT (ms):                        15754.01
    P99 TTFT (ms):                           140782.55
    -----Time per Output Token (excl. 1st token)------
    Mean TPOT (ms):                          58.58
    Median TPOT (ms):                        58.36
    P99 TPOT (ms):                           91.09
    ---------------Inter-token Latency----------------
    Mean ITL (ms):                           58.32
    Median ITL (ms):                         50.56
    P99 ITL (ms):                            64.12
    ==================================================

    Terminal 2: Load test the `travel-helper` (Standard) model

    kubectl exec -it deploy/vllm-benchmark -- env GW_IP=${GW_IP} python3 /root/vllm/benchmarks/benchmark_serving.py \
    --backend vllm \
    --model /models/DeepSeek-R1-Distill-Qwen-7B \
    --served-model-name travel-helper \
    --trust-remote-code \
    --dataset-name random \
    --random-prefix-len 1000 \
    --random-input-len 3000 \
    --random-output-len 3000 \
    --random-range-ratio 0.2 \
    --num-prompts 300 \
    --max-concurrency 60 \
    --host $GW_IP \
    --port 8081 \
    --endpoint /v1/completions \
    --save-result \
    2>&1 | tee benchmark_serving.txt

    Expected output:

    ============ Serving Benchmark Result ============
    Successful requests:                     165
    Benchmark duration (s):                  889.41
    Total input tokens:                      660560
    Total generated tokens:                  492207
    Request throughput (req/s):              0.19
    Output token throughput (tok/s):         553.41
    Total Token throughput (tok/s):          1296.10
    ---------------Time to First Token----------------
    Mean TTFT (ms):                          44201.12
    Median TTFT (ms):                        28757.03
    P99 TTFT (ms):                           214710.13
    -----Time per Output Token (excl. 1st token)------
    Mean TPOT (ms):                          67.38
    Median TPOT (ms):                        60.51
    P99 TPOT (ms):                           118.36
    ---------------Inter-token Latency----------------
    Mean ITL (ms):                           66.98
    Median ITL (ms):                         51.25
    P99 ITL (ms):                            64.87
    ==================================================

    The results confirm priority scheduling under full load:

    Metric qwen (Critical) travel-helper (Standard)
    Mean TTFT (ms) 21,329 44,201
    Successful requests 293 / 300 (97.7%) 165 / 300 (55%)
    Mean TPOT (ms) 58.58 67.38

    Under saturation, Standard-priority requests wait longer in the gateway queue before dispatching—which is why their Time to First Token (TTFT) is approximately 2x higher than for Critical requests. This is the expected trade-off: Critical requests are served first, while Standard requests wait.