全部產品
Search
文件中心

Container Service for Kubernetes:使用Gateway with Inference Extension實現推理請求排隊與優先順序調度

更新時間:Aug 29, 2025

Gateway with Inference Extension支援基於推理服務負載感知的推理請求排隊與優先順序調度。當產生式AI推理服務後端模型伺服器滿載時,可以根據模型優先順序對隊列中的推理請求進行優先順序調度,即優先響應高優先順序模型的請求。本文主要介紹Gateway with Inference Extension的推理請求排隊與優先順序調度能力。

重要

本文內容依賴1.4.0及以上版本Gateway with Inference Extension

背景資訊

對於產生式AI推理服務,單個推理伺服器的請求吞吐能力會受到GPU資源的嚴格限制。當大量的請求同時發送到同一個推理伺服器時,會導致推理引擎的KV Cache等資源佔用率滿載,從而影響所有請求的回應時間和token吞吐速度。

Gateway with Inference Extension支援過推理伺服器多個維度指標來評估推理伺服器的內部狀態,並在推理伺服器負載滿載時對推理請求進行排隊,防止過量請求發送到推理伺服器,造成服務整體品質下降。

前提條件

說明

本文使用的鏡像推薦ACK叢集使用A10卡型,ACS GPU算力推薦使用L20(GN8IS)卡型。

同時,由於LLM鏡像體積較大,建議您提前轉存到ACR,使用內網地址進行拉取。直接從公網拉取的速度取決於叢集EIP的頻寬配置,會有較長的等待時間。

操作步驟

步驟一:部署樣本推理服務

  1. 建立vllm-service.yaml。

    展開查看YAML內容

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      labels:
        app: qwen
      name: qwen
    spec:
      replicas: 5
      selector:
        matchLabels:
          app: qwen
      template:
        metadata:
          annotations:
            prometheus.io/path: /metrics
            prometheus.io/port: "8000"
            prometheus.io/scrape: "true"
          labels:
            app: qwen
        spec:
          containers:
          - command:
            - sh
            - -c
            - vllm serve /models/Qwen-2.5-7B-Instruct --port 8000 --enable_prefix_caching --trust-remote-code --served-model-name /model/qwen --max-model-len 8192 --gpu-memory-utilization 0.95 --enforce-eager --enable-lora --max-loras 2 --max-cpu-loras 4 --lora-modules travel-helper-v1=/models/Qwen-TravelHelper-Lora travel-helper-v2=/models/Qwen-TravelHelper-Lora-v2
            image: registry-cn-hangzhou.ack.aliyuncs.com/dev/qwen-2.5-7b-instruct-lora:v0.1
            imagePullPolicy: IfNotPresent
            name: custom-serving
            ports:
            - containerPort: 8000
              name: http
              protocol: TCP
            readinessProbe:
              failureThreshold: 3
              initialDelaySeconds: 30
              periodSeconds: 30
              successThreshold: 1
              tcpSocket:
                port: 8000
              timeoutSeconds: 1
            resources:
              limits:
                nvidia.com/gpu: "1"
            terminationMessagePath: /dev/termination-log
            terminationMessagePolicy: File
            volumeMounts:
            - mountPath: /dev/shm
              name: dshm
          dnsPolicy: ClusterFirst
          restartPolicy: Always
          schedulerName: default-scheduler
          securityContext: {}
          terminationGracePeriodSeconds: 30
          volumes:
          - emptyDir:
              medium: Memory
              sizeLimit: 30Gi
            name: dshm
    ---
    apiVersion: v1
    kind: Service
    metadata:
      labels:
        app: qwen
      name: qwen
    spec:
      ports:
      - name: http-serving
        port: 8000
        protocol: TCP
        targetPort: 8000
      selector:
        app: qwen
  2. 部署樣本推理服務。

    kubectl apply -f vllm-service.yaml

步驟二:部署推理路由

本步驟建立InferencePool資源和InferenceModel資源。並通過為InferencePool添加inference-epp-env.networking.x-k8s.io/experimental-use-queueing: "true"inference-epp-env.networking.x-k8s.io/experimental-use-scheduler-v2: "true"註解來為InferencePool選中的推理服務啟用排隊能力。

  1. 建立inference-pool.yaml。

    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferencePool
    metadata:
      annotations:
        inference-epp-env.networking.x-k8s.io/experimental-use-queueing: "true"
        inference-epp-env.networking.x-k8s.io/experimental-use-scheduler-v2: "true"
      name: qwen-pool
      namespace: default
    spec:
      extensionRef:
        group: ""
        kind: Service
        name: qwen-ext-proc
      selector:
        app: qwen
      targetPortNumber: 8000
    ---
    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferenceModel
    metadata:
      name: qwen-model
    spec:
      criticality: Critical
      modelName: qwen
      poolRef:
        group: inference.networking.x-k8s.io
        kind: InferencePool
        name: qwen-pool
      targetModels:
      - name: qwen
        weight: 100
    ---
    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferenceModel
    metadata:
      name: travel-helper-model
    spec:
      criticality: Standard
      modelName: travel-helper
      poolRef:
        group: inference.networking.x-k8s.io
        kind: InferencePool
        name: qwen-pool
      targetModels:
      - name: travel-helper-v1
        weight: 100

    在InferencePool資源中,同時聲明了兩個InferenceModel資源,表示樣本推理服務可以提供服務的兩種模型:

    • qwen-model:聲明了樣本推理服務提供的基本模型qwen,並通過criticality: Critical欄位聲明了該模型的關鍵性等級為關鍵。

    • travel-helper-model:聲明了樣本推理服務基於基本模型提供的LoRA模型travel-helper,並通過criticality: Standard欄位聲明了該模型的關鍵性等級為標準。

    模型關鍵性等級可以聲明為Critical(關鍵)、Standard(標準)、Sheddable(低優)。三種關鍵性等級的優先順序為Critical>Standard>Sheddable。啟用排隊能力後,當後端模型伺服器滿載時,會優先響應高優先順序模型的請求。

  2. 部署推理路由。

    kubectl apply -f inference-pool.yaml

步驟三:部署網關和網關路由規則

通過匹配請求中的模型名稱,將請求qwentravel-helper模型的請求路由到後端名為qwen-pool的InferencePool。

  1. 建立inference-gateway.yaml。

    apiVersion: gateway.networking.k8s.io/v1
    kind: GatewayClass
    metadata:
      name: inference-gateway
    spec:
      controllerName: gateway.envoyproxy.io/gatewayclass-controller
    ---
    apiVersion: gateway.networking.k8s.io/v1
    kind: Gateway
    metadata:
      name: inference-gateway
    spec:
      gatewayClassName: inference-gateway
      listeners:
        - name: llm-gw
          protocol: HTTP
          port: 8081
    ---
    apiVersion: gateway.networking.k8s.io/v1
    kind: HTTPRoute
    metadata:
      name: llm-route
      namespace: default
    spec:
      parentRefs:
      - group: gateway.networking.k8s.io
        kind: Gateway
        name: inference-gateway
        sectionName: llm-gw
      rules:
      - backendRefs:
        - group: inference.networking.x-k8s.io
          kind: InferencePool
          name: qwen-pool
        matches:
        - headers:
          - type: Exact
            name: X-Gateway-Model-Name
            value: qwen
        - headers:
          - type: RegularExpression
            name: X-Gateway-Model-Name
            value: travel-helper.*
    ---
    apiVersion: gateway.envoyproxy.io/v1alpha1
    kind: BackendTrafficPolicy
    metadata:
      name: backend-timeout
    spec:
      timeout:
        http:
          requestTimeout: 24h
      targetRef:
        group: gateway.networking.k8s.io
        kind: Gateway
        name: inference-gateway
  2. 部署網關。

    kubectl apply -f inference-gateway.yaml

步驟四:驗證請求排隊與優先順序調度能力

以ACK叢集為例,使用vllm benchmark同時對qwen模型和travel-helper模型發起壓測,使模型伺服器滿載。

  1. 部署壓測工作負載。

    kubectl apply -f- <<EOF
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      labels:
        app: vllm-benchmark
      name: vllm-benchmark
      namespace: default
    spec:
      progressDeadlineSeconds: 600
      replicas: 1
      revisionHistoryLimit: 10
      selector:
        matchLabels:
          app: vllm-benchmark
      strategy:
        rollingUpdate:
          maxSurge: 25%
          maxUnavailable: 25%
        type: RollingUpdate
      template:
        metadata:
          creationTimestamp: null
          labels:
            app: vllm-benchmark
        spec:
          containers:
          - command:
            - sh
            - -c
            - sleep inf
            image: registry-cn-hangzhou.ack.aliyuncs.com/dev/llm-benchmark:random-and-qa
            imagePullPolicy: IfNotPresent
            name: vllm-benchmark
            resources: {}
            terminationMessagePath: /dev/termination-log
            terminationMessagePolicy: File
          dnsPolicy: ClusterFirst
          restartPolicy: Always
          schedulerName: default-scheduler
          securityContext: {}
          terminationGracePeriodSeconds: 30
    EOF
  2. 擷取Gateway的內網IP。

    export GW_IP=$(kubectl get svc -n envoy-gateway-system -l gateway.envoyproxy.io/owning-gateway-namespace=default,gateway.envoyproxy.io/owning-gateway-name=inference-gateway -o jsonpath='{.items[0].spec.clusterIP}')
  3. 開啟兩個終端視窗,同時對兩個模型發起壓測。

    重要

    以下資料使用測試環境產生,僅供參考。實際壓測結果可能會由於環境不同而出現差異。

    qwen:

    kubectl exec -it deploy/vllm-benchmark -- env GW_IP=${GW_IP} python3 /root/vllm/benchmarks/benchmark_serving.py \
    --backend vllm \
    --model /models/DeepSeek-R1-Distill-Qwen-7B \
    --served-model-name qwen \
    --trust-remote-code \
    --dataset-name random \
    --random-prefix-len 1000 \
    --random-input-len 3000 \
    --random-output-len 3000 \
    --random-range-ratio 0.2 \
    --num-prompts 300 \
    --max-concurrency 60 \
    --host $GW_IP \
    --port 8081 \
    --endpoint /v1/completions \
    --save-result \
    2>&1 | tee benchmark_serving.txt

    預期輸出:

    ============ Serving Benchmark Result ============
    Successful requests:                     293       
    Benchmark duration (s):                  1005.55   
    Total input tokens:                      1163919   
    Total generated tokens:                  837560    
    Request throughput (req/s):              0.29      
    Output token throughput (tok/s):         832.94    
    Total Token throughput (tok/s):          1990.43   
    ---------------Time to First Token----------------
    Mean TTFT (ms):                          21329.91  
    Median TTFT (ms):                        15754.01  
    P99 TTFT (ms):                           140782.55 
    -----Time per Output Token (excl. 1st token)------
    Mean TPOT (ms):                          58.58     
    Median TPOT (ms):                        58.36     
    P99 TPOT (ms):                           91.09     
    ---------------Inter-token Latency----------------
    Mean ITL (ms):                           58.32     
    Median ITL (ms):                         50.56     
    P99 ITL (ms):                            64.12     
    ==================================================

    travel-helper:

    kubectl exec -it deploy/vllm-benchmark -- env GW_IP=${GW_IP} python3 /root/vllm/benchmarks/benchmark_serving.py \
    --backend vllm \
    --model /models/DeepSeek-R1-Distill-Qwen-7B \
    --served-model-name travel-helper \
    --trust-remote-code \
    --dataset-name random \
    --random-prefix-len 1000 \
    --random-input-len 3000 \
    --random-output-len 3000 \
    --random-range-ratio 0.2 \
    --num-prompts 300 \
    --max-concurrency 60 \
    --host $GW_IP \
    --port 8081 \
    --endpoint /v1/completions \
    --save-result \
    2>&1 | tee benchmark_serving.txt

    預期輸出:

    ============ Serving Benchmark Result ============
    Successful requests:                     165       
    Benchmark duration (s):                  889.41    
    Total input tokens:                      660560    
    Total generated tokens:                  492207    
    Request throughput (req/s):              0.19      
    Output token throughput (tok/s):         553.41    
    Total Token throughput (tok/s):          1296.10   
    ---------------Time to First Token----------------
    Mean TTFT (ms):                          44201.12  
    Median TTFT (ms):                        28757.03  
    P99 TTFT (ms):                           214710.13 
    -----Time per Output Token (excl. 1st token)------
    Mean TPOT (ms):                          67.38     
    Median TPOT (ms):                        60.51     
    P99 TPOT (ms):                           118.36    
    ---------------Inter-token Latency----------------
    Mean ITL (ms):                           66.98     
    Median ITL (ms):                         51.25     
    P99 ITL (ms):                            64.87     
    ==================================================

    可以看到,在模型服務滿載的情況下,qwen模型請求平均TTFT指標相比travel-helper模型請求降低約50%,錯誤請求數量相比travel-helper模型下降約96%。