全部產品
Search
文件中心

Container Compute Service:使用Gateway with Inference Extension實現首碼感知負載平衡

更新時間:Jul 28, 2025

通過Gateway with Inference Extension組件,您可以根據產生式AI推理服務的不同使用情境、指定使用推理服務路由的不同負載平衡策略。本文介紹如何使用Gateway with Inference Extension組件實現首碼感知的負載平衡策略。

重要
  • 閱讀本文前,請確保您已經瞭解InferencePool和InferenceModel的相關概念。

  • 本文內容依賴1.4.0及以上版本Gateway with Inference Extension

背景資訊

vLLM的自動首碼緩衝

vLLM支援自動首碼緩衝特性。自動首碼緩衝(APC)會緩衝vLLM已經計算過請求的KV Cache,這樣如果新的請求與某個歷史請求具有相同的首碼,就可以直接複用現有的KV Cache,從而使新請求得以跳過共用首碼部分的KV Cache計算,從而加速對LLM推理請求的處理流程。

首碼感知的負載平衡策略

首碼感知的負載平衡策略是指將共用同一首碼內容的請求儘可能發送到同一個推理伺服器Pod的負載平衡策略。

在模型伺服器開啟APC特性的情況下,首碼感知的負載平衡策略可以儘可能的提高首碼快取命中率,減少請求回應時間。此策略主要適用於有大量共用首碼請求的情境,請根據您的實際業務情境進行判斷。

典型的使用情境如下:

  • 長文檔查詢:使用者反覆使用不同的查詢對同一長文檔(例如軟體手冊或年度報告)進行查詢。

  • 多輪對話:使用者可能在同一聊天會話中多次與應用程式進行互動。

前提條件

說明

本文使用的鏡像推薦ACK叢集使用A10卡型,ACS GPU算力推薦使用GN8IS(8代GPU B)卡型。

同時,由於LLM鏡像體積較大,建議您提前轉存到ACR,使用內網地址進行拉取。直接從公網拉取的速度取決於叢集EIP的頻寬配置,會有較長的等待時間。

操作步驟

步驟一:部署樣本推理服務

  1. 建立vllm-service.yaml。

    展開查看YAML內容

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      labels:
        app: qwen
      name: qwen
    spec:
      progressDeadlineSeconds: 600
      replicas: 5
      selector:
        matchLabels:
          app: qwen
      template:
        metadata:
          annotations:
            prometheus.io/path: /metrics
            prometheus.io/port: "8000"
            prometheus.io/scrape: "true"
          labels:
            app: qwen
            alibabacloud.com/compute-class: gpu
            alibabacloud.com/compute-qos: default
            alibabacloud.com/gpu-model-series: GN8IS
        spec:
          containers:
            - command:
                - sh
                - -c
                - vllm serve /models/Qwen-2.5-7B-Instruct --port 8000 --enable_prefix_caching --trust-remote-code --served-model-name /model/qwen --gpu-memory-utilization 0.95 --enforce-eager --enable-lora --max-loras 2 --max-cpu-loras 4 --lora-modules travel-helper-v1=/models/Qwen-TravelHelper-Lora travel-helper-v2=/models/Qwen-TravelHelper-Lora-v2
              image: registry-cn-hangzhou.ack.aliyuncs.com/dev/qwen-2.5-7b-instruct-lora:v0.1
              imagePullPolicy: IfNotPresent
              name: custom-serving
              ports:
                - containerPort: 8000
                  name: http
                  protocol: TCP
              readinessProbe:
                failureThreshold: 3
                initialDelaySeconds: 30
                periodSeconds: 30
                successThreshold: 1
                tcpSocket:
                  port: 8000
                timeoutSeconds: 1
              resources:
                limits:
                  nvidia.com/gpu: "1"
                  cpu: "8"
                  memory: 30G
              terminationMessagePath: /dev/termination-log
              terminationMessagePolicy: File
              volumeMounts:
                - mountPath: /dev/shm
                  name: dshm
          restartPolicy: Always
          volumes:
            - emptyDir:
                medium: Memory
                sizeLimit: 30Gi
              name: dshm
    ---
    apiVersion: v1
    kind: Service
    metadata:
      labels:
        app: qwen
      name: qwen
    spec:
      ports:
        - name: http-serving
          port: 8000
          protocol: TCP
          targetPort: 8000
      selector:
        app: qwen
  2. 部署樣本推理服務。

    kubectl apply -f vllm-service.yaml

步驟二:部署推理路由

本步驟建立InferencePool資源和InferenceModel資源。

  1. 建立inference-pool.yaml。

    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferencePool
    metadata:
      annotations:
        inference.networking.x-k8s.io/routing-strategy: "PREFIX_CACHE"
      name: vllm-qwen-pool
    spec:
      targetPortNumber: 8000
      selector:
        app: qwen
      extensionRef:
        name: inference-gateway-ext-proc
    ---
    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferenceModel
    metadata:
      name: inferencemodel-qwen
    spec:
      modelName: /model/qwen
      criticality: Critical
      poolRef:
        group: inference.networking.x-k8s.io
        kind: InferencePool
        name: vllm-qwen-pool
      targetModels:
      - name: /model/qwen
        weight: 100

    在InferencePool資源中,通過設定inference.networking.x-k8s.io/routing-strategy: "PREFIX_CACHE"註解,為InferencePool中的Pod啟用首碼感知的負載平衡策略。

  2. 部署推理路由。

    kubectl apply -f inference-pool.yaml

步驟三:部署網關和網關路由規則

本步驟將建立一個包含8080和8081連接埠的網關。其中在網關的8081連接埠通過HTTPRoute資源指定了網關路由後端為推理擴充提供的InferencePool,推理請求將會被路由到InferencePool指定的Pod集合中。在網關的8080連接埠通過HTTPRoute資源指定了網關路由後端為Service,推理請求會通過普通的HTTP最小請求數負載平衡策略路由到相同的Pod集合中。

  1. 建立inference-gateway.yaml。

    apiVersion: gateway.networking.k8s.io/v1
    kind: GatewayClass
    metadata:
      name: qwen-inference-gateway-class
    spec:
      controllerName: gateway.envoyproxy.io/gatewayclass-controller
    ---
    apiVersion: gateway.networking.k8s.io/v1
    kind: Gateway
    metadata:
      name: qwen-inference-gateway
    spec:
      gatewayClassName: qwen-inference-gateway-class
      listeners:
        - name: http
          protocol: HTTP
          port: 8080
        - name: llm-gw
          protocol: HTTP
          port: 8081
    ---
    apiVersion: gateway.networking.k8s.io/v1
    kind: HTTPRoute
    metadata:
      name: qwen-backend
    spec:
      parentRefs:
        - name: qwen-inference-gateway
          sectionName: llm-gw
      rules:
        - backendRefs:
            - group: inference.networking.x-k8s.io
              kind: InferencePool
              name: vllm-qwen-pool
          matches:
            - path:
                type: PathPrefix
                value: /
    ---
    apiVersion: gateway.networking.k8s.io/v1
    kind: HTTPRoute
    metadata:
      name: qwen-backend-no-inference
    spec:
      parentRefs:
      - group: gateway.networking.k8s.io
        kind: Gateway
        name: qwen-inference-gateway
        sectionName: http
      rules:
      - backendRefs:
        - group: ""
          kind: Service
          name: qwen
          port: 8000
          weight: 1
        matches:
        - path:
            type: PathPrefix
            value: /
    ---
    apiVersion: gateway.envoyproxy.io/v1alpha1
    kind: BackendTrafficPolicy
    metadata:
      name: backend-timeout
    spec:
      timeout:
        http:
          requestTimeout: 1h
      targetRef:
        group: gateway.networking.k8s.io
        kind: Gateway
        name: qwen-inference-gateway
  2. 部署網關。

    kubectl apply -f inference-gateway.yaml

步驟四:驗證路由規則

  1. 建立round1.txt和round2.txt。在兩個txt檔案中都包含了一段相同的一段content,通過先後將round1.txt和round2.txt作為LLM請求的Body,然後查看智能路由extensionRef的日誌內容,來驗證是否觸發智能路由的首碼感知功能。

    round1.txt:

    echo '{"max_tokens":24,"messages":[{"content":"Hi, here'\''s some system prompt: hi hi hi hi hi hi hi hi hi hi.For user 3, here are some other context: hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi.I would like to test your intelligence. for this purpose I would like you to play zork. you can interact with the game by typing in commands. I will forward these commands to the game and type in any response. are you ready?","role":"user"}],"model":"/model/qwen","stream":true,"stream_options":{"include_usage":true},"temperature":0}' > round1.txt

    round2.txt:

    echo '{"max_tokens":3,"messages":[{"content":"Hi, here'\''s some system prompt: hi hi hi hi hi hi hi hi hi hi.For user 3, here are some other context: hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi.I would like to test your intelligence. for this purpose I would like you to play zork. you can interact with the game by typing in commands. I will forward these commands to the game and type in any response. are you ready?","role":"user"},{"content":"Hi there! It looks like you're setting up a fun test. I'm ready to play Zork! You can","role":"assistant"},{"content":"% zork\nWelcome to Dungeon. This version created 11-MAR-91.\nYou are in an open field west of a big white house with a boarded\nfront door.\nThere is a small mailbox here.\n>","role":"user"},{"content":"Great!","role":"assistant"},{"content":"Opening the mailbox reveals:\n A leaflet.\n>","role":"user"}],"model":"/model/qwen","stream":true,"stream_options":{"include_usage":true},"temperature":0}' > round2.txt
  2. 擷取Gateway的公網IP。

    export GATEWAY_IP=$(kubectl get gateway/qwen-inference-gateway -o jsonpath='{.status.addresses[0].value}')
  3. 發起兩次會話請求,類比多輪對話情境。

    curl -X POST ${GATEWAY_IP}:8081/v1/chat/completions -H 'Content-Type: application/json' -d @./round1.txt
    curl -X POST ${GATEWAY_IP}:8081/v1/chat/completions -H 'Content-Type: application/json' -d @./round2.txt
  4. 查看日誌,確認首碼負載平衡是否生效。

    kubectl logs deploy/epp-default-inference-gateway-ext-proc -n envoy-gateway-system|grep "Do prefix"

    預期輸出:

    2025-05-23T03:33:09Z    INFO    scheduling/prefixcache_filter.go:311    Do prefix-aware routing!        {"request": "v68m4zx472", "matching ratio": " 0.54 > 0.50"}

    可以看到,日誌中有Do prefix-aware routing!的資訊,說明首碼負載平衡已經生效。

(可選)步驟五:通過多輪對話測試評估推理服務效能

本步驟以ACK叢集為例,示範使用壓測工具進行多輪對話測試,來對比普通HTTP路由和推理路由的首碼負載平衡效果。

  1. 部署llm-qa-benchmark壓測工具。

    kubectl apply -f- <<EOF
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      labels:
        app: llm-qa-benchmark
      name: llm-qa-benchmark
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: llm-qa-benchmark
      template:
        metadata:
          labels:
            app: llm-qa-benchmark
        spec:
          containers:
          - command:
            - sh
            - -c
            - sleep inf
            image: registry-cn-hangzhou.ack.aliyuncs.com/dev/llm-qa-benchmark:v0.1
            imagePullPolicy: IfNotPresent
            name: llm-qa-benchmark
            terminationMessagePath: /dev/termination-log
            terminationMessagePolicy: File
          restartPolicy: Always
    EOF
  2. 擷取Gateway內網IP。

    export GW_IP=$(kubectl get svc -n envoy-gateway-system -l gateway.envoyproxy.io/owning-gateway-namespace=default,gateway.envoyproxy.io/owning-gateway-name=qwen-inference-gateway -o jsonpath='{.items[0].spec.clusterIP}')
  3. 執行壓測。

    重要

    以下測試結果由測試環境產生,具體結果請以實際環境為準。

    普通HTTP路由

    kubectl exec -it deploy/llm-qa-benchmark -- env GW_IP=${GW_IP} python3 multi-round-qa.py \
        --num-users 8 \
        --num-rounds 15 \
        --qps 0.1 \
        --shared-system-prompt 100 \
        --sharegpt \
        --user-history-prompt 2000 \
        --answer-len 100 \
        --model /model/qwen \
        --time 600 \
        --base-url http://${GW_IP}:8080/v1

    預期輸出:

    ==================== Performance summary ======================
      QPS: 0.1000 reqs/s
    
      Processing speed: 0.1080 reqs/s
    
      Requests on-the-fly: 0
    
      Input tokens per second: 259.0703 tokens/s
    
      Output tokens per second: 4.8576 tokens/s
    
      Average generation throughput (per request): 26.6710 tokens/req/s
    
      Average TTFT: 0.3669s
    
    Time range: 1748231183.2753935 - 1748231766.4799275 (583.20s)
    ===============================================================

    推理服務路由

     kubectl exec -it deploy/llm-qa-benchmark -- env GW_IP=${GW_IP} python3 multi-round-qa.py \
        --num-users 8 \
        --num-rounds 15 \
        --qps 0.1 \
        --shared-system-prompt 100 \
        --sharegpt \
        --user-history-prompt 2000 \
        --answer-len 100 \
        --model /model/qwen \
        --time 600 \
        --base-url http://${GW_IP}:8081/v1

    預期輸出:

    ==================== Performance summary ======================
      QPS: 0.1000 reqs/s
    
      Processing speed: 0.1081 reqs/s
    
      Requests on-the-fly: 0
    
      Input tokens per second: 259.3009 tokens/s
    
      Output tokens per second: 4.8548 tokens/s
    
      Average generation throughput (per request): 26.9300 tokens/req/s
    
      Average TTFT: 0.2761s
    
    Time range: 1748231885.874972 - 1748232468.5918882 (582.72s)
    ===============================================================

    可以看到,推理服務路由的Average TTFT較HTTP路由有明顯提升。