全部產品
Search
文件中心

Container Service for Kubernetes:為LLM推理服務配置推理網關智能路由

更新時間:Sep 05, 2025

傳統的HTTP請求,經典負載平衡演算法可以將請求均勻地發送給不同的工作負載。然而,對於LLM推理服務來說,每個請求給後端帶來的負載是難以預測的。推理網關(Gateway with Inference Extension)是基於Kubernetes社區Gateway API及其Inference Extension規範實現的增強型組件,它能夠通過智能路由最佳化在多個推理服務工作負載之間的負載平衡效能,根據LLM推理服務不同情境提供不同的負載平衡策略,並實現模型灰階發布、推理請求排隊等能力。

前提條件

步驟一:為推理服務配置智能路由

根據推理服務的不同需求,Gateway with Inference Extension提供了兩種智能路由負載平衡策略。

  • 基於請求隊列長度和GPU Cache利用率的負載平衡(預設策略)。

  • 首碼感知的負載平衡策略(Prefix Cache Aware Routing)。

您可以通過為推理服務聲明InferencePool和InferenceModel資源來針對推理服務啟用推理網關的智能路由能力,並根據後端推理服務的部署方式和選用的負載平衡策略來靈活調整InferencePool和InferenceModel資源配置。

基於請求隊列長度和GPU Cache利用率的負載平衡

當InferencePool的annotations為空白時,預設採用基於請求隊列長度和GPU Cache利用率的推理服務智能路由策略。該策略會根據後端推理服務的即時負載情況(包括請求隊列長度和GPU緩衝利用率)來動態分配請求,以實現最優的負載平衡效果。

  1. 建立inference_networking.yaml檔案。

    單機vLLM部署

    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferencePool
    metadata:
      name: qwen-inference-pool
    spec:
      targetPortNumber: 8000
      selector:
        alibabacloud.com/inference-workload: vllm-inference
      extensionRef:
        name: inference-gateway-ext-proc
    ---
    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferenceModel
    metadata:
      name: qwen-inference-model
    spec:
      modelName: /models/Qwen3-32B
      criticality: Critical
      poolRef:
        group: inference.networking.x-k8s.io
        kind: InferencePool
        name: qwen-inference-pool
      targetModels:
      - name: /models/Qwen3-32B
        weight: 100

    單機SGLang部署

    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferencePool
    metadata:
      name: qwen-inference-pool
      annotations:
        inference.networking.x-k8s.io/model-server-runtime: sglang
    spec:
      targetPortNumber: 8000
      selector:
        alibabacloud.com/inference-workload: sgl-inference
      extensionRef:
        name: inference-gateway-ext-proc
    ---
    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferenceModel
    metadata:
      name: qwen-inference-model
    spec:
      modelName: /models/Qwen3-32B
      criticality: Critical
      poolRef:
        group: inference.networking.x-k8s.io
        kind: InferencePool
        name: qwen-inference-pool
      targetModels:
      - name: /models/Qwen3-32B
        weight: 100

    分布式vLLM部署

    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferencePool
    metadata:
      name: qwen-inference-pool
    spec:
      targetPortNumber: 8000
      selector:
        alibabacloud.com/inference-workload: vllm-multi-nodes
        role: leader
      extensionRef:
        name: inference-gateway-ext-proc
    ---
    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferenceModel
    metadata:
      name: qwen-inference-model
    spec:
      modelName: /models/Qwen3-32B
      criticality: Critical
      poolRef:
        group: inference.networking.x-k8s.io
        kind: InferencePool
        name: qwen-inference-pool
      targetModels:
      - name: /models/Qwen3-32B
        weight: 100

    分布式SGLang部署

    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferencePool
    metadata:
      name: qwen-inference-pool
      annotations:
        inference.networking.x-k8s.io/model-server-runtime: sglang
    spec:
      targetPortNumber: 8000
      selector:
        alibabacloud.com/inference-workload: sglang-multi-nodes
        role: leader
      extensionRef:
        name: inference-gateway-ext-proc
    ---
    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferenceModel
    metadata:
      name: qwen-inference-model
    spec:
      modelName: /models/Qwen3-32B
      criticality: Critical
      poolRef:
        group: inference.networking.x-k8s.io
        kind: InferencePool
        name: qwen-inference-pool
      targetModels:
      - name: /models/Qwen3-32B
        weight: 100

    SGLang PD分離部署

    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferencePool
    metadata:
      name: qwen-inference-pool
    spec:
      targetPortNumber: 8000
      selector:
        alibabacloud.com/inference_backend: sglang # 同時選中prefill和decode工作負載
    ---
    # InferenceTrafficPolicy 指定了針對InferencePool應用的具體流量策略
    apiVersion: inferenceextension.alibabacloud.com/v1alpha1
    kind: InferenceTrafficPolicy
    metadata:
      name: inference-policy
    spec:
      poolRef:
        name: qwen-inference-pool
      modelServerRuntime: sglang # 指定後端服務運行架構為SGLang
      profile:
        pd:  # 指定後端服務以PD分離方式部署
          pdRoleLabelName: rolebasedgroup.workloads.x-k8s.io/role #通過指定pod標籤區分InferencePool中的prefill和decode角色
          kvTransfer:
            bootstrapPort: 34000 # SGLang PD分離服務進行KVCache傳輸時使用的bootstrap port,和RoleBasedGroup部署中指定的 disaggregation-bootstrap-port 參數一致。
  2. 建立基於請求隊列長度和GPU Cache利用率的負載平衡。

    kubectl create -f inference_networking.yaml

首碼感知的負載平衡(Prefix Cache Aware Routing)

首碼感知負載平衡策略(Prefix Cache Aware Routing)是一種將共用相同首碼內容的請求儘可能發送到同一個推理伺服器Pod的策略。當模型伺服器開啟自動首碼緩衝(APC)特性時,這種策略可以提高首碼快取命中率,減少請求回應時間。

重要

在本文檔中使用的vLLM v0.9.2版本以及SGLang架構預設已開啟首碼緩衝功能,因此無需重新部署服務來啟用首碼緩衝。

要啟用首碼感知負載平衡策略,需要在InferencePool中添加註解:inference.networking.x-k8s.io/routing-strategy: "PREFIX_CACHE"

  1. 建立Prefix_Cache.yaml檔案。

    單機vLLM部署

    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferencePool
    metadata:
      name: qwen-inference-pool
      annotations:
        inference.networking.x-k8s.io/routing-strategy: "PREFIX_CACHE"
    spec:
      targetPortNumber: 8000
      selector:
        alibabacloud.com/inference-workload: vllm-inference
      extensionRef:
        name: inference-gateway-ext-proc
    ---
    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferenceModel
    metadata:
      name: qwen-inference-model
    spec:
      modelName: /models/Qwen3-32B
      criticality: Critical
      poolRef:
        group: inference.networking.x-k8s.io
        kind: InferencePool
        name: qwen-inference-pool
      targetModels:
      - name: /models/Qwen3-32B
        weight: 100

    單機SGLang部署

    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferencePool
    metadata:
      name: qwen-inference-pool
      annotations:
        inference.networking.x-k8s.io/model-server-runtime: sglang
        inference.networking.x-k8s.io/routing-strategy: "PREFIX_CACHE"
    spec:
      targetPortNumber: 8000
      selector:
        alibabacloud.com/inference-workload: sgl-inference
      extensionRef:
        name: inference-gateway-ext-proc
    ---
    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferenceModel
    metadata:
      name: qwen-inference-model
    spec:
      modelName: /models/Qwen3-32B
      criticality: Critical
      poolRef:
        group: inference.networking.x-k8s.io
        kind: InferencePool
        name: qwen-inference-pool
      targetModels:
      - name: /models/Qwen3-32B
        weight: 100

    分布式vLLM部署

    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferencePool
    metadata:
      name: qwen-inference-pool
      annotations:
        inference.networking.x-k8s.io/routing-strategy: "PREFIX_CACHE"
    spec:
      targetPortNumber: 8000
      selector:
        alibabacloud.com/inference-workload: vllm-multi-nodes
        role: leader
      extensionRef:
        name: inference-gateway-ext-proc
    ---
    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferenceModel
    metadata:
      name: qwen-inference-model
    spec:
      modelName: /models/Qwen3-32B
      criticality: Critical
      poolRef:
        group: inference.networking.x-k8s.io
        kind: InferencePool
        name: qwen-inference-pool
      targetModels:
      - name: /models/Qwen3-32B
        weight: 100

    分布式SGLang部署

    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferencePool
    metadata:
      name: qwen-inference-pool
      annotations:
        inference.networking.x-k8s.io/model-server-runtime: sglang
        inference.networking.x-k8s.io/routing-strategy: "PREFIX_CACHE"
    spec:
      targetPortNumber: 8000
      selector:
        alibabacloud.com/inference-workload: sglang-multi-nodes
        role: leader
      extensionRef:
        name: inference-gateway-ext-proc
    ---
    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferenceModel
    metadata:
      name: qwen-inference-model
    spec:
      modelName: /models/Qwen3-32B
      criticality: Critical
      poolRef:
        group: inference.networking.x-k8s.io
        kind: InferencePool
        name: qwen-inference-pool
      targetModels:
      - name: /models/Qwen3-32B
        weight: 100

    SGLang PD分離部署

    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferencePool
    metadata:
      name: qwen-inference-pool
    spec:
      targetPortNumber: 8000
      selector:
        alibabacloud.com/inference_backend: sglang # 同時選中prefill和decode工作負載
    ---
    # InferenceTrafficPolicy 指定了針對InferencePool應用的具體流量策略
    apiVersion: inferenceextension.alibabacloud.com/v1alpha1
    kind: InferenceTrafficPolicy
    metadata:
      name: inference-policy
    spec:
      poolRef:
        name: qwen-inference-pool
      modelServerRuntime: sglang # 指定後端服務運行架構為SGLang
      profile:
        pd:  # 指定後端服務以PD分離方式部署
          trafficPolicy:
            prefixCache: # 聲明首碼緩衝的負載平衡策略
              mode: estimate
          prefillPolicyRef: prefixCache
          decodePolicyRef: prefixCache # prefill 和 decode 均應用首碼感知的負載平衡
          pdRoleLabelName: rolebasedgroup.workloads.x-k8s.io/role #通過指定pod標籤區分InferencePool中的prefill和decode角色
          kvTransfer:
            bootstrapPort: 34000 # SGLang PD分離服務進行KVCache傳輸時使用的bootstrap port,和RoleBasedGroup部署中指定的 disaggregation-bootstrap-port 參數一致。
  2. 建立首碼感知的負載平衡。

    kubectl create -f Prefix_Cache.yaml

展開查看InferencePool與InferenceModel配置項說明。

配置項

類型

含義

預設值

metadata.annotations.inference.networking.x-k8s.io/model-server-runtime

string

指定模型服務運行時(如sglang)

metadata.annotations.inference.networking.x-k8s.io/routing-strategy

string

指定路由策略(可選DEFAULT、PREFIX_CACHE)

基於請求隊列長度和GPU Cache利用率的智能路由策略

spec.targetPortNumber

int

指定推理服務的連接埠號碼

spec.selector

map[string]string

選取器用於匹配推理服務的Pod

spec.extensionRef

ObjectReference

對推理擴充服務的聲明

spec.modelName

string

模型名稱,用於路由匹配

spec.criticality

string

模型關鍵性等級,可選值為Critical、Standard

spec.poolRef

PoolReference

關聯的InferencePool資源

步驟二:部署網關

  1. 建立gateway_networking.yaml檔案。

    apiVersion: gateway.networking.k8s.io/v1
    kind: GatewayClass
    metadata:
      name: inference-gateway-class
    spec:
      controllerName: inference.networking.x-k8s.io/gateway-controller
    ---
    apiVersion: gateway.networking.k8s.io/v1
    kind: Gateway
    metadata:
      name: inference-gateway
    spec:
      gatewayClassName: inference-gateway-class
      listeners:
      - name: http-llm
        protocol: HTTP
        port: 8080
    ---
    apiVersion: gateway.networking.k8s.io/v1
    kind: HTTPRoute
    metadata:
      name: inference-route
    spec:
      parentRefs:
      - name: inference-gateway
      rules:
      - matches:
        - path:
            type: PathPrefix
            value: /v1
        backendRefs:
        - name: qwen-inference-pool
          kind: InferencePool
          group: inference.networking.x-k8s.io
  2. 建立GatewayClass、Gateway和HTTPRoute資源,在8080連接埠配置LLM推理服務路由。

    kubectl create -f gateway_networking.yaml

步驟三:驗證推理網關配置

  1. 執行以下命令擷取網關的外部存取地址:

    export GATEWAY_HOST=$(kubectl get gateway/inference-gateway -o jsonpath='{.status.addresses[0].value}')
  2. 通過curl命令測試8080連接埠的服務訪問:

    curl http://${GATEWAY_HOST}:8080/v1/chat/completions \
      -H "Content-Type: application/json" \
      -d '{
        "model": "/models/Qwen3-32B",
        "messages": [
          {"role": "user", "content": "你好,這是一個測試"}
        ],
        "max_tokens": 50
      }'
  3. 驗證不同負載平衡。

    驗證基於請求隊列長度和GPU Cache利用率的負載平衡策略

    預設策略基於請求隊列長度和GPU Cache利用率進行智能路由。可以通過壓測推理服務、觀察推理服務TTFT和輸送量指標進行觀察。

    具體測試方法可參考配置LLM服務可觀測指標與可觀測大盤

    驗證首碼感知負載平衡

    建立測試檔案驗證首碼感知負載平衡是否生效。

    1. 產生 round1.txt:

      echo '{"max_tokens":24,"messages":[{"content":"Hi, here's some system prompt: hi hi hi hi hi hi hi hi hi hi.For user 3, here are some other context: hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi.I would like to test your intelligence. for this purpose I would like you to play zork. you can interact with the game by typing in commands. I will forward these commands to the game and type in any response. are you ready?","role":"user"}],"model":"/models/Qwen3-32B","stream":true,"stream_options":{"include_usage":true},"temperature":0}' > round1.txt
    2. 產生 round2.txt:

      echo '{"max_tokens":3,"messages":[{"content":"Hi, here's some system prompt: hi hi hi hi hi hi hi hi hi hi.For user 3, here are some other context: hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi.I would like to test your intelligence. for this purpose I would like you to play zork. you can interact with the game by typing in commands. I will forward these commands to the game and type in any response. are you ready?","role":"user"},{"content":"Hi there! It looks like you're setting up a fun test. I'm ready to play Zork! You can","role":"assistant"},{"content":"% zork\nWelcome to Dungeon. This version created 11-MAR-91.\nYou are in an open field west of a big white house with a boarded\nfront door.\nThere is a small mailbox here.\n>","role":"user"},{"content":"Great!","role":"assistant"},{"content":"Opening the mailbox reveals:\n A leaflet.\n>","role":"user"}],"model":"/models/Qwen3-32B","stream":true,"stream_options":{"include_usage":true},"temperature":0}' > round2.txt
    3. 執行以下命令進行測試:

      curl -X POST ${GATEWAY_HOST}:8080/v1/chat/completions -H 'Content-Type: application/json' -d @./round1.txt
      curl -X POST ${GATEWAY_HOST}:8080/v1/chat/completions -H 'Content-Type: application/json' -d @./round2.txt
    4. 檢查Inference Extension Processor的日誌確認首碼感知的負載平衡是否生效:

      kubectl logs deploy/inference-gateway-ext-proc -n envoy-gateway-system | grep "Request Handled"

      如果看到列印的兩條日誌中出現相同的pod名稱,說明首碼感知的負載平衡生效。

      首碼感知的負載平衡的具體測試方法與效果,可參考 通過多輪對話測試評估推理服務效能