為LLM推理服務配置推理網關智能路由 - Container Service for Kubernetes

傳統的HTTP請求，經典負載平衡演算法可以將請求均勻地發送給不同的工作負載。然而，對於LLM推理服務來說，每個請求給後端帶來的負載是難以預測的。推理網關（Gateway with Inference Extension）是基於Kubernetes社區Gateway API及其Inference Extension規範實現的增強型組件，它能夠通過智能路由最佳化在多個推理服務工作負載之間的負載平衡效能，根據LLM推理服務不同情境提供不同的負載平衡策略，並實現模型灰階發布、推理請求排隊等能力。

前提條件

已部署Gateway with Inference Extension組件。
已部署單機LLM推理服務或部署多機分布式推理服務。

步驟一：為推理服務配置智能路由

根據推理服務的不同需求，Gateway with Inference Extension提供了兩種智能路由負載平衡策略。

基於請求隊列長度和GPU Cache利用率的負載平衡（預設策略）。
首碼感知的負載平衡策略（Prefix Cache Aware Routing）。

您可以通過為推理服務聲明InferencePool和InferenceModel資源來針對推理服務啟用推理網關的智能路由能力，並根據後端推理服務的部署方式和選用的負載平衡策略來靈活調整InferencePool和InferenceModel資源配置。

基於請求隊列長度和GPU Cache利用率的負載平衡

當InferencePool的annotations為空白時，預設採用基於請求隊列長度和GPU Cache利用率的推理服務智能路由策略。該策略會根據後端推理服務的即時負載情況（包括請求隊列長度和GPU緩衝利用率）來動態分配請求，以實現最優的負載平衡效果。

建立inference_networking.yaml檔案。

單機vLLM部署

apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
  name: qwen-inference-pool
spec:
  targetPortNumber: 8000
  selector:
    alibabacloud.com/inference-workload: vllm-inference
  extensionRef:
    name: inference-gateway-ext-proc
---
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferenceModel
metadata:
  name: qwen-inference-model
spec:
  modelName: /models/Qwen3-32B
  criticality: Critical
  poolRef:
    group: inference.networking.x-k8s.io
    kind: InferencePool
    name: qwen-inference-pool
  targetModels:
  - name: /models/Qwen3-32B
    weight: 100

單機SGLang部署

apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
  name: qwen-inference-pool
  annotations:
    inference.networking.x-k8s.io/model-server-runtime: sglang
spec:
  targetPortNumber: 8000
  selector:
    alibabacloud.com/inference-workload: sgl-inference
  extensionRef:
    name: inference-gateway-ext-proc
---
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferenceModel
metadata:
  name: qwen-inference-model
spec:
  modelName: /models/Qwen3-32B
  criticality: Critical
  poolRef:
    group: inference.networking.x-k8s.io
    kind: InferencePool
    name: qwen-inference-pool
  targetModels:
  - name: /models/Qwen3-32B
    weight: 100

分布式vLLM部署

apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
  name: qwen-inference-pool
spec:
  targetPortNumber: 8000
  selector:
    alibabacloud.com/inference-workload: vllm-multi-nodes
    role: leader
  extensionRef:
    name: inference-gateway-ext-proc
---
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferenceModel
metadata:
  name: qwen-inference-model
spec:
  modelName: /models/Qwen3-32B
  criticality: Critical
  poolRef:
    group: inference.networking.x-k8s.io
    kind: InferencePool
    name: qwen-inference-pool
  targetModels:
  - name: /models/Qwen3-32B
    weight: 100

分布式SGLang部署

apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
  name: qwen-inference-pool
  annotations:
    inference.networking.x-k8s.io/model-server-runtime: sglang
spec:
  targetPortNumber: 8000
  selector:
    alibabacloud.com/inference-workload: sglang-multi-nodes
    role: leader
  extensionRef:
    name: inference-gateway-ext-proc
---
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferenceModel
metadata:
  name: qwen-inference-model
spec:
  modelName: /models/Qwen3-32B
  criticality: Critical
  poolRef:
    group: inference.networking.x-k8s.io
    kind: InferencePool
    name: qwen-inference-pool
  targetModels:
  - name: /models/Qwen3-32B
    weight: 100

SGLang PD分離部署

apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
  name: qwen-inference-pool
spec:
  targetPortNumber: 8000
  selector:
    alibabacloud.com/inference_backend: sglang # 同時選中prefill和decode工作負載
---
# InferenceTrafficPolicy 指定了針對InferencePool應用的具體流量策略
apiVersion: inferenceextension.alibabacloud.com/v1alpha1
kind: InferenceTrafficPolicy
metadata:
  name: inference-policy
spec:
  poolRef:
    name: qwen-inference-pool
  modelServerRuntime: sglang # 指定後端服務運行架構為SGLang
  profile:
    pd:  # 指定後端服務以PD分離方式部署
      pdRoleLabelName: rolebasedgroup.workloads.x-k8s.io/role #通過指定pod標籤區分InferencePool中的prefill和decode角色
      kvTransfer:
        bootstrapPort: 34000 # SGLang PD分離服務進行KVCache傳輸時使用的bootstrap port，和RoleBasedGroup部署中指定的 disaggregation-bootstrap-port 參數一致。

建立基於請求隊列長度和GPU Cache利用率的負載平衡。
```
kubectl create -f inference_networking.yaml
```

首碼感知的負載平衡（Prefix Cache Aware Routing）

首碼感知負載平衡策略（Prefix Cache Aware Routing）是一種將共用相同首碼內容的請求儘可能發送到同一個推理伺服器Pod的策略。當模型伺服器開啟自動首碼緩衝(APC)特性時，這種策略可以提高首碼快取命中率，減少請求回應時間。

重要

在本文檔中使用的vLLM v0.9.2版本以及SGLang架構預設已開啟首碼緩衝功能，因此無需重新部署服務來啟用首碼緩衝。

要啟用首碼感知負載平衡策略，需要在InferencePool中添加註解：inference.networking.x-k8s.io/routing-strategy: "PREFIX_CACHE"

建立Prefix_Cache.yaml檔案。

單機vLLM部署

apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
  name: qwen-inference-pool
  annotations:
    inference.networking.x-k8s.io/routing-strategy: "PREFIX_CACHE"
spec:
  targetPortNumber: 8000
  selector:
    alibabacloud.com/inference-workload: vllm-inference
  extensionRef:
    name: inference-gateway-ext-proc
---
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferenceModel
metadata:
  name: qwen-inference-model
spec:
  modelName: /models/Qwen3-32B
  criticality: Critical
  poolRef:
    group: inference.networking.x-k8s.io
    kind: InferencePool
    name: qwen-inference-pool
  targetModels:
  - name: /models/Qwen3-32B
    weight: 100

單機SGLang部署

apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
  name: qwen-inference-pool
  annotations:
    inference.networking.x-k8s.io/model-server-runtime: sglang
    inference.networking.x-k8s.io/routing-strategy: "PREFIX_CACHE"
spec:
  targetPortNumber: 8000
  selector:
    alibabacloud.com/inference-workload: sgl-inference
  extensionRef:
    name: inference-gateway-ext-proc
---
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferenceModel
metadata:
  name: qwen-inference-model
spec:
  modelName: /models/Qwen3-32B
  criticality: Critical
  poolRef:
    group: inference.networking.x-k8s.io
    kind: InferencePool
    name: qwen-inference-pool
  targetModels:
  - name: /models/Qwen3-32B
    weight: 100

分布式vLLM部署

apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
  name: qwen-inference-pool
  annotations:
    inference.networking.x-k8s.io/routing-strategy: "PREFIX_CACHE"
spec:
  targetPortNumber: 8000
  selector:
    alibabacloud.com/inference-workload: vllm-multi-nodes
    role: leader
  extensionRef:
    name: inference-gateway-ext-proc
---
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferenceModel
metadata:
  name: qwen-inference-model
spec:
  modelName: /models/Qwen3-32B
  criticality: Critical
  poolRef:
    group: inference.networking.x-k8s.io
    kind: InferencePool
    name: qwen-inference-pool
  targetModels:
  - name: /models/Qwen3-32B
    weight: 100

分布式SGLang部署

apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
  name: qwen-inference-pool
  annotations:
    inference.networking.x-k8s.io/model-server-runtime: sglang
    inference.networking.x-k8s.io/routing-strategy: "PREFIX_CACHE"
spec:
  targetPortNumber: 8000
  selector:
    alibabacloud.com/inference-workload: sglang-multi-nodes
    role: leader
  extensionRef:
    name: inference-gateway-ext-proc
---
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferenceModel
metadata:
  name: qwen-inference-model
spec:
  modelName: /models/Qwen3-32B
  criticality: Critical
  poolRef:
    group: inference.networking.x-k8s.io
    kind: InferencePool
    name: qwen-inference-pool
  targetModels:
  - name: /models/Qwen3-32B
    weight: 100

SGLang PD分離部署

apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
  name: qwen-inference-pool
spec:
  targetPortNumber: 8000
  selector:
    alibabacloud.com/inference_backend: sglang # 同時選中prefill和decode工作負載
---
# InferenceTrafficPolicy 指定了針對InferencePool應用的具體流量策略
apiVersion: inferenceextension.alibabacloud.com/v1alpha1
kind: InferenceTrafficPolicy
metadata:
  name: inference-policy
spec:
  poolRef:
    name: qwen-inference-pool
  modelServerRuntime: sglang # 指定後端服務運行架構為SGLang
  profile:
    pd:  # 指定後端服務以PD分離方式部署
      trafficPolicy:
        prefixCache: # 聲明首碼緩衝的負載平衡策略
          mode: estimate
      prefillPolicyRef: prefixCache
      decodePolicyRef: prefixCache # prefill 和 decode 均應用首碼感知的負載平衡
      pdRoleLabelName: rolebasedgroup.workloads.x-k8s.io/role #通過指定pod標籤區分InferencePool中的prefill和decode角色
      kvTransfer:
        bootstrapPort: 34000 # SGLang PD分離服務進行KVCache傳輸時使用的bootstrap port，和RoleBasedGroup部署中指定的 disaggregation-bootstrap-port 參數一致。

建立首碼感知的負載平衡。
```
kubectl create -f Prefix_Cache.yaml
```

展開查看InferencePool與InferenceModel配置項說明。

配置項	類型	含義	預設值
metadata.annotations.inference.networking.x-k8s.io/model-server-runtime	string	指定模型服務運行時（如sglang）	無
metadata.annotations.inference.networking.x-k8s.io/routing-strategy	string	指定路由策略（可選DEFAULT、PREFIX_CACHE）	基於請求隊列長度和GPU Cache利用率的智能路由策略
spec.targetPortNumber	int	指定推理服務的連接埠號碼	無
spec.selector	map[string]string	選取器用於匹配推理服務的Pod	無
spec.extensionRef	ObjectReference	對推理擴充服務的聲明	無
spec.modelName	string	模型名稱，用於路由匹配	無
spec.criticality	string	模型關鍵性等級，可選值為Critical、Standard	無
spec.poolRef	PoolReference	關聯的InferencePool資源	無

步驟二：部署網關

建立gateway_networking.yaml檔案。

apiVersion: gateway.networking.k8s.io/v1
kind: GatewayClass
metadata:
  name: inference-gateway-class
spec:
  controllerName: inference.networking.x-k8s.io/gateway-controller
---
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: inference-gateway
spec:
  gatewayClassName: inference-gateway-class
  listeners:
  - name: http-llm
    protocol: HTTP
    port: 8080
---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: inference-route
spec:
  parentRefs:
  - name: inference-gateway
  rules:
  - matches:
    - path:
        type: PathPrefix
        value: /v1
    backendRefs:
    - name: qwen-inference-pool
      kind: InferencePool
      group: inference.networking.x-k8s.io

建立GatewayClass、Gateway和HTTPRoute資源，在8080連接埠配置LLM推理服務路由。
```
kubectl create -f gateway_networking.yaml
```

步驟三：驗證推理網關配置

執行以下命令擷取網關的外部存取地址：

export GATEWAY_HOST=$(kubectl get gateway/inference-gateway -o jsonpath='{.status.addresses[0].value}')

通過curl命令測試8080連接埠的服務訪問：

curl http://${GATEWAY_HOST}:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "/models/Qwen3-32B",
    "messages": [
      {"role": "user", "content": "你好，這是一個測試"}
    ],
    "max_tokens": 50
  }'

驗證不同負載平衡。

驗證基於請求隊列長度和GPU Cache利用率的負載平衡策略

預設策略基於請求隊列長度和GPU Cache利用率進行智能路由。可以通過壓測推理服務、觀察推理服務TTFT和輸送量指標進行觀察。

具體測試方法可參考配置LLM服務可觀測指標與可觀測大盤。

驗證首碼感知負載平衡

建立測試檔案驗證首碼感知負載平衡是否生效。

產生 round1.txt:

echo '{"max_tokens":24,"messages":[{"content":"Hi, here's some system prompt: hi hi hi hi hi hi hi hi hi hi.For user 3, here are some other context: hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi.I would like to test your intelligence. for this purpose I would like you to play zork. you can interact with the game by typing in commands. I will forward these commands to the game and type in any response. are you ready?","role":"user"}],"model":"/models/Qwen3-32B","stream":true,"stream_options":{"include_usage":true},"temperature":0}' > round1.txt

產生 round2.txt:

echo '{"max_tokens":3,"messages":[{"content":"Hi, here's some system prompt: hi hi hi hi hi hi hi hi hi hi.For user 3, here are some other context: hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi.I would like to test your intelligence. for this purpose I would like you to play zork. you can interact with the game by typing in commands. I will forward these commands to the game and type in any response. are you ready?","role":"user"},{"content":"Hi there! It looks like you're setting up a fun test. I'm ready to play Zork! You can","role":"assistant"},{"content":"% zork\nWelcome to Dungeon. This version created 11-MAR-91.\nYou are in an open field west of a big white house with a boarded\nfront door.\nThere is a small mailbox here.\n>","role":"user"},{"content":"Great!","role":"assistant"},{"content":"Opening the mailbox reveals:\n A leaflet.\n>","role":"user"}],"model":"/models/Qwen3-32B","stream":true,"stream_options":{"include_usage":true},"temperature":0}' > round2.txt

執行以下命令進行測試：

curl -X POST ${GATEWAY_HOST}:8080/v1/chat/completions -H 'Content-Type: application/json' -d @./round1.txt
curl -X POST ${GATEWAY_HOST}:8080/v1/chat/completions -H 'Content-Type: application/json' -d @./round2.txt

檢查Inference Extension Processor的日誌確認首碼感知的負載平衡是否生效：
```
kubectl logs deploy/inference-gateway-ext-proc -n envoy-gateway-system | grep "Request Handled"
```
如果看到列印的兩條日誌中出現相同的pod名稱，說明首碼感知的負載平衡生效。
首碼感知的負載平衡的具體測試方法與效果，可參考通過多輪對話測試評估推理服務效能。