傳統的HTTP請求,經典負載平衡演算法可以將請求均勻地發送給不同的工作負載。然而,對於LLM推理服務來說,每個請求給後端帶來的負載是難以預測的。推理網關(Gateway with Inference Extension)是基於Kubernetes社區Gateway API及其Inference Extension規範實現的增強型組件,它能夠通過智能路由最佳化在多個推理服務工作負載之間的負載平衡效能,根據LLM推理服務不同情境提供不同的負載平衡策略,並實現模型灰階發布、推理請求排隊等能力。
前提條件
步驟一:為推理服務配置智能路由
根據推理服務的不同需求,Gateway with Inference Extension提供了兩種智能路由負載平衡策略。
基於請求隊列長度和GPU Cache利用率的負載平衡(預設策略)。
首碼感知的負載平衡策略(Prefix Cache Aware Routing)。
您可以通過為推理服務聲明InferencePool和InferenceModel資源來針對推理服務啟用推理網關的智能路由能力,並根據後端推理服務的部署方式和選用的負載平衡策略來靈活調整InferencePool和InferenceModel資源配置。
基於請求隊列長度和GPU Cache利用率的負載平衡
當InferencePool的annotations為空白時,預設採用基於請求隊列長度和GPU Cache利用率的推理服務智能路由策略。該策略會根據後端推理服務的即時負載情況(包括請求隊列長度和GPU緩衝利用率)來動態分配請求,以實現最優的負載平衡效果。
建立
inference_networking.yaml檔案。單機vLLM部署
apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferencePool metadata: name: qwen-inference-pool spec: targetPortNumber: 8000 selector: alibabacloud.com/inference-workload: vllm-inference extensionRef: name: inference-gateway-ext-proc --- apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferenceModel metadata: name: qwen-inference-model spec: modelName: /models/Qwen3-32B criticality: Critical poolRef: group: inference.networking.x-k8s.io kind: InferencePool name: qwen-inference-pool targetModels: - name: /models/Qwen3-32B weight: 100單機SGLang部署
apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferencePool metadata: name: qwen-inference-pool annotations: inference.networking.x-k8s.io/model-server-runtime: sglang spec: targetPortNumber: 8000 selector: alibabacloud.com/inference-workload: sgl-inference extensionRef: name: inference-gateway-ext-proc --- apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferenceModel metadata: name: qwen-inference-model spec: modelName: /models/Qwen3-32B criticality: Critical poolRef: group: inference.networking.x-k8s.io kind: InferencePool name: qwen-inference-pool targetModels: - name: /models/Qwen3-32B weight: 100分布式vLLM部署
apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferencePool metadata: name: qwen-inference-pool spec: targetPortNumber: 8000 selector: alibabacloud.com/inference-workload: vllm-multi-nodes role: leader extensionRef: name: inference-gateway-ext-proc --- apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferenceModel metadata: name: qwen-inference-model spec: modelName: /models/Qwen3-32B criticality: Critical poolRef: group: inference.networking.x-k8s.io kind: InferencePool name: qwen-inference-pool targetModels: - name: /models/Qwen3-32B weight: 100分布式SGLang部署
apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferencePool metadata: name: qwen-inference-pool annotations: inference.networking.x-k8s.io/model-server-runtime: sglang spec: targetPortNumber: 8000 selector: alibabacloud.com/inference-workload: sglang-multi-nodes role: leader extensionRef: name: inference-gateway-ext-proc --- apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferenceModel metadata: name: qwen-inference-model spec: modelName: /models/Qwen3-32B criticality: Critical poolRef: group: inference.networking.x-k8s.io kind: InferencePool name: qwen-inference-pool targetModels: - name: /models/Qwen3-32B weight: 100SGLang PD分離部署
apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferencePool metadata: name: qwen-inference-pool spec: targetPortNumber: 8000 selector: alibabacloud.com/inference_backend: sglang # 同時選中prefill和decode工作負載 --- # InferenceTrafficPolicy 指定了針對InferencePool應用的具體流量策略 apiVersion: inferenceextension.alibabacloud.com/v1alpha1 kind: InferenceTrafficPolicy metadata: name: inference-policy spec: poolRef: name: qwen-inference-pool modelServerRuntime: sglang # 指定後端服務運行架構為SGLang profile: pd: # 指定後端服務以PD分離方式部署 pdRoleLabelName: rolebasedgroup.workloads.x-k8s.io/role #通過指定pod標籤區分InferencePool中的prefill和decode角色 kvTransfer: bootstrapPort: 34000 # SGLang PD分離服務進行KVCache傳輸時使用的bootstrap port,和RoleBasedGroup部署中指定的 disaggregation-bootstrap-port 參數一致。建立基於請求隊列長度和GPU Cache利用率的負載平衡。
kubectl create -f inference_networking.yaml
首碼感知的負載平衡(Prefix Cache Aware Routing)
首碼感知負載平衡策略(Prefix Cache Aware Routing)是一種將共用相同首碼內容的請求儘可能發送到同一個推理伺服器Pod的策略。當模型伺服器開啟自動首碼緩衝(APC)特性時,這種策略可以提高首碼快取命中率,減少請求回應時間。
在本文檔中使用的vLLM v0.9.2版本以及SGLang架構預設已開啟首碼緩衝功能,因此無需重新部署服務來啟用首碼緩衝。
要啟用首碼感知負載平衡策略,需要在InferencePool中添加註解:inference.networking.x-k8s.io/routing-strategy: "PREFIX_CACHE"
建立
Prefix_Cache.yaml檔案。單機vLLM部署
apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferencePool metadata: name: qwen-inference-pool annotations: inference.networking.x-k8s.io/routing-strategy: "PREFIX_CACHE" spec: targetPortNumber: 8000 selector: alibabacloud.com/inference-workload: vllm-inference extensionRef: name: inference-gateway-ext-proc --- apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferenceModel metadata: name: qwen-inference-model spec: modelName: /models/Qwen3-32B criticality: Critical poolRef: group: inference.networking.x-k8s.io kind: InferencePool name: qwen-inference-pool targetModels: - name: /models/Qwen3-32B weight: 100單機SGLang部署
apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferencePool metadata: name: qwen-inference-pool annotations: inference.networking.x-k8s.io/model-server-runtime: sglang inference.networking.x-k8s.io/routing-strategy: "PREFIX_CACHE" spec: targetPortNumber: 8000 selector: alibabacloud.com/inference-workload: sgl-inference extensionRef: name: inference-gateway-ext-proc --- apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferenceModel metadata: name: qwen-inference-model spec: modelName: /models/Qwen3-32B criticality: Critical poolRef: group: inference.networking.x-k8s.io kind: InferencePool name: qwen-inference-pool targetModels: - name: /models/Qwen3-32B weight: 100分布式vLLM部署
apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferencePool metadata: name: qwen-inference-pool annotations: inference.networking.x-k8s.io/routing-strategy: "PREFIX_CACHE" spec: targetPortNumber: 8000 selector: alibabacloud.com/inference-workload: vllm-multi-nodes role: leader extensionRef: name: inference-gateway-ext-proc --- apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferenceModel metadata: name: qwen-inference-model spec: modelName: /models/Qwen3-32B criticality: Critical poolRef: group: inference.networking.x-k8s.io kind: InferencePool name: qwen-inference-pool targetModels: - name: /models/Qwen3-32B weight: 100分布式SGLang部署
apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferencePool metadata: name: qwen-inference-pool annotations: inference.networking.x-k8s.io/model-server-runtime: sglang inference.networking.x-k8s.io/routing-strategy: "PREFIX_CACHE" spec: targetPortNumber: 8000 selector: alibabacloud.com/inference-workload: sglang-multi-nodes role: leader extensionRef: name: inference-gateway-ext-proc --- apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferenceModel metadata: name: qwen-inference-model spec: modelName: /models/Qwen3-32B criticality: Critical poolRef: group: inference.networking.x-k8s.io kind: InferencePool name: qwen-inference-pool targetModels: - name: /models/Qwen3-32B weight: 100SGLang PD分離部署
apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferencePool metadata: name: qwen-inference-pool spec: targetPortNumber: 8000 selector: alibabacloud.com/inference_backend: sglang # 同時選中prefill和decode工作負載 --- # InferenceTrafficPolicy 指定了針對InferencePool應用的具體流量策略 apiVersion: inferenceextension.alibabacloud.com/v1alpha1 kind: InferenceTrafficPolicy metadata: name: inference-policy spec: poolRef: name: qwen-inference-pool modelServerRuntime: sglang # 指定後端服務運行架構為SGLang profile: pd: # 指定後端服務以PD分離方式部署 trafficPolicy: prefixCache: # 聲明首碼緩衝的負載平衡策略 mode: estimate prefillPolicyRef: prefixCache decodePolicyRef: prefixCache # prefill 和 decode 均應用首碼感知的負載平衡 pdRoleLabelName: rolebasedgroup.workloads.x-k8s.io/role #通過指定pod標籤區分InferencePool中的prefill和decode角色 kvTransfer: bootstrapPort: 34000 # SGLang PD分離服務進行KVCache傳輸時使用的bootstrap port,和RoleBasedGroup部署中指定的 disaggregation-bootstrap-port 參數一致。建立首碼感知的負載平衡。
kubectl create -f Prefix_Cache.yaml
步驟二:部署網關
建立
gateway_networking.yaml檔案。apiVersion: gateway.networking.k8s.io/v1 kind: GatewayClass metadata: name: inference-gateway-class spec: controllerName: inference.networking.x-k8s.io/gateway-controller --- apiVersion: gateway.networking.k8s.io/v1 kind: Gateway metadata: name: inference-gateway spec: gatewayClassName: inference-gateway-class listeners: - name: http-llm protocol: HTTP port: 8080 --- apiVersion: gateway.networking.k8s.io/v1 kind: HTTPRoute metadata: name: inference-route spec: parentRefs: - name: inference-gateway rules: - matches: - path: type: PathPrefix value: /v1 backendRefs: - name: qwen-inference-pool kind: InferencePool group: inference.networking.x-k8s.io建立GatewayClass、Gateway和HTTPRoute資源,在8080連接埠配置LLM推理服務路由。
kubectl create -f gateway_networking.yaml
步驟三:驗證推理網關配置
執行以下命令擷取網關的外部存取地址:
export GATEWAY_HOST=$(kubectl get gateway/inference-gateway -o jsonpath='{.status.addresses[0].value}')通過curl命令測試8080連接埠的服務訪問:
curl http://${GATEWAY_HOST}:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "/models/Qwen3-32B", "messages": [ {"role": "user", "content": "你好,這是一個測試"} ], "max_tokens": 50 }'驗證不同負載平衡。
驗證基於請求隊列長度和GPU Cache利用率的負載平衡策略
預設策略基於請求隊列長度和GPU Cache利用率進行智能路由。可以通過壓測推理服務、觀察推理服務TTFT和輸送量指標進行觀察。
具體測試方法可參考配置LLM服務可觀測指標與可觀測大盤。
驗證首碼感知負載平衡
建立測試檔案驗證首碼感知負載平衡是否生效。
產生 round1.txt:
echo '{"max_tokens":24,"messages":[{"content":"Hi, here's some system prompt: hi hi hi hi hi hi hi hi hi hi.For user 3, here are some other context: hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi.I would like to test your intelligence. for this purpose I would like you to play zork. you can interact with the game by typing in commands. I will forward these commands to the game and type in any response. are you ready?","role":"user"}],"model":"/models/Qwen3-32B","stream":true,"stream_options":{"include_usage":true},"temperature":0}' > round1.txt產生 round2.txt:
echo '{"max_tokens":3,"messages":[{"content":"Hi, here's some system prompt: hi hi hi hi hi hi hi hi hi hi.For user 3, here are some other context: hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi.I would like to test your intelligence. for this purpose I would like you to play zork. you can interact with the game by typing in commands. I will forward these commands to the game and type in any response. are you ready?","role":"user"},{"content":"Hi there! It looks like you're setting up a fun test. I'm ready to play Zork! You can","role":"assistant"},{"content":"% zork\nWelcome to Dungeon. This version created 11-MAR-91.\nYou are in an open field west of a big white house with a boarded\nfront door.\nThere is a small mailbox here.\n>","role":"user"},{"content":"Great!","role":"assistant"},{"content":"Opening the mailbox reveals:\n A leaflet.\n>","role":"user"}],"model":"/models/Qwen3-32B","stream":true,"stream_options":{"include_usage":true},"temperature":0}' > round2.txt執行以下命令進行測試:
curl -X POST ${GATEWAY_HOST}:8080/v1/chat/completions -H 'Content-Type: application/json' -d @./round1.txt curl -X POST ${GATEWAY_HOST}:8080/v1/chat/completions -H 'Content-Type: application/json' -d @./round2.txt檢查Inference Extension Processor的日誌確認首碼感知的負載平衡是否生效:
kubectl logs deploy/inference-gateway-ext-proc -n envoy-gateway-system | grep "Request Handled"如果看到列印的兩條日誌中出現相同的pod名稱,說明首碼感知的負載平衡生效。
首碼感知的負載平衡的具體測試方法與效果,可參考 通過多輪對話測試評估推理服務效能。