通過Gateway with Inference Extension組件,您可以根據產生式AI推理服務的不同使用情境、指定使用推理服務路由的不同負載平衡策略。本文介紹如何使用Gateway with Inference Extension組件實現首碼感知的負載平衡策略。
閱讀本文前,請確保您已經瞭解InferencePool和InferenceModel的相關概念。
本文內容依賴1.4.0及以上版本的Gateway with Inference Extension。
背景資訊
vLLM的自動首碼緩衝
vLLM支援自動首碼緩衝特性。自動首碼緩衝(APC)會緩衝vLLM已經計算過請求的KV Cache,這樣如果新的請求與某個歷史請求具有相同的首碼,就可以直接複用現有的KV Cache,從而使新請求得以跳過共用首碼部分的KV Cache計算,從而加速對LLM推理請求的處理流程。
首碼感知的負載平衡策略
首碼感知的負載平衡策略是指將共用同一首碼內容的請求儘可能發送到同一個推理伺服器Pod的負載平衡策略。
在模型伺服器開啟APC特性的情況下,首碼感知的負載平衡策略可以儘可能的提高首碼快取命中率,減少請求回應時間。此策略主要適用於有大量共用首碼請求的情境,請根據您的實際業務情境進行判斷。
典型的使用情境如下:
長文檔查詢:使用者反覆使用不同的查詢對同一長文檔(例如軟體手冊或年度報告)進行查詢。
多輪對話:使用者可能在同一聊天會話中多次與應用程式進行互動。
前提條件
已建立帶有GPU節點池的ACK託管叢集。您也可以在ACK託管叢集中安裝ACK Virtual Node組件,以使用ACS GPU算力。
已安裝1.4.0版本的Gateway with Inference ExtensionGateway with Inference Extension並勾選啟用Gateway API推理擴充。操作入口,請參見安裝組件。
本文使用的鏡像推薦ACK叢集使用A10卡型,ACS GPU算力推薦使用GN8IS(8代GPU B)卡型。
同時,由於LLM鏡像體積較大,建議您提前轉存到ACR,使用內網地址進行拉取。直接從公網拉取的速度取決於叢集EIP的頻寬配置,會有較長的等待時間。
操作步驟
步驟一:部署樣本推理服務
建立vllm-service.yaml。
部署樣本推理服務。
kubectl apply -f vllm-service.yaml
步驟二:部署推理路由
本步驟建立InferencePool資源和InferenceModel資源。
建立inference-pool.yaml。
apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferencePool metadata: annotations: inference.networking.x-k8s.io/routing-strategy: "PREFIX_CACHE" name: vllm-qwen-pool spec: targetPortNumber: 8000 selector: app: qwen extensionRef: name: inference-gateway-ext-proc --- apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferenceModel metadata: name: inferencemodel-qwen spec: modelName: /model/qwen criticality: Critical poolRef: group: inference.networking.x-k8s.io kind: InferencePool name: vllm-qwen-pool targetModels: - name: /model/qwen weight: 100在InferencePool資源中,通過設定
inference.networking.x-k8s.io/routing-strategy: "PREFIX_CACHE"註解,為InferencePool中的Pod啟用首碼感知的負載平衡策略。部署推理路由。
kubectl apply -f inference-pool.yaml
步驟三:部署網關和網關路由規則
本步驟將建立一個包含8080和8081連接埠的網關。其中在網關的8081連接埠通過HTTPRoute資源指定了網關路由後端為推理擴充提供的InferencePool,推理請求將會被路由到InferencePool指定的Pod集合中。在網關的8080連接埠通過HTTPRoute資源指定了網關路由後端為Service,推理請求會通過普通的HTTP最小請求數負載平衡策略路由到相同的Pod集合中。
建立inference-gateway.yaml。
apiVersion: gateway.networking.k8s.io/v1 kind: GatewayClass metadata: name: qwen-inference-gateway-class spec: controllerName: gateway.envoyproxy.io/gatewayclass-controller --- apiVersion: gateway.networking.k8s.io/v1 kind: Gateway metadata: name: qwen-inference-gateway spec: gatewayClassName: qwen-inference-gateway-class listeners: - name: http protocol: HTTP port: 8080 - name: llm-gw protocol: HTTP port: 8081 --- apiVersion: gateway.networking.k8s.io/v1 kind: HTTPRoute metadata: name: qwen-backend spec: parentRefs: - name: qwen-inference-gateway sectionName: llm-gw rules: - backendRefs: - group: inference.networking.x-k8s.io kind: InferencePool name: vllm-qwen-pool matches: - path: type: PathPrefix value: / --- apiVersion: gateway.networking.k8s.io/v1 kind: HTTPRoute metadata: name: qwen-backend-no-inference spec: parentRefs: - group: gateway.networking.k8s.io kind: Gateway name: qwen-inference-gateway sectionName: http rules: - backendRefs: - group: "" kind: Service name: qwen port: 8000 weight: 1 matches: - path: type: PathPrefix value: / --- apiVersion: gateway.envoyproxy.io/v1alpha1 kind: BackendTrafficPolicy metadata: name: backend-timeout spec: timeout: http: requestTimeout: 1h targetRef: group: gateway.networking.k8s.io kind: Gateway name: qwen-inference-gateway部署網關。
kubectl apply -f inference-gateway.yaml
步驟四:驗證路由規則
建立round1.txt和round2.txt。在兩個txt檔案中都包含了一段相同的一段
content,通過先後將round1.txt和round2.txt作為LLM請求的Body,然後查看智能路由extensionRef的日誌內容,來驗證是否觸發智能路由的首碼感知功能。round1.txt:
echo '{"max_tokens":24,"messages":[{"content":"Hi, here'\''s some system prompt: hi hi hi hi hi hi hi hi hi hi.For user 3, here are some other context: hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi.I would like to test your intelligence. for this purpose I would like you to play zork. you can interact with the game by typing in commands. I will forward these commands to the game and type in any response. are you ready?","role":"user"}],"model":"/model/qwen","stream":true,"stream_options":{"include_usage":true},"temperature":0}' > round1.txtround2.txt:
echo '{"max_tokens":3,"messages":[{"content":"Hi, here'\''s some system prompt: hi hi hi hi hi hi hi hi hi hi.For user 3, here are some other context: hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi.I would like to test your intelligence. for this purpose I would like you to play zork. you can interact with the game by typing in commands. I will forward these commands to the game and type in any response. are you ready?","role":"user"},{"content":"Hi there! It looks like you're setting up a fun test. I'm ready to play Zork! You can","role":"assistant"},{"content":"% zork\nWelcome to Dungeon. This version created 11-MAR-91.\nYou are in an open field west of a big white house with a boarded\nfront door.\nThere is a small mailbox here.\n>","role":"user"},{"content":"Great!","role":"assistant"},{"content":"Opening the mailbox reveals:\n A leaflet.\n>","role":"user"}],"model":"/model/qwen","stream":true,"stream_options":{"include_usage":true},"temperature":0}' > round2.txt擷取Gateway的公網IP。
export GATEWAY_IP=$(kubectl get gateway/qwen-inference-gateway -o jsonpath='{.status.addresses[0].value}')發起兩次會話請求,類比多輪對話情境。
curl -X POST ${GATEWAY_IP}:8081/v1/chat/completions -H 'Content-Type: application/json' -d @./round1.txt curl -X POST ${GATEWAY_IP}:8081/v1/chat/completions -H 'Content-Type: application/json' -d @./round2.txt查看日誌,確認首碼負載平衡是否生效。
kubectl logs deploy/epp-default-inference-gateway-ext-proc -n envoy-gateway-system|grep "Do prefix"預期輸出:
2025-05-23T03:33:09Z INFO scheduling/prefixcache_filter.go:311 Do prefix-aware routing! {"request": "v68m4zx472", "matching ratio": " 0.54 > 0.50"}可以看到,日誌中有
Do prefix-aware routing!的資訊,說明首碼負載平衡已經生效。
(可選)步驟五:通過多輪對話測試評估推理服務效能
本步驟以ACK叢集為例,示範使用壓測工具進行多輪對話測試,來對比普通HTTP路由和推理路由的首碼負載平衡效果。
部署llm-qa-benchmark壓測工具。
kubectl apply -f- <<EOF apiVersion: apps/v1 kind: Deployment metadata: labels: app: llm-qa-benchmark name: llm-qa-benchmark spec: replicas: 1 selector: matchLabels: app: llm-qa-benchmark template: metadata: labels: app: llm-qa-benchmark spec: containers: - command: - sh - -c - sleep inf image: registry-cn-hangzhou.ack.aliyuncs.com/dev/llm-qa-benchmark:v0.1 imagePullPolicy: IfNotPresent name: llm-qa-benchmark terminationMessagePath: /dev/termination-log terminationMessagePolicy: File restartPolicy: Always EOF擷取Gateway內網IP。
export GW_IP=$(kubectl get svc -n envoy-gateway-system -l gateway.envoyproxy.io/owning-gateway-namespace=default,gateway.envoyproxy.io/owning-gateway-name=qwen-inference-gateway -o jsonpath='{.items[0].spec.clusterIP}')執行壓測。
重要以下測試結果由測試環境產生,具體結果請以實際環境為準。
普通HTTP路由
kubectl exec -it deploy/llm-qa-benchmark -- env GW_IP=${GW_IP} python3 multi-round-qa.py \ --num-users 8 \ --num-rounds 15 \ --qps 0.1 \ --shared-system-prompt 100 \ --sharegpt \ --user-history-prompt 2000 \ --answer-len 100 \ --model /model/qwen \ --time 600 \ --base-url http://${GW_IP}:8080/v1預期輸出:
==================== Performance summary ====================== QPS: 0.1000 reqs/s Processing speed: 0.1080 reqs/s Requests on-the-fly: 0 Input tokens per second: 259.0703 tokens/s Output tokens per second: 4.8576 tokens/s Average generation throughput (per request): 26.6710 tokens/req/s Average TTFT: 0.3669s Time range: 1748231183.2753935 - 1748231766.4799275 (583.20s) ===============================================================推理服務路由
kubectl exec -it deploy/llm-qa-benchmark -- env GW_IP=${GW_IP} python3 multi-round-qa.py \ --num-users 8 \ --num-rounds 15 \ --qps 0.1 \ --shared-system-prompt 100 \ --sharegpt \ --user-history-prompt 2000 \ --answer-len 100 \ --model /model/qwen \ --time 600 \ --base-url http://${GW_IP}:8081/v1預期輸出:
==================== Performance summary ====================== QPS: 0.1000 reqs/s Processing speed: 0.1081 reqs/s Requests on-the-fly: 0 Input tokens per second: 259.3009 tokens/s Output tokens per second: 4.8548 tokens/s Average generation throughput (per request): 26.9300 tokens/req/s Average TTFT: 0.2761s Time range: 1748231885.874972 - 1748232468.5918882 (582.72s) ===============================================================可以看到,推理服務路由的
Average TTFT較HTTP路由有明顯提升。