Gateway with Inference Extension支援基於推理服務負載感知的推理請求排隊與優先順序調度。當產生式AI推理服務後端模型伺服器滿載時,可以根據模型優先順序對隊列中的推理請求進行優先順序調度,即優先響應高優先順序模型的請求。本文主要介紹Gateway with Inference Extension的推理請求排隊與優先順序調度能力。
本文內容依賴1.4.0及以上版本的Gateway with Inference Extension。
背景資訊
對於產生式AI推理服務,單個推理伺服器的請求吞吐能力會受到GPU資源的嚴格限制。當大量的請求同時發送到同一個推理伺服器時,會導致推理引擎的KV Cache等資源佔用率滿載,從而影響所有請求的回應時間和token吞吐速度。
Gateway with Inference Extension支援過推理伺服器多個維度指標來評估推理伺服器的內部狀態,並在推理伺服器負載滿載時對推理請求進行排隊,防止過量請求發送到推理伺服器,造成服務整體品質下降。
前提條件
已建立帶有GPU節點池的ACK託管叢集。您也可以在ACK託管叢集中安裝ACK Virtual Node組件,以使用ACS GPU算力。
已安裝1.4.0版本的Gateway with Inference ExtensionGateway with Inference Extension並勾選啟用Gateway API推理擴充。操作入口,請參見安裝組件。
本文使用的鏡像推薦ACK叢集使用A10卡型,ACS GPU算力推薦使用L20(GN8IS)卡型。
同時,由於LLM鏡像體積較大,建議您提前轉存到ACR,使用內網地址進行拉取。直接從公網拉取的速度取決於叢集EIP的頻寬配置,會有較長的等待時間。
操作步驟
步驟一:部署樣本推理服務
建立vllm-service.yaml。
部署樣本推理服務。
kubectl apply -f vllm-service.yaml
步驟二:部署推理路由
本步驟建立InferencePool資源和InferenceModel資源。並通過為InferencePool添加inference-epp-env.networking.x-k8s.io/experimental-use-queueing: "true"和inference-epp-env.networking.x-k8s.io/experimental-use-scheduler-v2: "true"註解來為InferencePool選中的推理服務啟用排隊能力。
建立inference-pool.yaml。
apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferencePool metadata: annotations: inference-epp-env.networking.x-k8s.io/experimental-use-queueing: "true" inference-epp-env.networking.x-k8s.io/experimental-use-scheduler-v2: "true" name: qwen-pool namespace: default spec: extensionRef: group: "" kind: Service name: qwen-ext-proc selector: app: qwen targetPortNumber: 8000 --- apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferenceModel metadata: name: qwen-model spec: criticality: Critical modelName: qwen poolRef: group: inference.networking.x-k8s.io kind: InferencePool name: qwen-pool targetModels: - name: qwen weight: 100 --- apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferenceModel metadata: name: travel-helper-model spec: criticality: Standard modelName: travel-helper poolRef: group: inference.networking.x-k8s.io kind: InferencePool name: qwen-pool targetModels: - name: travel-helper-v1 weight: 100在InferencePool資源中,同時聲明了兩個InferenceModel資源,表示樣本推理服務可以提供服務的兩種模型:
qwen-model:聲明了樣本推理服務提供的基本模型
qwen,並通過criticality: Critical欄位聲明了該模型的關鍵性等級為關鍵。travel-helper-model:聲明了樣本推理服務基於基本模型提供的LoRA模型
travel-helper,並通過criticality: Standard欄位聲明了該模型的關鍵性等級為標準。
模型關鍵性等級可以聲明為
Critical(關鍵)、Standard(標準)、Sheddable(低優)。三種關鍵性等級的優先順序為Critical>Standard>Sheddable。啟用排隊能力後,當後端模型伺服器滿載時,會優先響應高優先順序模型的請求。部署推理路由。
kubectl apply -f inference-pool.yaml
步驟三:部署網關和網關路由規則
通過匹配請求中的模型名稱,將請求qwen和travel-helper模型的請求路由到後端名為qwen-pool的InferencePool。
建立inference-gateway.yaml。
apiVersion: gateway.networking.k8s.io/v1 kind: GatewayClass metadata: name: inference-gateway spec: controllerName: gateway.envoyproxy.io/gatewayclass-controller --- apiVersion: gateway.networking.k8s.io/v1 kind: Gateway metadata: name: inference-gateway spec: gatewayClassName: inference-gateway listeners: - name: llm-gw protocol: HTTP port: 8081 --- apiVersion: gateway.networking.k8s.io/v1 kind: HTTPRoute metadata: name: llm-route namespace: default spec: parentRefs: - group: gateway.networking.k8s.io kind: Gateway name: inference-gateway sectionName: llm-gw rules: - backendRefs: - group: inference.networking.x-k8s.io kind: InferencePool name: qwen-pool matches: - headers: - type: Exact name: X-Gateway-Model-Name value: qwen - headers: - type: RegularExpression name: X-Gateway-Model-Name value: travel-helper.* --- apiVersion: gateway.envoyproxy.io/v1alpha1 kind: BackendTrafficPolicy metadata: name: backend-timeout spec: timeout: http: requestTimeout: 24h targetRef: group: gateway.networking.k8s.io kind: Gateway name: inference-gateway部署網關。
kubectl apply -f inference-gateway.yaml
步驟四:驗證請求排隊與優先順序調度能力
以ACK叢集為例,使用vllm benchmark同時對qwen模型和travel-helper模型發起壓測,使模型伺服器滿載。
部署壓測工作負載。
kubectl apply -f- <<EOF apiVersion: apps/v1 kind: Deployment metadata: labels: app: vllm-benchmark name: vllm-benchmark namespace: default spec: progressDeadlineSeconds: 600 replicas: 1 revisionHistoryLimit: 10 selector: matchLabels: app: vllm-benchmark strategy: rollingUpdate: maxSurge: 25% maxUnavailable: 25% type: RollingUpdate template: metadata: creationTimestamp: null labels: app: vllm-benchmark spec: containers: - command: - sh - -c - sleep inf image: registry-cn-hangzhou.ack.aliyuncs.com/dev/llm-benchmark:random-and-qa imagePullPolicy: IfNotPresent name: vllm-benchmark resources: {} terminationMessagePath: /dev/termination-log terminationMessagePolicy: File dnsPolicy: ClusterFirst restartPolicy: Always schedulerName: default-scheduler securityContext: {} terminationGracePeriodSeconds: 30 EOF擷取Gateway的內網IP。
export GW_IP=$(kubectl get svc -n envoy-gateway-system -l gateway.envoyproxy.io/owning-gateway-namespace=default,gateway.envoyproxy.io/owning-gateway-name=inference-gateway -o jsonpath='{.items[0].spec.clusterIP}')開啟兩個終端視窗,同時對兩個模型發起壓測。
重要以下資料使用測試環境產生,僅供參考。實際壓測結果可能會由於環境不同而出現差異。
qwen:
kubectl exec -it deploy/vllm-benchmark -- env GW_IP=${GW_IP} python3 /root/vllm/benchmarks/benchmark_serving.py \ --backend vllm \ --model /models/DeepSeek-R1-Distill-Qwen-7B \ --served-model-name qwen \ --trust-remote-code \ --dataset-name random \ --random-prefix-len 1000 \ --random-input-len 3000 \ --random-output-len 3000 \ --random-range-ratio 0.2 \ --num-prompts 300 \ --max-concurrency 60 \ --host $GW_IP \ --port 8081 \ --endpoint /v1/completions \ --save-result \ 2>&1 | tee benchmark_serving.txt預期輸出:
============ Serving Benchmark Result ============ Successful requests: 293 Benchmark duration (s): 1005.55 Total input tokens: 1163919 Total generated tokens: 837560 Request throughput (req/s): 0.29 Output token throughput (tok/s): 832.94 Total Token throughput (tok/s): 1990.43 ---------------Time to First Token---------------- Mean TTFT (ms): 21329.91 Median TTFT (ms): 15754.01 P99 TTFT (ms): 140782.55 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 58.58 Median TPOT (ms): 58.36 P99 TPOT (ms): 91.09 ---------------Inter-token Latency---------------- Mean ITL (ms): 58.32 Median ITL (ms): 50.56 P99 ITL (ms): 64.12 ==================================================travel-helper:
kubectl exec -it deploy/vllm-benchmark -- env GW_IP=${GW_IP} python3 /root/vllm/benchmarks/benchmark_serving.py \ --backend vllm \ --model /models/DeepSeek-R1-Distill-Qwen-7B \ --served-model-name travel-helper \ --trust-remote-code \ --dataset-name random \ --random-prefix-len 1000 \ --random-input-len 3000 \ --random-output-len 3000 \ --random-range-ratio 0.2 \ --num-prompts 300 \ --max-concurrency 60 \ --host $GW_IP \ --port 8081 \ --endpoint /v1/completions \ --save-result \ 2>&1 | tee benchmark_serving.txt預期輸出:
============ Serving Benchmark Result ============ Successful requests: 165 Benchmark duration (s): 889.41 Total input tokens: 660560 Total generated tokens: 492207 Request throughput (req/s): 0.19 Output token throughput (tok/s): 553.41 Total Token throughput (tok/s): 1296.10 ---------------Time to First Token---------------- Mean TTFT (ms): 44201.12 Median TTFT (ms): 28757.03 P99 TTFT (ms): 214710.13 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 67.38 Median TPOT (ms): 60.51 P99 TPOT (ms): 118.36 ---------------Inter-token Latency---------------- Mean ITL (ms): 66.98 Median ITL (ms): 51.25 P99 ITL (ms): 64.87 ==================================================可以看到,在模型服務滿載的情況下,
qwen模型請求平均TTFT指標相比travel-helper模型請求降低約50%,錯誤請求數量相比travel-helper模型下降約96%。