ACK Gateway with Inference Extension組件在支援推理服務智能負載平衡的同時,也支援推理請求的流量鏡像功能。在生產環境中部署新推理模型時,您可以通過流量鏡像複製生產流量來評估新模型的表現,確保其效能和穩定性符合要求之後再正式上線。本文介紹如何使用ACK Gateway with Inference Extension來實現推理請求的流量鏡像。
閱讀本文前,請確保您已經瞭解InferencePool和InferenceModel的相關概念。
前提條件
已建立帶有GPU節點池的ACK託管叢集。您也可以在ACK託管叢集中安裝ACK Virtual Node組件,以使用ACK使用ACS GPU算力樣本。
已安裝ACK Gateway with Inference Extension並勾選啟用Gateway API推理擴充。操作入口,請參見安裝組件。
本文使用的鏡像需要GPU顯存大於16GiB,T4卡型(16GiB顯存)的實際可用顯存不足以啟動此應用。因此ACK叢集卡型推薦使用A10,ACS GPU算力卡型推薦使用GN8IS(8代GPU B)。
同時,由於LLM鏡像體積較大,建議您提前轉存到ACR,使用內網地址進行拉取。直接從公網拉取的速度取決於叢集EIP的頻寬配置,會有較長的等待時間。
操作流程
本文樣本將部署以下資源:
兩個推理服務vllm-llama2-7b-pool和vllm-llama2-7b-pool-1(下圖中的APP和APP1)。
Service類型為ClusterIP的網關。
HTTPRoute資源,配置了具體的流量轉寄以及鏡像規則。
InferencePool和對應的InferenceModel資源,為APP開啟智能負載平衡。一個普通Service,對接APP1(當前不支援對鏡像流量開啟智能負載平衡,因此需要建立一個普通的Service)。
Sleep應用,作為測試用戶端。
以下為示範流量鏡像的流程示意圖。
用戶端訪問網關,HTTPRoute根據首碼匹配規則識別生產流量。
規則匹配成功後:
生產流量正常轉寄給對應的InferencePool,經過智能負載平衡後轉寄給後端APP。
規則的HTTPFilter將鏡像流量發送給指定的Service,然後將鏡像流量轉寄給後端APP1。
後端APP和APP1的響應都正常返回,但網關只會處理從InferencePool返回的響應,忽略鏡像服務的響應,用戶端僅感知主服務的處理結果。
操作步驟
部署樣本推理服務vllm-llama2-7b-pool和vllm-llama2-7b-pool-1。
本步驟只給出了vllm-llama2-7b-pool的YAML,vllm-llama2-7b-pool-1與vllm-llama2-7b-pool的配置只有名稱不同,請自行修改YAML中對應欄位進行部署。
部署InferencePool和InferenceModel資源,和vllm-llama2-7b-pool-1應用對應的服務。
# ============================================================= # inference_rules.yaml # ============================================================= apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferencePool metadata: name: vllm-llama2-7b-pool spec: targetPortNumber: 8000 selector: app: vllm-llama2-7b-pool extensionRef: name: inference-gateway-ext-proc --- apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferenceModel metadata: name: inferencemodel-sample spec: modelName: /model/llama2 criticality: Critical poolRef: group: inference.networking.x-k8s.io kind: InferencePool name: vllm-llama2-7b-pool targetModels: - name: /model/llama2 weight: 100 --- apiVersion: v1 kind: Service metadata: name: vllm-llama2-7b-pool-1 spec: selector: app: vllm-llama2-7b-pool-1 ports: - protocol: TCP port: 8000 targetPort: 8000 type: ClusterIP部署Gateway和HTTPRoute。
網關的Service類型是ClusterIP,只能從叢集內訪問。您可以根據實際需求修改為LoadBalancer。
# ============================================================= # gateway.yaml # ============================================================= kind: GatewayClass apiVersion: gateway.networking.k8s.io/v1 metadata: name: example-gateway-class labels: example: http-routing spec: controllerName: gateway.envoyproxy.io/gatewayclass-controller --- apiVersion: gateway.networking.k8s.io/v1 kind: Gateway metadata: labels: example: http-routing name: example-gateway namespace: default spec: gatewayClassName: example-gateway-class infrastructure: parametersRef: group: gateway.envoyproxy.io kind: EnvoyProxy name: custom-proxy-config listeners: - allowedRoutes: namespaces: from: Same name: http port: 80 protocol: HTTP --- apiVersion: gateway.envoyproxy.io/v1alpha1 kind: EnvoyProxy metadata: name: custom-proxy-config namespace: default spec: provider: type: Kubernetes kubernetes: envoyService: type: ClusterIP --- apiVersion: gateway.networking.k8s.io/v1 kind: HTTPRoute metadata: name: mirror-route labels: example: http-routing spec: parentRefs: - name: example-gateway hostnames: - "example.com" rules: - matches: - path: type: PathPrefix value: / backendRefs: - group: inference.networking.x-k8s.io kind: InferencePool name: vllm-llama2-7b-pool weight: 1 filters: - type: RequestMirror requestMirror: backendRef: kind: Service name: vllm-llama2-7b-pool-1 port: 8000部署sleep應用。
# ============================================================= # sleep.yaml # ============================================================= apiVersion: v1 kind: ServiceAccount metadata: name: sleep --- apiVersion: v1 kind: Service metadata: name: sleep labels: app: sleep service: sleep spec: ports: - port: 80 name: http selector: app: sleep --- apiVersion: apps/v1 kind: Deployment metadata: name: sleep spec: replicas: 1 selector: matchLabels: app: sleep template: metadata: labels: app: sleep spec: terminationGracePeriodSeconds: 0 serviceAccountName: sleep containers: - name: sleep image: registry-cn-hangzhou.ack.aliyuncs.com/ack-demo/curl:asm-sleep command: ["/bin/sleep", "infinity"] imagePullPolicy: IfNotPresent volumeMounts: - mountPath: /etc/sleep/tls name: secret-volume volumes: - name: secret-volume secret: secretName: sleep-secret optional: true驗證流量鏡像。
擷取網關地址。
export GATEWAY_ADDRESS=$(kubectl get gateway/example-gateway -o jsonpath='{.status.addresses[0].value}')發起測試請求。
kubectl exec deployment/sleep -it -- curl -X POST ${GATEWAY_ADDRESS}/v1/chat/completions -H 'Content-Type: application/json' -H "host: example.com" -d '{ "model": "/model/llama2", "max_completion_tokens": 100, "temperature": 0, "messages": [ { "role": "user", "content": "introduce yourself" } ] }'預期輸出:
{"id":"chatcmpl-eb67bf29-1f87-4e29-8c3e-a83f3c74cd87","object":"chat.completion","created":1745207283,"model":"/model/llama2","choices":[{"index":0,"message":{"role":"assistant","content":"\n [INST] I'm a [/INST]\n\n [INST] I'm a [/INST]\n\n [INST] I'm a [/INST]\n\n [INST] I'm a [/INST]\n\n [INST] I'm a [/INST]\n\n [INST] I'm a [/INST]\n\n [INST] I'm a [/INST]\n\n ","tool_calls":[]},"logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":15,"total_tokens":115,"completion_tokens":100,"prompt_tokens_details":null},"prompt_logprobs":null}%查看應用日誌。
echo "original logs↓↓↓" && kubectl logs deployments/vllm-llama2-7b-pool | grep /v1/chat/completions | grep OK echo "mirror logs↓↓↓" && kubectl logs deployments/vllm-llama2-7b-pool-1 | grep /v1/chat/completions | grep OK預期輸出:
original logs↓↓↓ INFO: 10.2.14.146:39478 - "POST /v1/chat/completions HTTP/1.1" 200 OK INFO: 10.2.14.146:60660 - "POST /v1/chat/completions HTTP/1.1" 200 OK mirror logs↓↓↓ INFO: 10.2.14.146:39742 - "POST /v1/chat/completions HTTP/1.1" 200 OK INFO: 10.2.14.146:59976 - "POST /v1/chat/completions HTTP/1.1" 200 OK可以看到,vllm-llama2-7b-pool和vllm-llama2-7b-pool-1中都有請求,流量鏡像生效。