The ACK Gateway with Inference Extension component supports traffic mirroring for inference requests while providing intelligent load balancing for inference services. When deploying new inference models in a production environment, you can evaluate the performance of new models by mirroring production traffic to ensure their performance and stability meet requirements before officially publishing them. This topic describes how to use ACK Gateway with Inference Extension to implement traffic mirroring for inference requests.
Before reading this topic, make sure you understand the concepts of InferencePool and InferenceModel.
Prerequisites
An ACK managed cluster with a GPU node pool is created. You can also install the ACK Virtual Node component in the ACK managed cluster to use ACS GPU compute power.
ACK Gateway with Inference Extension is installed and Enable Gateway API Inference Extension is selected when you create the cluster. For more information, see Step 2: Install the Gateway with Inference Extension component.
For the image used in this topic, we recommend that you use A10 cards for ACK clusters and GN8IS cards for Alibaba Cloud Container Compute Service (ACS) GPU computing power.
Due to the large size of the LLM image, we recommend that you transfer it to Container Registry in advance and pull it using the internal network address. The speed of pulling from the public network depends on the bandwidth configuration of the cluster elastic IP address (EIP), which may result in longer wait times.
Workflow
This example will deploy the following resources:
Two inference services: vllm-llama2-7b-pool and vllm-llama2-7b-pool-1 (APP and APP1 in the following figure).
A ClusterIP Service that serves as a gateway.
An HTTPRoute that configures specific traffic forwarding and mirroring rules.
An InferencePool and the corresponding InferenceModel to enable intelligent load balancing for APP. A regular Service for APP1. Currently, intelligent load balancing is not supported for mirrored traffic. Therefore, a regular Service is required for APP1.
A Sleep application as a test client.
The following figure shows the traffic mirroring process.
When the client accesses the gateway, the HTTPRoute identifies production traffic based on prefix matching rules.
After the rule matches successfully:
Production traffic is normally forwarded to the corresponding InferencePool, and then forwarded to the backend APP after intelligent load balancing.
The HTTPFilter in the rule sends mirrored traffic to the specified Service, which then forwards it to the backend APP1.
Both backend APP and APP1 return normal responses, but the gateway only processes the response returned from the InferencePool and ignores the response returned from the mirrored service. The client only perceives the processing result of the main service.
Procedure
Deploy the sample inference services vllm-llama2-7b-pool and vllm-llama2-7b-pool-1.
This step only provides the YAML file for vllm-llama2-7b-pool. The configuration for vllm-llama2-7b-pool-1 is identical to vllm-llama2-7b-pool except for the name. Please modify the corresponding fields in the following YAML file when you deploy the vllm-llama2-7b-pool-1 inference service.
Deploy the InferencePool and InferenceModel, and the Service for the vllm-llama2-7b-pool-1 application.
# ============================================================= # inference_rules.yaml # ============================================================= apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferencePool metadata: name: vllm-llama2-7b-pool spec: targetPortNumber: 8000 selector: app: vllm-llama2-7b-pool extensionRef: name: inference-gateway-ext-proc --- apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferenceModel metadata: name: inferencemodel-sample spec: modelName: /model/llama2 criticality: Critical poolRef: group: inference.networking.x-k8s.io kind: InferencePool name: vllm-llama2-7b-pool targetModels: - name: /model/llama2 weight: 100 --- apiVersion: v1 kind: Service metadata: name: vllm-llama2-7b-pool-1 spec: selector: app: vllm-llama2-7b-pool-1 ports: - protocol: TCP port: 8000 targetPort: 8000 type: ClusterIPDeploy the Gateway and HTTPRoute.
The Gateway uses a ClusterIP Service, which can only be accessed from within the cluster. You can change the Service type to LoadBalancer based on your actual needs.
# ============================================================= # gateway.yaml # ============================================================= kind: GatewayClass apiVersion: gateway.networking.k8s.io/v1 metadata: name: example-gateway-class labels: example: http-routing spec: controllerName: gateway.envoyproxy.io/gatewayclass-controller --- apiVersion: gateway.networking.k8s.io/v1 kind: Gateway metadata: labels: example: http-routing name: example-gateway namespace: default spec: gatewayClassName: example-gateway-class infrastructure: parametersRef: group: gateway.envoyproxy.io kind: EnvoyProxy name: custom-proxy-config listeners: - allowedRoutes: namespaces: from: Same name: http port: 80 protocol: HTTP --- apiVersion: gateway.envoyproxy.io/v1alpha1 kind: EnvoyProxy metadata: name: custom-proxy-config namespace: default spec: provider: type: Kubernetes kubernetes: envoyService: type: ClusterIP --- apiVersion: gateway.networking.k8s.io/v1 kind: HTTPRoute metadata: name: mirror-route labels: example: http-routing spec: parentRefs: - name: example-gateway hostnames: - "example.com" rules: - matches: - path: type: PathPrefix value: / backendRefs: - group: inference.networking.x-k8s.io kind: InferencePool name: vllm-llama2-7b-pool weight: 1 filters: - type: RequestMirror requestMirror: backendRef: kind: Service name: vllm-llama2-7b-pool-1 port: 8000Deploy the sleep application.
# ============================================================= # sleep.yaml # ============================================================= apiVersion: v1 kind: ServiceAccount metadata: name: sleep --- apiVersion: v1 kind: Service metadata: name: sleep labels: app: sleep service: sleep spec: ports: - port: 80 name: http selector: app: sleep --- apiVersion: apps/v1 kind: Deployment metadata: name: sleep spec: replicas: 1 selector: matchLabels: app: sleep template: metadata: labels: app: sleep spec: terminationGracePeriodSeconds: 0 serviceAccountName: sleep containers: - name: sleep image: registry-cn-hangzhou.ack.aliyuncs.com/ack-demo/curl:asm-sleep command: ["/bin/sleep", "infinity"] imagePullPolicy: IfNotPresent volumeMounts: - mountPath: /etc/sleep/tls name: secret-volume volumes: - name: secret-volume secret: secretName: sleep-secret optional: trueVerify traffic mirroring.
Obtain the gateway address.
export GATEWAY_ADDRESS=$(kubectl get gateway/example-gateway -o jsonpath='{.status.addresses[0].value}')Send a test request.
kubectl exec deployment/sleep -it -- curl -X POST ${GATEWAY_ADDRESS}/v1/chat/completions -H 'Content-Type: application/json' -H "host: example.com" -d '{ "model": "/model/llama2", "max_completion_tokens": 100, "temperature": 0, "messages": [ { "role": "user", "content": "introduce yourself" } ] }'Expected output:
{"id":"chatcmpl-eb67bf29-1f87-4e29-8c3e-a83f3c74cd87","object":"chat.completion","created":1745207283,"model":"/model/llama2","choices":[{"index":0,"message":{"role":"assistant","content":"\n [INST] I'm a [/INST]\n\n [INST] I'm a [/INST]\n\n [INST] I'm a [/INST]\n\n [INST] I'm a [/INST]\n\n [INST] I'm a [/INST]\n\n [INST] I'm a [/INST]\n\n [INST] I'm a [/INST]\n\n ","tool_calls":[]},"logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":15,"total_tokens":115,"completion_tokens":100,"prompt_tokens_details":null},"prompt_logprobs":null}%Check the application logs.
echo "original logs↓↓↓" && kubectl logs deployments/vllm-llama2-7b-pool | grep /v1/chat/completions | grep OK echo "mirror logs↓↓↓" && kubectl logs deployments/vllm-llama2-7b-pool-1 | grep /v1/chat/completions | grep OKExpected output:
original logs↓↓↓ INFO: 10.2.14.146:39478 - "POST /v1/chat/completions HTTP/1.1" 200 OK INFO: 10.2.14.146:60660 - "POST /v1/chat/completions HTTP/1.1" 200 OK mirror logs↓↓↓ INFO: 10.2.14.146:39742 - "POST /v1/chat/completions HTTP/1.1" 200 OK INFO: 10.2.14.146:59976 - "POST /v1/chat/completions HTTP/1.1" 200 OKThe output shows that requests are routed to both vllm-llama2-7b-pool and vllm-llama2-7b-pool-1. This indicates that traffic mirroring works as normal.