The ACK Gateway with Inference Extension component supports traffic mirroring for inference requests while providing intelligent load balancing for inference services. When deploying new inference models in a production environment, you can evaluate the performance of new models by mirroring production traffic to ensure their performance and stability meet requirements before officially publishing them. This topic describes how to use ACK Gateway with Inference Extension to implement traffic mirroring for inference requests.
Traffic mirroring (also called shadowing) lets you send a copy of live production requests to a candidate model without affecting the responses your users receive. The mirrored requests are fire-and-forget: the gateway forwards them out of band, ignores any responses from the mirror target, and returns only the primary service’s response to the client. This makes traffic mirroring a zero-risk way to validate new LLM models against real traffic before officially releasing them.
This topic explains how to configure ACK Gateway with Inference Extension to mirror inference requests from a primary vLLM service to a shadow service.
Before you begin, make sure you understand the concepts of InferencePool and InferenceModel.
Prerequisites
Before you begin, ensure that you have:
An ACK managed cluster with a GPU node pool. Alternatively, install the ACK Virtual Node component to use Container Compute Service (ACS) GPU compute power.
ACK Gateway with Inference Extension installed, with Enable Gateway API Inference Extension selected. For the installation entry, see Step 2: Install the ACK Gateway with Inference Extension component.
How it works
This example deploys the following resources:
| Resource | Description |
|---|---|
vllm-llama2-7b-pool | Primary inference service (APP) |
vllm-llama2-7b-pool-1 | Shadow inference service (APP1) |
| Gateway (ClusterIP) | Entry point for inference requests |
HTTPRoute (mirror-route) | Routes production traffic to the primary service and mirrors it to the shadow service |
| InferencePool + InferenceModel | Enables intelligent load balancing for the primary service |
| Service for APP1 | Regular ClusterIP Service for the shadow service. Intelligent load balancing is not applied to mirrored traffic, so a standard Service is required |
| Sleep | Test client |
The following diagram shows the traffic flow.
The client sends a request to the gateway.
The HTTPRoute matches the request using a PathPrefix rule.
Production traffic is forwarded to the InferencePool, which applies intelligent load balancing before routing to APP.
The RequestMirror filter sends a copy of the request to the shadow Service, which forwards it to APP1.
Both APP and APP1 process the request, but the gateway returns only APP's response to the client. APP1's response is discarded.
Deploy traffic mirroring for inference services
Step 1: Deploy the inference services
Deploy vllm-llama2-7b-pool using the following YAML. The configuration for vllm-llama2-7b-pool-1 is identical — copy the YAML and modify the corresponding fields (the Deployment name, selector labels, and Pod template labels) to replace vllm-llama2-7b-pool with vllm-llama2-7b-pool-1. Do not rename the shared chat-template ConfigMap.
Step 2: Deploy the InferencePool, InferenceModel, and shadow Service
# =============================================================
# inference_rules.yaml
# =============================================================
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
name: vllm-llama2-7b-pool
spec:
targetPortNumber: 8000
selector:
app: vllm-llama2-7b-pool
extensionRef:
name: inference-gateway-ext-proc
---
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferenceModel
metadata:
name: inferencemodel-sample
spec:
modelName: /model/llama2
criticality: Critical
poolRef:
group: inference.networking.x-k8s.io
kind: InferencePool
name: vllm-llama2-7b-pool
targetModels:
- name: /model/llama2
weight: 100
---
apiVersion: v1
kind: Service
metadata:
name: vllm-llama2-7b-pool-1
spec:
selector:
app: vllm-llama2-7b-pool-1
ports:
- protocol: TCP
port: 8000
targetPort: 8000
type: ClusterIPStep 3: Deploy the Gateway and HTTPRoute
# =============================================================
# gateway.yaml
# =============================================================
kind: GatewayClass
apiVersion: gateway.networking.k8s.io/v1
metadata:
name: example-gateway-class
labels:
example: http-routing
spec:
controllerName: gateway.envoyproxy.io/gatewayclass-controller
---
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
labels:
example: http-routing
name: example-gateway
namespace: default
spec:
gatewayClassName: example-gateway-class
infrastructure:
parametersRef:
group: gateway.envoyproxy.io
kind: EnvoyProxy
name: custom-proxy-config
listeners:
- allowedRoutes:
namespaces:
from: Same
name: http
port: 80
protocol: HTTP
---
apiVersion: gateway.envoyproxy.io/v1alpha1
kind: EnvoyProxy
metadata:
name: custom-proxy-config
namespace: default
spec:
provider:
type: Kubernetes
kubernetes:
envoyService:
type: ClusterIP
---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: mirror-route
labels:
example: http-routing
spec:
parentRefs:
- name: example-gateway
hostnames:
- "example.com"
rules:
- matches:
- path:
type: PathPrefix
value: /
backendRefs:
- group: inference.networking.x-k8s.io
kind: InferencePool
name: vllm-llama2-7b-pool
weight: 1
filters:
- type: RequestMirror
requestMirror:
backendRef:
kind: Service
name: vllm-llama2-7b-pool-1
port: 8000The Gateway uses a ClusterIP Service and is accessible only from within the cluster. ChangeenvoyService.typetoLoadBalancerif you need external access. TheRequestMirrorfilter copies each incoming request and sends it tovllm-llama2-7b-pool-1. Responses from the mirror target are always discarded — only the InferencePool response is returned to the client. Before applying this HTTPRoute,vllm-llama2-7b-pool-1receives no traffic; the log check in Step 5 confirms that mirroring is active.
Step 4: Deploy the test client
# =============================================================
# sleep.yaml
# =============================================================
apiVersion: v1
kind: ServiceAccount
metadata:
name: sleep
---
apiVersion: v1
kind: Service
metadata:
name: sleep
labels:
app: sleep
service: sleep
spec:
ports:
- port: 80
name: http
selector:
app: sleep
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: sleep
spec:
replicas: 1
selector:
matchLabels:
app: sleep
template:
metadata:
labels:
app: sleep
spec:
terminationGracePeriodSeconds: 0
serviceAccountName: sleep
containers:
- name: sleep
image: registry-cn-hangzhou.ack.aliyuncs.com/ack-demo/curl:asm-sleep
command: ["/bin/sleep", "infinity"]
imagePullPolicy: IfNotPresent
volumeMounts:
- mountPath: /etc/sleep/tls
name: secret-volume
volumes:
- name: secret-volume
secret:
secretName: sleep-secret
optional: trueStep 5: Verify traffic mirroring
Get the gateway address.
export GATEWAY_ADDRESS=$(kubectl get gateway/example-gateway -o jsonpath='{.status.addresses[0].value}')Send a test request.
kubectl exec deployment/sleep -it -- curl -X POST ${GATEWAY_ADDRESS}/v1/chat/completions \ -H 'Content-Type: application/json' \ -H "host: example.com" \ -d '{ "model": "/model/llama2", "max_completion_tokens": 100, "temperature": 0, "messages": [ { "role": "user", "content": "introduce yourself" } ] }'The expected output is similar to:
{"id":"chatcmpl-eb67bf29-1f87-4e29-8c3e-a83f3c74cd87","object":"chat.completion","created":1745207283,"model":"/model/llama2","choices":[{"index":0,"message":{"role":"assistant","content":"\n [INST] I'm a [/INST]\n\n [INST] I'm a [/INST]\n\n [INST] I'm a [/INST]\n\n [INST] I'm a [/INST]\n\n [INST] I'm a [/INST]\n\n [INST] I'm a [/INST]\n\n [INST] I'm a [/INST]\n\n ","tool_calls":[]},"logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":15,"total_tokens":115,"completion_tokens":100,"prompt_tokens_details":null},"prompt_logprobs":null}Confirm that both services received the request by checking their logs.
echo "primary service logs:" && kubectl logs deployments/vllm-llama2-7b-pool | grep /v1/chat/completions | grep OK echo "mirror service logs:" && kubectl logs deployments/vllm-llama2-7b-pool-1 | grep /v1/chat/completions | grep OKThe expected output is similar to:
primary service logs: INFO: 10.2.14.146:39478 - "POST /v1/chat/completions HTTP/1.1" 200 OK INFO: 10.2.14.146:60660 - "POST /v1/chat/completions HTTP/1.1" 200 OK mirror service logs: INFO: 10.2.14.146:39742 - "POST /v1/chat/completions HTTP/1.1" 200 OK INFO: 10.2.14.146:59976 - "POST /v1/chat/completions HTTP/1.1" 200 OK200 OKentries in both deployment logs confirm that traffic mirroring is working correctly.
Clean up
Remove all resources created in this tutorial to avoid unnecessary GPU costs.
kubectl delete -f sleep.yaml
kubectl delete -f gateway.yaml
kubectl delete -f inference_rules.yaml
kubectl delete -f inference_app.yaml