Mirror Inference Traffic Safely with ACK Gateway Extension - Container Service for Kubernetes

The ACK Gateway with Inference Extension component supports traffic mirroring for inference requests while providing intelligent load balancing for inference services. When deploying new inference models in a production environment, you can evaluate the performance of new models by mirroring production traffic to ensure their performance and stability meet requirements before officially publishing them. This topic describes how to use ACK Gateway with Inference Extension to implement traffic mirroring for inference requests.

Traffic mirroring (also called shadowing) lets you send a copy of live production requests to a candidate model without affecting the responses your users receive. The mirrored requests are fire-and-forget: the gateway forwards them out of band, ignores any responses from the mirror target, and returns only the primary service’s response to the client. This makes traffic mirroring a zero-risk way to validate new LLM models against real traffic before officially releasing them.

This topic explains how to configure ACK Gateway with Inference Extension to mirror inference requests from a primary vLLM service to a shadow service.

Important

Before you begin, make sure you understand the concepts of InferencePool and InferenceModel.

Prerequisites

Before you begin, ensure that you have:

An ACK managed cluster with a GPU node pool. Alternatively, install the ACK Virtual Node component to use Container Compute Service (ACS) GPU compute power.
ACK Gateway with Inference Extension installed, with Enable Gateway API Inference Extension selected. For the installation entry, see Step 2: Install the ACK Gateway with Inference Extension component.

Note The container image used in this guide requires more than 16 GiB of GPU memory. The T4 card (16 GiB) may not be sufficient. Use an A10 GPU card for ACK clusters, or 8th-gen GPU B for ACS GPU compute power. The LLM image is large. To reduce pull time, push it to Container Registry (ACR) in advance and pull it over the internal network. Pulling over the public internet depends on your cluster's Elastic IP Address (EIP) bandwidth and may be slow.

How it works

This example deploys the following resources:

Resource	Description
`vllm-llama2-7b-pool`	Primary inference service (APP)
`vllm-llama2-7b-pool-1`	Shadow inference service (APP1)
Gateway (ClusterIP)	Entry point for inference requests
HTTPRoute (`mirror-route`)	Routes production traffic to the primary service and mirrors it to the shadow service
InferencePool + InferenceModel	Enables intelligent load balancing for the primary service
Service for APP1	Regular ClusterIP Service for the shadow service. Intelligent load balancing is not applied to mirrored traffic, so a standard Service is required
Sleep	Test client

The following diagram shows the traffic flow.

The client sends a request to the gateway.
The HTTPRoute matches the request using a PathPrefix rule.
Production traffic is forwarded to the InferencePool, which applies intelligent load balancing before routing to APP.
The RequestMirror filter sends a copy of the request to the shadow Service, which forwards it to APP1.
Both APP and APP1 process the request, but the gateway returns only APP's response to the client. APP1's response is discarded.

Deploy traffic mirroring for inference services

Step 1: Deploy the inference services

Deploy vllm-llama2-7b-pool using the following YAML. The configuration for vllm-llama2-7b-pool-1 is identical — copy the YAML and modify the corresponding fields (the Deployment name, selector labels, and Pod template labels) to replace vllm-llama2-7b-pool with vllm-llama2-7b-pool-1. Do not rename the shared chat-template ConfigMap.

Expand to view YAML content

# =============================================================
# inference_app.yaml
# =============================================================
apiVersion: v1
kind: ConfigMap
metadata:
  name: chat-template
data:
  llama-2-chat.jinja: |
    {% if messages[0]['role'] == 'system' %}
      {% set system_message = '<<SYS>>\n' + messages[0]['content'] | trim + '\n<</SYS>>\n\n' %}
      {% set messages = messages[1:] %}
    {% else %}
        {% set system_message = '' %}
    {% endif %}

    {% for message in messages %}
        {% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}
            {{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}
        {% endif %}

        {% if loop.index0 == 0 %}
            {% set content = system_message + message['content'] %}
        {% else %}
            {% set content = message['content'] %}
        {% endif %}
        {% if message['role'] == 'user' %}
            {{ bos_token + '[INST] ' + content | trim + ' [/INST]' }}
        {% elif message['role'] == 'assistant' %}
            {{ ' ' + content | trim + ' ' + eos_token }}
        {% endif %}
    {% endfor %}
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-llama2-7b-pool
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm-llama2-7b-pool
  template:
    metadata:
      annotations:
        prometheus.io/path: /metrics
        prometheus.io/port: '8000'
        prometheus.io/scrape: 'true'
      labels:
        app: vllm-llama2-7b-pool
    spec:
      containers:
        - name: lora
          image: "registry-cn-hangzhou.ack.aliyuncs.com/ack-demo/llama2-with-lora:v0.2"
          imagePullPolicy: IfNotPresent
          command: ["python3", "-m", "vllm.entrypoints.openai.api_server"]
          args:
          - "--model"
          - "/model/llama2"
          - "--tensor-parallel-size"
          - "1"
          - "--port"
          - "8000"
          - '--gpu_memory_utilization'
          - '0.8'
          - "--enable-lora"
          - "--max-loras"
          - "4"
          - "--max-cpu-loras"
          - "12"
          - "--lora-modules"
          - 'sql-lora=/adapters/yard1/llama-2-7b-sql-lora-test_0'
          - 'sql-lora-1=/adapters/yard1/llama-2-7b-sql-lora-test_1'
          - 'sql-lora-2=/adapters/yard1/llama-2-7b-sql-lora-test_2'
          - 'sql-lora-3=/adapters/yard1/llama-2-7b-sql-lora-test_3'
          - 'sql-lora-4=/adapters/yard1/llama-2-7b-sql-lora-test_4'
          - 'tweet-summary=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_0'
          - 'tweet-summary-1=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_1'
          - 'tweet-summary-2=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_2'
          - 'tweet-summary-3=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_3'
          - 'tweet-summary-4=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_4'
          - '--chat-template'
          - '/etc/vllm/llama-2-chat.jinja'
          env:
            - name: PORT
              value: "8000"
          ports:
            - containerPort: 8000
              name: http
              protocol: TCP
          livenessProbe:
            failureThreshold: 2400
            httpGet:
              path: /health
              port: http
              scheme: HTTP
            initialDelaySeconds: 5
            periodSeconds: 5
            successThreshold: 1
            timeoutSeconds: 1
          readinessProbe:
            failureThreshold: 6000
            httpGet:
              path: /health
              port: http
              scheme: HTTP
            initialDelaySeconds: 5
            periodSeconds: 5
            successThreshold: 1
            timeoutSeconds: 1
          resources:
            limits:
              nvidia.com/gpu: 1
            requests:
              nvidia.com/gpu: 1
          volumeMounts:
            - mountPath: /data
              name: data
            - mountPath: /dev/shm
              name: shm
            - mountPath: /etc/vllm
              name: chat-template
      restartPolicy: Always
      schedulerName: default-scheduler
      terminationGracePeriodSeconds: 30
      volumes:
        - name: data
          emptyDir: {}
        - name: shm
          emptyDir:
            medium: Memory
        - name: chat-template
          configMap:
            name: chat-template

Step 2: Deploy the InferencePool, InferenceModel, and shadow Service

# =============================================================
# inference_rules.yaml
# =============================================================
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
  name: vllm-llama2-7b-pool
spec:
  targetPortNumber: 8000
  selector:
    app: vllm-llama2-7b-pool
  extensionRef:
    name: inference-gateway-ext-proc
---
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferenceModel
metadata:
  name: inferencemodel-sample
spec:
  modelName: /model/llama2
  criticality: Critical
  poolRef:
    group: inference.networking.x-k8s.io
    kind: InferencePool
    name: vllm-llama2-7b-pool
  targetModels:
  - name: /model/llama2
    weight: 100
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-llama2-7b-pool-1
spec:
  selector:
    app: vllm-llama2-7b-pool-1
  ports:
  - protocol: TCP
    port: 8000
    targetPort: 8000
  type: ClusterIP

Step 3: Deploy the Gateway and HTTPRoute

# =============================================================
# gateway.yaml
# =============================================================
kind: GatewayClass
apiVersion: gateway.networking.k8s.io/v1
metadata:
  name: example-gateway-class
  labels:
    example: http-routing
spec:
  controllerName: gateway.envoyproxy.io/gatewayclass-controller
---
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  labels:
    example: http-routing
  name: example-gateway
  namespace: default
spec:
  gatewayClassName: example-gateway-class
  infrastructure:
    parametersRef:
      group: gateway.envoyproxy.io
      kind: EnvoyProxy
      name: custom-proxy-config
  listeners:
  - allowedRoutes:
      namespaces:
        from: Same
    name: http
    port: 80
    protocol: HTTP
---
apiVersion: gateway.envoyproxy.io/v1alpha1
kind: EnvoyProxy
metadata:
  name: custom-proxy-config
  namespace: default
spec:
  provider:
    type: Kubernetes
    kubernetes:
      envoyService:
        type: ClusterIP
---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: mirror-route
  labels:
    example: http-routing
spec:
  parentRefs:
    - name: example-gateway
  hostnames:
    - "example.com"
  rules:
    - matches:
        - path:
            type: PathPrefix
            value: /
      backendRefs:
      - group: inference.networking.x-k8s.io
        kind: InferencePool
        name: vllm-llama2-7b-pool
        weight: 1
      filters:
      - type: RequestMirror
        requestMirror:
          backendRef:
            kind: Service
            name: vllm-llama2-7b-pool-1
            port: 8000

The Gateway uses a ClusterIP Service and is accessible only from within the cluster. Change envoyService.type to LoadBalancer if you need external access. The RequestMirror filter copies each incoming request and sends it to vllm-llama2-7b-pool-1. Responses from the mirror target are always discarded — only the InferencePool response is returned to the client. Before applying this HTTPRoute, vllm-llama2-7b-pool-1 receives no traffic; the log check in Step 5 confirms that mirroring is active.

Step 4: Deploy the test client

# =============================================================
# sleep.yaml
# =============================================================
apiVersion: v1
kind: ServiceAccount
metadata:
  name: sleep
---
apiVersion: v1
kind: Service
metadata:
  name: sleep
  labels:
    app: sleep
    service: sleep
spec:
  ports:
  - port: 80
    name: http
  selector:
    app: sleep
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: sleep
spec:
  replicas: 1
  selector:
    matchLabels:
      app: sleep
  template:
    metadata:
      labels:
        app: sleep
    spec:
      terminationGracePeriodSeconds: 0
      serviceAccountName: sleep
      containers:
      - name: sleep
        image:  registry-cn-hangzhou.ack.aliyuncs.com/ack-demo/curl:asm-sleep
        command: ["/bin/sleep", "infinity"]
        imagePullPolicy: IfNotPresent
        volumeMounts:
        - mountPath: /etc/sleep/tls
          name: secret-volume
      volumes:
      - name: secret-volume
        secret:
          secretName: sleep-secret
          optional: true

Step 5: Verify traffic mirroring

Get the gateway address.

export GATEWAY_ADDRESS=$(kubectl get gateway/example-gateway -o jsonpath='{.status.addresses[0].value}')

Send a test request.

kubectl exec deployment/sleep -it -- curl -X POST ${GATEWAY_ADDRESS}/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -H "host: example.com" \
  -d '{
    "model": "/model/llama2",
    "max_completion_tokens": 100,
    "temperature": 0,
    "messages": [
      {
        "role": "user",
        "content": "introduce yourself"
      }
    ]
  }'

The expected output is similar to:

{"id":"chatcmpl-eb67bf29-1f87-4e29-8c3e-a83f3c74cd87","object":"chat.completion","created":1745207283,"model":"/model/llama2","choices":[{"index":0,"message":{"role":"assistant","content":"\n         [INST] I'm a [/INST]\n\n         [INST] I'm a [/INST]\n\n         [INST] I'm a [/INST]\n\n         [INST] I'm a [/INST]\n\n         [INST] I'm a [/INST]\n\n         [INST] I'm a [/INST]\n\n         [INST] I'm a [/INST]\n\n        ","tool_calls":[]},"logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":15,"total_tokens":115,"completion_tokens":100,"prompt_tokens_details":null},"prompt_logprobs":null}

Confirm that both services received the request by checking their logs.

echo "primary service logs:" && kubectl logs deployments/vllm-llama2-7b-pool | grep /v1/chat/completions | grep OK
echo "mirror service logs:" && kubectl logs deployments/vllm-llama2-7b-pool-1 | grep /v1/chat/completions | grep OK

The expected output is similar to:

primary service logs:
INFO:     10.2.14.146:39478 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     10.2.14.146:60660 - "POST /v1/chat/completions HTTP/1.1" 200 OK
mirror service logs:
INFO:     10.2.14.146:39742 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     10.2.14.146:59976 - "POST /v1/chat/completions HTTP/1.1" 200 OK

200 OK entries in both deployment logs confirm that traffic mirroring is working correctly.

Clean up

Remove all resources created in this tutorial to avoid unnecessary GPU costs.

kubectl delete -f sleep.yaml
kubectl delete -f gateway.yaml
kubectl delete -f inference_rules.yaml
kubectl delete -f inference_app.yaml

Container Service for Kubernetes:Use ACK gateway with inference extension to implement traffic mirroring for inference services