All Products
Search
Document Center

Container Compute Service:Use ACK gateway with inference extension to implement traffic mirroring for inference services

Last Updated:Jul 28, 2025

The ACK Gateway with Inference Extension component supports traffic mirroring for inference requests while providing intelligent load balancing for inference services. When deploying new inference models in a production environment, you can evaluate the performance of new models by mirroring production traffic to ensure their performance and stability meet requirements before officially publishing them. This topic describes how to use ACK Gateway with Inference Extension to implement traffic mirroring for inference requests.

Important

Before reading this topic, make sure you understand the concepts of InferencePool and InferenceModel.

Prerequisites

Note

For the image used in this topic, we recommend that you use A10 cards for ACK clusters and GN8IS cards for Alibaba Cloud Container Compute Service (ACS) GPU computing power.

Due to the large size of the LLM image, we recommend that you transfer it to Container Registry in advance and pull it using the internal network address. The speed of pulling from the public network depends on the bandwidth configuration of the cluster elastic IP address (EIP), which may result in longer wait times.

Workflow

This example will deploy the following resources:

  • Two inference services: vllm-llama2-7b-pool and vllm-llama2-7b-pool-1 (APP and APP1 in the following figure).

  • A ClusterIP Service that serves as a gateway.

  • An HTTPRoute that configures specific traffic forwarding and mirroring rules.

  • An InferencePool and the corresponding InferenceModel to enable intelligent load balancing for APP. A regular Service for APP1. Currently, intelligent load balancing is not supported for mirrored traffic. Therefore, a regular Service is required for APP1.

  • A Sleep application as a test client.

The following figure shows the traffic mirroring process.

image
  • When the client accesses the gateway, the HTTPRoute identifies production traffic based on prefix matching rules.

  • After the rule matches successfully:

    • Production traffic is normally forwarded to the corresponding InferencePool, and then forwarded to the backend APP after intelligent load balancing.

    • The HTTPFilter in the rule sends mirrored traffic to the specified Service, which then forwards it to the backend APP1.

  • Both backend APP and APP1 return normal responses, but the gateway only processes the response returned from the InferencePool and ignores the response returned from the mirrored service. The client only perceives the processing result of the main service.

Procedure

  1. Deploy the sample inference services vllm-llama2-7b-pool and vllm-llama2-7b-pool-1.

    This step only provides the YAML file for vllm-llama2-7b-pool. The configuration for vllm-llama2-7b-pool-1 is identical to vllm-llama2-7b-pool except for the name. Please modify the corresponding fields in the following YAML file when you deploy the vllm-llama2-7b-pool-1 inference service.

    Expand to view YAML content

    # =============================================================
    # inference_app.yaml
    # =============================================================
    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: chat-template
    data:
      llama-2-chat.jinja: |
        {% if messages[0]['role'] == 'system' %}
          {% set system_message = '<<SYS>>\n' + messages[0]['content'] | trim + '\n<</SYS>>\n\n' %}
          {% set messages = messages[1:] %}
        {% else %}
            {% set system_message = '' %}
        {% endif %}
    
        {% for message in messages %}
            {% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}
                {{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}
            {% endif %}
    
            {% if loop.index0 == 0 %}
                {% set content = system_message + message['content'] %}
            {% else %}
                {% set content = message['content'] %}
            {% endif %}
            {% if message['role'] == 'user' %}
                {{ bos_token + '[INST] ' + content | trim + ' [/INST]' }}
            {% elif message['role'] == 'assistant' %}
                {{ ' ' + content | trim + ' ' + eos_token }}
            {% endif %}
        {% endfor %}
    ---
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: vllm-llama2-7b-pool
      namespace: default
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: vllm-llama2-7b-pool
      template:
        metadata:
          annotations:
            prometheus.io/path: /metrics
            prometheus.io/port: '8000'
            prometheus.io/scrape: 'true'
          labels:
            app: vllm-llama2-7b-pool
        spec:
          containers:
            - name: lora
              image: "registry-cn-hangzhou.ack.aliyuncs.com/ack-demo/llama2-with-lora:v0.2"
              imagePullPolicy: IfNotPresent
              command: ["python3", "-m", "vllm.entrypoints.openai.api_server"]
              args:
              - "--model"
              - "/model/llama2"
              - "--tensor-parallel-size"
              - "1"
              - "--port"
              - "8000"
              - '--gpu_memory_utilization'
              - '0.8'
              - "--enable-lora"
              - "--max-loras"
              - "4"
              - "--max-cpu-loras"
              - "12"
              - "--lora-modules"
              - 'sql-lora=/adapters/yard1/llama-2-7b-sql-lora-test_0'
              - 'sql-lora-1=/adapters/yard1/llama-2-7b-sql-lora-test_1'
              - 'sql-lora-2=/adapters/yard1/llama-2-7b-sql-lora-test_2'
              - 'sql-lora-3=/adapters/yard1/llama-2-7b-sql-lora-test_3'
              - 'sql-lora-4=/adapters/yard1/llama-2-7b-sql-lora-test_4'
              - 'tweet-summary=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_0'
              - 'tweet-summary-1=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_1'
              - 'tweet-summary-2=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_2'
              - 'tweet-summary-3=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_3'
              - 'tweet-summary-4=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_4'
              - '--chat-template'
              - '/etc/vllm/llama-2-chat.jinja'
              env:
                - name: PORT
                  value: "8000"
              ports:
                - containerPort: 8000
                  name: http
                  protocol: TCP
              livenessProbe:
                failureThreshold: 2400
                httpGet:
                  path: /health
                  port: http
                  scheme: HTTP
                initialDelaySeconds: 5
                periodSeconds: 5
                successThreshold: 1
                timeoutSeconds: 1
              readinessProbe:
                failureThreshold: 6000
                httpGet:
                  path: /health
                  port: http
                  scheme: HTTP
                initialDelaySeconds: 5
                periodSeconds: 5
                successThreshold: 1
                timeoutSeconds: 1
              resources:
                limits:
                  nvidia.com/gpu: 1
                requests:
                  nvidia.com/gpu: 1
              volumeMounts:
                - mountPath: /data
                  name: data
                - mountPath: /dev/shm
                  name: shm
                - mountPath: /etc/vllm
                  name: chat-template
          restartPolicy: Always
          schedulerName: default-scheduler
          terminationGracePeriodSeconds: 30
          volumes:
            - name: data
              emptyDir: {}
            - name: shm
              emptyDir:
                medium: Memory
            - name: chat-template
              configMap:
                name: chat-template
  2. Deploy the InferencePool and InferenceModel, and the Service for the vllm-llama2-7b-pool-1 application.

    # =============================================================
    # inference_rules.yaml
    # =============================================================
    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferencePool
    metadata:
      name: vllm-llama2-7b-pool
    spec:
      targetPortNumber: 8000
      selector:
        app: vllm-llama2-7b-pool
      extensionRef:
        name: inference-gateway-ext-proc
    ---
    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferenceModel
    metadata:
      name: inferencemodel-sample
    spec:
      modelName: /model/llama2
      criticality: Critical
      poolRef:
        group: inference.networking.x-k8s.io
        kind: InferencePool
        name: vllm-llama2-7b-pool
      targetModels:
      - name: /model/llama2
        weight: 100
    ---
    apiVersion: v1
    kind: Service
    metadata:
      name: vllm-llama2-7b-pool-1
    spec:
      selector:
        app: vllm-llama2-7b-pool-1
      ports:
      - protocol: TCP
        port: 8000
        targetPort: 8000
      type: ClusterIP
  3. Deploy the Gateway and HTTPRoute.

    The Gateway uses a ClusterIP Service, which can only be accessed from within the cluster. You can change the Service type to LoadBalancer based on your actual needs.
    # =============================================================
    # gateway.yaml
    # =============================================================
    kind: GatewayClass
    apiVersion: gateway.networking.k8s.io/v1
    metadata:
      name: example-gateway-class
      labels:
        example: http-routing
    spec:
      controllerName: gateway.envoyproxy.io/gatewayclass-controller
    ---
    apiVersion: gateway.networking.k8s.io/v1
    kind: Gateway
    metadata:
      labels:
        example: http-routing
      name: example-gateway
      namespace: default
    spec:
      gatewayClassName: example-gateway-class
      infrastructure:
        parametersRef:
          group: gateway.envoyproxy.io
          kind: EnvoyProxy
          name: custom-proxy-config
      listeners:
      - allowedRoutes:
          namespaces:
            from: Same
        name: http
        port: 80
        protocol: HTTP
    ---
    apiVersion: gateway.envoyproxy.io/v1alpha1
    kind: EnvoyProxy
    metadata:
      name: custom-proxy-config
      namespace: default
    spec:
      provider:
        type: Kubernetes
        kubernetes:
          envoyService:
            type: ClusterIP
    ---
    apiVersion: gateway.networking.k8s.io/v1
    kind: HTTPRoute
    metadata:
      name: mirror-route
      labels:
        example: http-routing
    spec:
      parentRefs:
        - name: example-gateway
      hostnames:
        - "example.com"
      rules:
        - matches:
            - path:
                type: PathPrefix
                value: /
          backendRefs:
          - group: inference.networking.x-k8s.io
            kind: InferencePool
            name: vllm-llama2-7b-pool
            weight: 1
          filters:
          - type: RequestMirror
            requestMirror:
              backendRef:
                kind: Service
                name: vllm-llama2-7b-pool-1
                port: 8000
  4. Deploy the sleep application.

    # =============================================================
    # sleep.yaml
    # =============================================================
    apiVersion: v1
    kind: ServiceAccount
    metadata:
      name: sleep
    ---
    apiVersion: v1
    kind: Service
    metadata:
      name: sleep
      labels:
        app: sleep
        service: sleep
    spec:
      ports:
      - port: 80
        name: http
      selector:
        app: sleep
    ---
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: sleep
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: sleep
      template:
        metadata:
          labels:
            app: sleep
        spec:
          terminationGracePeriodSeconds: 0
          serviceAccountName: sleep
          containers:
          - name: sleep
            image:  registry-cn-hangzhou.ack.aliyuncs.com/ack-demo/curl:asm-sleep
            command: ["/bin/sleep", "infinity"]
            imagePullPolicy: IfNotPresent
            volumeMounts:
            - mountPath: /etc/sleep/tls
              name: secret-volume
          volumes:
          - name: secret-volume
            secret:
              secretName: sleep-secret
              optional: true
  5. Verify traffic mirroring.

    1. Obtain the gateway address.

      export GATEWAY_ADDRESS=$(kubectl get gateway/example-gateway -o jsonpath='{.status.addresses[0].value}')
    2. Send a test request.

      kubectl exec deployment/sleep -it -- curl -X POST ${GATEWAY_ADDRESS}/v1/chat/completions -H 'Content-Type: application/json' -H "host: example.com" -d '{
          "model": "/model/llama2",
          "max_completion_tokens": 100,
          "temperature": 0,
          "messages": [
            {
              "role": "user",
              "content": "introduce yourself"
            }
          ]
      }'

      Expected output:

      {"id":"chatcmpl-eb67bf29-1f87-4e29-8c3e-a83f3c74cd87","object":"chat.completion","created":1745207283,"model":"/model/llama2","choices":[{"index":0,"message":{"role":"assistant","content":"\n         [INST] I'm a [/INST]\n\n         [INST] I'm a [/INST]\n\n         [INST] I'm a [/INST]\n\n         [INST] I'm a [/INST]\n\n         [INST] I'm a [/INST]\n\n         [INST] I'm a [/INST]\n\n         [INST] I'm a [/INST]\n\n        ","tool_calls":[]},"logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":15,"total_tokens":115,"completion_tokens":100,"prompt_tokens_details":null},"prompt_logprobs":null}%
    3. Check the application logs.

      echo "original logs↓↓↓" && kubectl logs deployments/vllm-llama2-7b-pool | grep /v1/chat/completions | grep OK
      echo "mirror logs↓↓↓" && kubectl logs deployments/vllm-llama2-7b-pool-1 | grep /v1/chat/completions | grep OK

      Expected output:

      original logs↓↓↓
      INFO:     10.2.14.146:39478 - "POST /v1/chat/completions HTTP/1.1" 200 OK
      INFO:     10.2.14.146:60660 - "POST /v1/chat/completions HTTP/1.1" 200 OK
      mirror logs↓↓↓
      INFO:     10.2.14.146:39742 - "POST /v1/chat/completions HTTP/1.1" 200 OK
      INFO:     10.2.14.146:59976 - "POST /v1/chat/completions HTTP/1.1" 200 OK

      The output shows that requests are routed to both vllm-llama2-7b-pool and vllm-llama2-7b-pool-1. This indicates that traffic mirroring works as normal.