Use ACK Gateway with Inference Extension to implement traffic mirroring for inference services - Container Compute Service

The ACK Gateway with Inference Extension component supports traffic mirroring for inference requests while providing intelligent load balancing for inference services. When deploying new inference models in a production environment, you can evaluate the performance of new models by mirroring production traffic to ensure their performance and stability meet requirements before officially publishing them. This topic describes how to use ACK Gateway with Inference Extension to implement traffic mirroring for inference requests.

Important

Before reading this topic, make sure you understand the concepts of InferencePool and InferenceModel.

Prerequisites

An ACK managed cluster with a GPU node pool is created. You can also install the ACK Virtual Node component in the ACK managed cluster to use ACS GPU compute power.
ACK Gateway with Inference Extension is installed and Enable Gateway API Inference Extension is selected when you create the cluster. For more information, see Step 2: Install the Gateway with Inference Extension component.

Note

For the image used in this topic, we recommend that you use A10 cards for ACK clusters and GN8IS cards for Alibaba Cloud Container Compute Service (ACS) GPU computing power.

Due to the large size of the LLM image, we recommend that you transfer it to Container Registry in advance and pull it using the internal network address. The speed of pulling from the public network depends on the bandwidth configuration of the cluster elastic IP address (EIP), which may result in longer wait times.

Workflow

This example will deploy the following resources:

Two inference services: vllm-llama2-7b-pool and vllm-llama2-7b-pool-1 (APP and APP1 in the following figure).
A ClusterIP Service that serves as a gateway.
An HTTPRoute that configures specific traffic forwarding and mirroring rules.
An InferencePool and the corresponding InferenceModel to enable intelligent load balancing for APP. A regular Service for APP1. Currently, intelligent load balancing is not supported for mirrored traffic. Therefore, a regular Service is required for APP1.
A Sleep application as a test client.

The following figure shows the traffic mirroring process.

When the client accesses the gateway, the HTTPRoute identifies production traffic based on prefix matching rules.
After the rule matches successfully:
- Production traffic is normally forwarded to the corresponding InferencePool, and then forwarded to the backend APP after intelligent load balancing.
- The HTTPFilter in the rule sends mirrored traffic to the specified Service, which then forwards it to the backend APP1.
Both backend APP and APP1 return normal responses, but the gateway only processes the response returned from the InferencePool and ignores the response returned from the mirrored service. The client only perceives the processing result of the main service.

Procedure

Deploy the sample inference services vllm-llama2-7b-pool and vllm-llama2-7b-pool-1.

This step only provides the YAML file for vllm-llama2-7b-pool. The configuration for vllm-llama2-7b-pool-1 is identical to vllm-llama2-7b-pool except for the name. Please modify the corresponding fields in the following YAML file when you deploy the vllm-llama2-7b-pool-1 inference service.

Expand to view YAML content

# =============================================================
# inference_app.yaml
# =============================================================
apiVersion: v1
kind: ConfigMap
metadata:
  name: chat-template
data:
  llama-2-chat.jinja: |
    {% if messages[0]['role'] == 'system' %}
      {% set system_message = '<<SYS>>\n' + messages[0]['content'] | trim + '\n<</SYS>>\n\n' %}
      {% set messages = messages[1:] %}
    {% else %}
        {% set system_message = '' %}
    {% endif %}

    {% for message in messages %}
        {% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}
            {{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}
        {% endif %}

        {% if loop.index0 == 0 %}
            {% set content = system_message + message['content'] %}
        {% else %}
            {% set content = message['content'] %}
        {% endif %}
        {% if message['role'] == 'user' %}
            {{ bos_token + '[INST] ' + content | trim + ' [/INST]' }}
        {% elif message['role'] == 'assistant' %}
            {{ ' ' + content | trim + ' ' + eos_token }}
        {% endif %}
    {% endfor %}
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-llama2-7b-pool
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm-llama2-7b-pool
  template:
    metadata:
      annotations:
        prometheus.io/path: /metrics
        prometheus.io/port: '8000'
        prometheus.io/scrape: 'true'
      labels:
        app: vllm-llama2-7b-pool
    spec:
      containers:
        - name: lora
          image: "registry-cn-hangzhou.ack.aliyuncs.com/ack-demo/llama2-with-lora:v0.2"
          imagePullPolicy: IfNotPresent
          command: ["python3", "-m", "vllm.entrypoints.openai.api_server"]
          args:
          - "--model"
          - "/model/llama2"
          - "--tensor-parallel-size"
          - "1"
          - "--port"
          - "8000"
          - '--gpu_memory_utilization'
          - '0.8'
          - "--enable-lora"
          - "--max-loras"
          - "4"
          - "--max-cpu-loras"
          - "12"
          - "--lora-modules"
          - 'sql-lora=/adapters/yard1/llama-2-7b-sql-lora-test_0'
          - 'sql-lora-1=/adapters/yard1/llama-2-7b-sql-lora-test_1'
          - 'sql-lora-2=/adapters/yard1/llama-2-7b-sql-lora-test_2'
          - 'sql-lora-3=/adapters/yard1/llama-2-7b-sql-lora-test_3'
          - 'sql-lora-4=/adapters/yard1/llama-2-7b-sql-lora-test_4'
          - 'tweet-summary=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_0'
          - 'tweet-summary-1=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_1'
          - 'tweet-summary-2=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_2'
          - 'tweet-summary-3=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_3'
          - 'tweet-summary-4=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_4'
          - '--chat-template'
          - '/etc/vllm/llama-2-chat.jinja'
          env:
            - name: PORT
              value: "8000"
          ports:
            - containerPort: 8000
              name: http
              protocol: TCP
          livenessProbe:
            failureThreshold: 2400
            httpGet:
              path: /health
              port: http
              scheme: HTTP
            initialDelaySeconds: 5
            periodSeconds: 5
            successThreshold: 1
            timeoutSeconds: 1
          readinessProbe:
            failureThreshold: 6000
            httpGet:
              path: /health
              port: http
              scheme: HTTP
            initialDelaySeconds: 5
            periodSeconds: 5
            successThreshold: 1
            timeoutSeconds: 1
          resources:
            limits:
              nvidia.com/gpu: 1
            requests:
              nvidia.com/gpu: 1
          volumeMounts:
            - mountPath: /data
              name: data
            - mountPath: /dev/shm
              name: shm
            - mountPath: /etc/vllm
              name: chat-template
      restartPolicy: Always
      schedulerName: default-scheduler
      terminationGracePeriodSeconds: 30
      volumes:
        - name: data
          emptyDir: {}
        - name: shm
          emptyDir:
            medium: Memory
        - name: chat-template
          configMap:
            name: chat-template

Deploy the InferencePool and InferenceModel, and the Service for the vllm-llama2-7b-pool-1 application.

# =============================================================
# inference_rules.yaml
# =============================================================
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
  name: vllm-llama2-7b-pool
spec:
  targetPortNumber: 8000
  selector:
    app: vllm-llama2-7b-pool
  extensionRef:
    name: inference-gateway-ext-proc
---
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferenceModel
metadata:
  name: inferencemodel-sample
spec:
  modelName: /model/llama2
  criticality: Critical
  poolRef:
    group: inference.networking.x-k8s.io
    kind: InferencePool
    name: vllm-llama2-7b-pool
  targetModels:
  - name: /model/llama2
    weight: 100
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-llama2-7b-pool-1
spec:
  selector:
    app: vllm-llama2-7b-pool-1
  ports:
  - protocol: TCP
    port: 8000
    targetPort: 8000
  type: ClusterIP

Deploy the Gateway and HTTPRoute.

The Gateway uses a ClusterIP Service, which can only be accessed from within the cluster. You can change the Service type to LoadBalancer based on your actual needs.

# =============================================================
# gateway.yaml
# =============================================================
kind: GatewayClass
apiVersion: gateway.networking.k8s.io/v1
metadata:
  name: example-gateway-class
  labels:
    example: http-routing
spec:
  controllerName: gateway.envoyproxy.io/gatewayclass-controller
---
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  labels:
    example: http-routing
  name: example-gateway
  namespace: default
spec:
  gatewayClassName: example-gateway-class
  infrastructure:
    parametersRef:
      group: gateway.envoyproxy.io
      kind: EnvoyProxy
      name: custom-proxy-config
  listeners:
  - allowedRoutes:
      namespaces:
        from: Same
    name: http
    port: 80
    protocol: HTTP
---
apiVersion: gateway.envoyproxy.io/v1alpha1
kind: EnvoyProxy
metadata:
  name: custom-proxy-config
  namespace: default
spec:
  provider:
    type: Kubernetes
    kubernetes:
      envoyService:
        type: ClusterIP
---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: mirror-route
  labels:
    example: http-routing
spec:
  parentRefs:
    - name: example-gateway
  hostnames:
    - "example.com"
  rules:
    - matches:
        - path:
            type: PathPrefix
            value: /
      backendRefs:
      - group: inference.networking.x-k8s.io
        kind: InferencePool
        name: vllm-llama2-7b-pool
        weight: 1
      filters:
      - type: RequestMirror
        requestMirror:
          backendRef:
            kind: Service
            name: vllm-llama2-7b-pool-1
            port: 8000

Deploy the sleep application.

# =============================================================
# sleep.yaml
# =============================================================
apiVersion: v1
kind: ServiceAccount
metadata:
  name: sleep
---
apiVersion: v1
kind: Service
metadata:
  name: sleep
  labels:
    app: sleep
    service: sleep
spec:
  ports:
  - port: 80
    name: http
  selector:
    app: sleep
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: sleep
spec:
  replicas: 1
  selector:
    matchLabels:
      app: sleep
  template:
    metadata:
      labels:
        app: sleep
    spec:
      terminationGracePeriodSeconds: 0
      serviceAccountName: sleep
      containers:
      - name: sleep
        image:  registry-cn-hangzhou.ack.aliyuncs.com/ack-demo/curl:asm-sleep
        command: ["/bin/sleep", "infinity"]
        imagePullPolicy: IfNotPresent
        volumeMounts:
        - mountPath: /etc/sleep/tls
          name: secret-volume
      volumes:
      - name: secret-volume
        secret:
          secretName: sleep-secret
          optional: true

Verify traffic mirroring.

Obtain the gateway address.

export GATEWAY_ADDRESS=$(kubectl get gateway/example-gateway -o jsonpath='{.status.addresses[0].value}')

Send a test request.

kubectl exec deployment/sleep -it -- curl -X POST ${GATEWAY_ADDRESS}/v1/chat/completions -H 'Content-Type: application/json' -H "host: example.com" -d '{
    "model": "/model/llama2",
    "max_completion_tokens": 100,
    "temperature": 0,
    "messages": [
      {
        "role": "user",
        "content": "introduce yourself"
      }
    ]
}'

Expected output:

{"id":"chatcmpl-eb67bf29-1f87-4e29-8c3e-a83f3c74cd87","object":"chat.completion","created":1745207283,"model":"/model/llama2","choices":[{"index":0,"message":{"role":"assistant","content":"\n         [INST] I'm a [/INST]\n\n         [INST] I'm a [/INST]\n\n         [INST] I'm a [/INST]\n\n         [INST] I'm a [/INST]\n\n         [INST] I'm a [/INST]\n\n         [INST] I'm a [/INST]\n\n         [INST] I'm a [/INST]\n\n        ","tool_calls":[]},"logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":15,"total_tokens":115,"completion_tokens":100,"prompt_tokens_details":null},"prompt_logprobs":null}%

Check the application logs.

echo "original logs↓↓↓" && kubectl logs deployments/vllm-llama2-7b-pool | grep /v1/chat/completions | grep OK
echo "mirror logs↓↓↓" && kubectl logs deployments/vllm-llama2-7b-pool-1 | grep /v1/chat/completions | grep OK

Expected output:

original logs↓↓↓
INFO:     10.2.14.146:39478 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     10.2.14.146:60660 - "POST /v1/chat/completions HTTP/1.1" 200 OK
mirror logs↓↓↓
INFO:     10.2.14.146:39742 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     10.2.14.146:59976 - "POST /v1/chat/completions HTTP/1.1" 200 OK

The output shows that requests are routed to both vllm-llama2-7b-pool and vllm-llama2-7b-pool-1. This indicates that traffic mirroring works as normal.