All Products
Search
Document Center

Container Compute Service:Implementing prefix-aware load balancing by using Gateway with Inference Extension

Last Updated:Jul 28, 2025

With the Gateway with Inference Extension component, you can specify different load balancing strategies for inference service routing based on different usage scenarios of generative AI inference services. This topic describes how to implement prefix-aware load balancing strategies by using the Gateway with Inference Extension component.

Important
  • Before reading this topic, make sure you understand the concepts of InferencePool and InferenceModel.

  • This topic requires Gateway with Inference Extension 1.4.0 or later.

Background information

APC of vLLM

Vectorized Large Language Model (vLLM) supports Automatic prefix caching (APC). APC caches the key-value (KV) cache that vLLM has already computed for requests. This way, if a new request has the same prefix as a historical request, it can directly reuse the existing KV cache. This allows the new request to skip the KV cache calculation for the shared prefix, thereby accelerating the processing of LLM inference requests.

Prefix-aware load balancing strategy

A prefix-aware load balancing strategy refers to a load balancing strategy that sends requests sharing the same prefix content to the same inference server pod whenever possible.

When the APC feature is enabled on the model server, a prefix-aware load balancing strategy can maximize the cache hit ratio and reduce request response time. This strategy is suitable for scenarios with many requests sharing prefixes. You should evaluate based on your actual business scenarios.

Typical usage scenarios:

  • Long document queries: Users repeatedly query the same long document (such as software manuals or annual reports) with different queries.

  • Multi-round conversations: Users may interact with the application multiple times in the same session.

Prerequisites

Note

For the image used in this topic, we recommend that you use A10 cards for ACK clusters and GN8IS cards for Alibaba Cloud Container Compute Service (ACS) GPU computing power.

Due to the large size of the LLM image, we recommend that you transfer it to Container Registry in advance and pull it using the internal network address. The speed of pulling from the public network depends on the bandwidth configuration of the cluster elastic IP address (EIP), which may result in longer wait times.

Procedure

Step 1: Deploy a sample inference service

  1. Create vllm-service.yaml.

    View the YAML content

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      labels:
        app: qwen
      name: qwen
    spec:
      progressDeadlineSeconds: 600
      replicas: 5
      selector:
        matchLabels:
          app: qwen
      template:
        metadata:
          annotations:
            prometheus.io/path: /metrics
            prometheus.io/port: "8000"
            prometheus.io/scrape: "true"
          labels:
            app: qwen
            alibabacloud.com/compute-class: gpu
            alibabacloud.com/compute-qos: default
            alibabacloud.com/gpu-model-series: GN8IS
        spec:
          containers:
            - command:
                - sh
                - -c
                - vllm serve /models/Qwen-2.5-7B-Instruct --port 8000 --enable_prefix_caching --trust-remote-code --served-model-name /model/qwen --gpu-memory-utilization 0.95 --enforce-eager --enable-lora --max-loras 2 --max-cpu-loras 4 --lora-modules travel-helper-v1=/models/Qwen-TravelHelper-Lora travel-helper-v2=/models/Qwen-TravelHelper-Lora-v2
              image: registry-cn-hangzhou.ack.aliyuncs.com/dev/qwen-2.5-7b-instruct-lora:v0.1
              imagePullPolicy: IfNotPresent
              name: custom-serving
              ports:
                - containerPort: 8000
                  name: http
                  protocol: TCP
              readinessProbe:
                failureThreshold: 3
                initialDelaySeconds: 30
                periodSeconds: 30
                successThreshold: 1
                tcpSocket:
                  port: 8000
                timeoutSeconds: 1
              resources:
                limits:
                  nvidia.com/gpu: "1"
                  cpu: "8"
                  memory: 30G
              terminationMessagePath: /dev/termination-log
              terminationMessagePolicy: File
              volumeMounts:
                - mountPath: /dev/shm
                  name: dshm
          restartPolicy: Always
          volumes:
            - emptyDir:
                medium: Memory
                sizeLimit: 30Gi
              name: dshm
    ---
    apiVersion: v1
    kind: Service
    metadata:
      labels:
        app: qwen
      name: qwen
    spec:
      ports:
        - name: http-serving
          port: 8000
          protocol: TCP
          targetPort: 8000
      selector:
        app: qwen
  2. Deploy the sample inference service.

    kubectl apply -f vllm-service.yaml

Step 2: Deploy inference routing

In this step, you create InferencePool and InferenceModel resources.

  1. Create inference-pool.yaml.

    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferencePool
    metadata:
      annotations:
        inference.networking.x-k8s.io/routing-strategy: "PREFIX_CACHE"
      name: vllm-qwen-pool
    spec:
      targetPortNumber: 8000
      selector:
        app: qwen
      extensionRef:
        name: inference-gateway-ext-proc
    ---
    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferenceModel
    metadata:
      name: inferencemodel-qwen
    spec:
      modelName: /model/qwen
      criticality: Critical
      poolRef:
        group: inference.networking.x-k8s.io
        kind: InferencePool
        name: vllm-qwen-pool
      targetModels:
      - name: /model/qwen
        weight: 100

    In the InferencePool resource, you enable the prefix-aware load balancing strategy for pods in the InferencePool resource by configuring the inference.networking.x-k8s.io/routing-strategy: "PREFIX_CACHE" annotation.

  2. Deploy the inference routing.

    kubectl apply -f inference-pool.yaml

Step 3: Deploy the gateway and gateway routing rules

In this step, you create a gateway with ports 8080 and 8081. On port 8081 of the gateway, an HTTPRoute resource specifies the InferencePool resource provided by the inference extension as the gateway routing backend. Inference requests will be routed to the pod set specified by the InferencePool resource. On port 8080 of the gateway, an HTTPRoute resource specifies the Service as the gateway routing backend. Inference requests will be routed to the same pod set using the standard HTTP least request load balancing strategy.

  1. Create inference-gateway.yaml.

    apiVersion: gateway.networking.k8s.io/v1
    kind: GatewayClass
    metadata:
      name: qwen-inference-gateway-class
    spec:
      controllerName: gateway.envoyproxy.io/gatewayclass-controller
    ---
    apiVersion: gateway.networking.k8s.io/v1
    kind: Gateway
    metadata:
      name: qwen-inference-gateway
    spec:
      gatewayClassName: qwen-inference-gateway-class
      listeners:
        - name: http
          protocol: HTTP
          port: 8080
        - name: llm-gw
          protocol: HTTP
          port: 8081
    ---
    apiVersion: gateway.networking.k8s.io/v1
    kind: HTTPRoute
    metadata:
      name: qwen-backend
    spec:
      parentRefs:
        - name: qwen-inference-gateway
          sectionName: llm-gw
      rules:
        - backendRefs:
            - group: inference.networking.x-k8s.io
              kind: InferencePool
              name: vllm-qwen-pool
          matches:
            - path:
                type: PathPrefix
                value: /
    ---
    apiVersion: gateway.networking.k8s.io/v1
    kind: HTTPRoute
    metadata:
      name: qwen-backend-no-inference
    spec:
      parentRefs:
      - group: gateway.networking.k8s.io
        kind: Gateway
        name: qwen-inference-gateway
        sectionName: http
      rules:
      - backendRefs:
        - group: ""
          kind: Service
          name: qwen
          port: 8000
          weight: 1
        matches:
        - path:
            type: PathPrefix
            value: /
    ---
    apiVersion: gateway.envoyproxy.io/v1alpha1
    kind: BackendTrafficPolicy
    metadata:
      name: backend-timeout
    spec:
      timeout:
        http:
          requestTimeout: 1h
      targetRef:
        group: gateway.networking.k8s.io
        kind: Gateway
        name: qwen-inference-gateway
  2. Deploy the gateway.

    kubectl apply -f inference-gateway.yaml

Step 4: Verify the routing rules

  1. Create round1.txt and round2.txt. Both files contain the same content section. You can check whether the prefix-aware feature of intelligent routing is triggered by using round1.txt and round2.txt as LLM request bodies in sequence, and then check the logs of the extensionRef field of intelligent routing.

    round1.txt:

    echo '{"max_tokens":24,"messages":[{"content":"Hi, here'\''s some system prompt: hi hi hi hi hi hi hi hi hi hi.For user 3, here are some other context: hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi.I would like to test your intelligence. for this purpose I would like you to play zork. you can interact with the game by typing in commands. I will forward these commands to the game and type in any response. are you ready?","role":"user"}],"model":"/model/qwen","stream":true,"stream_options":{"include_usage":true},"temperature":0}' > round1.txt

    round2.txt:

    echo '{"max_tokens":3,"messages":[{"content":"Hi, here'\''s some system prompt: hi hi hi hi hi hi hi hi hi hi.For user 3, here are some other context: hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi.I would like to test your intelligence. for this purpose I would like you to play zork. you can interact with the game by typing in commands. I will forward these commands to the game and type in any response. are you ready?","role":"user"},{"content":"Hi there! It looks like you're setting up a fun test. I'm ready to play Zork! You can","role":"assistant"},{"content":"% zork\nWelcome to Dungeon. This version created 11-MAR-91.\nYou are in an open field west of a big white house with a boarded\nfront door.\nThere is a small mailbox here.\n>","role":"user"},{"content":"Great!","role":"assistant"},{"content":"Opening the mailbox reveals:\n A leaflet.\n>","role":"user"}],"model":"/model/qwen","stream":true,"stream_options":{"include_usage":true},"temperature":0}' > round2.txt
  2. Get the public IP address of the gateway.

    export GATEWAY_IP=$(kubectl get gateway/qwen-inference-gateway -o jsonpath='{.status.addresses[0].value}')
  3. Initiate two session requests to simulate a multi-round conversation scenario.

    curl -X POST ${GATEWAY_IP}:8081/v1/chat/completions -H 'Content-Type: application/json' -d @./round1.txt
    curl -X POST ${GATEWAY_IP}:8081/v1/chat/completions -H 'Content-Type: application/json' -d @./round2.
  4. View the logs to confirm whether the prefix load balancing is effective.

    kubectl logs deploy/epp-default-inference-gateway-ext-proc -n envoy-gateway-system|grep "Do prefix"

    Expected output:

    2025-05-23T03:33:09Z    INFO    scheduling/prefixcache_filter.go:311    Do prefix-aware routing!        {"request": "v68m4zx472", "matching ratio": " 0.54 > 0.50"}

    The output shows that the log contains the Do prefix-aware routing! message, which indicates that the prefix load balancing is effective.

(Optional) Step 5: Evaluate inference service performance through multi-round conversation testing

This step demonstrates how to use a stress testing tool for multi-round conversation testing to compare the effects of regular HTTP routing and inference routing with prefix load balancing. In this example, an ACK cluster is used.

  1. Deploy the llm-qa-benchmark stress testing tool.

    kubectl apply -f- <<EOF
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      labels:
        app: llm-qa-benchmark
      name: llm-qa-benchmark
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: llm-qa-benchmark
      template:
        metadata:
          labels:
            app: llm-qa-benchmark
        spec:
          containers:
          - command:
            - sh
            - -c
            - sleep inf
            image: registry-cn-hangzhou.ack.aliyuncs.com/dev/llm-qa-benchmark:v0.1
            imagePullPolicy: IfNotPresent
            name: llm-qa-benchmark
            terminationMessagePath: /dev/termination-log
            terminationMessagePolicy: File
          restartPolicy: Always
    EOF
  2. Get the internal IP address of the gateway.

    export GW_IP=$(kubectl get svc -n envoy-gateway-system -l gateway.envoyproxy.io/owning-gateway-namespace=default,gateway.envoyproxy.io/owning-gateway-name=qwen-inference-gateway -o jsonpath='{.items[0].spec.clusterIP}')
  3. Perform the stress test.

    Important

    The following test results are generated from a test environment. Actual results may vary depending on your specific environment.

    Regular HTTP routing

    kubectl exec -it deploy/llm-qa-benchmark -- env GW_IP=${GW_IP} python3 multi-round-qa.py \
        --num-users 8 \
        --num-rounds 15 \
        --qps 0.1 \
        --shared-system-prompt 100 \
        --sharegpt \
        --user-history-prompt 2000 \
        --answer-len 100 \
        --model /model/qwen \
        --time 600 \
        --base-url http://${GW_IP}:8080/v1

    Expected output:

    ==================== Performance summary ======================
      QPS: 0.1000 reqs/s
    
      Processing speed: 0.1080 reqs/s
    
      Requests on-the-fly: 0
    
      Input tokens per second: 259.0703 tokens/s
    
      Output tokens per second: 4.8576 tokens/s
    
      Average generation throughput (per request): 26.6710 tokens/req/s
    
      Average TTFT: 0.3669s
    
    Time range: 1748231183.2753935 - 1748231766.4799275 (583.20s)
    ===============================================================

    Inference service routing

     kubectl exec -it deploy/llm-qa-benchmark -- env GW_IP=${GW_IP} python3 multi-round-qa.py \
        --num-users 8 \
        --num-rounds 15 \
        --qps 0.1 \
        --shared-system-prompt 100 \
        --sharegpt \
        --user-history-prompt 2000 \
        --answer-len 100 \
        --model /model/qwen \
        --time 600 \
        --base-url http://${GW_IP}:8081/v1

    Expected output:

    ==================== Performance summary ======================
      QPS: 0.1000 reqs/s
    
      Processing speed: 0.1081 reqs/s
    
      Requests on-the-fly: 0
    
      Input tokens per second: 259.3009 tokens/s
    
      Output tokens per second: 4.8548 tokens/s
    
      Average generation throughput (per request): 26.9300 tokens/req/s
    
      Average TTFT: 0.2761s
    
    Time range: 1748231885.874972 - 1748232468.5918882 (582.72s)
    ===============================================================

    You can see that the value of the Average TTFT parameter of inference service routing shows a significant improvement compared to HTTP routing.