All Products
Search
Document Center

Container Service for Kubernetes:Prefix-aware routing with estimation mode

Last Updated:Mar 26, 2026

The Gateway with Inference Extension component supports multiple load balancing policies for LLM inference services. This topic shows how to enable prefix-aware routing with estimation mode — a policy that routes requests sharing the same prompt prefix to the same inference server pod, increasing the key-value (KV) cache hit rate and reducing time to first token (TTFT).

Important

Before you begin, make sure you understand InferencePool and InferenceModel. This topic applies only to Gateway with Inference Extension version 1.4.0 or later.

How it works

Automatic prefix caching in vLLM

vLLM supports automatic prefix caching (APC). When a request is processed, vLLM saves the computed KV cache for that request. If a subsequent request shares the same prefix, vLLM reuses the cached result instead of recomputing it, which accelerates LLM inference.

Prefix-aware routing with estimation mode

Prefix-aware routing with estimation mode is a load balancing policy that routes requests with the same prefix to the same inference server pod whenever possible.

What "estimation mode" means: The gateway does not query each inference server's cache state directly. Instead, it tracks the requests sent to each server and *estimates* which server is most likely to have a given prefix cached. This approach avoids tight coupling to the inference engine's internal cache API while still improving the prefix cache hit ratio.

When APC is enabled on the model server, this policy reduces TTFT by increasing the prefix cache hit ratio. It is best suited for workloads where many requests share the same prefix:

  • Long-document queries: Users ask different questions about the same long document, such as a software manual or an annual report, in the same session.

  • Multi-turn conversations: Users continue interacting with the same application across multiple turns of a conversation.

Prerequisites

Before you begin, ensure that you have:

Note

For the image used in this topic, we recommend using A10 cards for ACK clusters and L20(GN8IS) cards for ACS GPU computing power.

Because LLM images are large, transfer the image to Alibaba Container Registry (ACR) beforehand and pull it using an internal network address. Pulling directly from the public network is slow and depends on the bandwidth of the cluster's elastic IP addresses (EIPs).

Enable prefix-aware routing

Step 1: Deploy a sample inference service

  1. Create a file named vllm-service.yaml. Expand to view YAML content

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      labels:
        app: qwen
      name: qwen
    spec:
      replicas: 5
      selector:
        matchLabels:
          app: qwen
      template:
        metadata:
          annotations:
            prometheus.io/path: /metrics
            prometheus.io/port: "8000"
            prometheus.io/scrape: "true"
          labels:
            app: qwen
        spec:
          containers:
          - command:
            - sh
            - -c
            - vllm serve /models/Qwen-2.5-7B-Instruct --port 8000 --enable_prefix_caching --trust-remote-code --served-model-name /model/qwen --max-model-len 8192 --gpu-memory-utilization 0.95 --enforce-eager --enable-lora --max-loras 2 --max-cpu-loras 4 --lora-modules travel-helper-v1=/models/Qwen-TravelHelper-Lora travel-helper-v2=/models/Qwen-TravelHelper-Lora-v2
            image: registry-cn-hangzhou.ack.aliyuncs.com/dev/qwen-2.5-7b-instruct-lora:v0.1
            imagePullPolicy: IfNotPresent
            name: custom-serving
            ports:
            - containerPort: 8000
              name: http
              protocol: TCP
            readinessProbe:
              failureThreshold: 3
              initialDelaySeconds: 30
              periodSeconds: 30
              successThreshold: 1
              tcpSocket:
                port: 8000
              timeoutSeconds: 1
            resources:
              limits:
                nvidia.com/gpu: "1"
            terminationMessagePath: /dev/termination-log
            terminationMessagePolicy: File
            volumeMounts:
            - mountPath: /dev/shm
              name: dshm
          dnsPolicy: ClusterFirst
          restartPolicy: Always
          schedulerName: default-scheduler
          securityContext: {}
          terminationGracePeriodSeconds: 30
          volumes:
          - emptyDir:
              medium: Memory
              sizeLimit: 30Gi
            name: dshm
    ---
    apiVersion: v1
    kind: Service
    metadata:
      labels:
        app: qwen
      name: qwen
    spec:
      ports:
      - name: http-serving
        port: 8000
        protocol: TCP
        targetPort: 8000
      selector:
        app: qwen
  2. Deploy the inference service.

    kubectl apply -f vllm-service.yaml

Step 2: Deploy inference routing

Create an InferencePool resource and an InferenceModel resource. The InferencePool annotation enables prefix-aware routing with estimation mode for all pods in the pool.

  1. Create a file named inference-pool.yaml.

    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferencePool
    metadata:
      annotations:
        inference.networking.x-k8s.io/routing-strategy: "PREFIX_CACHE"
      name: vllm-qwen-pool
    spec:
      targetPortNumber: 8000
      selector:
        app: qwen
      extensionRef:
        name: inference-gateway-ext-proc
    ---
    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferenceModel
    metadata:
      name: inferencemodel-qwen
    spec:
      modelName: /model/qwen
      criticality: Critical
      poolRef:
        group: inference.networking.x-k8s.io
        kind: InferencePool
        name: vllm-qwen-pool
      targetModels:
      - name: /model/qwen
        weight: 100

    The annotation inference.networking.x-k8s.io/routing-strategy: "PREFIX_CACHE" on the InferencePool enables prefix-aware routing with estimation mode for pods in that pool.

  2. Deploy the inference routing resources.

    kubectl apply -f inference-pool.yaml

Step 3: Deploy the gateway and routing rules

This step creates a gateway with two ports:

PortPurpose
8081Routes inference requests through the InferencePool using prefix-aware load balancing
8080Routes inference requests using standard HTTP least-request load balancing (used as a baseline in the performance test)
  1. Create a file named inference-gateway.yaml.

    apiVersion: gateway.networking.k8s.io/v1
    kind: GatewayClass
    metadata:
      name: qwen-inference-gateway-class
    spec:
      controllerName: gateway.envoyproxy.io/gatewayclass-controller
    ---
    apiVersion: gateway.networking.k8s.io/v1
    kind: Gateway
    metadata:
      name: qwen-inference-gateway
    spec:
      gatewayClassName: qwen-inference-gateway-class
      listeners:
        - name: http
          protocol: HTTP
          port: 8080
        - name: llm-gw
          protocol: HTTP
          port: 8081
    ---
    apiVersion: gateway.networking.k8s.io/v1
    kind: HTTPRoute
    metadata:
      name: qwen-backend
    spec:
      parentRefs:
        - name: qwen-inference-gateway
          sectionName: llm-gw
      rules:
        - backendRefs:
            - group: inference.networking.x-k8s.io
              kind: InferencePool
              name: vllm-qwen-pool
          matches:
            - path:
                type: PathPrefix
                value: /
    ---
    apiVersion: gateway.networking.k8s.io/v1
    kind: HTTPRoute
    metadata:
      name: qwen-backend-no-inference
    spec:
      parentRefs:
      - group: gateway.networking.k8s.io
        kind: Gateway
        name: qwen-inference-gateway
        sectionName: http
      rules:
      - backendRefs:
        - group: ""
          kind: Service
          name: qwen
          port: 8000
          weight: 1
        matches:
        - path:
            type: PathPrefix
            value: /
    ---
    apiVersion: gateway.envoyproxy.io/v1alpha1
    kind: BackendTrafficPolicy
    metadata:
      name: backend-timeout
    spec:
      timeout:
        http:
          requestTimeout: 1h
      targetRef:
        group: gateway.networking.k8s.io
        kind: Gateway
        name: qwen-inference-gateway
  2. Deploy the gateway.

    kubectl apply -f inference-gateway.yaml

Step 4: Verify the routing rule

Send two requests that share the same prefix content and check the extension logs to confirm prefix-aware routing is active.

  1. Create round1.txt and round2.txt. Both files use the same message prefix.

    echo '{"max_tokens":24,"messages":[{"content":"Hi, here'\''s some system prompt: hi hi hi hi hi hi hi hi hi hi.For user 3, here are some other context: hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi.I would like to test your intelligence. for this purpose I would like you to play zork. you can interact with the game by typing in commands. I will forward these commands to the game and type in any response. are you ready?","role":"user"}],"model":"/model/qwen","stream":true,"stream_options":{"include_usage":true},"temperature":0}' > round1.txt
    echo '{"max_tokens":3,"messages":[{"content":"Hi, here'\''s some system prompt: hi hi hi hi hi hi hi hi hi hi.For user 3, here are some other context: hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi.I would like to test your intelligence. for this purpose I would like you to play zork. you can interact with the game by typing in commands. I will forward these commands to the game and type in any response. are you ready?","role":"user"},{"content":"Hi there! It looks like you'\''re setting up a fun test. I'\''m ready to play Zork! You can","role":"assistant"},{"content":"% zork\nWelcome to Dungeon. This version created 11-MAR-91.\nYou are in an open field west of a big white house with a boarded\nfront door.\nThere is a small mailbox here.\n>","role":"user"},{"content":"Great!","role":"assistant"},{"content":"Opening the mailbox reveals:\n A leaflet.\n>","role":"user"}],"model":"/model/qwen","stream":true,"stream_options":{"include_usage":true},"temperature":0}' > round2.txt
  2. Get the gateway's public IP address.

    export GATEWAY_IP=$(kubectl get gateway/qwen-inference-gateway -o jsonpath='{.status.addresses[0].value}')
  3. Send two sequential requests to simulate a multi-turn conversation.

    curl -X POST ${GATEWAY_IP}:8081/v1/chat/completions -H 'Content-Type: application/json' -d @./round1.txt
    curl -X POST ${GATEWAY_IP}:8081/v1/chat/completions -H 'Content-Type: application/json' -d @./round2.txt
  4. Check the extension logs to confirm prefix-aware routing is active.

    kubectl logs deploy/epp-default-inference-gateway-ext-proc -n envoy-gateway-system | grep "Do prefix"

    Expected output:

    2025-05-23T03:33:09Z    INFO    scheduling/prefixcache_filter.go:311    Do prefix-aware routing!        {"request": "v68m4zx472", "matching ratio": " 0.54 > 0.50"}

    The Do prefix-aware routing! message confirms that the policy is active. The matching ratio field shows that the prefix match ratio (0.54) exceeded the threshold (0.50), triggering prefix-aware routing.

(Optional) Step 5: Test inference service performance

This step uses the llm-qa-benchmark stress testing tool to compare TTFT between standard HTTP routing and inference routing under a simulated multi-turn conversation workload.

Important

The following results were generated in a test environment. Your actual results may vary.

  1. Deploy the llm-qa-benchmark stress testing tool.

    kubectl apply -f- <<EOF
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      labels:
        app: llm-qa-benchmark
      name: llm-qa-benchmark
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: llm-qa-benchmark
      template:
        metadata:
          labels:
            app: llm-qa-benchmark
        spec:
          containers:
          - command:
            - sh
            - -c
            - sleep inf
            image: registry-cn-hangzhou.ack.aliyuncs.com/dev/llm-qa-benchmark:v0.1
            imagePullPolicy: IfNotPresent
            name: llm-qa-benchmark
            terminationMessagePath: /dev/termination-log
            terminationMessagePolicy: File
          restartPolicy: Always
    EOF
  2. Get the gateway's internal IP address.

    export GW_IP=$(kubectl get svc -n envoy-gateway-system \
      -l gateway.envoyproxy.io/owning-gateway-namespace=default,gateway.envoyproxy.io/owning-gateway-name=qwen-inference-gateway \
      -o jsonpath='{.items[0].spec.clusterIP}')
  3. Run both tests and compare the results. Both commands use the same workload parameters: 8 users, 15 rounds, 0.1 QPS, a 100-token shared system prompt, 2,000-token user history, and 100-token answers over 600 seconds. Standard HTTP routing (port 8080):

    kubectl exec -it deploy/llm-qa-benchmark -- env GW_IP=${GW_IP} python3 multi-round-qa.py \
      --num-users 8 \
      --num-rounds 15 \
      --qps 0.1 \
      --shared-system-prompt 100 \
      --sharegpt \
      --user-history-prompt 2000 \
      --answer-len 100 \
      --model /model/qwen \
      --time 600 \
      --base-url http://${GW_IP}:8080/v1

    Expected output:

    ==================== Performance summary ======================
     QPS: 0.1000 reqs/s
    
     Processing speed: 0.1080 reqs/s
    
     Requests on-the-fly: 0
    
     Input tokens per second: 259.0703 tokens/s
    
     Output tokens per second: 4.8576 tokens/s
    
     Average generation throughput (per request): 26.6710 tokens/req/s
    
     Average TTFT: 0.3669s
    
    Time range: 1748231183.2753935 - 1748231766.4799275 (583.20s)
    ===============================================================

    Inference service routing (port 8081):

    kubectl exec -it deploy/llm-qa-benchmark -- env GW_IP=${GW_IP} python3 multi-round-qa.py \
      --num-users 8 \
      --num-rounds 15 \
      --qps 0.1 \
      --shared-system-prompt 100 \
      --sharegpt \
      --user-history-prompt 2000 \
      --answer-len 100 \
      --model /model/qwen \
      --time 600 \
      --base-url http://${GW_IP}:8081/v1

    Expected output:

    ==================== Performance summary ======================
     QPS: 0.1000 reqs/s
    
     Processing speed: 0.1081 reqs/s
    
     Requests on-the-fly: 0
    
     Input tokens per second: 259.3009 tokens/s
    
     Output tokens per second: 4.8548 tokens/s
    
     Average generation throughput (per request): 26.9300 tokens/req/s
    
     Average TTFT: 0.2761s
    
    Time range: 1748231885.874972 - 1748232468.5918882 (582.72s)
    ===============================================================

    Inference service routing reduces average TTFT from 0.3669s to 0.2761s — approximately a 25% improvement — compared to standard HTTP routing under the same workload.