All Products
Search
Document Center

Container Service for Kubernetes:Using the prefix-cache-aware routing capability in estimation mode

Last Updated:Oct 22, 2025

The Gateway with Inference Extension component lets you specify different load balancing policies for inference service routing in various generative AI scenarios. This topic describes how to use the Gateway with Inference Extension component to implement prefix-cache-aware routing in estimation mode.

Important
  • Before you begin, make sure that you understand the concepts of InferencePool and InferenceModel.

  • This topic requires the Gateway with Inference Extension component version 1.4.0 or later.

Background information

Automatic prefix caching in vLLM

vLLM supports automatic prefix caching (APC). APC caches the key-value (KV) state that vLLM has computed for previous requests. If a new request shares a prefix with a previous request, it can reuse the existing KV cache. This allows the new request to skip the KV cache calculation for the shared prefix. This process accelerates the processing of large language model (LLM) inference requests.

Prefix-cache-aware routing in estimation mode

Prefix-cache-aware routing in estimation mode is a load balancing policy that sends requests that share the same prefix to the same inference server pod whenever possible. The Gateway with Inference Extension component tracks requests routed to each inference server. It then estimates the prefix cache status of each server to improve the cache hit ratio of the inference engine.

When the APC feature is enabled on the model server, the prefix-cache-aware routing policy in estimation mode can maximize the cache hit ratio and reduce the request response time. This policy is ideal for scenarios with many requests that share prefixes. You should evaluate its suitability based on your business needs.

Typical scenarios include the following:

  • Long document queries: Users repeatedly query the same long document, such as a software manual or an annual report, with different queries.

  • Multi-turn conversations: Users may interact with the application multiple times in the same chat session.

Prerequisites

Note

For the image described in this topic, we recommend using A10 cards for ACK clusters and GN8IS cards for Alibaba Cloud Container Compute Service (ACS) GPU computing power.

Due to the large size of the LLM image, we recommend that you transfer it to Container Registry in advance and pull it using the internal network address. The speed of pulling from the public network depends on the bandwidth configuration of the cluster elastic IP address (EIP), which may result in longer wait times.

Procedure

Step 1: Deploy a sample inference service

  1. Create a file named vllm-service.yaml.

    Click to view the YAML content

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      labels:
        app: qwen
      name: qwen
    spec:
      replicas: 5
      selector:
        matchLabels:
          app: qwen
      template:
        metadata:
          annotations:
            prometheus.io/path: /metrics
            prometheus.io/port: "8000"
            prometheus.io/scrape: "true"
          labels:
            app: qwen
        spec:
          containers:
          - command:
            - sh
            - -c
            - vllm serve /models/Qwen-2.5-7B-Instruct --port 8000 --enable_prefix_caching --trust-remote-code --served-model-name /model/qwen --max-model-len 8192 --gpu-memory-utilization 0.95 --enforce-eager --enable-lora --max-loras 2 --max-cpu-loras 4 --lora-modules travel-helper-v1=/models/Qwen-TravelHelper-Lora travel-helper-v2=/models/Qwen-TravelHelper-Lora-v2
            image: registry-cn-hangzhou.ack.aliyuncs.com/dev/qwen-2.5-7b-instruct-lora:v0.1
            imagePullPolicy: IfNotPresent
            name: custom-serving
            ports:
            - containerPort: 8000
              name: http
              protocol: TCP
            readinessProbe:
              failureThreshold: 3
              initialDelaySeconds: 30
              periodSeconds: 30
              successThreshold: 1
              tcpSocket:
                port: 8000
              timeoutSeconds: 1
            resources:
              limits:
                nvidia.com/gpu: "1"
            terminationMessagePath: /dev/termination-log
            terminationMessagePolicy: File
            volumeMounts:
            - mountPath: /dev/shm
              name: dshm
          dnsPolicy: ClusterFirst
          restartPolicy: Always
          schedulerName: default-scheduler
          securityContext: {}
          terminationGracePeriodSeconds: 30
          volumes:
          - emptyDir:
              medium: Memory
              sizeLimit: 30Gi
            name: dshm
    ---
    apiVersion: v1
    kind: Service
    metadata:
      labels:
        app: qwen
      name: qwen
    spec:
      ports:
      - name: http-serving
        port: 8000
        protocol: TCP
        targetPort: 8000
      selector:
        app: qwen
  2. Deploy the sample inference service.

    kubectl apply -f vllm-service.yaml

Step 2: Deploy an inference route

In this step, you can create an InferencePool resource and an InferenceModel resource.

  1. Create a file named inference-pool.yaml.

    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferencePool
    metadata:
      annotations:
        inference.networking.x-k8s.io/routing-strategy: "PREFIX_CACHE"
      name: vllm-qwen-pool
    spec:
      targetPortNumber: 8000
      selector:
        app: qwen
      extensionRef:
        name: inference-gateway-ext-proc
    ---
    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferenceModel
    metadata:
      name: inferencemodel-qwen
    spec:
      modelName: /model/qwen
      criticality: Critical
      poolRef:
        group: inference.networking.x-k8s.io
        kind: InferencePool
        name: vllm-qwen-pool
      targetModels:
      - name: /model/qwen
        weight: 100

    In the InferencePool resource, add the inference.networking.x-k8s.io/routing-strategy: "PREFIX_CACHE" annotation. This enables prefix-cache-aware routing in estimation mode for the pods in the InferencePool.

  2. Deploy the inference route.

    kubectl apply -f inference-pool.yaml

Step 3: Deploy the gateway and gateway routing rules

In this step, you can create a gateway that listens on ports 8080 and 8081. On port 8081, an HTTPRoute resource specifies the InferencePool provided by the inference extension as the backend. Inference requests are routed to the pod collection defined in the InferencePool. On port 8080, another HTTPRoute resource specifies the Service as the backend. Inference requests are routed to the same pod collection using the standard HTTP least-request load balancing policy.

  1. Create a file named inference-gateway.yaml.

    apiVersion: gateway.networking.k8s.io/v1
    kind: GatewayClass
    metadata:
      name: qwen-inference-gateway-class
    spec:
      controllerName: gateway.envoyproxy.io/gatewayclass-controller
    ---
    apiVersion: gateway.networking.k8s.io/v1
    kind: Gateway
    metadata:
      name: qwen-inference-gateway
    spec:
      gatewayClassName: qwen-inference-gateway-class
      listeners:
        - name: http
          protocol: HTTP
          port: 8080
        - name: llm-gw
          protocol: HTTP
          port: 8081
    ---
    apiVersion: gateway.networking.k8s.io/v1
    kind: HTTPRoute
    metadata:
      name: qwen-backend
    spec:
      parentRefs:
        - name: qwen-inference-gateway
          sectionName: llm-gw
      rules:
        - backendRefs:
            - group: inference.networking.x-k8s.io
              kind: InferencePool
              name: vllm-qwen-pool
          matches:
            - path:
                type: PathPrefix
                value: /
    ---
    apiVersion: gateway.networking.k8s.io/v1
    kind: HTTPRoute
    metadata:
      name: qwen-backend-no-inference
    spec:
      parentRefs:
      - group: gateway.networking.k8s.io
        kind: Gateway
        name: qwen-inference-gateway
        sectionName: http
      rules:
      - backendRefs:
        - group: ""
          kind: Service
          name: qwen
          port: 8000
          weight: 1
        matches:
        - path:
            type: PathPrefix
            value: /
    ---
    apiVersion: gateway.envoyproxy.io/v1alpha1
    kind: BackendTrafficPolicy
    metadata:
      name: backend-timeout
    spec:
      timeout:
        http:
          requestTimeout: 1h
      targetRef:
        group: gateway.networking.k8s.io
        kind: Gateway
        name: qwen-inference-gateway
  2. Deploy the gateway.

    kubectl apply -f inference-gateway.yaml

Step 4: Verify the routing rules

  1. Create files named round1.txt and round2.txt. Both files contain an identical content section. Use the content of round1.txt and then round2.txt as the request body for the LLM. Then, check the extensionRef logs to verify that prefix-cache-aware routing in estimation mode is triggered.

    round1.txt:

    echo '{"max_tokens":24,"messages":[{"content":"Hi, here'\''s some system prompt: hi hi hi hi hi hi hi hi hi hi.For user 3, here are some other context: hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi.I would like to test your intelligence. for this purpose I would like you to play zork. you can interact with the game by typing in commands. I will forward these commands to the game and type in any response. are you ready?","role":"user"}],"model":"/model/qwen","stream":true,"stream_options":{"include_usage":true},"temperature":0}' > round1.txt

    round2.txt:

    echo '{"max_tokens":3,"messages":[{"content":"Hi, here'\''s some system prompt: hi hi hi hi hi hi hi hi hi hi.For user 3, here are some other context: hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi.I would like to test your intelligence. for this purpose I would like you to play zork. you can interact with the game by typing in commands. I will forward these commands to the game and type in any response. are you ready?","role":"user"},{"content":"Hi there! It looks like you're setting up a fun test. I'm ready to play Zork! You can","role":"assistant"},{"content":"% zork\nWelcome to Dungeon. This version created 11-MAR-91.\nYou are in an open field west of a big white house with a boarded\nfront door.\nThere is a small mailbox here.\n>","role":"user"},{"content":"Great!","role":"assistant"},{"content":"Opening the mailbox reveals:\n A leaflet.\n>","role":"user"}],"model":"/model/qwen","stream":true,"stream_options":{"include_usage":true},"temperature":0}' > round2.txt
  2. Retrieve the public IP address of the gateway.

    export GATEWAY_IP=$(kubectl get gateway/qwen-inference-gateway -o jsonpath='{.status.addresses[0].value}')
  3. Send two session requests to simulate a multi-turn conversation scenario.

    curl -X POST ${GATEWAY_IP}:8081/v1/chat/completions -H 'Content-Type: application/json' -d @./round1.txt
    curl -X POST ${GATEWAY_IP}:8081/v1/chat/completions -H 'Content-Type: application/json' -d @./round2.txt
  4. Check the logs to confirm that prefix-cache-aware routing in estimation mode is working.

    kubectl logs deploy/epp-default-inference-gateway-ext-proc -n envoy-gateway-system|grep "Do prefix"

    Expected output:

    2025-05-23T03:33:09Z    INFO    scheduling/prefixcache_filter.go:311    Do prefix-aware routing!        {"request": "v68m4zx472", "matching ratio": " 0.54 > 0.50"}

    The log contains the Do prefix-aware routing! message. This indicates that prefix-cache-aware routing in estimation mode is working.

(Optional) Step 5: Evaluate inference service performance with multi-turn conversation tests

This step uses an ACK cluster as an example and shows how to use a stress testing tool to run multi-turn conversation tests. The goal is to compare the performance of standard HTTP routing with that of prefix-cache-aware inference routing.

  1. Deploy the llm-qa-benchmark stress testing tool.

    kubectl apply -f- <<EOF
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      labels:
        app: llm-qa-benchmark
      name: llm-qa-benchmark
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: llm-qa-benchmark
      template:
        metadata:
          labels:
            app: llm-qa-benchmark
        spec:
          containers:
          - command:
            - sh
            - -c
            - sleep inf
            image: registry-cn-hangzhou.ack.aliyuncs.com/dev/llm-qa-benchmark:v0.1
            imagePullPolicy: IfNotPresent
            name: llm-qa-benchmark
            terminationMessagePath: /dev/termination-log
            terminationMessagePolicy: File
          restartPolicy: Always
    EOF
  2. Retrieve the internal IP address of the gateway.

    export GW_IP=$(kubectl get svc -n envoy-gateway-system -l gateway.envoyproxy.io/owning-gateway-namespace=default,gateway.envoyproxy.io/owning-gateway-name=qwen-inference-gateway -o jsonpath='{.items[0].spec.clusterIP}')
  3. Run the stress test.

    Important

    The following results are from a test environment. Your actual results may vary.

    Standard HTTP routing

    kubectl exec -it deploy/llm-qa-benchmark -- env GW_IP=${GW_IP} python3 multi-round-qa.py \
        --num-users 8 \
        --num-rounds 15 \
        --qps 0.1 \
        --shared-system-prompt 100 \
        --sharegpt \
        --user-history-prompt 2000 \
        --answer-len 100 \
        --model /model/qwen \
        --time 600 \
        --base-url http://${GW_IP}:8080/v1

    Expected output:

    ==================== Performance summary ======================
      QPS: 0.1000 reqs/s
    
      Processing speed: 0.1080 reqs/s
    
      Requests on-the-fly: 0
    
      Input tokens per second: 259.0703 tokens/s
    
      Output tokens per second: 4.8576 tokens/s
    
      Average generation throughput (per request): 26.6710 tokens/req/s
    
      Average TTFT: 0.3669s
    
    Time range: 1748231183.2753935 - 1748231766.4799275 (583.20s)
    ===============================================================

    Inference service routing

     kubectl exec -it deploy/llm-qa-benchmark -- env GW_IP=${GW_IP} python3 multi-round-qa.py \
        --num-users 8 \
        --num-rounds 15 \
        --qps 0.1 \
        --shared-system-prompt 100 \
        --sharegpt \
        --user-history-prompt 2000 \
        --answer-len 100 \
        --model /model/qwen \
        --time 600 \
        --base-url http://${GW_IP}:8081/v1

    Expected output:

    ==================== Performance summary ======================
      QPS: 0.1000 reqs/s
    
      Processing speed: 0.1081 reqs/s
    
      Requests on-the-fly: 0
    
      Input tokens per second: 259.3009 tokens/s
    
      Output tokens per second: 4.8548 tokens/s
    
      Average generation throughput (per request): 26.9300 tokens/req/s
    
      Average TTFT: 0.2761s
    
    Time range: 1748231885.874972 - 1748232468.5918882 (582.72s)
    ===============================================================

    The results show that the Average TTFT for inference service routing is significantly lower than that for standard HTTP routing.