Enable Prefix-Cache-Aware Routing to Maximize KV Cache Efficiency - ACK

The Gateway with Inference Extension component supports multiple load balancing policies for LLM inference services. This topic shows how to enable prefix-aware routing with estimation mode — a policy that routes requests sharing the same prompt prefix to the same inference server pod, increasing the key-value (KV) cache hit rate and reducing time to first token (TTFT).

Important

Before you begin, make sure you understand InferencePool and InferenceModel. This topic applies only to Gateway with Inference Extension version 1.4.0 or later.

How it works

Automatic prefix caching in vLLM

vLLM supports automatic prefix caching (APC). When a request is processed, vLLM saves the computed KV cache for that request. If a subsequent request shares the same prefix, vLLM reuses the cached result instead of recomputing it, which accelerates LLM inference.

Prefix-aware routing with estimation mode

Prefix-aware routing with estimation mode is a load balancing policy that routes requests with the same prefix to the same inference server pod whenever possible.

What "estimation mode" means: The gateway does not query each inference server's cache state directly. Instead, it tracks the requests sent to each server and *estimates* which server is most likely to have a given prefix cached. This approach avoids tight coupling to the inference engine's internal cache API while still improving the prefix cache hit ratio.

When APC is enabled on the model server, this policy reduces TTFT by increasing the prefix cache hit ratio. It is best suited for workloads where many requests share the same prefix:

Long-document queries: Users ask different questions about the same long document, such as a software manual or an annual report, in the same session.
Multi-turn conversations: Users continue interacting with the same application across multiple turns of a conversation.

Prerequisites

Before you begin, ensure that you have:

An ACK managed cluster with a GPU node pool, or an ACK managed cluster with the ACK Virtual Node component installed to use ACS GPU computing power
Gateway with Inference Extension version 1.4.0 or later installed, with Enable Gateway API Inference Extension (Requires a deployed inference service) selected during installation (see Install components)

Note

For the image used in this topic, we recommend using A10 cards for ACK clusters and L20(GN8IS) cards for ACS GPU computing power.

Because LLM images are large, transfer the image to Alibaba Container Registry (ACR) beforehand and pull it using an internal network address. Pulling directly from the public network is slow and depends on the bandwidth of the cluster's elastic IP addresses (EIPs).

Enable prefix-aware routing

Step 1: Deploy a sample inference service

Create a file named vllm-service.yaml. Expand to view YAML content

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: qwen
  name: qwen
spec:
  replicas: 5
  selector:
    matchLabels:
      app: qwen
  template:
    metadata:
      annotations:
        prometheus.io/path: /metrics
        prometheus.io/port: "8000"
        prometheus.io/scrape: "true"
      labels:
        app: qwen
    spec:
      containers:
      - command:
        - sh
        - -c
        - vllm serve /models/Qwen-2.5-7B-Instruct --port 8000 --enable_prefix_caching --trust-remote-code --served-model-name /model/qwen --max-model-len 8192 --gpu-memory-utilization 0.95 --enforce-eager --enable-lora --max-loras 2 --max-cpu-loras 4 --lora-modules travel-helper-v1=/models/Qwen-TravelHelper-Lora travel-helper-v2=/models/Qwen-TravelHelper-Lora-v2
        image: registry-cn-hangzhou.ack.aliyuncs.com/dev/qwen-2.5-7b-instruct-lora:v0.1
        imagePullPolicy: IfNotPresent
        name: custom-serving
        ports:
        - containerPort: 8000
          name: http
          protocol: TCP
        readinessProbe:
          failureThreshold: 3
          initialDelaySeconds: 30
          periodSeconds: 30
          successThreshold: 1
          tcpSocket:
            port: 8000
          timeoutSeconds: 1
        resources:
          limits:
            nvidia.com/gpu: "1"
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /dev/shm
          name: dshm
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
      volumes:
      - emptyDir:
          medium: Memory
          sizeLimit: 30Gi
        name: dshm
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app: qwen
  name: qwen
spec:
  ports:
  - name: http-serving
    port: 8000
    protocol: TCP
    targetPort: 8000
  selector:
    app: qwen

Deploy the inference service.
```
kubectl apply -f vllm-service.yaml
```

Step 2: Deploy inference routing

Create an InferencePool resource and an InferenceModel resource. The InferencePool annotation enables prefix-aware routing with estimation mode for all pods in the pool.

Create a file named inference-pool.yaml.

apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
  annotations:
    inference.networking.x-k8s.io/routing-strategy: "PREFIX_CACHE"
  name: vllm-qwen-pool
spec:
  targetPortNumber: 8000
  selector:
    app: qwen
  extensionRef:
    name: inference-gateway-ext-proc
---
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferenceModel
metadata:
  name: inferencemodel-qwen
spec:
  modelName: /model/qwen
  criticality: Critical
  poolRef:
    group: inference.networking.x-k8s.io
    kind: InferencePool
    name: vllm-qwen-pool
  targetModels:
  - name: /model/qwen
    weight: 100

The annotation inference.networking.x-k8s.io/routing-strategy: "PREFIX_CACHE" on the InferencePool enables prefix-aware routing with estimation mode for pods in that pool.

Deploy the inference routing resources.
```
kubectl apply -f inference-pool.yaml
```

Step 3: Deploy the gateway and routing rules

This step creates a gateway with two ports:

Port	Purpose
8081	Routes inference requests through the InferencePool using prefix-aware load balancing
8080	Routes inference requests using standard HTTP least-request load balancing (used as a baseline in the performance test)

Create a file named inference-gateway.yaml.

apiVersion: gateway.networking.k8s.io/v1
kind: GatewayClass
metadata:
  name: qwen-inference-gateway-class
spec:
  controllerName: gateway.envoyproxy.io/gatewayclass-controller
---
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: qwen-inference-gateway
spec:
  gatewayClassName: qwen-inference-gateway-class
  listeners:
    - name: http
      protocol: HTTP
      port: 8080
    - name: llm-gw
      protocol: HTTP
      port: 8081
---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: qwen-backend
spec:
  parentRefs:
    - name: qwen-inference-gateway
      sectionName: llm-gw
  rules:
    - backendRefs:
        - group: inference.networking.x-k8s.io
          kind: InferencePool
          name: vllm-qwen-pool
      matches:
        - path:
            type: PathPrefix
            value: /
---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: qwen-backend-no-inference
spec:
  parentRefs:
  - group: gateway.networking.k8s.io
    kind: Gateway
    name: qwen-inference-gateway
    sectionName: http
  rules:
  - backendRefs:
    - group: ""
      kind: Service
      name: qwen
      port: 8000
      weight: 1
    matches:
    - path:
        type: PathPrefix
        value: /
---
apiVersion: gateway.envoyproxy.io/v1alpha1
kind: BackendTrafficPolicy
metadata:
  name: backend-timeout
spec:
  timeout:
    http:
      requestTimeout: 1h
  targetRef:
    group: gateway.networking.k8s.io
    kind: Gateway
    name: qwen-inference-gateway

Deploy the gateway.
```
kubectl apply -f inference-gateway.yaml
```

Step 4: Verify the routing rule

Send two requests that share the same prefix content and check the extension logs to confirm prefix-aware routing is active.

Create round1.txt and round2.txt. Both files use the same message prefix.

echo '{"max_tokens":24,"messages":[{"content":"Hi, here'\''s some system prompt: hi hi hi hi hi hi hi hi hi hi.For user 3, here are some other context: hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi.I would like to test your intelligence. for this purpose I would like you to play zork. you can interact with the game by typing in commands. I will forward these commands to the game and type in any response. are you ready?","role":"user"}],"model":"/model/qwen","stream":true,"stream_options":{"include_usage":true},"temperature":0}' > round1.txt

echo '{"max_tokens":3,"messages":[{"content":"Hi, here'\''s some system prompt: hi hi hi hi hi hi hi hi hi hi.For user 3, here are some other context: hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi.I would like to test your intelligence. for this purpose I would like you to play zork. you can interact with the game by typing in commands. I will forward these commands to the game and type in any response. are you ready?","role":"user"},{"content":"Hi there! It looks like you'\''re setting up a fun test. I'\''m ready to play Zork! You can","role":"assistant"},{"content":"% zork\nWelcome to Dungeon. This version created 11-MAR-91.\nYou are in an open field west of a big white house with a boarded\nfront door.\nThere is a small mailbox here.\n>","role":"user"},{"content":"Great!","role":"assistant"},{"content":"Opening the mailbox reveals:\n A leaflet.\n>","role":"user"}],"model":"/model/qwen","stream":true,"stream_options":{"include_usage":true},"temperature":0}' > round2.txt

Get the gateway's public IP address.

export GATEWAY_IP=$(kubectl get gateway/qwen-inference-gateway -o jsonpath='{.status.addresses[0].value}')

Send two sequential requests to simulate a multi-turn conversation.

curl -X POST ${GATEWAY_IP}:8081/v1/chat/completions -H 'Content-Type: application/json' -d @./round1.txt
curl -X POST ${GATEWAY_IP}:8081/v1/chat/completions -H 'Content-Type: application/json' -d @./round2.txt

Check the extension logs to confirm prefix-aware routing is active.
```
kubectl logs deploy/epp-default-inference-gateway-ext-proc -n envoy-gateway-system | grep "Do prefix"
```
Expected output:
```
2025-05-23T03:33:09Z    INFO    scheduling/prefixcache_filter.go:311    Do prefix-aware routing!        {"request": "v68m4zx472", "matching ratio": " 0.54 > 0.50"}
```
The Do prefix-aware routing! message confirms that the policy is active. The matching ratio field shows that the prefix match ratio (0.54) exceeded the threshold (0.50), triggering prefix-aware routing.

(Optional) Step 5: Test inference service performance

This step uses the llm-qa-benchmark stress testing tool to compare TTFT between standard HTTP routing and inference routing under a simulated multi-turn conversation workload.

Important

The following results were generated in a test environment. Your actual results may vary.

Deploy the llm-qa-benchmark stress testing tool.

kubectl apply -f- <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: llm-qa-benchmark
  name: llm-qa-benchmark
spec:
  replicas: 1
  selector:
    matchLabels:
      app: llm-qa-benchmark
  template:
    metadata:
      labels:
        app: llm-qa-benchmark
    spec:
      containers:
      - command:
        - sh
        - -c
        - sleep inf
        image: registry-cn-hangzhou.ack.aliyuncs.com/dev/llm-qa-benchmark:v0.1
        imagePullPolicy: IfNotPresent
        name: llm-qa-benchmark
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      restartPolicy: Always
EOF

Get the gateway's internal IP address.

export GW_IP=$(kubectl get svc -n envoy-gateway-system \
  -l gateway.envoyproxy.io/owning-gateway-namespace=default,gateway.envoyproxy.io/owning-gateway-name=qwen-inference-gateway \
  -o jsonpath='{.items[0].spec.clusterIP}')

Run both tests and compare the results. Both commands use the same workload parameters: 8 users, 15 rounds, 0.1 QPS, a 100-token shared system prompt, 2,000-token user history, and 100-token answers over 600 seconds. Standard HTTP routing (port 8080):

kubectl exec -it deploy/llm-qa-benchmark -- env GW_IP=${GW_IP} python3 multi-round-qa.py \
  --num-users 8 \
  --num-rounds 15 \
  --qps 0.1 \
  --shared-system-prompt 100 \
  --sharegpt \
  --user-history-prompt 2000 \
  --answer-len 100 \
  --model /model/qwen \
  --time 600 \
  --base-url http://${GW_IP}:8080/v1

Expected output:

==================== Performance summary ======================
 QPS: 0.1000 reqs/s

 Processing speed: 0.1080 reqs/s

 Requests on-the-fly: 0

 Input tokens per second: 259.0703 tokens/s

 Output tokens per second: 4.8576 tokens/s

 Average generation throughput (per request): 26.6710 tokens/req/s

 Average TTFT: 0.3669s

Time range: 1748231183.2753935 - 1748231766.4799275 (583.20s)
===============================================================

Inference service routing (port 8081):

kubectl exec -it deploy/llm-qa-benchmark -- env GW_IP=${GW_IP} python3 multi-round-qa.py \
  --num-users 8 \
  --num-rounds 15 \
  --qps 0.1 \
  --shared-system-prompt 100 \
  --sharegpt \
  --user-history-prompt 2000 \
  --answer-len 100 \
  --model /model/qwen \
  --time 600 \
  --base-url http://${GW_IP}:8081/v1

Expected output:

==================== Performance summary ======================
 QPS: 0.1000 reqs/s

 Processing speed: 0.1081 reqs/s

 Requests on-the-fly: 0

 Input tokens per second: 259.3009 tokens/s

 Output tokens per second: 4.8548 tokens/s

 Average generation throughput (per request): 26.9300 tokens/req/s

 Average TTFT: 0.2761s

Time range: 1748231885.874972 - 1748232468.5918882 (582.72s)
===============================================================

Inference service routing reduces average TTFT from 0.3669s to 0.2761s — approximately a 25% improvement — compared to standard HTTP routing under the same workload.