The Gateway with Inference Extension component supports multiple load balancing policies for LLM inference services. This topic shows how to enable prefix-aware routing with estimation mode — a policy that routes requests sharing the same prompt prefix to the same inference server pod, increasing the key-value (KV) cache hit rate and reducing time to first token (TTFT).
Before you begin, make sure you understand InferencePool and InferenceModel. This topic applies only to Gateway with Inference Extension version 1.4.0 or later.
How it works
Automatic prefix caching in vLLM
vLLM supports automatic prefix caching (APC). When a request is processed, vLLM saves the computed KV cache for that request. If a subsequent request shares the same prefix, vLLM reuses the cached result instead of recomputing it, which accelerates LLM inference.
Prefix-aware routing with estimation mode
Prefix-aware routing with estimation mode is a load balancing policy that routes requests with the same prefix to the same inference server pod whenever possible.
What "estimation mode" means: The gateway does not query each inference server's cache state directly. Instead, it tracks the requests sent to each server and *estimates* which server is most likely to have a given prefix cached. This approach avoids tight coupling to the inference engine's internal cache API while still improving the prefix cache hit ratio.
When APC is enabled on the model server, this policy reduces TTFT by increasing the prefix cache hit ratio. It is best suited for workloads where many requests share the same prefix:
Long-document queries: Users ask different questions about the same long document, such as a software manual or an annual report, in the same session.
Multi-turn conversations: Users continue interacting with the same application across multiple turns of a conversation.
Prerequisites
Before you begin, ensure that you have:
An ACK managed cluster with a GPU node pool, or an ACK managed cluster with the ACK Virtual Node component installed to use ACS GPU computing power
Gateway with Inference Extension version 1.4.0 or later installed, with Enable Gateway API Inference Extension (Requires a deployed inference service) selected during installation (see Install components)
For the image used in this topic, we recommend using A10 cards for ACK clusters and L20(GN8IS) cards for ACS GPU computing power.
Because LLM images are large, transfer the image to Alibaba Container Registry (ACR) beforehand and pull it using an internal network address. Pulling directly from the public network is slow and depends on the bandwidth of the cluster's elastic IP addresses (EIPs).
Enable prefix-aware routing
Step 2: Deploy inference routing
Create an InferencePool resource and an InferenceModel resource. The InferencePool annotation enables prefix-aware routing with estimation mode for all pods in the pool.
Create a file named
inference-pool.yaml.apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferencePool metadata: annotations: inference.networking.x-k8s.io/routing-strategy: "PREFIX_CACHE" name: vllm-qwen-pool spec: targetPortNumber: 8000 selector: app: qwen extensionRef: name: inference-gateway-ext-proc --- apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferenceModel metadata: name: inferencemodel-qwen spec: modelName: /model/qwen criticality: Critical poolRef: group: inference.networking.x-k8s.io kind: InferencePool name: vllm-qwen-pool targetModels: - name: /model/qwen weight: 100The annotation
inference.networking.x-k8s.io/routing-strategy: "PREFIX_CACHE"on the InferencePool enables prefix-aware routing with estimation mode for pods in that pool.Deploy the inference routing resources.
kubectl apply -f inference-pool.yaml
Step 3: Deploy the gateway and routing rules
This step creates a gateway with two ports:
| Port | Purpose |
|---|---|
| 8081 | Routes inference requests through the InferencePool using prefix-aware load balancing |
| 8080 | Routes inference requests using standard HTTP least-request load balancing (used as a baseline in the performance test) |
Create a file named
inference-gateway.yaml.apiVersion: gateway.networking.k8s.io/v1 kind: GatewayClass metadata: name: qwen-inference-gateway-class spec: controllerName: gateway.envoyproxy.io/gatewayclass-controller --- apiVersion: gateway.networking.k8s.io/v1 kind: Gateway metadata: name: qwen-inference-gateway spec: gatewayClassName: qwen-inference-gateway-class listeners: - name: http protocol: HTTP port: 8080 - name: llm-gw protocol: HTTP port: 8081 --- apiVersion: gateway.networking.k8s.io/v1 kind: HTTPRoute metadata: name: qwen-backend spec: parentRefs: - name: qwen-inference-gateway sectionName: llm-gw rules: - backendRefs: - group: inference.networking.x-k8s.io kind: InferencePool name: vllm-qwen-pool matches: - path: type: PathPrefix value: / --- apiVersion: gateway.networking.k8s.io/v1 kind: HTTPRoute metadata: name: qwen-backend-no-inference spec: parentRefs: - group: gateway.networking.k8s.io kind: Gateway name: qwen-inference-gateway sectionName: http rules: - backendRefs: - group: "" kind: Service name: qwen port: 8000 weight: 1 matches: - path: type: PathPrefix value: / --- apiVersion: gateway.envoyproxy.io/v1alpha1 kind: BackendTrafficPolicy metadata: name: backend-timeout spec: timeout: http: requestTimeout: 1h targetRef: group: gateway.networking.k8s.io kind: Gateway name: qwen-inference-gatewayDeploy the gateway.
kubectl apply -f inference-gateway.yaml
Step 4: Verify the routing rule
Send two requests that share the same prefix content and check the extension logs to confirm prefix-aware routing is active.
Create
round1.txtandround2.txt. Both files use the same message prefix.echo '{"max_tokens":24,"messages":[{"content":"Hi, here'\''s some system prompt: hi hi hi hi hi hi hi hi hi hi.For user 3, here are some other context: hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi.I would like to test your intelligence. for this purpose I would like you to play zork. you can interact with the game by typing in commands. I will forward these commands to the game and type in any response. are you ready?","role":"user"}],"model":"/model/qwen","stream":true,"stream_options":{"include_usage":true},"temperature":0}' > round1.txtecho '{"max_tokens":3,"messages":[{"content":"Hi, here'\''s some system prompt: hi hi hi hi hi hi hi hi hi hi.For user 3, here are some other context: hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi.I would like to test your intelligence. for this purpose I would like you to play zork. you can interact with the game by typing in commands. I will forward these commands to the game and type in any response. are you ready?","role":"user"},{"content":"Hi there! It looks like you'\''re setting up a fun test. I'\''m ready to play Zork! You can","role":"assistant"},{"content":"% zork\nWelcome to Dungeon. This version created 11-MAR-91.\nYou are in an open field west of a big white house with a boarded\nfront door.\nThere is a small mailbox here.\n>","role":"user"},{"content":"Great!","role":"assistant"},{"content":"Opening the mailbox reveals:\n A leaflet.\n>","role":"user"}],"model":"/model/qwen","stream":true,"stream_options":{"include_usage":true},"temperature":0}' > round2.txtGet the gateway's public IP address.
export GATEWAY_IP=$(kubectl get gateway/qwen-inference-gateway -o jsonpath='{.status.addresses[0].value}')Send two sequential requests to simulate a multi-turn conversation.
curl -X POST ${GATEWAY_IP}:8081/v1/chat/completions -H 'Content-Type: application/json' -d @./round1.txt curl -X POST ${GATEWAY_IP}:8081/v1/chat/completions -H 'Content-Type: application/json' -d @./round2.txtCheck the extension logs to confirm prefix-aware routing is active.
kubectl logs deploy/epp-default-inference-gateway-ext-proc -n envoy-gateway-system | grep "Do prefix"Expected output:
2025-05-23T03:33:09Z INFO scheduling/prefixcache_filter.go:311 Do prefix-aware routing! {"request": "v68m4zx472", "matching ratio": " 0.54 > 0.50"}The
Do prefix-aware routing!message confirms that the policy is active. Thematching ratiofield shows that the prefix match ratio (0.54) exceeded the threshold (0.50), triggering prefix-aware routing.
(Optional) Step 5: Test inference service performance
This step uses the llm-qa-benchmark stress testing tool to compare TTFT between standard HTTP routing and inference routing under a simulated multi-turn conversation workload.
The following results were generated in a test environment. Your actual results may vary.
Deploy the
llm-qa-benchmarkstress testing tool.kubectl apply -f- <<EOF apiVersion: apps/v1 kind: Deployment metadata: labels: app: llm-qa-benchmark name: llm-qa-benchmark spec: replicas: 1 selector: matchLabels: app: llm-qa-benchmark template: metadata: labels: app: llm-qa-benchmark spec: containers: - command: - sh - -c - sleep inf image: registry-cn-hangzhou.ack.aliyuncs.com/dev/llm-qa-benchmark:v0.1 imagePullPolicy: IfNotPresent name: llm-qa-benchmark terminationMessagePath: /dev/termination-log terminationMessagePolicy: File restartPolicy: Always EOFGet the gateway's internal IP address.
export GW_IP=$(kubectl get svc -n envoy-gateway-system \ -l gateway.envoyproxy.io/owning-gateway-namespace=default,gateway.envoyproxy.io/owning-gateway-name=qwen-inference-gateway \ -o jsonpath='{.items[0].spec.clusterIP}')Run both tests and compare the results. Both commands use the same workload parameters: 8 users, 15 rounds, 0.1 QPS, a 100-token shared system prompt, 2,000-token user history, and 100-token answers over 600 seconds. Standard HTTP routing (port 8080):
kubectl exec -it deploy/llm-qa-benchmark -- env GW_IP=${GW_IP} python3 multi-round-qa.py \ --num-users 8 \ --num-rounds 15 \ --qps 0.1 \ --shared-system-prompt 100 \ --sharegpt \ --user-history-prompt 2000 \ --answer-len 100 \ --model /model/qwen \ --time 600 \ --base-url http://${GW_IP}:8080/v1Expected output:
==================== Performance summary ====================== QPS: 0.1000 reqs/s Processing speed: 0.1080 reqs/s Requests on-the-fly: 0 Input tokens per second: 259.0703 tokens/s Output tokens per second: 4.8576 tokens/s Average generation throughput (per request): 26.6710 tokens/req/s Average TTFT: 0.3669s Time range: 1748231183.2753935 - 1748231766.4799275 (583.20s) ===============================================================Inference service routing (port 8081):
kubectl exec -it deploy/llm-qa-benchmark -- env GW_IP=${GW_IP} python3 multi-round-qa.py \ --num-users 8 \ --num-rounds 15 \ --qps 0.1 \ --shared-system-prompt 100 \ --sharegpt \ --user-history-prompt 2000 \ --answer-len 100 \ --model /model/qwen \ --time 600 \ --base-url http://${GW_IP}:8081/v1Expected output:
==================== Performance summary ====================== QPS: 0.1000 reqs/s Processing speed: 0.1081 reqs/s Requests on-the-fly: 0 Input tokens per second: 259.3009 tokens/s Output tokens per second: 4.8548 tokens/s Average generation throughput (per request): 26.9300 tokens/req/s Average TTFT: 0.2761s Time range: 1748231885.874972 - 1748232468.5918882 (582.72s) ===============================================================Inference service routing reduces average TTFT from 0.3669s to 0.2761s — approximately a 25% improvement — compared to standard HTTP routing under the same workload.