Prefix cache-aware routing in precise mode routes each request to the vLLM replica that already holds the longest matching KV cache prefix, maximizing cache hit ratios and reducing response latency for shared-prefix LLM workloads. Each vLLM replica publishes its cached KV cache block information to the Gateway with Inference Extension via ZeroMQ event messages, and the gateway selects the replica with the highest cache hit probability for each incoming request.
Key concepts
KV cache
During inference, the model generates key-value pairs for each token in the context. Caching these pairs lets the model skip redundant computation for tokens it has already processed, which speeds up inference and lowers response latency.
Automatic Prefix Caching (APC)
vLLM's APC stores the KV cache of previously computed requests. When a new request shares a prefix with a cached request, vLLM reuses the existing KV cache for the shared prefix, skipping recomputation.
Precise mode vs. estimated mode
| Precise mode | Estimated mode | |
|---|---|---|
| Cache monitoring | Receives KV cache block distribution directly from each vLLM replica | Infers cache state without direct reporting |
| Cache hit accuracy | Higher — routes based on actual cache state | Lower — cannot precisely track KV cache distribution |
| Requirements | vLLM v0.10.0 or later with KV cache event reporting enabled at startup | No additional vLLM configuration required |
| Best for | Workloads with many shared-prefix requests | Scenarios where vLLM version constraints prevent precise mode |
Use precise mode when your inference service runs vLLM v0.10.0 or later and your workload includes requests with shared system prompts or conversation history.
Prerequisites
Before you begin, ensure that you have:
An ACK managed cluster with a GPU node pool. Alternatively, install the ACK Virtual Node component to use ACK with ACS GPU computing power.
This tutorial deploys the Qwen3-32B model, which requires more than 64 GB of GPU memory. Use the ecs.gn8is-2x.8xlarge instance type for GPU node pools, or the GU8TF card type for ACS virtual nodes.
Gateway with Inference Extension version 1.4.0-aliyun.3 or later, with Enable Gateway API Inference Extension turned on. See Install the Gateway with Inference Extension add-on.
Deploy the model service
Step 1: Prepare the Qwen3-32B model files
Install git-lfs if it is not already installed.
# On RHEL/CentOS-based systems yum install git-lfs # On Debian/Ubuntu-based systems apt-get install git-lfsFor other installation methods, see Installing Git Large File Storage.
Download the Qwen3-32B model from ModelScope.
git lfs install GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/Qwen/Qwen3-32B.git cd Qwen3-32B/ git lfs pullCreate an OSS folder and upload the model files. For instructions on installing ossutil, see Install ossutil.
ossutil mkdir oss://<YOUR-BUCKET-NAME>/Qwen3-32B ossutil cp -r ./Qwen3-32B oss://<YOUR-BUCKET-NAME>/Qwen3-32BCreate
llm-model.yamlto define an OSS-backed Secret, persistent volume (PV), and persistent volume claim (PVC). For background, see Use ossfs 1.0 to create a statically provisioned volume.apiVersion: v1 kind: Secret metadata: name: oss-secret stringData: akId: <YOUR-OSS-AK> # AccessKey ID for OSS access akSecret: <YOUR-OSS-SK> # AccessKey Secret for OSS access --- apiVersion: v1 kind: PersistentVolume metadata: name: llm-model labels: alicloud-pvname: llm-model spec: capacity: storage: 30Gi accessModes: - ReadOnlyMany persistentVolumeReclaimPolicy: Retain csi: driver: ossplugin.csi.alibabacloud.com volumeHandle: llm-model nodePublishSecretRef: name: oss-secret namespace: default volumeAttributes: bucket: <YOUR-BUCKET-NAME> # OSS bucket name url: <YOUR-BUCKET-ENDPOINT> # OSS endpoint, e.g., oss-cn-hangzhou-internal.aliyuncs.com otherOpts: "-o umask=022 -o max_stat_cache_size=0 -o allow_other" path: <YOUR-MODEL-PATH> # Path to model files, e.g., /Qwen3-32B/ --- apiVersion: v1 kind: PersistentVolumeClaim metadata: name: llm-model spec: accessModes: - ReadOnlyMany resources: requests: storage: 30Gi selector: matchLabels: alicloud-pvname: llm-modelApply the manifest.
kubectl create -f llm-model.yaml
Step 2: Deploy the vLLM inference service
Create
vllm.yaml.The following table describes the startup parameters and environment variables that enable precise-mode routing. The
--block-sizeandPYTHONHASHSEEDvalues must match the corresponding fields in theInferenceTrafficPolicyyou create in the next section.Parameter / variable Description --kv-events-configKV cache event publishing configuration. Set enable_kv_cache_eventstotrueandpublishertozmq. Forendpoint, use the naming conventiontcp://epp-<InferencePool namespace>-<InferencePool name>.envoy-gateway-system.<cluster local domain>:5557. Fortopic, usekv@${POD_IP}@<served model name>. In this example, the InferencePool namedqwen-inference-poolis in thedefaultnamespace and the model name isQwen3-32B, so the values aretcp://epp-default-qwen-inference-pool.envoy-gateway-system.svc.cluster.local:5557andkv@${POD_IP}@Qwen3-32B.--prefix-caching-hash-algoHash algorithm for KV cache prefix blocks. Must be sha256_cbor_64bit.--block-sizeNumber of tokens per KV cache prefix block. Must match blockSizeinInferenceTrafficPolicy. In this example:64.PYTHONHASHSEEDPython hash seed. Must be a non-zero value and must match hashSeedinInferenceTrafficPolicy. In this example:42.Deploy the vLLM inference service.
kubectl create -f vllm.yaml
Deploy inference routing
Step 1: Deploy the inference routing policy
Create
inference-policy.yaml. TheblockSizeandhashSeedvalues must match the--block-sizeandPYTHONHASHSEEDvalues in your vLLM deployment.# InferencePool selects the vLLM workload pods for routing apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferencePool metadata: name: qwen-inference-pool spec: targetPortNumber: 8000 selector: app: qwen3 --- # InferenceTrafficPolicy configures KV cache-aware load balancing for the pool apiVersion: inferenceextension.alibabacloud.com/v1alpha1 kind: InferenceTrafficPolicy metadata: name: inference-policy spec: poolRef: name: qwen-inference-pool profile: single: # Backend is a single-model vLLM deployment trafficPolicy: prefixCache: mode: tracking # Enables KV cache-aware load balancing (precise mode) trackingConfig: indexerConfig: tokenProcessorConfig: blockSize: 64 # Must match vLLM --block-size hashSeed: 42 # Must match vLLM PYTHONHASHSEED model: Qwen/Qwen3-32B # Official ModelScope model nameApply the policy.
kubectl apply -f inference-policy.yaml
Step 2: Deploy the gateway and routing rules
Create
inference-gateway.yamlwith the Gateway, HTTPRoute, and a backend timeout policy.apiVersion: gateway.networking.k8s.io/v1 kind: Gateway metadata: name: inference-gateway spec: gatewayClassName: ack-gateway listeners: - name: http-llm protocol: HTTP port: 8080 --- apiVersion: gateway.networking.k8s.io/v1 kind: HTTPRoute metadata: name: inference-route spec: parentRefs: - name: inference-gateway rules: - matches: - path: type: PathPrefix value: /v1 backendRefs: - name: qwen-inference-pool kind: InferencePool group: inference.networking.x-k8s.io --- apiVersion: gateway.envoyproxy.io/v1alpha1 kind: BackendTrafficPolicy metadata: name: backend-timeout spec: timeout: http: requestTimeout: 24h targetRef: group: gateway.networking.k8s.io kind: Gateway name: inference-gatewayApply the manifest.
kubectl apply -f inference-gateway.yaml
Step 3: Verify routing
To confirm precise-mode routing is working, send two requests that share the same prefix and check that both are routed to the same vLLM replica.
Create the two request payloads. Both share the same
contentsegment in the first message.echo '{"max_tokens":24,"messages":[{"content":"Hi, here'\''s some system prompt: hi hi hi hi hi hi hi hi hi hi.For user 3, here are some other context: hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi.I would like to test your intelligence. for this purpose I would like you to play zork. you can interact with the game by typing in commands. I will forward these commands to the game and type in any response. are you ready?","role":"user"}],"model":"Qwen3-32B","stream":true,"stream_options":{"include_usage":true},"temperature":0}' > round1.txtecho '{"max_tokens":3,"messages":[{"content":"Hi, here'\''s some system prompt: hi hi hi hi hi hi hi hi hi hi.For user 3, here are some other context: hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi.I would like to test your intelligence. for this purpose I would like you to play zork. you can interact with the game by typing in commands. I will forward these commands to the game and type in any response. are you ready?","role":"user"},{"content":"Hi there! It looks like you'\''re setting up a fun test. I'\''m ready to play Zork! You can","role":"assistant"},{"content":"% zork\nWelcome to Dungeon. This version created 11-MAR-91.\nYou are in an open field west of a big white house with a boarded\nfront door.\nThere is a small mailbox here.\n>","role":"user"},{"content":"Great!","role":"assistant"},{"content":"Opening the mailbox reveals:\n A leaflet.\n>","role":"user"}],"model":"Qwen3-32B","stream":true,"stream_options":{"include_usage":true},"temperature":0}' > round2.txtGet the gateway's public IP address.
export GATEWAY_IP=$(kubectl get gateway/inference-gateway -o jsonpath='{.status.addresses[0].value}')Send both requests to simulate a multi-turn conversation.
curl -X POST $GATEWAY_IP:8080/v1/chat/completions -H 'Content-Type: application/json' -d @./round1.txt curl -X POST $GATEWAY_IP:8080/v1/chat/completions -H 'Content-Type: application/json' -d @./round2.txtCheck the inference extension logs and confirm that both entries show the same
endpoint.Addressvalue. If the addresses match, both requests were routed to the same vLLM replica, confirming that prefix cache-aware routing in precise mode is working.kubectl logs deploy/epp-default-qwen-inference-pool -n envoy-gateway-system | grep "handled"Expected output:
2025-08-19T10:16:12Z LEVEL(-2) requestcontrol/director.go:278 Request handled {"x-request-id": "00d5c24e-b3c8-461d-9848-7bb233243eb9", "model": "Qwen3-32B", "resolvedTargetModel": "Qwen3-32B", "criticality": "Critical", "model": "Qwen3-32B", "targetModel": "Qwen3-32B", "endpoint": "{NamespacedName:default/qwen3-779c54544f-9c4vz Address:10.0.0.5 Labels:map[app:qwen3 pod-template-hash:779c54544f]}"} 2025-08-19T10:16:19Z LEVEL(-2) requestcontrol/director.go:278 Request handled {"x-request-id": "401925f5-fe65-46e3-8494-5afd83921ba5", "model": "Qwen3-32B", "resolvedTargetModel": "Qwen3-32B", "criticality": "Critical", "model": "Qwen3-32B", "targetModel": "Qwen3-32B", "endpoint": "{NamespacedName:default/qwen3-779c54544f-9c4vz Address:10.0.0.5 Labels:map[app:qwen3 pod-template-hash:779c54544f]}"}In this example, both requests show
Address:10.0.0.5, confirming they were routed to the same pod.