KV Cache-aware load balancing is designed for generative AI inference. It significantly improves the efficiency of large language model (LLM) services by dynamically routing requests to optimal compute nodes. This topic shows you how to use the Gateway with Inference Extension component to implement a KV Cache-aware load balancing policy.
Concepts
vLLM
vLLM is a framework designed for efficient and user-friendly construction of LLM inference services. It supports various large language models, including Qwen, and optimizes LLM inference efficiency through techniques like PagedAttention, dynamic batch inference (Continuous Batching), and model quantization.
KV cache
vLLM automatic prefix caching
vLLM utilizes automatic prefix caching (APC) to speed up inference. By reusing the KV cache for requests that share the same prefix, the system avoids unnecessary re-computation and reduces latency.
KV cache-aware prefix versus prefix-aware load balancing
A KV Cache-aware load balancing policy works as follows:
Each vLLM workload reports its KV cache block information to the Gateway with Inference Extension component via event messages. Using this data, the Gateway routes new requests to the workload with the highest cache hit ratio based on the request content. Such routing maximizes the prefix cache hit ratio and reduces response time, making this policy ideal for scenarios with a high volume of requests that share common prefixes.
Similar to prefix-aware load balancing, KV Cache-aware load balancing also aims to maximize the prefix cache hit ratio by leveraging the prefix caching mechanism of the inference service framework.
KV Cache-aware load balancing: Directly receives KV Cache block distribution via events, allowing for precise cache-hit maximization. This policy requires vLLM v0.10.0 or later.
Prefix-aware load balancing: Decoupled from the inference engine (enabling broader compatibility), but cannot accurately detect the actual KV Cache distribution on individual pods.
To use KV Cache-aware load balancing, your inference service must use vLLM v0.10.0 or later. You must also configure specific startup arguments to enable KV Cache event reporting.
Prerequisites
You have created an ACK managed cluster with a GPU node pool. You can also install the ACK Virtual Node component in the ACK managed cluster to use ACS GPU computing power in ACK.
This topic uses the Qwen3-32B model as an example, which requires more than 64 GB of GPU memory. The ecs.gn8is-2x.8xlarge instance type or ACS GU8TF card type is recommended.
You have installed Gateway with Inference Extension version 1.4.0-aliyun.3 or later and enabled Enable Gateway API Inference Extension. For instructions, see Install the Gateway with Inference Extension component.
Deploy the model service
Step 1: Prepare the Qwen3-32B model files
Download the Qwen-32B model from ModelScope.
Make sure that the git-lfs plugin is installed. If it is not installed, you can run
yum install git-lfsorapt-get install git-lfsto install it. For more information about installation methods, see Installing Git Large File Storage.git lfs install GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/Qwen/Qwen3-32B.git cd Qwen3-32B/ git lfs pullCreate a folder in OSS and upload the model files to it.
For information about how to install and use ossutil, see Install ossutil.
ossutil mkdir oss://<YOUR-BUCKET-NAME>/Qwen3-32B ossutil cp -r ./Qwen3-32B oss://<YOUR-BUCKET-NAME>/Qwen3-32BCreate a persistent volume (PV) named
llm-modeland a persistent volume claim (PVC) for the target cluster. For more information, see Use ossfs 1.0 to create a statically provisioned volume.Create an llm-model.yaml file. This file contains the configurations for a Secret, a statically provisioned PV, and a statically provisioned PVC.
apiVersion: v1 kind: Secret metadata: name: oss-secret stringData: akId: <YOUR-OSS-AK> # The AccessKey ID used to access OSS akSecret: <YOUR-OSS-SK> # The AccessKey secret used to access OSS --- apiVersion: v1 kind: PersistentVolume metadata: name: llm-model labels: alicloud-pvname: llm-model spec: capacity: storage: 30 Gi accessModes: - ReadOnlyMany persistentVolumeReclaimPolicy: Retain csi: driver: ossplugin.csi.alibabacloud.com volumeHandle: llm-model nodePublishSecretRef: name: oss-secret namespace: default volumeAttributes: bucket: <YOUR-BUCKET-NAME> # The name of the bucket. url: <YOUR-BUCKET-ENDPOINT> # The Endpoint information, such as oss-cn-hangzhou-internal.aliyuncs.com. otherOpts: "-o umask=022 -o max_stat_cache_size=0 -o allow_other" path: <YOUR-MODEL-PATH> # In this example, the path is /Qwen3-32B/. --- apiVersion: v1 kind: PersistentVolumeClaim metadata: name: llm-model spec: accessModes: - ReadOnlyMany resources: requests: storage: 30 Gi selector: matchLabels: alicloud-pvname: llm-modelCreate the Secret, the statically provisioned PV, and the statically provisioned PVC.
kubectl create -f llm-model.yaml
Step 2: Deploy the vLLM inference service
Create a
vllm.yamlfile.The following table describes some of the startup parameters and environment variables.
Parameter/Environment variable
Description
--kv-events-config
Configuration for publishing KV Cache events. This must be a valid JSON string or individually passed JSON keys.
Example value:
{"enable_kv_cache_events":true,"publisher":"zmq","endpoint":"tcp://epp-default-qwen-inference-pool.envoy-gateway-system.svc.cluster.local:5557","topic":"kv@${POD_IP}@Qwen3-32B"}Details:
endpoint: The ZMQ server endpoint of the inference extension. The naming convention is
tcp://epp-<InferencePool_namespace>-<InferencePool_name>.envoy-gateway-system.<cluster_domain>:5557. You must replace<InferencePool_namespace>and<InferencePool_name>with the actual namespace and name defined in yourinference-policy.yaml. In this example, the endpoint istcp://epp-default-qwen-inference-pool.envoy-gateway-system.svc.cluster.local:5557.topic: The naming convention is
kv@${POD_IP}@<served_model_name>. In this example, the topic iskv@${POD_IP}@Qwen3-32B.
--prefix-caching-hash-algo
The hashing algorithm for KV Cache prefix blocks. This must be set to
sha256_cbor_64bit.--block-size
The number of tokens per KV Cache prefix block. In this example, it is set to
64.PYTHONHASHSEED
The seed used by Python for hash algorithms. It must be set to a non-zero value. In this example, it is set to
42.Deploy the vLLM inference service.
kubectl create -f vllm.yaml
Deploy the inference route
Step 1: Deploy the inference routing policy
Create a file named
inference-policy.yaml. It defines theInferencePooland theInferenceTrafficPolicywhich enables the tracking mode for KV Cache awareness.# The InferencePool declares that inference routing is enabled for the workload. apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferencePool metadata: name: qwen-inference-pool spec: targetPortNumber: 8000 selector: app: qwen3 --- # InferenceTrafficPolicy specifies the traffic policy applied to the InferencePool. apiVersion: inferenceextension.alibabacloud.com/v1alpha1 kind: InferenceTrafficPolicy metadata: name: inference-policy spec: poolRef: name: qwen-inference-pool profile: single: # Specifies that the backend inference service is a single-node vLLM deployment. trafficPolicy: # Specifies the load balancing policy for the inference service. prefixCache: mode: tracking # Enables tracking-based KV Cache-aware load balancing. trackingConfig: indexerConfig: tokenProcessorConfig: blockSize: 64 # Must be consistent with the --block-size startup parameter of vLLM. hashSeed: 42 # Must be consistent with the PYTHONHASHSEED environment variable of vLLM. model: Qwen/Qwen3-32B # Specifies the official ModelScope name of the model for the inference service.Deploy the inference routing policy.
kubectl apply -f inference-policy.yaml
Step 2: Deploy the Gateway and routing rules
Create an
inference-gateway.yamlfile. This file defines the Gateway, its HTTPRoute, and a BackendTrafficPolicy for timeouts.apiVersion: gateway.networking.k8s.io/v1 kind: Gateway metadata: name: inference-gateway spec: gatewayClassName: ack-gateway listeners: - name: http-llm protocol: HTTP port: 8080 --- apiVersion: gateway.networking.k8s.io/v1 kind: HTTPRoute metadata: name: inference-route spec: parentRefs: - name: inference-gateway rules: - matches: - path: type: PathPrefix value: /v1 backendRefs: - name: qwen-inference-pool kind: InferencePool group: inference.networking.x-k8s.io --- apiVersion: gateway.envoyproxy.io/v1alpha1 kind: BackendTrafficPolicy metadata: name: backend-timeout spec: timeout: http: requestTimeout: 24h targetRef: group: gateway.networking.k8s.io kind: Gateway name: inference-gatewayDeploy the Gateway configuration.
kubectl apply -f inference-gateway.yaml
Step 3: Verify the route
Create
round1.txtandround2.txt. Both text files contain the same initialcontentblock. Sendround1.txtandround2.txtas the body of LLM requests to check the inference extension logs and verify if KV Cache-aware load balancing is triggered.round1.txt:echo '{"max_tokens":24,"messages":[{"content":"Hi, here'\''s some system prompt: hi hi hi hi hi hi hi hi hi hi.For user 3, here are some other context: hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi.I would like to test your intelligence. for this purpose I would like you to play zork. you can interact with the game by typing in commands. I will forward these commands to the game and type in any response. are you ready?","role":"user"}],"model":"Qwen3-32B","stream":true,"stream_options":{"include_usage":true},"temperature":0}' > round1.txtround2.txt:echo '{"max_tokens":3,"messages":[{"content":"Hi, here'\''s some system prompt: hi hi hi hi hi hi hi hi hi hi.For user 3, here are some other context: hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi.I would like to test your intelligence. for this purpose I would like you to play zork. you can interact with the game by typing in commands. I will forward these commands to the game and type in any response. are you ready?","role":"user"},{"content":"Hi there! It looks like you're setting up a fun test. I'm ready to play Zork! You can","role":"assistant"},{"content":"% zork\nWelcome to Dungeon. This version created 11-MAR-91.\nYou are in an open field west of a big white house with a boarded\nfront door.\nThere is a small mailbox here.\n>","role":"user"},{"content":"Great!","role":"assistant"},{"content":"Opening the mailbox reveals:\n A leaflet.\n>","role":"user"}],"model":"Qwen3-32B","stream":true,"stream_options":{"include_usage":true},"temperature":0}' > round2.txtGet the public IP address of the Gateway.
export GATEWAY_IP=$(kubectl get gateway/inference-gateway -o jsonpath='{.status.addresses[0].value}')Send two requests to simulate a multi-turn conversation.
curl -X POST $GATEWAY_IP:8080/v1/chat/completions -H 'Content-Type: application/json' -d @./round1.txt curl -X POST $GATEWAY_IP:8080/v1/chat/completions -H 'Content-Type: application/json' -d @./round2.txtCheck the logs to verify that KV Cache-aware load balancing is working.
kubectl logs deploy/epp-default-qwen-inference-pool -n envoy-gateway-system|grep "handled"Expected output:
2025-08-19T10:16:12Z LEVEL(-2) requestcontrol/director.go:278 Request handled {"x-request-id": "00d5c24e-b3c8-461d-9848-7bb233243eb9", "model": "Qwen3-32B", "resolvedTargetModel": "Qwen3-32B", "criticality": "Critical", "model": "Qwen3-32B", "targetModel": "Qwen3-32B", "endpoint": "{NamespacedName:default/qwen3-779c54544f-9c4vz Address:10.0.0.5 Labels:map[app:qwen3 pod-template-hash:779c54544f]}"} 2025-08-19T10:16:19Z LEVEL(-2) requestcontrol/director.go:278 Request handled {"x-request-id": "401925f5-fe65-46e3-8494-5afd83921ba5", "model": "Qwen3-32B", "resolvedTargetModel": "Qwen3-32B", "criticality": "Critical", "model": "Qwen3-32B", "targetModel": "Qwen3-32B", "endpoint": "{NamespacedName:default/qwen3-779c54544f-9c4vz Address:10.0.0.5 Labels:map[app:qwen3 pod-template-hash:779c54544f]}"}The logs should show
Request handledevents with aresolvedTargetModelindicating successful routing. If both requests were routed to the same workload (sameendpointaddress), KV Cache-aware load balancing is working correctly.