All Products
Search
Document Center

Container Service for Kubernetes:Implement KV Cache-aware load balancing for vLLM using Gateway with Inference Extension

Last Updated:Jan 05, 2026

KV Cache-aware load balancing is designed for generative AI inference. It significantly improves the efficiency of large language model (LLM) services by dynamically routing requests to optimal compute nodes. This topic shows you how to use the Gateway with Inference Extension component to implement a KV Cache-aware load balancing policy.

Concepts

vLLM

vLLM is a framework designed for efficient and user-friendly construction of LLM inference services. It supports various large language models, including Qwen, and optimizes LLM inference efficiency through techniques like PagedAttention, dynamic batch inference (Continuous Batching), and model quantization.

KV cache

During the inference process, cache the keys and values generated by the model to access the contextual information of historical requests quickly. This improves the efficiency of text generation by the model. The use of a KV Cache can avoid redundant computations, accelerate inference speed, and reduce response latency for the model.

vLLM automatic prefix caching

vLLM utilizes automatic prefix caching (APC) to speed up inference. By reusing the KV cache for requests that share the same prefix, the system avoids unnecessary re-computation and reduces latency.

KV cache-aware prefix versus prefix-aware load balancing

A KV Cache-aware load balancing policy works as follows:

Each vLLM workload reports its KV cache block information to the Gateway with Inference Extension component via event messages. Using this data, the Gateway routes new requests to the workload with the highest cache hit ratio based on the request content. Such routing maximizes the prefix cache hit ratio and reduces response time, making this policy ideal for scenarios with a high volume of requests that share common prefixes.

Similar to prefix-aware load balancing, KV Cache-aware load balancing also aims to maximize the prefix cache hit ratio by leveraging the prefix caching mechanism of the inference service framework.

  • KV Cache-aware load balancing: Directly receives KV Cache block distribution via events, allowing for precise cache-hit maximization. This policy requires vLLM v0.10.0 or later.

  • Prefix-aware load balancing: Decoupled from the inference engine (enabling broader compatibility), but cannot accurately detect the actual KV Cache distribution on individual pods.

Important

To use KV Cache-aware load balancing, your inference service must use vLLM v0.10.0 or later. You must also configure specific startup arguments to enable KV Cache event reporting.

Prerequisites

Deploy the model service

Step 1: Prepare the Qwen3-32B model files

  1. Download the Qwen-32B model from ModelScope.

    Make sure that the git-lfs plugin is installed. If it is not installed, you can run yum install git-lfs or apt-get install git-lfs to install it. For more information about installation methods, see Installing Git Large File Storage.
    git lfs install
    GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/Qwen/Qwen3-32B.git
    cd Qwen3-32B/
    git lfs pull
  2. Create a folder in OSS and upload the model files to it.

    For information about how to install and use ossutil, see Install ossutil.
    ossutil mkdir oss://<YOUR-BUCKET-NAME>/Qwen3-32B
    ossutil cp -r ./Qwen3-32B oss://<YOUR-BUCKET-NAME>/Qwen3-32B
  3. Create a persistent volume (PV) named llm-model and a persistent volume claim (PVC) for the target cluster. For more information, see Use ossfs 1.0 to create a statically provisioned volume.

    1. Create an llm-model.yaml file. This file contains the configurations for a Secret, a statically provisioned PV, and a statically provisioned PVC.

      apiVersion: v1
      kind: Secret
      metadata:
        name: oss-secret
      stringData:
        akId: <YOUR-OSS-AK> # The AccessKey ID used to access OSS
        akSecret: <YOUR-OSS-SK> # The AccessKey secret used to access OSS
      ---
      apiVersion: v1
      kind: PersistentVolume
      metadata:
        name: llm-model
        labels:
          alicloud-pvname: llm-model
      spec:
        capacity:
          storage: 30 Gi 
        accessModes:
          - ReadOnlyMany
        persistentVolumeReclaimPolicy: Retain
        csi:
          driver: ossplugin.csi.alibabacloud.com
          volumeHandle: llm-model
          nodePublishSecretRef:
            name: oss-secret
            namespace: default
          volumeAttributes:
            bucket: <YOUR-BUCKET-NAME> # The name of the bucket.
            url: <YOUR-BUCKET-ENDPOINT> # The Endpoint information, such as oss-cn-hangzhou-internal.aliyuncs.com.
            otherOpts: "-o umask=022 -o max_stat_cache_size=0 -o allow_other"
            path: <YOUR-MODEL-PATH> # In this example, the path is /Qwen3-32B/.
      ---
      apiVersion: v1
      kind: PersistentVolumeClaim
      metadata:
        name: llm-model
      spec:
        accessModes:
          - ReadOnlyMany
        resources:
          requests:
            storage: 30 Gi
        selector:
          matchLabels:
            alicloud-pvname: llm-model
    2. Create the Secret, the statically provisioned PV, and the statically provisioned PVC.

      kubectl create -f llm-model.yaml

Step 2: Deploy the vLLM inference service

  1. Create a vllm.yaml file.

    Expand to view YAML content

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      labels:
        app: qwen3
      name: qwen3
    spec:
      progressDeadlineSeconds: 600
      replicas: 3
      revisionHistoryLimit: 10
      selector:
        matchLabels:
          app: qwen3
      strategy:
        rollingUpdate:
          maxSurge: 25%
          maxUnavailable: 25%
        type: RollingUpdate
      template:
        metadata:
          annotations:
            prometheus.io/path: /metrics
            prometheus.io/port: '8000'
            prometheus.io/scrape: 'true'
          creationTimestamp: null
          labels:
            app: qwen3
        spec:
          containers:
            - command:
                - sh
                - '-c'
                - >-
                  vllm serve /models/Qwen3-32B --served-model-name Qwen3-32B
                  --trust-remote-code --port=8000 --max-model-len 8192
                  --gpu-memory-utilization 0.95 --enforce-eager --kv-events-config
                  "{\"enable_kv_cache_events\":true,\"publisher\":\"zmq\",\"endpoint\":\"tcp://epp-default-qwen-inference-pool.envoy-gateway-system.svc.cluster.local:5557\",\"topic\":\"kv@${POD_IP}@Qwen3-32B\"}"
                  --prefix-caching-hash-algo sha256_cbor_64bit --block-size 64
              env:
                - name: POD_IP
                  valueFrom:
                    fieldRef:
                      apiVersion: v1
                      fieldPath: status.podIP
                - name: PYTHONHASHSEED
                  value: '42'
              image: 'registry-cn-hangzhou.ack.aliyuncs.com/dev/vllm:0.10.0'
              imagePullPolicy: IfNotPresent
              name: vllm
              ports:
                - containerPort: 8000
                  name: restful
                  protocol: TCP
              readinessProbe:
                failureThreshold: 3
                initialDelaySeconds: 30
                periodSeconds: 30
                successThreshold: 1
                tcpSocket:
                  port: 8000
                timeoutSeconds: 1
              resources:
                limits:
                  nvidia.com/gpu: '1'
                requests:
                  nvidia.com/gpu: '1'
              terminationMessagePath: /dev/termination-log
              terminationMessagePolicy: File
              volumeMounts:
                - mountPath: /models/Qwen3-32B
                  name: model
                - mountPath: /dev/shm
                  name: dshm
          dnsPolicy: ClusterFirst
          restartPolicy: Always
          schedulerName: default-scheduler
          securityContext: {}
          terminationGracePeriodSeconds: 30
          volumes:
            - name: model
              persistentVolumeClaim:
                claimName: llm-model
            - emptyDir:
                medium: Memory
                sizeLimit: 30Gi
              name: dshm
    ---
    apiVersion: v1
    kind: Service
    metadata:
      labels:
        app: qwen3
      name: qwen3
    spec:
      ports:
        - name: http-serving
          port: 8000
          protocol: TCP
          targetPort: 8000
      selector:
        app: qwen3
      type: ClusterIP

    The following table describes some of the startup parameters and environment variables.

    Parameter/Environment variable

    Description

    --kv-events-config

    Configuration for publishing KV Cache events. This must be a valid JSON string or individually passed JSON keys.

    Example value:

    {"enable_kv_cache_events":true,"publisher":"zmq","endpoint":"tcp://epp-default-qwen-inference-pool.envoy-gateway-system.svc.cluster.local:5557","topic":"kv@${POD_IP}@Qwen3-32B"}

    Details:

    • endpoint: The ZMQ server endpoint of the inference extension. The naming convention is tcp://epp-<InferencePool_namespace>-<InferencePool_name>.envoy-gateway-system.<cluster_domain>:5557. You must replace <InferencePool_namespace> and <InferencePool_name> with the actual namespace and name defined in your inference-policy.yaml. In this example, the endpoint is tcp://epp-default-qwen-inference-pool.envoy-gateway-system.svc.cluster.local:5557.

    • topic: The naming convention is kv@${POD_IP}@<served_model_name>. In this example, the topic is kv@${POD_IP}@Qwen3-32B.

    --prefix-caching-hash-algo

    The hashing algorithm for KV Cache prefix blocks. This must be set to sha256_cbor_64bit.

    --block-size

    The number of tokens per KV Cache prefix block. In this example, it is set to 64.

    PYTHONHASHSEED

    The seed used by Python for hash algorithms. It must be set to a non-zero value. In this example, it is set to 42.

  2. Deploy the vLLM inference service.

    kubectl create -f vllm.yaml

Deploy the inference route

Step 1: Deploy the inference routing policy

  1. Create a file named inference-policy.yaml. It defines the InferencePool and the InferenceTrafficPolicy which enables the tracking mode for KV Cache awareness.

    # The InferencePool declares that inference routing is enabled for the workload.
    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferencePool
    metadata:
      name: qwen-inference-pool
    spec:
      targetPortNumber: 8000
      selector:
        app: qwen3
    ---
    # InferenceTrafficPolicy specifies the traffic policy applied to the InferencePool.
    apiVersion: inferenceextension.alibabacloud.com/v1alpha1
    kind: InferenceTrafficPolicy
    metadata:
      name: inference-policy
    spec:
      poolRef:
        name: qwen-inference-pool
      profile: 
        single: # Specifies that the backend inference service is a single-node vLLM deployment.
          trafficPolicy: # Specifies the load balancing policy for the inference service.
            prefixCache:
              mode: tracking # Enables tracking-based KV Cache-aware load balancing.
              trackingConfig:
                indexerConfig:
                  tokenProcessorConfig:
                    blockSize: 64 # Must be consistent with the --block-size startup parameter of vLLM.
                    hashSeed: 42  # Must be consistent with the PYTHONHASHSEED environment variable of vLLM.
                    model: Qwen/Qwen3-32B # Specifies the official ModelScope name of the model for the inference service.
  2. Deploy the inference routing policy.

    kubectl apply -f inference-policy.yaml

Step 2: Deploy the Gateway and routing rules

  1. Create an inference-gateway.yaml file. This file defines the Gateway, its HTTPRoute, and a BackendTrafficPolicy for timeouts.

    apiVersion: gateway.networking.k8s.io/v1
    kind: Gateway
    metadata:
      name: inference-gateway
    spec:
      gatewayClassName: ack-gateway
      listeners:
      - name: http-llm
        protocol: HTTP
        port: 8080
    ---
    apiVersion: gateway.networking.k8s.io/v1
    kind: HTTPRoute
    metadata:
      name: inference-route
    spec:
      parentRefs:
      - name: inference-gateway
      rules:
      - matches:
        - path:
            type: PathPrefix
            value: /v1
        backendRefs:
        - name: qwen-inference-pool
          kind: InferencePool
          group: inference.networking.x-k8s.io
    ---
    apiVersion: gateway.envoyproxy.io/v1alpha1
    kind: BackendTrafficPolicy
    metadata:
      name: backend-timeout
    spec:
      timeout:
        http:
          requestTimeout: 24h
      targetRef:
        group: gateway.networking.k8s.io
        kind: Gateway
        name: inference-gateway
  2. Deploy the Gateway configuration.

    kubectl apply -f inference-gateway.yaml

Step 3: Verify the route

  1. Create round1.txt and round2.txt. Both text files contain the same initial content block. Send round1.txt and round2.txt as the body of LLM requests to check the inference extension logs and verify if KV Cache-aware load balancing is triggered.

    round1.txt:

    echo '{"max_tokens":24,"messages":[{"content":"Hi, here'\''s some system prompt: hi hi hi hi hi hi hi hi hi hi.For user 3, here are some other context: hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi.I would like to test your intelligence. for this purpose I would like you to play zork. you can interact with the game by typing in commands. I will forward these commands to the game and type in any response. are you ready?","role":"user"}],"model":"Qwen3-32B","stream":true,"stream_options":{"include_usage":true},"temperature":0}' > round1.txt

    round2.txt:

    echo '{"max_tokens":3,"messages":[{"content":"Hi, here'\''s some system prompt: hi hi hi hi hi hi hi hi hi hi.For user 3, here are some other context: hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi.I would like to test your intelligence. for this purpose I would like you to play zork. you can interact with the game by typing in commands. I will forward these commands to the game and type in any response. are you ready?","role":"user"},{"content":"Hi there! It looks like you're setting up a fun test. I'm ready to play Zork! You can","role":"assistant"},{"content":"% zork\nWelcome to Dungeon. This version created 11-MAR-91.\nYou are in an open field west of a big white house with a boarded\nfront door.\nThere is a small mailbox here.\n>","role":"user"},{"content":"Great!","role":"assistant"},{"content":"Opening the mailbox reveals:\n A leaflet.\n>","role":"user"}],"model":"Qwen3-32B","stream":true,"stream_options":{"include_usage":true},"temperature":0}' > round2.txt
  2. Get the public IP address of the Gateway.

    export GATEWAY_IP=$(kubectl get gateway/inference-gateway -o jsonpath='{.status.addresses[0].value}')
  3. Send two requests to simulate a multi-turn conversation.

    curl -X POST $GATEWAY_IP:8080/v1/chat/completions -H 'Content-Type: application/json' -d @./round1.txt
    curl -X POST $GATEWAY_IP:8080/v1/chat/completions -H 'Content-Type: application/json' -d @./round2.txt
  4. Check the logs to verify that KV Cache-aware load balancing is working.

    kubectl logs deploy/epp-default-qwen-inference-pool -n envoy-gateway-system|grep "handled"

    Expected output:

    2025-08-19T10:16:12Z	LEVEL(-2)	requestcontrol/director.go:278	Request handled	{"x-request-id": "00d5c24e-b3c8-461d-9848-7bb233243eb9", "model": "Qwen3-32B", "resolvedTargetModel": "Qwen3-32B", "criticality": "Critical", "model": "Qwen3-32B", "targetModel": "Qwen3-32B", "endpoint": "{NamespacedName:default/qwen3-779c54544f-9c4vz Address:10.0.0.5 Labels:map[app:qwen3 pod-template-hash:779c54544f]}"}
    2025-08-19T10:16:19Z	LEVEL(-2)	requestcontrol/director.go:278	Request handled	{"x-request-id": "401925f5-fe65-46e3-8494-5afd83921ba5", "model": "Qwen3-32B", "resolvedTargetModel": "Qwen3-32B", "criticality": "Critical", "model": "Qwen3-32B", "targetModel": "Qwen3-32B", "endpoint": "{NamespacedName:default/qwen3-779c54544f-9c4vz Address:10.0.0.5 Labels:map[app:qwen3 pod-template-hash:779c54544f]}"}

    The logs should show Request handled events with a resolvedTargetModel indicating successful routing. If both requests were routed to the same workload (same endpoint address), KV Cache-aware load balancing is working correctly.