Boost vLLM Inference with KV Cache-Aware Load Balancing via Gateway - Container Service for Kubernetes

Prefix cache-aware routing in precise mode routes each request to the vLLM replica that already holds the longest matching KV cache prefix, maximizing cache hit ratios and reducing response latency for shared-prefix LLM workloads. Each vLLM replica publishes its cached KV cache block information to the Gateway with Inference Extension via ZeroMQ event messages, and the gateway selects the replica with the highest cache hit probability for each incoming request.

Key concepts

KV cache

During inference, the model generates key-value pairs for each token in the context. Caching these pairs lets the model skip redundant computation for tokens it has already processed, which speeds up inference and lowers response latency.

Automatic Prefix Caching (APC)

vLLM's APC stores the KV cache of previously computed requests. When a new request shares a prefix with a cached request, vLLM reuses the existing KV cache for the shared prefix, skipping recomputation.

Precise mode vs. estimated mode

	Precise mode	Estimated mode
Cache monitoring	Receives KV cache block distribution directly from each vLLM replica	Infers cache state without direct reporting
Cache hit accuracy	Higher — routes based on actual cache state	Lower — cannot precisely track KV cache distribution
Requirements	vLLM v0.10.0 or later with KV cache event reporting enabled at startup	No additional vLLM configuration required
Best for	Workloads with many shared-prefix requests	Scenarios where vLLM version constraints prevent precise mode

Use precise mode when your inference service runs vLLM v0.10.0 or later and your workload includes requests with shared system prompts or conversation history.

Prerequisites

Before you begin, ensure that you have:

An ACK managed cluster with a GPU node pool. Alternatively, install the ACK Virtual Node component to use ACK with ACS GPU computing power.
This tutorial deploys the Qwen3-32B model, which requires more than 64 GB of GPU memory. Use the ecs.gn8is-2x.8xlarge instance type for GPU node pools, or the GU8TF card type for ACS virtual nodes.
Gateway with Inference Extension version 1.4.0-aliyun.3 or later, with Enable Gateway API Inference Extension turned on. See Install the Gateway with Inference Extension add-on.

Deploy the model service

Step 1: Prepare the Qwen3-32B model files

Install git-lfs if it is not already installed.
```
# On RHEL/CentOS-based systems
yum install git-lfs

# On Debian/Ubuntu-based systems
apt-get install git-lfs
```
For other installation methods, see Installing Git Large File Storage.

Download the Qwen3-32B model from ModelScope.

git lfs install
GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/Qwen/Qwen3-32B.git
cd Qwen3-32B/
git lfs pull

Create an OSS folder and upload the model files. For instructions on installing ossutil, see Install ossutil.

ossutil mkdir oss://<YOUR-BUCKET-NAME>/Qwen3-32B
ossutil cp -r ./Qwen3-32B oss://<YOUR-BUCKET-NAME>/Qwen3-32B

Create llm-model.yaml to define an OSS-backed Secret, persistent volume (PV), and persistent volume claim (PVC). For background, see Use ossfs 1.0 to create a statically provisioned volume.

apiVersion: v1
kind: Secret
metadata:
  name: oss-secret
stringData:
  akId: <YOUR-OSS-AK>       # AccessKey ID for OSS access
  akSecret: <YOUR-OSS-SK>   # AccessKey Secret for OSS access
---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: llm-model
  labels:
    alicloud-pvname: llm-model
spec:
  capacity:
    storage: 30Gi
  accessModes:
    - ReadOnlyMany
  persistentVolumeReclaimPolicy: Retain
  csi:
    driver: ossplugin.csi.alibabacloud.com
    volumeHandle: llm-model
    nodePublishSecretRef:
      name: oss-secret
      namespace: default
    volumeAttributes:
      bucket: <YOUR-BUCKET-NAME>      # OSS bucket name
      url: <YOUR-BUCKET-ENDPOINT>     # OSS endpoint, e.g., oss-cn-hangzhou-internal.aliyuncs.com
      otherOpts: "-o umask=022 -o max_stat_cache_size=0 -o allow_other"
      path: <YOUR-MODEL-PATH>         # Path to model files, e.g., /Qwen3-32B/
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: llm-model
spec:
  accessModes:
    - ReadOnlyMany
  resources:
    requests:
      storage: 30Gi
  selector:
    matchLabels:
      alicloud-pvname: llm-model

Apply the manifest.
```
kubectl create -f llm-model.yaml
```

Step 2: Deploy the vLLM inference service

Create vllm.yaml.

Expand to view YAML content

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: qwen3
  name: qwen3
spec:
  progressDeadlineSeconds: 600
  replicas: 3
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: qwen3
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      annotations:
        prometheus.io/path: /metrics
        prometheus.io/port: '8000'
        prometheus.io/scrape: 'true'
      labels:
        app: qwen3
    spec:
      containers:
        - command:
            - sh
            - '-c'
            - >-
              vllm serve /models/Qwen3-32B --served-model-name Qwen3-32B
              --trust-remote-code --port=8000 --max-model-len 8192
              --gpu-memory-utilization 0.95 --enforce-eager --kv-events-config
              "{\"enable_kv_cache_events\":true,\"publisher\":\"zmq\",\"endpoint\":\"tcp://epp-default-qwen-inference-pool.envoy-gateway-system.svc.cluster.local:5557\",\"topic\":\"kv@${POD_IP}@Qwen3-32B\"}"
              --prefix-caching-hash-algo sha256_cbor_64bit --block-size 64
          env:
            - name: POD_IP
              valueFrom:
                fieldRef:
                  apiVersion: v1
                  fieldPath: status.podIP
            - name: PYTHONHASHSEED
              value: '42'
          image: 'registry-cn-hangzhou.ack.aliyuncs.com/dev/vllm:0.10.0'
          imagePullPolicy: IfNotPresent
          name: vllm
          ports:
            - containerPort: 8000
              name: restful
              protocol: TCP
          readinessProbe:
            failureThreshold: 3
            initialDelaySeconds: 30
            periodSeconds: 30
            successThreshold: 1
            tcpSocket:
              port: 8000
            timeoutSeconds: 1
          resources:
            limits:
              nvidia.com/gpu: '1'
            requests:
              nvidia.com/gpu: '1'
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          volumeMounts:
            - mountPath: /models/Qwen3-32B
              name: model
            - mountPath: /dev/shm
              name: dshm
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
      volumes:
        - name: model
          persistentVolumeClaim:
            claimName: llm-model
        - emptyDir:
            medium: Memory
            sizeLimit: 30Gi
          name: dshm
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app: qwen3
  name: qwen3
spec:
  ports:
    - name: http-serving
      port: 8000
      protocol: TCP
      targetPort: 8000
  selector:
    app: qwen3
  type: ClusterIP

The following table describes the startup parameters and environment variables that enable precise-mode routing. The --block-size and PYTHONHASHSEED values must match the corresponding fields in the InferenceTrafficPolicy you create in the next section.

Parameter / variable	Description
`--kv-events-config`	KV cache event publishing configuration. Set `enable_kv_cache_events` to `true` and `publisher` to `zmq`. For `endpoint`, use the naming convention `tcp://epp-<InferencePool namespace>-<InferencePool name>.envoy-gateway-system.<cluster local domain>:5557`. For `topic`, use `kv@${POD_IP}@<served model name>`. In this example, the InferencePool named `qwen-inference-pool` is in the `default` namespace and the model name is `Qwen3-32B`, so the values are `tcp://epp-default-qwen-inference-pool.envoy-gateway-system.svc.cluster.local:5557` and `kv@${POD_IP}@Qwen3-32B`.
`--prefix-caching-hash-algo`	Hash algorithm for KV cache prefix blocks. Must be `sha256_cbor_64bit`.
`--block-size`	Number of tokens per KV cache prefix block. Must match `blockSize` in `InferenceTrafficPolicy`. In this example: `64`.
`PYTHONHASHSEED`	Python hash seed. Must be a non-zero value and must match `hashSeed` in `InferenceTrafficPolicy`. In this example: `42`.

Deploy the vLLM inference service.
```
kubectl create -f vllm.yaml
```

Deploy inference routing

Step 1: Deploy the inference routing policy

Create inference-policy.yaml. The blockSize and hashSeed values must match the --block-size and PYTHONHASHSEED values in your vLLM deployment.

# InferencePool selects the vLLM workload pods for routing
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
  name: qwen-inference-pool
spec:
  targetPortNumber: 8000
  selector:
    app: qwen3
---
# InferenceTrafficPolicy configures KV cache-aware load balancing for the pool
apiVersion: inferenceextension.alibabacloud.com/v1alpha1
kind: InferenceTrafficPolicy
metadata:
  name: inference-policy
spec:
  poolRef:
    name: qwen-inference-pool
  profile:
    single:                  # Backend is a single-model vLLM deployment
      trafficPolicy:
        prefixCache:
          mode: tracking     # Enables KV cache-aware load balancing (precise mode)
          trackingConfig:
            indexerConfig:
              tokenProcessorConfig:
                blockSize: 64            # Must match vLLM --block-size
                hashSeed: 42             # Must match vLLM PYTHONHASHSEED
                model: Qwen/Qwen3-32B   # Official ModelScope model name

Apply the policy.
```
kubectl apply -f inference-policy.yaml
```

Step 2: Deploy the gateway and routing rules

Create inference-gateway.yaml with the Gateway, HTTPRoute, and a backend timeout policy.

apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: inference-gateway
spec:
  gatewayClassName: ack-gateway
  listeners:
  - name: http-llm
    protocol: HTTP
    port: 8080
---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: inference-route
spec:
  parentRefs:
  - name: inference-gateway
  rules:
  - matches:
    - path:
        type: PathPrefix
        value: /v1
    backendRefs:
    - name: qwen-inference-pool
      kind: InferencePool
      group: inference.networking.x-k8s.io
---
apiVersion: gateway.envoyproxy.io/v1alpha1
kind: BackendTrafficPolicy
metadata:
  name: backend-timeout
spec:
  timeout:
    http:
      requestTimeout: 24h
  targetRef:
    group: gateway.networking.k8s.io
    kind: Gateway
    name: inference-gateway

Apply the manifest.
```
kubectl apply -f inference-gateway.yaml
```

Step 3: Verify routing

To confirm precise-mode routing is working, send two requests that share the same prefix and check that both are routed to the same vLLM replica.

Create the two request payloads. Both share the same content segment in the first message.

echo '{"max_tokens":24,"messages":[{"content":"Hi, here'\''s some system prompt: hi hi hi hi hi hi hi hi hi hi.For user 3, here are some other context: hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi.I would like to test your intelligence. for this purpose I would like you to play zork. you can interact with the game by typing in commands. I will forward these commands to the game and type in any response. are you ready?","role":"user"}],"model":"Qwen3-32B","stream":true,"stream_options":{"include_usage":true},"temperature":0}' > round1.txt

echo '{"max_tokens":3,"messages":[{"content":"Hi, here'\''s some system prompt: hi hi hi hi hi hi hi hi hi hi.For user 3, here are some other context: hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi.I would like to test your intelligence. for this purpose I would like you to play zork. you can interact with the game by typing in commands. I will forward these commands to the game and type in any response. are you ready?","role":"user"},{"content":"Hi there! It looks like you'\''re setting up a fun test. I'\''m ready to play Zork! You can","role":"assistant"},{"content":"% zork\nWelcome to Dungeon. This version created 11-MAR-91.\nYou are in an open field west of a big white house with a boarded\nfront door.\nThere is a small mailbox here.\n>","role":"user"},{"content":"Great!","role":"assistant"},{"content":"Opening the mailbox reveals:\n A leaflet.\n>","role":"user"}],"model":"Qwen3-32B","stream":true,"stream_options":{"include_usage":true},"temperature":0}' > round2.txt

Get the gateway's public IP address.

export GATEWAY_IP=$(kubectl get gateway/inference-gateway -o jsonpath='{.status.addresses[0].value}')

Send both requests to simulate a multi-turn conversation.

curl -X POST $GATEWAY_IP:8080/v1/chat/completions -H 'Content-Type: application/json' -d @./round1.txt
curl -X POST $GATEWAY_IP:8080/v1/chat/completions -H 'Content-Type: application/json' -d @./round2.txt

Check the inference extension logs and confirm that both entries show the same endpoint.Address value. If the addresses match, both requests were routed to the same vLLM replica, confirming that prefix cache-aware routing in precise mode is working.

kubectl logs deploy/epp-default-qwen-inference-pool -n envoy-gateway-system | grep "handled"

Expected output:

2025-08-19T10:16:12Z	LEVEL(-2)	requestcontrol/director.go:278	Request handled	{"x-request-id": "00d5c24e-b3c8-461d-9848-7bb233243eb9", "model": "Qwen3-32B", "resolvedTargetModel": "Qwen3-32B", "criticality": "Critical", "model": "Qwen3-32B", "targetModel": "Qwen3-32B", "endpoint": "{NamespacedName:default/qwen3-779c54544f-9c4vz Address:10.0.0.5 Labels:map[app:qwen3 pod-template-hash:779c54544f]}"}
2025-08-19T10:16:19Z	LEVEL(-2)	requestcontrol/director.go:278	Request handled	{"x-request-id": "401925f5-fe65-46e3-8494-5afd83921ba5", "model": "Qwen3-32B", "resolvedTargetModel": "Qwen3-32B", "criticality": "Critical", "model": "Qwen3-32B", "targetModel": "Qwen3-32B", "endpoint": "{NamespacedName:default/qwen3-779c54544f-9c4vz Address:10.0.0.5 Labels:map[app:qwen3 pod-template-hash:779c54544f]}"}

In this example, both requests show Address:10.0.0.5, confirming they were routed to the same pod.