All Products
Search
Document Center

Container Service for Kubernetes:Prefix Cache-Aware Routing in Precise Mode

Last Updated:Mar 25, 2026

Prefix cache-aware routing in precise mode routes each request to the vLLM replica that already holds the longest matching KV cache prefix, maximizing cache hit ratios and reducing response latency for shared-prefix LLM workloads. Each vLLM replica publishes its cached KV cache block information to the Gateway with Inference Extension via ZeroMQ event messages, and the gateway selects the replica with the highest cache hit probability for each incoming request.

Key concepts

KV cache

During inference, the model generates key-value pairs for each token in the context. Caching these pairs lets the model skip redundant computation for tokens it has already processed, which speeds up inference and lowers response latency.

Automatic Prefix Caching (APC)

vLLM's APC stores the KV cache of previously computed requests. When a new request shares a prefix with a cached request, vLLM reuses the existing KV cache for the shared prefix, skipping recomputation.

Precise mode vs. estimated mode

Precise modeEstimated mode
Cache monitoringReceives KV cache block distribution directly from each vLLM replicaInfers cache state without direct reporting
Cache hit accuracyHigher — routes based on actual cache stateLower — cannot precisely track KV cache distribution
RequirementsvLLM v0.10.0 or later with KV cache event reporting enabled at startupNo additional vLLM configuration required
Best forWorkloads with many shared-prefix requestsScenarios where vLLM version constraints prevent precise mode

Use precise mode when your inference service runs vLLM v0.10.0 or later and your workload includes requests with shared system prompts or conversation history.

Prerequisites

Before you begin, ensure that you have:

Deploy the model service

Step 1: Prepare the Qwen3-32B model files

  1. Install git-lfs if it is not already installed.

    # On RHEL/CentOS-based systems
    yum install git-lfs
    
    # On Debian/Ubuntu-based systems
    apt-get install git-lfs

    For other installation methods, see Installing Git Large File Storage.

  2. Download the Qwen3-32B model from ModelScope.

    git lfs install
    GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/Qwen/Qwen3-32B.git
    cd Qwen3-32B/
    git lfs pull
  3. Create an OSS folder and upload the model files. For instructions on installing ossutil, see Install ossutil.

    ossutil mkdir oss://<YOUR-BUCKET-NAME>/Qwen3-32B
    ossutil cp -r ./Qwen3-32B oss://<YOUR-BUCKET-NAME>/Qwen3-32B
  4. Create llm-model.yaml to define an OSS-backed Secret, persistent volume (PV), and persistent volume claim (PVC). For background, see Use ossfs 1.0 to create a statically provisioned volume.

    apiVersion: v1
    kind: Secret
    metadata:
      name: oss-secret
    stringData:
      akId: <YOUR-OSS-AK>       # AccessKey ID for OSS access
      akSecret: <YOUR-OSS-SK>   # AccessKey Secret for OSS access
    ---
    apiVersion: v1
    kind: PersistentVolume
    metadata:
      name: llm-model
      labels:
        alicloud-pvname: llm-model
    spec:
      capacity:
        storage: 30Gi
      accessModes:
        - ReadOnlyMany
      persistentVolumeReclaimPolicy: Retain
      csi:
        driver: ossplugin.csi.alibabacloud.com
        volumeHandle: llm-model
        nodePublishSecretRef:
          name: oss-secret
          namespace: default
        volumeAttributes:
          bucket: <YOUR-BUCKET-NAME>      # OSS bucket name
          url: <YOUR-BUCKET-ENDPOINT>     # OSS endpoint, e.g., oss-cn-hangzhou-internal.aliyuncs.com
          otherOpts: "-o umask=022 -o max_stat_cache_size=0 -o allow_other"
          path: <YOUR-MODEL-PATH>         # Path to model files, e.g., /Qwen3-32B/
    ---
    apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      name: llm-model
    spec:
      accessModes:
        - ReadOnlyMany
      resources:
        requests:
          storage: 30Gi
      selector:
        matchLabels:
          alicloud-pvname: llm-model
  5. Apply the manifest.

    kubectl create -f llm-model.yaml

Step 2: Deploy the vLLM inference service

  1. Create vllm.yaml.

    Expand to view YAML content

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      labels:
        app: qwen3
      name: qwen3
    spec:
      progressDeadlineSeconds: 600
      replicas: 3
      revisionHistoryLimit: 10
      selector:
        matchLabels:
          app: qwen3
      strategy:
        rollingUpdate:
          maxSurge: 25%
          maxUnavailable: 25%
        type: RollingUpdate
      template:
        metadata:
          annotations:
            prometheus.io/path: /metrics
            prometheus.io/port: '8000'
            prometheus.io/scrape: 'true'
          labels:
            app: qwen3
        spec:
          containers:
            - command:
                - sh
                - '-c'
                - >-
                  vllm serve /models/Qwen3-32B --served-model-name Qwen3-32B
                  --trust-remote-code --port=8000 --max-model-len 8192
                  --gpu-memory-utilization 0.95 --enforce-eager --kv-events-config
                  "{\"enable_kv_cache_events\":true,\"publisher\":\"zmq\",\"endpoint\":\"tcp://epp-default-qwen-inference-pool.envoy-gateway-system.svc.cluster.local:5557\",\"topic\":\"kv@${POD_IP}@Qwen3-32B\"}"
                  --prefix-caching-hash-algo sha256_cbor_64bit --block-size 64
              env:
                - name: POD_IP
                  valueFrom:
                    fieldRef:
                      apiVersion: v1
                      fieldPath: status.podIP
                - name: PYTHONHASHSEED
                  value: '42'
              image: 'registry-cn-hangzhou.ack.aliyuncs.com/dev/vllm:0.10.0'
              imagePullPolicy: IfNotPresent
              name: vllm
              ports:
                - containerPort: 8000
                  name: restful
                  protocol: TCP
              readinessProbe:
                failureThreshold: 3
                initialDelaySeconds: 30
                periodSeconds: 30
                successThreshold: 1
                tcpSocket:
                  port: 8000
                timeoutSeconds: 1
              resources:
                limits:
                  nvidia.com/gpu: '1'
                requests:
                  nvidia.com/gpu: '1'
              terminationMessagePath: /dev/termination-log
              terminationMessagePolicy: File
              volumeMounts:
                - mountPath: /models/Qwen3-32B
                  name: model
                - mountPath: /dev/shm
                  name: dshm
          dnsPolicy: ClusterFirst
          restartPolicy: Always
          schedulerName: default-scheduler
          securityContext: {}
          terminationGracePeriodSeconds: 30
          volumes:
            - name: model
              persistentVolumeClaim:
                claimName: llm-model
            - emptyDir:
                medium: Memory
                sizeLimit: 30Gi
              name: dshm
    ---
    apiVersion: v1
    kind: Service
    metadata:
      labels:
        app: qwen3
      name: qwen3
    spec:
      ports:
        - name: http-serving
          port: 8000
          protocol: TCP
          targetPort: 8000
      selector:
        app: qwen3
      type: ClusterIP

    The following table describes the startup parameters and environment variables that enable precise-mode routing. The --block-size and PYTHONHASHSEED values must match the corresponding fields in the InferenceTrafficPolicy you create in the next section.

    Parameter / variableDescription
    --kv-events-configKV cache event publishing configuration. Set enable_kv_cache_events to true and publisher to zmq. For endpoint, use the naming convention tcp://epp-<InferencePool namespace>-<InferencePool name>.envoy-gateway-system.<cluster local domain>:5557. For topic, use kv@${POD_IP}@<served model name>. In this example, the InferencePool named qwen-inference-pool is in the default namespace and the model name is Qwen3-32B, so the values are tcp://epp-default-qwen-inference-pool.envoy-gateway-system.svc.cluster.local:5557 and kv@${POD_IP}@Qwen3-32B.
    --prefix-caching-hash-algoHash algorithm for KV cache prefix blocks. Must be sha256_cbor_64bit.
    --block-sizeNumber of tokens per KV cache prefix block. Must match blockSize in InferenceTrafficPolicy. In this example: 64.
    PYTHONHASHSEEDPython hash seed. Must be a non-zero value and must match hashSeed in InferenceTrafficPolicy. In this example: 42.
  2. Deploy the vLLM inference service.

    kubectl create -f vllm.yaml

Deploy inference routing

Step 1: Deploy the inference routing policy

  1. Create inference-policy.yaml. The blockSize and hashSeed values must match the --block-size and PYTHONHASHSEED values in your vLLM deployment.

    # InferencePool selects the vLLM workload pods for routing
    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferencePool
    metadata:
      name: qwen-inference-pool
    spec:
      targetPortNumber: 8000
      selector:
        app: qwen3
    ---
    # InferenceTrafficPolicy configures KV cache-aware load balancing for the pool
    apiVersion: inferenceextension.alibabacloud.com/v1alpha1
    kind: InferenceTrafficPolicy
    metadata:
      name: inference-policy
    spec:
      poolRef:
        name: qwen-inference-pool
      profile:
        single:                  # Backend is a single-model vLLM deployment
          trafficPolicy:
            prefixCache:
              mode: tracking     # Enables KV cache-aware load balancing (precise mode)
              trackingConfig:
                indexerConfig:
                  tokenProcessorConfig:
                    blockSize: 64            # Must match vLLM --block-size
                    hashSeed: 42             # Must match vLLM PYTHONHASHSEED
                    model: Qwen/Qwen3-32B   # Official ModelScope model name
  2. Apply the policy.

    kubectl apply -f inference-policy.yaml

Step 2: Deploy the gateway and routing rules

  1. Create inference-gateway.yaml with the Gateway, HTTPRoute, and a backend timeout policy.

    apiVersion: gateway.networking.k8s.io/v1
    kind: Gateway
    metadata:
      name: inference-gateway
    spec:
      gatewayClassName: ack-gateway
      listeners:
      - name: http-llm
        protocol: HTTP
        port: 8080
    ---
    apiVersion: gateway.networking.k8s.io/v1
    kind: HTTPRoute
    metadata:
      name: inference-route
    spec:
      parentRefs:
      - name: inference-gateway
      rules:
      - matches:
        - path:
            type: PathPrefix
            value: /v1
        backendRefs:
        - name: qwen-inference-pool
          kind: InferencePool
          group: inference.networking.x-k8s.io
    ---
    apiVersion: gateway.envoyproxy.io/v1alpha1
    kind: BackendTrafficPolicy
    metadata:
      name: backend-timeout
    spec:
      timeout:
        http:
          requestTimeout: 24h
      targetRef:
        group: gateway.networking.k8s.io
        kind: Gateway
        name: inference-gateway
  2. Apply the manifest.

    kubectl apply -f inference-gateway.yaml

Step 3: Verify routing

To confirm precise-mode routing is working, send two requests that share the same prefix and check that both are routed to the same vLLM replica.

  1. Create the two request payloads. Both share the same content segment in the first message.

    echo '{"max_tokens":24,"messages":[{"content":"Hi, here'\''s some system prompt: hi hi hi hi hi hi hi hi hi hi.For user 3, here are some other context: hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi.I would like to test your intelligence. for this purpose I would like you to play zork. you can interact with the game by typing in commands. I will forward these commands to the game and type in any response. are you ready?","role":"user"}],"model":"Qwen3-32B","stream":true,"stream_options":{"include_usage":true},"temperature":0}' > round1.txt
    echo '{"max_tokens":3,"messages":[{"content":"Hi, here'\''s some system prompt: hi hi hi hi hi hi hi hi hi hi.For user 3, here are some other context: hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi.I would like to test your intelligence. for this purpose I would like you to play zork. you can interact with the game by typing in commands. I will forward these commands to the game and type in any response. are you ready?","role":"user"},{"content":"Hi there! It looks like you'\''re setting up a fun test. I'\''m ready to play Zork! You can","role":"assistant"},{"content":"% zork\nWelcome to Dungeon. This version created 11-MAR-91.\nYou are in an open field west of a big white house with a boarded\nfront door.\nThere is a small mailbox here.\n>","role":"user"},{"content":"Great!","role":"assistant"},{"content":"Opening the mailbox reveals:\n A leaflet.\n>","role":"user"}],"model":"Qwen3-32B","stream":true,"stream_options":{"include_usage":true},"temperature":0}' > round2.txt
  2. Get the gateway's public IP address.

    export GATEWAY_IP=$(kubectl get gateway/inference-gateway -o jsonpath='{.status.addresses[0].value}')
  3. Send both requests to simulate a multi-turn conversation.

    curl -X POST $GATEWAY_IP:8080/v1/chat/completions -H 'Content-Type: application/json' -d @./round1.txt
    curl -X POST $GATEWAY_IP:8080/v1/chat/completions -H 'Content-Type: application/json' -d @./round2.txt
  4. Check the inference extension logs and confirm that both entries show the same endpoint.Address value. If the addresses match, both requests were routed to the same vLLM replica, confirming that prefix cache-aware routing in precise mode is working.

    kubectl logs deploy/epp-default-qwen-inference-pool -n envoy-gateway-system | grep "handled"

    Expected output:

    2025-08-19T10:16:12Z	LEVEL(-2)	requestcontrol/director.go:278	Request handled	{"x-request-id": "00d5c24e-b3c8-461d-9848-7bb233243eb9", "model": "Qwen3-32B", "resolvedTargetModel": "Qwen3-32B", "criticality": "Critical", "model": "Qwen3-32B", "targetModel": "Qwen3-32B", "endpoint": "{NamespacedName:default/qwen3-779c54544f-9c4vz Address:10.0.0.5 Labels:map[app:qwen3 pod-template-hash:779c54544f]}"}
    2025-08-19T10:16:19Z	LEVEL(-2)	requestcontrol/director.go:278	Request handled	{"x-request-id": "401925f5-fe65-46e3-8494-5afd83921ba5", "model": "Qwen3-32B", "resolvedTargetModel": "Qwen3-32B", "criticality": "Critical", "model": "Qwen3-32B", "targetModel": "Qwen3-32B", "endpoint": "{NamespacedName:default/qwen3-779c54544f-9c4vz Address:10.0.0.5 Labels:map[app:qwen3 pod-template-hash:779c54544f]}"}

    In this example, both requests show Address:10.0.0.5, confirming they were routed to the same pod.