×
Community Blog Caching is Efficiency: Achieving Precise LLM Cache Hits with Alibaba Cloud ACK GIE

Caching is Efficiency: Achieving Precise LLM Cache Hits with Alibaba Cloud ACK GIE

This article introduces ACK GIE's precision-mode prefix cache-aware routing that maximizes KV-Cache hit rates for distributed LLM inference.

By Yinhang

In the world of Large Language Model (LLM) inference, there is a metric that is often underestimated yet absolutely critical: KV-Cache hit rate. This technical indicator directly determines whether your AI application delivers a silky-smooth experience or a sluggish, bottlenecked performance—and whether your operational costs remain under control or spiral out of budget.

Alibaba Cloud ACK Gateway with Inference Extension (ACK GIE) has recently introduced the precision mode prefix cache-aware routing capability. This technology not only pushes vLLM’s KV-Cache hit rate to its limit but also delivers tangible business value by slashing latency and resource waste.

KV-Cache: The Invisible Engine of LLM Inference

Before diving into our solution, let's understand why KV-Cache is so important.

The Core Bottleneck of Transformer Architecture

LLMs are built on the Transformer architecture, centered around the self-attention mechanism. During inference, the model must calculate attention weights between every new token and all preceding tokens. As context length grows, this computational complexity increases quadratically. Consequently, the prefill stage (processing the entire input prompt) becomes the most time-consuming part of the inference cycle.

KV-Cache is the definitive solution to this bottleneck. It stores previously computed Key and Value vectors in GPU memory. When generating subsequent tokens, the model retrieves these values directly from the cache rather than recomputing them. This drastically reduces redundant computation and boosts inference efficiency.

Automatic Prefix Caching (APC) for vLLM

As an industry-leading LLM inference framework, vLLM optimizes KV-Cache utilization through its APC feature. APC intelligently identifies shared prefixes across different requests and reuses existing KV-Cache blocks.

Consider a customer service bot: every interaction begins with the same system prompt: "You are a professional customer service assistant; please answer user questions in a friendly tone..." Without APC, every request re-calculates the KV-Cache for this prompt. With APC, the system reuses the cached blocks for the second user, skipping the prefill stage and moving directly to the decode (generation) stage.

In single-instance environments, the results are staggering. Our tests show that for a long prompt of 10,000 tokens, the Time to First Token (TTFT) for the second request drops from 4.3 seconds to 0.6 seconds—a performance leap of more than 7 times.

The Achilles Heel of Distributed Deployment

While single-machine optimization is powerful, production environments require multiple replicas to handle high concurrency. This introduces a fatal flaw: KV-Cache fragmentation.

In a distributed environment, each vLLM instance maintains its own independent KV-Cache. Traditional load balancers (e.g., Round Robin, Least Connections) are blind to the state of these caches; they distribute requests mechanically. This leads to:

• Requests with shared prefixes being scattered across different instances.

• Every instance being forced to recompute the same prefill.

• The total loss of KV-Cache reuse advantages.

• A significant drop in overall system performance.

It’s like having an elite commando unit where every soldier acts in isolation without a central command—synergy is impossible.

Consider a typical agent workflow:

  1. Request 1: A request with a 6,000-token context is routed to Pod A. Pod A computes and caches the K/V vectors.
  2. Request 2: The next iteration of the same agent (6,000-token context + 100 new tokens) is randomly routed to Pod B by the load balancer.

1

At this point, the disaster occurred:

Cache miss: Pod B has nothing in its cache regarding the 6,000-token context.

Redundant computation: Pod B is forced to re-run the full 6,100-token prefill, wasting expensive GPU cycles.

Latency spike: User TTFT jumps from milliseconds to seconds or even tens of seconds.

Resource waste: GPU power that could have processed new requests is consumed by redundant work, lowering total system throughput.

In high-concurrency environments, this "cache thrashing" occurs continuously, forcing the system to spend most of its time on redundant Prefills rather than efficient Decode generation.

Precise Mode Prefix Cache-Aware Routing

ACK GIE’s Winning Strategy

The precise-mode prefix cache-aware routing of ACK GIE was designed to solve this problem. By establishing a global view of the KV-Cache across the cluster, it enables intelligent, cache-aware request scheduling.

Technical Principle: From Fragmented Views to Global Vision

The core of precise mode is real-time perception of the KV-Cache state on every vLLM instance.

1. KV event reporting: Each vLLM instance (v0.10.0+) uses the ZeroMQ protocol to report events—such as the creation, update, or deletion of KV-Cache blocks—to ACK GIE in real-time.

2. Global index construction: ACK GIE receives these events and builds a global KV-Cache index. This index tracks the hash values of every KV block, its instance location, and its storage medium (GPU/CPU).

3. Intelligent routing decisions: When a new request arrives, the ACK GIE:

  • Calculates the KV block hash sequence for the request prefix
  • Queries the global index to find which instances host the most relevant KV blocks
  • Selects the optimal target instance by combining cache data with real-time load metrics (queue length, GPU utilization, etc.)

This mechanism ensures that requests with the same prefix are routed to the same instance, maximizing the KV-Cache hit rate.

Precise Mode vs. Estimation Mode

ACK GIE also provides prefix cache-aware routing in estimation mode, but the two differ fundamentally:

Estimation mode: Estimates KV-Cache distribution based on historical routing records. It requires no inference engine support but has limited accuracy.

Precise mode: Directly receives KV-Cache events for 100% accuracy. It delivers much higher hit rates but requires specific inference engine versions.

In actual testing, precise mode delivered a 57x TTFT performance improvement over estimation mode in high-concurrency scenarios.

Hands-on: Deploying Precise Mode in ACK

Prerequisites

• An ACK managed cluster (GPU node pools are recommended)
• Gateway with Inference Extension v1.4.0-apsara.3 or later
• vLLM v0.10.0 or later

Step 1: Prepare Model Files

Using Qwen3-32B as an example, store the model files in OSS:

# Download the model
git lfs install
GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/Qwen/Qwen3-32B.git
cd Qwen3-32B/
git lfs pull

# Upload to OSS
ossutil mkdir oss://<your-bucket-name>/Qwen3-32B
ossutil cp -r ./Qwen3-32B oss://<your-bucket-name>/Qwen3-32B

Step 2: Configure Storage

Create a PV and a PVC to mount the model in OSS:

# llm-model.yaml
apiVersion: v1
kind: Secret
metadata:
  name: oss-secret
stringData:
  akId: <your-oss-ak>
  akSecret: <your-oss-sk>
---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: llm-model
  labels:
    alicloud-pvname: llm-model
spec:
  capacity:
    storage: 30Gi 
  accessModes:
    - ReadOnlyMany
  persistentVolumeReclaimPolicy: Retain
  csi:
    driver: ossplugin.csi.alibabacloud.com
    volumeHandle: llm-model
    nodePublishSecretRef:
      name: oss-secret
      namespace: default
    volumeAttributes:
      bucket: <your-bucket-name>
      url: <your-bucket-endpoint>
      otherOpts: "-o umask=022 -o max_stat_cache_size=0 -o allow_other"
      path: /Qwen3-32B/
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: llm-model
spec:
  accessModes:
    - ReadOnlyMany
  resources:
    requests:
      storage: 30Gi
  selector:
    matchLabels:
      alicloud-pvname: llm-model

Step 3: Deploy the vLLM Service

The key is to configure the KV event reporting parameters:

# vllm.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: qwen3
  name: qwen3
spec:
  replicas: 3
  selector:
    matchLabels:
      app: qwen3
  template:
    metadata:
      labels:
        app: qwen3
    spec:
      containers:
        - name: vllm
          image: 'registry-cn-hangzhou.ack.aliyuncs.com/dev/vllm:0.10.0'
          command:
            - sh
            - '-c'
            - >-
              vllm serve /models/Qwen3-32B --served-model-name Qwen3-32B
              --trust-remote-code --port=8000 --max-model-len 8192
              --gpu-memory-utilization 0.95 --enforce-eager --kv-events-config
              "{\"enable_kv_cache_events\":true,\"publisher\":\"zmq\",\"endpoint\":\"tcp://epp-default-qwen-inference-pool.envoy-gateway-system.svc.cluster.local:5557\",\"topic\":\"kv@${POD_IP}@Qwen3-32B\"}"
              --prefix-caching-hash-algo sha256_cbor_64bit --block-size 64
          env:
            - name: POD_IP
              valueFrom:
                fieldRef:
                  apiVersion: v1
                  fieldPath: status.podIP
            - name: PYTHONHASHSEED
              value: '42'
          ports:
            - containerPort: 8000
              name: restful
          resources:
            limits:
              nvidia.com/gpu: '1'
            requests:
              nvidia.com/gpu: '1'
          volumeMounts:
            - mountPath: /models/Qwen3-32B
              name: model
            - mountPath: /dev/shm
              name: dshm
      volumes:
        - name: model
          persistentVolumeClaim:
            claimName: llm-model
        - emptyDir:
            medium: Memory
            sizeLimit: 30Gi
          name: dshm
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app: qwen3
  name: qwen3
spec:
  ports:
    - name: http-serving
      port: 8000
      targetPort: 8000
  selector:
    app: qwen3
  type: ClusterIP

Key parameter descriptions:

-- kv-events-config: Enables KV event reporting and specifies the ZeroMQ endpoint.

-- prefix-caching-hash-algo: Must be set to sha256_cbor_64bit.

-- block-size: The KV block size. We recommend setting this parameter to 64.

PYTHONHASHSEED: The hash seed, which must be consistent with the routing policy configuration.

Step 4: Configure a Routing Policy for the Precise Mode

# inference-policy.yaml
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
  name: qwen-inference-pool
spec:
  targetPortNumber: 8000
  selector:
    app: qwen3
---
apiVersion: inferenceextension.alibabacloud.com/v1alpha1
kind: InferenceTrafficPolicy
metadata:
  name: inference-policy
spec:
  poolRef:
    name: qwen-inference-pool
  profile: 
    single:
      trafficPolicy:
        prefixCache:
          mode: tracking
          trackingConfig:
            indexerConfig:
              tokenProcessorConfig:
                blockSize: 64
                hashSeed: 42
                model: Qwen/Qwen3-32B

Step 5: Deploy the Gateway

# inference-gateway.yaml
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: inference-gateway
spec:
  gatewayClassName: ack-gateway
  listeners:
  - name: http-llm
    protocol: HTTP
    port: 8080
---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: inference-route
spec:
  parentRefs:
  - name: inference-gateway
  rules:
  - matches:
    - path:
        type: PathPrefix
        value: /v1
    backendRefs:
    - name: qwen-inference-pool
      kind: InferencePool
      group: inference.networking.x-k8s.io
---
apiVersion: gateway.envoyproxy.io/v1alpha1
kind: BackendTrafficPolicy
metadata:
  name: backend-timeout
spec:
  timeout:
    http:
      requestTimeout: 24h
  targetRef:
    group: gateway.networking.k8s.io
    kind: Gateway
    name: inference-gateway

Verifying Deployment

Create two requests with identical prefixes:

# round1.txt
echo '{"max_tokens":24,"messages":[{"content":"Hi, here'\''s some system prompt: hi hi hi hi hi hi hi hi hi hi.For user 3, here are some other context: hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi.I would like to test your intelligence. for this purpose I would like you to play zork. you can interact with the game by typing in commands. I will forward these commands to the game and type in any response. are you ready?","role":"user"}],"model":"Qwen3-32B","stream":true,"stream_options":{"include_usage":true},"temperature":0}' > round1.txt

# round2.txt  
echo '{"max_tokens":3,"messages":[{"content":"Hi, here'\''s some system prompt: hi hi hi hi hi hi hi hi hi hi.For user 3, here are some other context: hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi hi.I would like to test your intelligence. for this purpose I would like you to play zork. you can interact with the game by typing in commands. I will forward these commands to the game and type in any response. are you ready?","role":"user"},{"content":"Hi there! It looks like you'\''re setting up a fun test. I'\''m ready to play Zork! You can","role":"assistant"},{"content":"% zork\nWelcome to Dungeon. This version created 11-MAR-91.\nYou are in an open field west of a big white house with a boarded\nfront door.\nThere is a small mailbox here.\n>","role":"user"},{"content":"Great!","role":"assistant"},{"content":"Opening the mailbox reveals:\n A leaflet.\n>","role":"user"}],"model":"Qwen3-32B","stream":true,"stream_options":{"include_usage":true},"temperature":0}' > round2.txt

Send the request and verify the routing:

export GATEWAY_IP=$(kubectl get gateway/inference-gateway -o jsonpath='{.status.addresses[0].value}')
curl -X POST $GATEWAY_IP:8080/v1/chat/completions -H 'Content-Type: application/json' -d @./round1.txt
curl -X POST $GATEWAY_IP:8080/v1/chat/completions -H 'Content-Type: application/json' -d @./round2.txt

# View the logs
kubectl logs deploy/epp-default-qwen-inference-pool -n envoy-gateway-system|grep "handled"

The expected output will show both requests being routed to the same instance:

2025-08-19T10:16:12Z    LEVEL(-2)    requestcontrol/director.go:278    Request handled    {"x-request-id": "00d5c24e-b3c8-461d-9848-7bb233243eb9", "model": "Qwen3-32B", "endpoint": "{NamespacedName:default/qwen3-779c54544f-9c4vz Address:10.0.0.5 ...}"}
2025-08-19T10:16:19Z    LEVEL(-2)    requestcontrol/director.go:278    Request handled    {"x-request-id": "401925f5-fe65-46e3-8494-5afd83921ba5", "model": "Qwen3-32B", "endpoint": "{NamespacedName:default/qwen3-779c54544f-9c4vz Address:10.0.0.5 ...}"}

Performance Benchmarking

We conducted a benchmark on a cluster with 8 vLLM instances to verify the real-world impact of precise mode.

Test Scenario

Simulated B2B SaaS case:

• 150 enterprise customers
• Each customer has a unique 6,000-token context (system prompt)
• 5 concurrent users per customer
• User query length: 1,200 tokens
• QPS gradually increased from 3 to 60

Test Results

Scheduling Strategy Output Throughput (tokens/s) P90 TTFT (s) Avg TTFT (s) Avg Wait Queue
(Precise Mode) Prefix cache-aware routing 8730.0 0.542 0.298 0.1
(Estimate Mode) Prefix cache-aware routing 6944.4 31.083 13.316 8.1
Random Scheduling 4428.7 92.551 46.987 27.3

Key Findings

  1. 170x TTFT improvement: Precise mode achieved a P90 TTFT of just 0.542 seconds, compared to over 92 seconds for random scheduling.
  2. Doubled throughput: Precise mode's throughput is nearly 2x that of random scheduling.
  3. System stability: Precise mode almost eliminates the waiting queue, keeping the system in a healthy, low-latency state.

These metrics prove that precise-mode prefix cache-aware routing is not just a theoretical optimization—it is a performance powerhouse for production environments.

Conclusion

In the race for LLM inference efficiency and cost reduction, every percentage point of optimization yields massive business value. ACK GIE co-developed and contributed the precise-mode prefix cache-aware routing technology to the llm-d upstream community. For details, see https://llm-d.ai/blog/kvcache-wins-you-can-see

By establishing a global KV-Cache view, ACK GIE transforms inference from "operating in the dark" to "full-spectrum visibility", allowing distributed vLLM clusters to function as a unified, high-performance engine. Whether you are building enterprise AI apps or operating a large-scale AI service platform, this technology is your key to superior performance and cost optimization.

0 1 0
Share on

Alibaba Container Service

228 posts | 33 followers

You may also like

Comments