Global context cache - Platform For AI - Alibaba Cloud Documentation Center

Long-context workloads re-process the same prompt prefix on every request—conversation history, a document uploaded by the user, or a large codebase shared across calls. Global context cache stores those reusable prefixes in a distributed Key-Value (KV) store so subsequent requests skip recomputation, cutting time to first token and compute cost.

When to use global context cache

Use global context cache when requests repeatedly share a long, stable prefix:

Scenario	How caching helps
Multi-turn conversations	Conversation history grows with each turn. The cache stores the accumulated history so only new tokens are computed per turn.
Long-form document analysis	A user uploads a 50-page manual and asks multiple questions. The document is cached after the first request; follow-up questions skip reprocessing.
Code generation	A large codebase or specification is prepended to every request. Caching the fixed portion cuts per-request compute cost.

Note

Note: If requests share little or no common prefix, or if the model spends most of its time generating long answers rather than processing input, cache hit rates will be low and the benefit minimal.

Benefits

Reduces computational overhead: Avoids recomputing cached tokens.
Lowers response latency: Reuses cached results to reduce time to first token.
Improves resource utilization: Supports more concurrent requests through tiered, pooled caching.

How it works

The cache is a tiered system with three components:

Component	Role
LLM intelligent router	Receives requests, analyzes request characteristics, queries Redis for prefix metadata, and routes each request to the Pod most likely to hold a cache hit (cache-aware request scheduling).
Tiered GPU/CPU cache (per Pod)	GPU memory is checked first for the lowest-latency hit. On a GPU miss, the Pod checks its local CPU cache or pulls KV data from a remote Pod via Redis metadata.
Shared Redis instance	Stores cache metadata to coordinate prefix lookups across Pods.

Request flow:

The router receives the request and queries Redis for prefix metadata.
The router forwards the request to the Pod most likely to have a cache hit.
The Pod searches its internal cache:
- GPU cache hit — Returns cached KV directly from GPU memory.
- CPU cache or remote Pod hit — Redis metadata indicates the prefix exists; the Pod pulls cached KV from local CPU cache or a remote Pod.
- Cache miss — The Pod processes the full prompt, generates a new KV cache entry, and stores it.

Cache behavior:

Property	Value
Eviction policy	Least Recently Used (LRU)
Time to Live (TTL)	None — entries remain valid until evicted
Hit guarantee	Best-effort; cache hits are not guaranteed

Limitations

Global context cache requires the LLM Deployment option under Scenario-based Model Deployment.

Requirement	Supported value
Resource type	Lingjun resources only
Inference engine	vLLM only
Model architecture	Multi-Head Attention (MHA) models only (e.g., Qwen)

Prerequisites

Before you begin, make sure you have:

Access to the PAI console
A workspace with Lingjun resources available
A supported MHA model (e.g., Qwen3-8B)

Enable global context cache

Deployment completes in under five minutes.

Log in to the PAI console. Select a region, then select your workspace and click Elastic Algorithm Service (EAS).
On the Inference Service tab, click Deploy Service. Under Scenario-based Model Deployment, click LLM Deployment.
On the deployment configuration page:
- Select a model (e.g., Qwen3-8B) and a deployment template (e.g., Single Machine).
- Set Inference Engine to vLLM.
- Enable Global Context Cache.
Enabling Global Context Cache displays configuration tabs for three sub-services: LLM inference service, LLM intelligent router, and Redis instance.
Important
When configuring the LLM inference service:
- Set Deployment Resources to Lingjun resources.
- Set Context cache capacity carefully. This field defines the memory reserved for KV cache. If the reserved memory is insufficient, the service may fail to start or be interrupted during operation.
Under Network Information, select a VPC, vSwitch, and security group.
Click Deploy.

Best practices

Structure prompts to maximize cache hits

Place stable, reusable content at the beginning of the prompt—system instructions, role definitions, or fixed document context. Keep this shared prefix unchanged across requests. Content that changes per request (user input, dynamic parameters) goes at the end.

Example prompt structure:

[System instructions — stable, cached]
[Document or codebase context — stable, cached]
[User query — changes per request]

Send similar requests close together

The LRU eviction policy evicts the least recently accessed entries first. To keep high-value prefixes in cache:

Send requests that share the same prefix in close succession rather than spreading them out.
For batch jobs, group or sort requests by prefix similarity before submission. This maximizes cache reuse within a batch and reduces the chance that a prefix is evicted before related requests arrive.