Long-context workloads re-process the same prompt prefix on every request—conversation history, a document uploaded by the user, or a large codebase shared across calls. Global context cache stores those reusable prefixes in a distributed Key-Value (KV) store so subsequent requests skip recomputation, cutting time to first token and compute cost.
When to use global context cache
Use global context cache when requests repeatedly share a long, stable prefix:
| Scenario | How caching helps |
|---|---|
| Multi-turn conversations | Conversation history grows with each turn. The cache stores the accumulated history so only new tokens are computed per turn. |
| Long-form document analysis | A user uploads a 50-page manual and asks multiple questions. The document is cached after the first request; follow-up questions skip reprocessing. |
| Code generation | A large codebase or specification is prepended to every request. Caching the fixed portion cuts per-request compute cost. |
Note: If requests share little or no common prefix, or if the model spends most of its time generating long answers rather than processing input, cache hit rates will be low and the benefit minimal.
Benefits
Reduces computational overhead: Avoids recomputing cached tokens.
Lowers response latency: Reuses cached results to reduce time to first token.
Improves resource utilization: Supports more concurrent requests through tiered, pooled caching.
How it works
The cache is a tiered system with three components:
| Component | Role |
|---|---|
| LLM intelligent router | Receives requests, analyzes request characteristics, queries Redis for prefix metadata, and routes each request to the Pod most likely to hold a cache hit (cache-aware request scheduling). |
| Tiered GPU/CPU cache (per Pod) | GPU memory is checked first for the lowest-latency hit. On a GPU miss, the Pod checks its local CPU cache or pulls KV data from a remote Pod via Redis metadata. |
| Shared Redis instance | Stores cache metadata to coordinate prefix lookups across Pods. |
Request flow:
The router receives the request and queries Redis for prefix metadata.
The router forwards the request to the Pod most likely to have a cache hit.
The Pod searches its internal cache:
GPU cache hit — Returns cached KV directly from GPU memory.
CPU cache or remote Pod hit — Redis metadata indicates the prefix exists; the Pod pulls cached KV from local CPU cache or a remote Pod.
Cache miss — The Pod processes the full prompt, generates a new KV cache entry, and stores it.
Cache behavior:
| Property | Value |
|---|---|
| Eviction policy | Least Recently Used (LRU) |
| Time to Live (TTL) | None — entries remain valid until evicted |
| Hit guarantee | Best-effort; cache hits are not guaranteed |
Limitations
Global context cache requires the LLM Deployment option under Scenario-based Model Deployment.
| Requirement | Supported value |
|---|---|
| Resource type | Lingjun resources only |
| Inference engine | vLLM only |
| Model architecture | Multi-Head Attention (MHA) models only (e.g., Qwen) |
Prerequisites
Before you begin, make sure you have:
Access to the PAI console
A workspace with Lingjun resources available
A supported MHA model (e.g., Qwen3-8B)
Enable global context cache
Deployment completes in under five minutes.
Log in to the PAI console. Select a region, then select your workspace and click Elastic Algorithm Service (EAS).
On the Inference Service tab, click Deploy Service. Under Scenario-based Model Deployment, click LLM Deployment.
On the deployment configuration page:
Select a model (e.g., Qwen3-8B) and a deployment template (e.g., Single Machine).
Set Inference Engine to vLLM.
Enable Global Context Cache.
Enabling Global Context Cache displays configuration tabs for three sub-services: LLM inference service, LLM intelligent router, and Redis instance.
ImportantWhen configuring the LLM inference service:
Set Deployment Resources to Lingjun resources.
Set Context cache capacity carefully. This field defines the memory reserved for KV cache. If the reserved memory is insufficient, the service may fail to start or be interrupted during operation.
Under Network Information, select a VPC, vSwitch, and security group.
Click Deploy.
Best practices
Structure prompts to maximize cache hits
Place stable, reusable content at the beginning of the prompt—system instructions, role definitions, or fixed document context. Keep this shared prefix unchanged across requests. Content that changes per request (user input, dynamic parameters) goes at the end.
Example prompt structure:
[System instructions — stable, cached]
[Document or codebase context — stable, cached]
[User query — changes per request]Send similar requests close together
The LRU eviction policy evicts the least recently accessed entries first. To keep high-value prefixes in cache:
Send requests that share the same prefix in close succession rather than spreading them out.
For batch jobs, group or sort requests by prefix similarity before submission. This maximizes cache reuse within a batch and reduces the chance that a prefix is evicted before related requests arrive.