Configure global context cache to accelerate LLM inference for long-context tasks through a tiered, pooled caching system built on a distributed KV store.
Overview
Global context cache is ideal for multi-turn conversations, code generation, and long-form document analysis. It accelerates subsequent inference by caching fixed prefixes such as conversation history, code snippets, or document content. Benefits include:
-
Reduces computational overhead: Avoids recomputing cached tokens.
-
Lowers response latency: Reuses cached results to reduce time to first token.
-
Improves resource utilization: Supports more concurrent requests through tiered, pooled caching.
How it works
Global context cache is a tiered system. Core components include LLM intelligent router, tiered cache (GPU/CPU) within each inference instance (Pod), and shared Redis instance for cache metadata. Workflow:
-
User request reaches the LLM intelligent router.
-
Router analyzes request characteristics and consults metadata in Redis. It performs cache-aware request scheduling to forward requests to optimal inference instances.
-
Inference instance (Pod) receives the request and searches its internal tiered cache.
-
Pod GPU cache: Service queries GPU memory of current Pod for fastest access.
-
Redis metadata: On GPU cache miss, service queries shared Redis instance. If metadata for prefix exists in Redis, service pulls cached data from local CPU cache or remote Pod.
-
Cache miss: If prefix is not found in Redis, service processes the full prompt. During inference, it generates a new KV cache and stores it according to the configured policy.
-
Note:
-
Cache policy: Cache uses a Least Recently Used (LRU) policy to automatically evict least recently accessed entries.
-
Cache validity: Entries do not have a Time to Live (TTL) and remain valid until evicted.
-
Best-effort basis: Global context cache is a best-effort optimization and does not guarantee cache hits.
Limitations
Global context cache requires the LLM Deployment option in Scenario-based Model Deployment. Requirements:
-
Resource type: Only Lingjun resources are supported.
-
Inference engine: Only vLLM engine is supported.
-
Model architecture: Only Multi-Head Attention (MHA) models (such as Qwen) are supported.
Procedure
Deploy LLM service with global context cache enabled. Deployment completes in under five minutes.
-
Log on to the PAI console. Select a region on the top of the page. Then, select the desired workspace and click Elastic Algorithm Service (EAS).
-
On the Inference Service tab, click Deploy Service. In Scenario-based Model Deployment, click LLM Deployment.
-
On the deployment configuration page, select a model (such as Qwen3-8B) and deployment template (such as Single Machine). For Inference Engine, select vLLM, then enable Global Context Cache.
-
Enabling global context cache displays configuration tabs for three sub-services: LLM inference service, LLM intelligent router, and Redis instance.
ImportantWhen configuring the LLM inference service:
-
Deployment Resources: Use Lingjun resources.
-
Context cache capacity: Defines memory size for KV cache. Reserve sufficient memory for model inference. If reserved memory is insufficient, service may fail to start or experience interruptions.
-
-
In Network Information, select VPC, vSwitch, and security group.
-
After completing configuration, click Deploy.
Best practices
To improve cache hit ratio:
-
Optimize the prompt structure
-
Place stable, reusable content such as system instructions or role definitions at the beginning.
-
Keep shared prefix stable and avoid frequent changes.
-
-
Optimize request patterns
-
Send requests with similar prefixes in close succession.
-
For batch processing, sort or group requests by prefix similarity.
-