Enable Global Context Cache to Speed Up Long-Context LLM Inference - Platform For AI

Overview

Global context cache is ideal for multi-turn conversations, code generation, and long-form document analysis. It accelerates subsequent inference by caching fixed prefixes such as conversation history, code snippets, or document content. Benefits include:

Reduces computational overhead: Avoids recomputing cached tokens.
Lowers response latency: Reuses cached results to reduce time to first token.
Improves resource utilization: Supports more concurrent requests through tiered, pooled caching.

How it works

Global context cache is a tiered system. Core components include LLM intelligent router, tiered cache (GPU/CPU) within each inference instance (Pod), and shared Redis instance for cache metadata. Workflow:

User request reaches the LLM intelligent router.
Router analyzes request characteristics and consults metadata in Redis. It performs cache-aware request scheduling to forward requests to optimal inference instances.
Inference instance (Pod) receives the request and searches its internal tiered cache.
1. Pod GPU cache: Service queries GPU memory of current Pod for fastest access.
2. Redis metadata: On GPU cache miss, service queries shared Redis instance. If metadata for prefix exists in Redis, service pulls cached data from local CPU cache or remote Pod.
3. Cache miss: If prefix is not found in Redis, service processes the full prompt. During inference, it generates a new KV cache and stores it according to the configured policy.

Note:

Cache policy: Cache uses a Least Recently Used (LRU) policy to automatically evict least recently accessed entries.
Cache validity: Entries do not have a Time to Live (TTL) and remain valid until evicted.
Best-effort basis: Global context cache is a best-effort optimization and does not guarantee cache hits.

Limitations

Global context cache requires the LLM Deployment option in Scenario-based Model Deployment. Requirements:

Resource type: Only Lingjun resources are supported.
Inference engine: Only vLLM engine is supported.
Model architecture: Only Multi-Head Attention (MHA) models (such as Qwen) are supported.

Procedure

Deploy LLM service with global context cache enabled. Deployment completes in under five minutes.

Log on to the PAI console. Select a region on the top of the page. Then, select the desired workspace and click Elastic Algorithm Service (EAS).
On the Inference Service tab, click Deploy Service. In Scenario-based Model Deployment, click LLM Deployment.
On the deployment configuration page, select a model (such as Qwen3-8B) and deployment template (such as Single Machine). For Inference Engine, select vLLM, then enable Global Context Cache.
Enabling global context cache displays configuration tabs for three sub-services: LLM inference service, LLM intelligent router, and Redis instance.
Important
When configuring the LLM inference service:
- Deployment Resources: Use Lingjun resources.
- Context cache capacity: Defines memory size for KV cache. Reserve sufficient memory for model inference. If reserved memory is insufficient, service may fail to start or experience interruptions.
In Network Information, select VPC, vSwitch, and security group.
After completing configuration, click Deploy.

Best practices

To improve cache hit ratio:

Optimize the prompt structure
- Place stable, reusable content such as system instructions or role definitions at the beginning.
- Keep shared prefix stable and avoid frequent changes.
Optimize request patterns
- Send requests with similar prefixes in close succession.
- For batch processing, sort or group requests by prefix similarity.