All Products
Search
Document Center

Platform For AI:Configure global context cache

Last Updated:Apr 01, 2026

Long-context workloads re-process the same prompt prefix on every request—conversation history, a document uploaded by the user, or a large codebase shared across calls. Global context cache stores those reusable prefixes in a distributed Key-Value (KV) store so subsequent requests skip recomputation, cutting time to first token and compute cost.

When to use global context cache

Use global context cache when requests repeatedly share a long, stable prefix:

ScenarioHow caching helps
Multi-turn conversationsConversation history grows with each turn. The cache stores the accumulated history so only new tokens are computed per turn.
Long-form document analysisA user uploads a 50-page manual and asks multiple questions. The document is cached after the first request; follow-up questions skip reprocessing.
Code generationA large codebase or specification is prepended to every request. Caching the fixed portion cuts per-request compute cost.
Note

Note: If requests share little or no common prefix, or if the model spends most of its time generating long answers rather than processing input, cache hit rates will be low and the benefit minimal.

Benefits

  • Reduces computational overhead: Avoids recomputing cached tokens.

  • Lowers response latency: Reuses cached results to reduce time to first token.

  • Improves resource utilization: Supports more concurrent requests through tiered, pooled caching.

How it works

The cache is a tiered system with three components:

ComponentRole
LLM intelligent routerReceives requests, analyzes request characteristics, queries Redis for prefix metadata, and routes each request to the Pod most likely to hold a cache hit (cache-aware request scheduling).
Tiered GPU/CPU cache (per Pod)GPU memory is checked first for the lowest-latency hit. On a GPU miss, the Pod checks its local CPU cache or pulls KV data from a remote Pod via Redis metadata.
Shared Redis instanceStores cache metadata to coordinate prefix lookups across Pods.

Request flow:

  1. The router receives the request and queries Redis for prefix metadata.

  2. The router forwards the request to the Pod most likely to have a cache hit.

  3. The Pod searches its internal cache:

    • GPU cache hit — Returns cached KV directly from GPU memory.

    • CPU cache or remote Pod hit — Redis metadata indicates the prefix exists; the Pod pulls cached KV from local CPU cache or a remote Pod.

    • Cache miss — The Pod processes the full prompt, generates a new KV cache entry, and stores it.

Cache behavior:

PropertyValue
Eviction policyLeast Recently Used (LRU)
Time to Live (TTL)None — entries remain valid until evicted
Hit guaranteeBest-effort; cache hits are not guaranteed

Limitations

Global context cache requires the LLM Deployment option under Scenario-based Model Deployment.

RequirementSupported value
Resource typeLingjun resources only
Inference enginevLLM only
Model architectureMulti-Head Attention (MHA) models only (e.g., Qwen)

Prerequisites

Before you begin, make sure you have:

  • Access to the PAI console

  • A workspace with Lingjun resources available

  • A supported MHA model (e.g., Qwen3-8B)

Enable global context cache

Deployment completes in under five minutes.

  1. Log in to the PAI console. Select a region, then select your workspace and click Elastic Algorithm Service (EAS).

  2. On the Inference Service tab, click Deploy Service. Under Scenario-based Model Deployment, click LLM Deployment.

  3. On the deployment configuration page:

    • Select a model (e.g., Qwen3-8B) and a deployment template (e.g., Single Machine).

    • Set Inference Engine to vLLM.

    • Enable Global Context Cache.

  4. Enabling Global Context Cache displays configuration tabs for three sub-services: LLM inference service, LLM intelligent router, and Redis instance.

    Important

    When configuring the LLM inference service:

    • Set Deployment Resources to Lingjun resources.

    • Set Context cache capacity carefully. This field defines the memory reserved for KV cache. If the reserved memory is insufficient, the service may fail to start or be interrupted during operation.

  5. Under Network Information, select a VPC, vSwitch, and security group.

  6. Click Deploy.

Best practices

Structure prompts to maximize cache hits

Place stable, reusable content at the beginning of the prompt—system instructions, role definitions, or fixed document context. Keep this shared prefix unchanged across requests. Content that changes per request (user input, dynamic parameters) goes at the end.

Example prompt structure:

[System instructions — stable, cached]
[Document or codebase context — stable, cached]
[User query — changes per request]

Send similar requests close together

The LRU eviction policy evicts the least recently accessed entries first. To keep high-value prefixes in cache:

  • Send requests that share the same prefix in close succession rather than spreading them out.

  • For batch jobs, group or sort requests by prefix similarity before submission. This maximizes cache reuse within a batch and reduces the chance that a prefix is evicted before related requests arrive.