When you call a model, different requests can have overlapping input, such as in multi-turn conversations or when you ask multiple questions about the same document. The context cache feature saves the common prefix of these requests to reduce redundant computations. It improves response speed and lowers your costs without affecting the response quality.
To accommodate different scenarios, context cache provides two modes. Choose a mode based on your requirements for convenience, certainty, and cost:
Implicit cache: An automatic mode that requires no extra configuration and cannot be disabled. It is suitable for general scenarios where convenience is a priority. The system automatically detects and caches the common prefix of the request content, but the cache hit rate is not guaranteed. The cached portion is billed at 20% of the standard price for input tokens.
Explicit cache: A cache mode that you must manually enable. Manually create a cache for specific content to achieve a guaranteed hit within a 5-minute period. Tokens that are used to create the cache are billed at 125% of the standard input token price. However, subsequent hits are billed at only 10% of the standard price.
Item | Implicit cache | Explicit cache |
Affects response quality | No impact | No impact |
Hit probability | Not guaranteed. The system determines the specific hit probability. | Guaranteed hit |
Tokens used to create the cache | 100% of the standard input token price | 125% of the standard input token price |
Cached input tokens | 20% of the standard input token price | 10% of the standard input token price |
Minimum tokens for caching | 256 | 1,024 |
Cache validity period | Not guaranteed. The system periodically purges cached data that has not been used for a long time. | 5 minutes (resets on hit) |
Implicit cache and explicit cache are mutually exclusive. A single request can use only one mode.
Implicit cache
Model availability
Singapore region
Text generation models
Qwen Max: qwen3-max, qwen-max
Qwen Plus: qwen-plus
Qwen Flash: qwen-flash
Qwen Turbo: qwen-turbo
Qwen-Coder: qwen3-coder-plus, qwen3-coder-flash
Visual understanding models
Qwen VL: qwen3-vl-plus, qwen3-vl-flash, qwen-vl-max, qwen-vl-plus
Beijing region
Text generation models
Qwen Max: qwen3-max, qwen-max
Qwen Plus: qwen-plus
Qwen Flash: qwen-flash
Qwen Turbo: qwen-turbo
Qwen-Coder: qwen3-coder-plus, qwen3-coder-flash
Visual understanding models
Qwen VL: qwen3-vl-plus, qwen3-vl-flash, qwen-vl-max, qwen-vl-plus
Snapshot and latest models are not currently supported.
How it works
When you send a request to a model that supports implicit cache, this feature works automatically:
Search: After the system receives a request, it uses prefix matching to determine if a common prefix of the content in the
messagesarray exists in the cache.Decision:
If a cache hit occurs, the system uses the cached result for generation.
If a cache miss occurs, the system processes the request normally and stores the prefix of the current prompt in the cache for future requests.
The system periodically purges cached data that has not been used for a long time. The cache hit rate is not guaranteed to be 100%. A cache miss can occur even for identical request contexts. The system determines the actual hit probability.
Content with fewer than 256 tokens is not cached.
Increase the hit rate
An implicit cache hit occurs when the system determines that different requests share a common prefix. To increase the hit rate, place recurring content at the beginning of the prompt and variable content at the end.
Text models: Assume the system has cached "ABCD". A request for "ABE" might hit the "AB" portion, but a request for "BCD" will not hit.
Visual understanding models:
For multiple questions about the same image or video: Place the image or video before the text to increase the hit probability.
For the same question about different images or videos: Place the text before the image or video to increase the hit probability.
Billing
There is no extra charge to enable implicit cache mode.
When a request hits the cache, the matched input tokens are billed as cached_tokens at 20% of the standard input_token price. Unmatched input tokens are billed at the standard input_token price. Output tokens are billed at the standard price.
Example: A request contains 10,000 input tokens, and 5,000 of them hit the cache. The costs are calculated as follows:
Unmatched tokens (5,000): Billed at 100% of the standard price.
Matched tokens (5,000): Billed at 20% of the standard price.
The total input cost is equivalent to 60% of the cost in a non-cached mode: (50% × 100%) + (50% × 20%) = 60%.

You can retrieve the number of cached tokens from the cached_tokens attribute in the response.
OpenAI compatible batch methods are not eligible for cache discounts.
Cache hit examples
Text generation models
OpenAI compatible
When using an OpenAI compatible method and an implicit cache is triggered, you receive a response similar to the following. View the number of cached tokens in usage.prompt_tokens_details.cached_tokens. This value is part of usage.prompt_tokens.
{
"choices": [
{
"message": {
"role": "assistant",
"content": "I am a large-scale language model developed by Alibaba Cloud. My name is Qwen."
},
"finish_reason": "stop",
"index": 0,
"logprobs": null
}
],
"object": "chat.completion",
"usage": {
"prompt_tokens": 3019,
"completion_tokens": 104,
"total_tokens": 3123,
"prompt_tokens_details": {
"cached_tokens": 2048
}
},
"created": 1735120033,
"system_fingerprint": null,
"model": "qwen-plus",
"id": "chatcmpl-6ada9ed2-7f33-9de2-8bb0-78bd4035025a"
}DashScope
When using the DashScope Python SDK or an HTTP request and an implicit cache is triggered, you receive a response similar to the following. View the number of cached tokens in usage.prompt_tokens_details.cached_tokens. This value is part of usage.input_tokens.
{
"status_code": 200,
"request_id": "f3acaa33-e248-97bb-96d5-cbeed34699e1",
"code": "",
"message": "",
"output": {
"text": null,
"finish_reason": null,
"choices": [
{
"finish_reason": "stop",
"message": {
"role": "assistant",
"content": "I am a large-scale language model from Alibaba Cloud. My name is Qwen. I can generate various types of text, such as articles, stories, and poems, and can adapt and expand based on different scenarios and needs. Additionally, I can answer various questions and provide help and solutions. If you have any questions or need assistance, feel free to let me know, and I will do my best to provide support. Please note that repeatedly sending the same content may not yield more detailed responses. It is recommended that you provide more specific information or vary your questions so I can better understand your needs."
}
}
]
},
"usage": {
"input_tokens": 3019,
"output_tokens": 101,
"prompt_tokens_details": {
"cached_tokens": 2048
},
"total_tokens": 3120
}
}Visual understanding models
OpenAI compatible
When using an OpenAI compatible method and an implicit cache is triggered, you receive a response similar to the following. View the number of cached tokens in usage.prompt_tokens_details.cached_tokens. This value is part of usage.prompt_tokens.
{
"id": "chatcmpl-3f3bf7d0-b168-9637-a245-dd0f946c700f",
"choices": [
{
"finish_reason": "stop",
"index": 0,
"logprobs": null,
"message": {
"content": "This image shows a heartwarming scene of a woman and a dog interacting on a beach. The woman is wearing a plaid shirt and sitting on the sand, smiling as she interacts with the dog. The dog is a large, light-colored breed wearing a colorful collar, with its front paw raised as if to shake hands or give a high-five to the woman. The background is a vast ocean and sky, with sunlight shining from the right side of the frame, adding a warm and peaceful atmosphere to the entire scene.",
"refusal": null,
"role": "assistant",
"audio": null,
"function_call": null,
"tool_calls": null
}
}
],
"created": 1744956927,
"model": "qwen-vl-max",
"object": "chat.completion",
"service_tier": null,
"system_fingerprint": null,
"usage": {
"completion_tokens": 93,
"prompt_tokens": 1316,
"total_tokens": 1409,
"completion_tokens_details": null,
"prompt_tokens_details": {
"audio_tokens": null,
"cached_tokens": 1152
}
}
}DashScope
When using the DashScope Python SDK or an HTTP request and an implicit cache is triggered, the number of cached tokens is included in the total input tokens (usage.input_tokens). The specific field varies by region and model:
Beijing region:
For
qwen-vl-maxandqwen-vl-plus, view the value inusage.prompt_tokens_details.cached_tokens.For
qwen3-vl-plusandqwen3-vl-flash, view the value inusage.cached_tokens.
Singapore region: For all models, view the value in
usage.cached_tokens.
Models that currently useusage.cached_tokenswill be upgraded to useusage.prompt_tokens_details.cached_tokensin the future.
{
"status_code": 200,
"request_id": "06a8f3bb-d871-9db4-857d-2c6eeac819bc",
"code": "",
"message": "",
"output": {
"text": null,
"finish_reason": null,
"choices": [
{
"finish_reason": "stop",
"message": {
"role": "assistant",
"content": [
{
"text": "This image shows a heartwarming scene of a woman and a dog interacting on a beach. The woman is wearing a plaid shirt and sitting on the sand, smiling as she interacts with the dog. The dog is a large breed wearing a colorful collar, with its front paw raised as if to shake hands or give a high-five to the woman. The background is a vast ocean and sky, with sunlight shining from the right side of the frame, adding a warm and peaceful atmosphere to the entire scene."
}
]
}
}
]
},
"usage": {
"input_tokens": 1292,
"output_tokens": 87,
"input_tokens_details": {
"text_tokens": 43,
"image_tokens": 1249
},
"total_tokens": 1379,
"output_tokens_details": {
"text_tokens": 87
},
"image_tokens": 1249,
"cached_tokens": 1152
}
}Typical scenarios
If your requests have the same prefix, context cache can improve inference speed, reduce inference costs, and lower first-packet latency. The following are typical application scenarios:
Q&A based on long text
This is suitable for scenarios that require sending multiple requests for a fixed long text, such as a novel, textbook, or legal document.
Message array for the first request
messages = [{"role": "system","content": "You are a language teacher who can help students with reading comprehension."}, {"role": "user","content": "<Article Content> What feelings and thoughts does the author express in this text?"}]Array of messages used in subsequent requests
messages = [{"role": "system","content": "You are a language teacher who can help students with reading comprehension."}, {"role": "user","content": "<Article Content> Please analyze the third paragraph of this text."}]Although the questions are different, they are all based on the same article. The identical system prompt and article content form a large, repetitive prefix, which results in a high probability of a cache hit.
Code auto-completion
In code auto-completion scenarios, the model completes the code based on the existing code in the context. As a user writes more code, the prefix of the code remains unchanged. Context Cache can cache the existing code to improve the completion speed.
Multi-turn conversation
To implement a multi-turn conversation, you can add the conversation history from each turn to the messages array. Therefore, each new request contains the previous turns as a prefix, resulting in a high probability of a cache hit.
Message array for the first turn
messages=[{"role": "system","content": "You are a helpful assistant."}, {"role": "user","content": "Who are you?"}]Message array for the second turn
messages=[{"role": "system","content": "You are a helpful assistant."}, {"role": "user","content": "Who are you?"}, {"role": "assistant","content": "I am Qwen, developed by Alibaba Cloud."}, {"role": "user","content": "What can you do?"}]As the number of conversation turns increases, caching provides more significant advantages for inference speed and cost.
Role-playing or few-shot learning
In role-playing or few-shot learning scenarios, you typically need to include a large amount of information in the prompt to guide the output format of the model. This results in a large amount of repetitive prefix information between different requests.
For example, to have the model act as a marketing expert, the system prompt contains a large amount of text. The following are message examples for two requests:
system_prompt = """You are an experienced marketing expert. Please provide detailed marketing suggestions for different products in the following format: 1. Target audience: xxx 2. Key selling points: xxx 3. Marketing channels: xxx ... 12. Long-term development strategy: xxx Please ensure your suggestions are specific, actionable, and highly relevant to the product features.""" # First request's user message asks about a smartwatch messages_1=[ {"role": "system", "content": system_prompt}, {"role": "user", "content": "Please provide marketing suggestions for a newly launched smartwatch."} ] # Second request's user message asks about a laptop. Because the system_prompt is the same, there is a high probability of a cache hit. messages_2=[ {"role": "system", "content": system_prompt}, {"role": "user", "content": "Please provide marketing suggestions for a newly launched laptop."} ]With context cache, even if the user frequently changes the type of product they are asking about, such as from a smartwatch to a laptop, the system can respond quickly after the cache is triggered.
Video understanding
In video understanding scenarios, if you ask multiple questions about the same video, placing the video before the text increases the hit rate. If you ask the same question about different videos, placing the text before the video increases the hit rate. The following are message examples for two requests about the same video:
# The user message for the first request asks about the content of this video messages1 = [ {"role":"system","content":[{"text": "You are a helpful assistant."}]}, {"role": "user", "content": [ {"video": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250328/eepdcq/phase_change_480p.mov"}, {"text": "What is the content of this video?"} ] } ] # The user message for the second request asks about the video timestamp. Because the question is based on the same video, placing the video before the text has a high probability of a cache hit. messages2 = [ {"role":"system","content":[{"text": "You are a helpful assistant."}]}, {"role": "user", "content": [ {"video": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250328/eepdcq/phase_change_480p.mov"}, {"text": "Please describe the series of events in the video. Output the start time (start_time), end time (end_time), and event (event) in JSON format. Do not include the ```json``` code segment."} ] } ]
Explicit cache
Compared to implicit cache, explicit cache requires you to manually create a cache and incurs a corresponding overhead. However, it provides a higher cache hit rate and lower access latency.
Usage
Add a "cache_control": {"type": "ephemeral"} marker to the messages. The system then attempts a cache hit by tracing back up to 20 content blocks from the position of each cache_control marker.
A single request supports up to 4 cache markers.
Cache miss
The system creates a new cache block using the content from the beginning of the messages array up to the
cache_controlmarker. The validity period is 5 minutes.Cache creation occurs after the model responds. Attempt to hit the cache after the creation request is complete.
The minimum content length for a cache block is 1,024 tokens.
Cache hit
The longest matching prefix is selected as the hit cache block, and the validity period of the cache block is reset to 5 minutes.
The following is an example:
Initiate the first request: Send a system message that contains text A with over 1,024 tokens and add a cache marker:
[{"role": "system", "content": [{"type": "text", "text": A, "cache_control": {"type": "ephemeral"}}]}]The system creates the first cache block, which is referred to as cache block A.
Initiate the second request: Send a request with the following structure:
[ {"role": "system", "content": A}, <other messages> {"role": "user","content": [{"type": "text", "text": B, "cache_control": {"type": "ephemeral"}}]} ]If there are no more than 20 "other messages", cache block A is hit, and its validity period is reset to 5 minutes. At the same time, the system creates a new cache block based on the content of A, the other messages, and B.
If there are more than 20 "other messages", cache block A is not hit. The system still creates a new cache block based on the full context (A + other messages + B).
Model availability
Qwen Max: qwen3-max
Qwen Plus: qwen-plus
Qwen Flash: qwen-flash
Qwen-Coder: qwen3-coder-plus, qwen3-coder-flash
The models listed above support the explicit cache feature in both the mainland China and international regions.
Snapshot and latest models are not currently supported.
Getting started
The following examples show the creation and hit mechanism of cache blocks in the OpenAI compatible API and the DashScope protocol.
OpenAI compatible
from openai import OpenAI
import os
client = OpenAI(
# If you have not exported an environment variable, replace the following line with api_key="sk-xxx"
api_key=os.getenv("DASHSCOPE_API_KEY"),
# If using a model from the Singapore region, replace the base_url with https://dashscope-intl.aliyuncs.com/compatible-mode/v1
base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
)
# Simulated code repository content. The minimum prompt length for caching is 1,024 tokens.
long_text_content = "<Your Code Here>" * 400
# Function to make a request
def get_completion(user_input):
messages = [
{
"role": "system",
"content": [
{
"type": "text",
"text": long_text_content,
# Place the cache_control marker here to create a cache block from the beginning of the messages array to the current content position.
"cache_control": {"type": "ephemeral"},
}
],
},
# The question content is different each time
{
"role": "user",
"content": user_input,
},
]
completion = client.chat.completions.create(
# Select a model that supports explicit cache
model="qwen3-coder-plus",
messages=messages,
)
return completion
# First request
first_completion = get_completion("What is the content of this code?")
print(f"First request created cache tokens: {first_completion.usage.prompt_tokens_details.cache_creation_input_tokens}")
print(f"First request hit cache tokens: {first_completion.usage.prompt_tokens_details.cached_tokens}")
print("=" * 20)
# Second request, the code content is the same, only the question is changed
second_completion = get_completion("How can this code be optimized?")
print(f"Second request created cache tokens: {second_completion.usage.prompt_tokens_details.cache_creation_input_tokens}")
print(f"Second request hit cache tokens: {second_completion.usage.prompt_tokens_details.cached_tokens}")DashScope
import os
from dashscope import Generation
# If using a model from the Singapore region, please uncomment the following.
# dashscope.base_http_api_url = "https://dashscope-intl.aliyuncs.com/api/v1"
# Simulated code repository content. The minimum prompt length for caching is 1,024 tokens.
long_text_content = "<Your Code Here>" * 400
# Function to make a request
def get_completion(user_input):
messages = [
{
"role": "system",
"content": [
{
"type": "text",
"text": long_text_content,
# Place the cache_control marker here to create a cache block from the beginning of the messages array to the current content position.
"cache_control": {"type": "ephemeral"},
}
],
},
# The question content is different each time
{
"role": "user",
"content": user_input,
},
]
response = Generation.call(
# If the environment variable is not configured, replace this line with your Model Studio API key: api_key = "sk-xxx",
api_key=os.getenv("DASHSCOPE_API_KEY"),
model="qwen3-coder-plus",
messages=messages,
result_format="message"
)
return response
# First request
first_completion = get_completion("What is the content of this code?")
print(f"First request created cache tokens: {first_completion.usage.prompt_tokens_details["cache_creation_input_tokens"]}")
print(f"First request hit cache tokens: {first_completion.usage.prompt_tokens_details["cached_tokens"]}")
print("=" * 20)
# Second request, the code content is the same, only the question is changed
second_completion = get_completion("How can this code be optimized?")
print(f"Second request created cache tokens: {second_completion.usage.prompt_tokens_details["cache_creation_input_tokens"]}")
print(f"Second request hit cache tokens: {second_completion.usage.prompt_tokens_details["cached_tokens"]}")// The minimum Java SDK version is 2.21.6
import com.alibaba.dashscope.aigc.generation.Generation;
import com.alibaba.dashscope.aigc.generation.GenerationParam;
import com.alibaba.dashscope.aigc.generation.GenerationResult;
import com.alibaba.dashscope.common.Message;
import com.alibaba.dashscope.common.MessageContentText;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.InputRequiredException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import java.util.Arrays;
import java.util.Collections;
public class Main {
private static final String MODEL = "qwen3-coder-plus";
// Simulate code repository content (repeat 400 times to ensure it exceeds 1024 tokens)
private static final String LONG_TEXT_CONTENT = generateLongText(400);
private static String generateLongText(int repeatCount) {
StringBuilder sb = new StringBuilder();
for (int i = 0; i < repeatCount; i++) {
sb.append("<Your Code Here>");
}
return sb.toString();
}
private static GenerationResult getCompletion(String userQuestion)
throws NoApiKeyException, ApiException, InputRequiredException {
// If using a model in the Singapore region, replace https://dashscope.aliyuncs.com/api/v1 with https://dashscope-intl.aliyuncs.com/api/v1
Generation gen = new Generation();
// Build a system message with cache control
MessageContentText systemContent = MessageContentText.builder()
.type("text")
.text(LONG_TEXT_CONTENT)
.cacheControl(MessageContentText.CacheControl.builder()
.type("ephemeral") // Set the cache type
.build())
.build();
Message systemMsg = Message.builder()
.role(Role.SYSTEM.getValue())
.contents(Collections.singletonList(systemContent))
.build();
Message userMsg = Message.builder()
.role(Role.USER.getValue())
.content(userQuestion)
.build();
// Build request parameters
GenerationParam param = GenerationParam.builder()
.model(MODEL)
.messages(Arrays.asList(systemMsg, userMsg))
.resultFormat(GenerationParam.ResultFormat.MESSAGE)
.build();
return gen.call(param);
}
private static void printCacheInfo(GenerationResult result, String requestLabel) {
System.out.printf("%s created cache tokens: %d%n", requestLabel, result.getUsage().getPromptTokensDetails().getCacheCreationInputTokens());
System.out.printf("%s hit cache tokens: %d%n", requestLabel, result.getUsage().getPromptTokensDetails().getCachedTokens());
}
public static void main(String[] args) {
try {
// First request
GenerationResult firstResult = getCompletion("What is the content of this code?");
printCacheInfo(firstResult, "First request");
System.out.println(new String(new char[20]).replace('\0', '=')); // Second request
GenerationResult secondResult = getCompletion("How can this code be optimized?");
printCacheInfo(secondResult, "Second request");
} catch (NoApiKeyException | ApiException | InputRequiredException e) {
System.err.println("API call failed: " + e.getMessage());
e.printStackTrace();
}
}
}The simulated code repository enables explicit cache by adding the cache_control marker. For subsequent requests that query this code repository, the system can reuse this cache block without recalculation. This results in faster responses and lower costs.
First request created cache tokens: 1605
First request hit cache tokens: 0
====================
Second request created cache tokens: 0
Second request hit cache tokens: 1605Use multiple cache markers for fine-grained control
In complex scenarios, a prompt often consists of multiple parts with different reuse frequencies. You can use multiple cache markers for fine-grained control.
For example, the prompt for an intelligent customer service agent typically includes:
System persona: Highly stable and rarely changes.
External knowledge: Semi-stable. This content is retrieved from a knowledge base or tool queries and might remain unchanged during a continuous conversation.
Conversation history: Grows dynamically.
Current question: Different each time.
If the entire prompt is cached as a single unit, any minor change, such as an update to the external knowledge, can cause a cache miss.
Instead, you can set up to four cache markers in a request to create separate cache blocks for different parts of the prompt. This improves the hit rate and allows for fine-grained control.
Billing
Explicit cache affects only the billing method for input tokens. The rules are as follows:
Cache creation: Newly created cache content is billed at 125% of the standard input price. If the cache content of a new request includes an existing cache as a prefix, only the new part is billed. This is calculated as the number of new cache tokens minus the number of existing cache tokens.
For example, if there is an existing cache A of 1200 tokens, and a new request needs to cache 1500 tokens of content AB, the first 1200 tokens are billed as a cache hit (10% of the standard price), and the new 300 tokens are billed for cache creation (125% of the standard price).
View the number of tokens used for cache creation in the
cache_creation_input_tokensparameter.Cache hit: Billed at 10% of the standard input price.
View the number of cached tokens in the
cached_tokensparameter.Other tokens: Tokens that are not hit and for which a cache is not created are billed at the standard price.
Cacheable content
Only the following message types in the messages array support adding cache markers:
System message
User message
Assistant message
Tool message (the result after tool execution)
If the request includes thetoolsparameter, adding a cache marker inmessagesalso caches the tool description information.
Taking a system message as an example, you can change the content field to an array format and add the cache_control field:
{
"role": "system",
"content": [
{
"type": "text",
"text": "<Specified Prompt>",
"cache_control": {
"type": "ephemeral"
}
}
]
}This structure also applies to other message types in the messages array.
Limitations
The minimum prompt length is 1,024 tokens.
The cache uses a backward prefix matching strategy. The system automatically checks the last 20 content blocks. A cache hit does not occur if the content to be matched is separated from the message that contains the
cache_controlmarker by more than 20 content blocks.Only setting
typetoephemeralis supported. This provides a validity period of 5 minutes.You can add up to 4 cache markers in a single request.
If the number of cache markers is greater than 4, only the last four take effect.
Usage examples
FAQ
Q: How to disable implicit cache?
A: You cannot. Implicit cache is enabled for all applicable requests because it does not affect response quality. It reduces costs and improves response speed when a cache hit occurs.
Q: Why did a cache miss occur after I created an explicit cache?
A: Possible reasons include the following:
The cache was not hit within 5 minutes. The system purges the cache block after its validity period expires.
A cache hit does not occur if the last
contentis separated from the existing cache block by more than 20contentblocks. We recommend that you create a new cache block.
Q: Does a hit on an explicit cache reset its validity period?
A: Yes. Each hit resets the validity period of the cache block to 5 minutes.
Q: Are explicit caches shared between different accounts?
A: No. Both implicit and explicit cache data is isolated at the account level.
Q: If the same account uses different models, are their explicit caches shared?
A: No. Cache data is isolated between models.
Q: Why is usage.input_tokens not equal to the sum of cache_creation_input_tokens and cached_tokens?
A: To ensure model performance, the backend service appends a small number of tokens (usually fewer than 10) to the user-provided prompt. These tokens are added after the cache_control marker. Therefore, they are not counted for cache creation or reads, but they are included in the total input_tokens.