Cache common input prefixes across requests to reduce inference latency and costs without affecting response quality.
The cache modes support different scenarios:
Explicit cache: Manually enable to create caches for specific content. Guarantees cache hits. Valid for 5 minutes. Creation: 125% of standard input price. Hits: 10%.
Implicit cache: Automatic mode requiring no configuration (cannot be disabled). System automatically identifies and caches common prefixes, but hit rates are not guaranteed. Hits: 20% of standard input price.
Item | Explicit cache | Implicit cache |
Affects response quality | No impact | No impact |
Billing for tokens used to create the cache | 125% of the standard input token price | 100% of the standard input token price |
Billing for cached input tokens that are hit | 10% of the standard input token price | 20% of the standard input token price |
Minimum tokens for caching | 1024 | 256 |
Cache validity period | 5 minutes (resets on hit) | Not guaranteed. The system periodically clears unused cached data. |
Explicit cache and implicit cache are mutually exclusive (one per request).
This topic applies to the OpenAI Chat Completions and DashScope APIs. For the Responses API, use session instead, see Session cache.
Explicit cache
Explicit cache requires manual setup and higher upfront cost but provides guaranteed hits and lower latency than implicit cache.
Usage
Add "cache_control": {"type": "ephemeral"} to the `messages` array. System searches backward up to 20 content blocks from each cache_control marker to match caches.
A single request supports a maximum of four cache markers.
Cache miss
System creates a new cache block from the `messages` array start to the
cache_controlmarker (valid 5 minutes).Cache creation occurs after model response. Wait for creation to complete before sending subsequent requests.
A cache block must contain at least 1024 tokens.
Cache hit
System selects the longest matching prefix and resets validity to 5 minutes.
The following example shows how to use this feature:
Send the first request: Send a system message that contains text A with more than 1024 tokens and add a cache marker.
[{"role": "system", "content": [{"type": "text", "text": A, "cache_control": {"type": "ephemeral"}}]}]The system creates the first cache block, which is referred to as cache block A.
Send the second request: Send a request with the following structure:
[ {"role": "system", "content": A}, <other messages> {"role": "user","content": [{"type": "text", "text": B, "cache_control": {"type": "ephemeral"}}]} ]If "other messages" ≤ 20 messages, cache block A is hit and validity period resets to 5 minutes. System also creates a new cache block based on A + other messages + B.
If "other messages" > 20 messages, cache block A is not hit. System creates a new cache block based on full context (A + other messages + B).
Supported models
International
Qwen-Max: qwen3-max
Qwen-Plus: qwen3.5-plus, qwen-plus
Qwen-Flash: qwen3.5-flash, qwen-flash
Qwen Coder: qwen3-coder-plus, qwen3-coder-flash
Qwen-VL: qwen3-vl-plus
DeepSeek: deepseek-v3.2
Global
Qwen-Max: qwen3-max
Qwen-Plus: qwen3.5-plus, qwen-plus
Qwen-Flash: qwen3.5-flash, qwen-flash
Qwen-Coder: qwen3-coder-plus, qwen3-coder-flash
Qwen-VL: qwen3-vl-plus
The above models support exlicit cache only in the Germany region.
Chinese mainland
Qwen-Max: qwen3-max
Qwen-Plus: qwen3.5-plus, qwen-plus
Qwen-Flash: qwen3.5-flash, qwen-flash
Qwen Coder: qwen3-coder-plus, qwen3-coder-flash
Qwen-VL: qwen3-vl-plus
DeepSeek: deepseek-v3.2
Kimi: kimi-k2.5
Hong Kong (China)
Qwen-Max: qwen3-max
Qwen-Plus: qwen-plus
Qwen-Flash: qwen3.5-flash
Qwen-VL: qwen3-vl-plus
EU
In the EU deployment mode, endpoint and data storage are located in Germany (Frankfurt), and model inference computing resources are limited to the EU.
Qwen-Max: qwen3-max
Qwen-Plus: qwen-plus
Qwen-Flash: qwen3.5-flash
Qwen-VL: qwen3-vl-plus
Getting started
Examples below show cache block creation and hits for OpenAI compatible interface and DashScope protocol.
OpenAI compatible
from openai import OpenAI
import os
client = OpenAI(
# If the environment variable is not set, replace the following line with: api_key="sk-xxx"
api_key=os.getenv("DASHSCOPE_API_KEY"),
# If you use a model in the Beijing region, replace the base_url with: https://dashscope.aliyuncs.com/compatible-mode/v1
base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)
# Mock code repository content. The minimum cacheable prompt length is 1024 tokens.
long_text_content = "<Your Code Here>" * 400
# Function to send the request
def get_completion(user_input):
messages = [
{
"role": "system",
"content": [
{
"type": "text",
"text": long_text_content,
# Place the cache_control marker here to create a cache block containing all content from the beginning of the messages array to this content block.
"cache_control": {"type": "ephemeral"},
}
],
},
# The question content is different for each request.
{
"role": "user",
"content": user_input,
},
]
completion = client.chat.completions.create(
# Select a model that supports explicit cache.
model="qwen3-coder-plus",
messages=messages,
)
return completion
# First request
first_completion = get_completion("What is the content of this code?")
print(f"First request cache creation tokens: {first_completion.usage.prompt_tokens_details.cache_creation_input_tokens}")
print(f"First request cached hit tokens: {first_completion.usage.prompt_tokens_details.cached_tokens}")
print("=" * 20)
# Second request, the code content is the same, only the question is changed.
second_completion = get_completion("How can this code be optimized?")
print(f"Second request cache creation tokens: {second_completion.usage.prompt_tokens_details.cache_creation_input_tokens}")
print(f"Second request cached hit tokens: {second_completion.usage.prompt_tokens_details.cached_tokens}")DashScope
import os
from dashscope import Generation
# If you use a model in the Beijing region, replace the base_url with: https://dashscope.aliyuncs.com/api/v1
dashscope.base_http_api_url = "https://dashscope-intl.aliyuncs.com/api/v1"
# Mock code repository content. The minimum cacheable prompt length is 1024 tokens.
long_text_content = "<Your Code Here>" * 400
# Function to send the request
def get_completion(user_input):
messages = [
{
"role": "system",
"content": [
{
"type": "text",
"text": long_text_content,
# Place the cache_control marker here to create a cache block containing all content from the beginning of the messages array to this content block.
"cache_control": {"type": "ephemeral"},
}
],
},
# The question content is different for each request.
{
"role": "user",
"content": user_input,
},
]
response = Generation.call(
# If the environment variable is not set, replace the following line with your Model Studio API key: api_key = "sk-xxx",
api_key=os.getenv("DASHSCOPE_API_KEY"),
model="qwen3-coder-plus",
messages=messages,
result_format="message"
)
return response
# First request
first_completion = get_completion("What is the content of this code?")
print(f"First request cache creation tokens: {first_completion.usage.prompt_tokens_details['cache_creation_input_tokens']}")
print(f"First request cached hit tokens: {first_completion.usage.prompt_tokens_details['cached_tokens']}")
print("=" * 20)
# Second request, the code content is the same, only the question is changed.
second_completion = get_completion("How can this code be optimized?")
print(f"Second request cache creation tokens: {second_completion.usage.prompt_tokens_details['cache_creation_input_tokens']}")
print(f"Second request cached hit tokens: {second_completion.usage.prompt_tokens_details['cached_tokens']}")// The minimum Java SDK version is 2.21.6
import com.alibaba.dashscope.aigc.generation.Generation;
import com.alibaba.dashscope.aigc.generation.GenerationParam;
import com.alibaba.dashscope.aigc.generation.GenerationResult;
import com.alibaba.dashscope.common.Message;
import com.alibaba.dashscope.common.MessageContentText;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.InputRequiredException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import java.util.Arrays;
import java.util.Collections;
public class Main {
private static final String MODEL = "qwen3-coder-plus";
// Mock code repository content (repeated 400 times to ensure it exceeds 1024 tokens)
private static final String LONG_TEXT_CONTENT = generateLongText(400);
private static String generateLongText(int repeatCount) {
StringBuilder sb = new StringBuilder();
for (int i = 0; i < repeatCount; i++) {
sb.append("<Your Code Here>");
}
return sb.toString();
}
private static GenerationResult getCompletion(String userQuestion)
throws NoApiKeyException, ApiException, InputRequiredException {
// If you use a model in the Beijing region, replace the base_url with: https://dashscope.aliyuncs.com/api/v1
Generation gen = new Generation("http", "https://dashscope-intl.aliyuncs.com/api/v1");
// Build the system message with cache control
MessageContentText systemContent = MessageContentText.builder()
.type("text")
.text(LONG_TEXT_CONTENT)
.cacheControl(MessageContentText.CacheControl.builder()
.type("ephemeral") // Set the cache type
.build())
.build();
Message systemMsg = Message.builder()
.role(Role.SYSTEM.getValue())
.contents(Collections.singletonList(systemContent))
.build();
Message userMsg = Message.builder()
.role(Role.USER.getValue())
.content(userQuestion)
.build();
// Build the request parameters
GenerationParam param = GenerationParam.builder()
.model(MODEL)
.messages(Arrays.asList(systemMsg, userMsg))
.resultFormat(GenerationParam.ResultFormat.MESSAGE)
.build();
return gen.call(param);
}
private static void printCacheInfo(GenerationResult result, String requestLabel) {
System.out.printf("%s cache creation tokens: %d%n", requestLabel, result.getUsage().getPromptTokensDetails().getCacheCreationInputTokens());
System.out.printf("%s cached hit tokens: %d%n", requestLabel, result.getUsage().getPromptTokensDetails().getCachedTokens());
}
public static void main(String[] args) {
try {
// First request
GenerationResult firstResult = getCompletion("What is the content of this code?");
printCacheInfo(firstResult, "First request");
System.out.println(new String(new char[20]).replace('\0', '=')); // Second request
GenerationResult secondResult = getCompletion("How can this code be optimized?");
printCacheInfo(secondResult, "Second request");
} catch (NoApiKeyException | ApiException | InputRequiredException e) {
System.err.println("API call failed: " + e.getMessage());
e.printStackTrace();
}
}
}This example caches mock code repository content with the cache_control marker. Subsequent requests asking about the same code reuse the cache, reducing response time and costs.
First request cache creation tokens: 1605
First request cached hit tokens: 0
====================
Second request cache creation tokens: 0
Second request cached hit tokens: 1605Use multiple cache markers for fine-grained control
In complex scenarios, prompts often have multiple parts with different reuse frequencies. Use multiple cache markers for fine-grained control.
For example, the prompt for a smart customer service agent typically includes:
System settings: Highly stable and almost never changes.
External knowledge: Semi-stable. It is retrieved from a knowledge base or by calling a tool and may remain unchanged during a continuous conversation.
Conversation history: Grows dynamically.
Current question: Different each time.
Caching the entire prompt as a single unit invalidates the cache on any minor change (such as changed external knowledge).
Set up to four cache markers per request to create separate cache blocks for different prompt parts, improving hit rate and control.
Billing
Explicit cache affects only input token pricing:
Cache creation: 125% of standard input price. If a new cache contains an existing cache as a prefix, only the increment is billed.
Example: existing cache A = 1200 tokens, new cache AB = 1500 tokens. First 1200 tokens billed as cache hit (10%), remaining 300 as cache creation (125%).
Check the number of tokens used for cache creation in the
cache_creation_input_tokensparameter.Cache hit: 10% of standard input price.
Check the number of hit cache tokens in the
cached_tokensparameter.Other tokens: Tokens not matching any cache or used for cache creation are billed at standard price.
Cacheable content
Supported message types for cache markers:
System message
User message
When you use the
qwen3-vl-plusmodel to create a cache, thecache_controlmarker can be placed after multimodal content or text. Its position does not affect the caching of the entire user message.Assistant message
Tool message (the result after a tool is executed)
If a request includes the
toolsparameter, adding a cache marker inmessagesalso caches the tool descriptions defined in the request.
For example, for a system message, change the content field to an array and add the cache_control field:
{
"role": "system",
"content": [
{
"type": "text",
"text": "<Your specified prompt>",
"cache_control": {
"type": "ephemeral"
}
}
]
}This structure also applies to other message types in the messages array.
Cache limits
Minimum cacheable prompt: 1024 tokens.
Cache uses backward prefix matching, checking the last 20 content blocks. If content is separated from the
cache_controlmarker by > 20 blocks, cache is not hit.Only supported
type:ephemeral(validity: 5 minutes).A single request can have a maximum of 4 cache markers.
If the number of cache markers is greater than four, only the last four cache markers take effect.
Usage examples
Implicit cache
Supported models
Global
In the Global deployment mode, endpoint and data storage are located in the US (Virginia) region or Germany (Frankfurt) region, and model inference computing resources are dynamically scheduled globally.
Text generation models
Qwen-Max: qwen3-max
Qwen-Plus: qwen-plus
Qwen-Flash: qwen-flash
Qwen-Coder: qwen3-coder-plus, qwen3-coder-flash
Vision understanding models
Qwen-VL: qwen3-vl-plus, qwen3-vl-flash
International
In the International deployment mode, endpoint and data storage are located in the Singapore region, while model inference computing resources are dynamically scheduled globally (excluding Chinese Mainland).
Text generation models
Qwen-Max: qwen3-max, qwen3-max-preview, qwen-max
Qwen-Plus: qwen-plus
Qwen-Flash: qwen-flash
Qwen-Turbo: qwen-turbo
Qwen-Coder: qwen3-coder-plus, qwen3-coder-flash
DeepSeek: deepseek-v3.2
Vision understanding models
Qwen-VL: qwen3-vl-plus, qwen3-vl-flash, qwen-vl-max, qwen-vl-plus
Industry-specific models
Role playing: qwen-plus-character, qwen-flash-character, qwen-plus-character-ja
US
In the US deployment mode, endpoint and data storage are located in the US (Virginia) region, and model inference computing resources are limited to the United States.
Text generation models
Qwen-Plus: qwen-plus-us
Qwen-Flash: qwen-flash-us
Vision understanding models
Qwen-VL: qwen3-vl-flash-us
Chinese mainland
In the Chinese Mainland deployment mode, endpoint and data storage are located in the Beijing region, and model inference computing resources are limited to Chinese Mainland.
Text generation models
Qwen-Max: qwen3-max, qwen-max
Qwen-Plus: qwen-plus
Qwen-Flash: qwen-flash
Qwen-Turbo: qwen-turbo
Qwen-Coder: qwen3-coder-plus, qwen3-coder-flash
DeepSeek: deepseek-v3.2, deepseek-v3.1, deepseek-v3, deepseek-r1
Kimi: kimi-k2.5, kimi-k2-thinking, Moonshot-Kimi-K2-Instruct
GLM: glm-5, glm-4.7, glm-4.6
MiniMax: MiniMax-M2.5
Vision understanding models
Qwen-VL: qwen3-vl-plus, qwen3-vl-flash, qwen-vl-max, qwen-vl-plus
Industry-specific models
Role playing: qwen-plus-character
Hong Kong (China)
In the China (Hong Kong) deployment mode, endpoint and data storage are located in China (Hong Kong), and model inference computing resources are limited to China (Hong Kong).
Text generation models
Vision understanding models
Qwen-VL: qwen3-vl-plus
EU
In the EU deployment mode, endpoint and data storage are located in Germany (Frankfurt), and model inference computing resources are limited to the EU.
Text generation models
Vision understanding models
Qwen-VL: qwen3-vl-plus, qwen3-vl-flash
Snapshot and latest models are not supported.
How it works
Implicit cache automatically activates for supported models:
Find: System checks cache for common prefix in request's
messagesarray using prefix matching.Evaluate:
Cache hit: System uses cached result for inference.
Cache miss: System processes normally and stores prompt prefix for future requests.
System periodically clears unused cache. Hit rates are not guaranteed — misses can occur even with identical context. Actual rate is system-determined.
Content with fewer than 256 tokens will not be cached.
Increase hit rate
Place static content first and variable content last to increase the hit rate.
Text-only: If the system has cached "ABCD", a request for "ABE" can match the "AB" prefix, while a request for "BCD" will not match any cache.
Visual understanding:
When asking multiple questions about the same image or video: Place the image or video before the text to increase the hit rate.
When asking the same question about different images or videos: Place the text before the image or video to increase the hit rate.
Billing
There are no additional fees.
On cache hit, matched input tokens are billed as cached_token at 20% of input_token unit price. Non-hit input tokens are billed at standard input_token price. Output tokens are billed at standard price.
Example: 10,000 input tokens with 5,000 hitting the cache:
Non-hit tokens (5,000): Billed at 100% of the unit price
Hit tokens (5,000): Billed at 20% of the unit price
Total input cost = 60% of cost without cache: (50% × 100%) + (50% × 20%) = 60%.

You can retrieve the number of hit cache tokens from the cached_tokens attribute of the returned result.
OpenAI compatible-Batch (file input) method is not eligible for cache discounts.
Cache hit examples
Text generation
OpenAI compatible
When you call a model using the OpenAI compatible method and trigger the implicit cache, you can retrieve the following result. In usage.prompt_tokens_details.cached_tokens, check the number of hit cache tokens. This value is part of usage.prompt_tokens.
{
"choices": [
{
"message": {
"role": "assistant",
"content": "I am a super-large language model developed by Alibaba Cloud. My name is Qwen."
},
"finish_reason": "stop",
"index": 0,
"logprobs": null
}
],
"object": "chat.completion",
"usage": {
"prompt_tokens": 3019,
"completion_tokens": 104,
"total_tokens": 3123,
"prompt_tokens_details": {
"cached_tokens": 2048
}
},
"created": 1735120033,
"system_fingerprint": null,
"model": "qwen-plus",
"id": "chatcmpl-6ada9ed2-7f33-9de2-8bb0-78bd4035025a"
}DashScope
When you call a model using the DashScope Python SDK or HTTP method and trigger the implicit cache, you can retrieve the following result. In usage.prompt_tokens_details.cached_tokens, check the number of hit cache tokens. This value is part of usage.input_tokens.
{
"status_code": 200,
"request_id": "f3acaa33-e248-97bb-96d5-cbeed34699e1",
"code": "",
"message": "",
"output": {
"text": null,
"finish_reason": null,
"choices": [
{
"finish_reason": "stop",
"message": {
"role": "assistant",
"content": "I am a large-scale language model from Alibaba Cloud. My name is Qwen. I can generate various types of text, such as articles, stories, poems, and stories, and can transform and expand them according to different scenarios and needs. In addition, I can answer various questions, provide help and solutions. If you have any questions or need help, please feel free to let me know, and I will do my best to provide support. Please note that continuously repeating the same content may not yield more detailed answers. We recommend that you provide more specific information or vary your questions so that I can better understand your needs."
}
}
]
},
"usage": {
"input_tokens": 3019,
"output_tokens": 101,
"prompt_tokens_details": {
"cached_tokens": 2048
},
"total_tokens": 3120
}
}Visual understanding
OpenAI compatible
When you call a model using the OpenAI compatible method and trigger the implicit cache, you can retrieve the following result. In usage.prompt_tokens_details.cached_tokens, check the number of hit cache tokens. This token count is part of usage.prompt_tokens.
{
"id": "chatcmpl-3f3bf7d0-b168-9637-a245-dd0f946c700f",
"choices": [
{
"finish_reason": "stop",
"index": 0,
"logprobs": null,
"message": {
"content": "This image shows a heartwarming scene of a woman and a dog interacting on a beach. The woman is wearing a plaid shirt and sitting on the sand, smiling as she interacts with the dog. The dog is a large, light-colored breed with a colorful collar, and its front paw is raised as if to shake hands or give a high-five to the woman. The background is a vast ocean and sky, with sunlight shining from the right side of the frame, adding a warm and peaceful atmosphere to the whole scene.",
"refusal": null,
"role": "assistant",
"audio": null,
"function_call": null,
"tool_calls": null
}
}
],
"created": 1744956927,
"model": "qwen-vl-max",
"object": "chat.completion",
"service_tier": null,
"system_fingerprint": null,
"usage": {
"completion_tokens": 93,
"prompt_tokens": 1316,
"total_tokens": 1409,
"completion_tokens_details": null,
"prompt_tokens_details": {
"audio_tokens": null,
"cached_tokens": 1152
}
}
}DashScope
When you call a model using the DashScope Python SDK or HTTP method and trigger the implicit cache, the number of hit cache tokens is included in the total input tokens (usage.input_tokens). The specific location to view this information varies by region and model:
Beijing region:
qwen-vl-max,qwen-vl-plus: View inusage.prompt_tokens_details.cached_tokensqwen3-vl-plus,qwen3-vl-flash: View inusage.cached_tokens
Singapore region: For all models, view in
usage.cached_tokens
Models that currently useusage.cached_tokenswill be upgraded tousage.prompt_tokens_details.cached_tokensin the future.
{
"status_code": 200,
"request_id": "06a8f3bb-d871-9db4-857d-2c6eeac819bc",
"code": "",
"message": "",
"output": {
"text": null,
"finish_reason": null,
"choices": [
{
"finish_reason": "stop",
"message": {
"role": "assistant",
"content": [
{
"text": "This image shows a heartwarming scene of a woman and a dog interacting on a beach. The woman is wearing a plaid shirt and sitting on the sand, smiling as she interacts with the dog. The dog is a large breed with a colorful collar, and its front paw is raised as if to shake hands or give a high-five to the woman. The background is a vast ocean and sky, with sunlight shining from the right side of the frame, adding a warm and peaceful atmosphere to the whole scene."
}
]
}
}
]
},
"usage": {
"input_tokens": 1292,
"output_tokens": 87,
"input_tokens_details": {
"text_tokens": 43,
"image_tokens": 1249
},
"total_tokens": 1379,
"output_tokens_details": {
"text_tokens": 87
},
"image_tokens": 1249,
"cached_tokens": 1152
}
}Typical scenarios
Context cache improves inference speed, reduces costs, and lowers time to first token for requests sharing prefix content. Typical scenarios:
Q&A based on long text
This applies to multiple requests about fixed long text like novels, textbooks, or legal documents.
Message array for the first request
messages = [{"role": "system","content": "You are a language teacher and you can help students with reading comprehension."}, {"role": "user","content": "<Article content> What is the author's main idea in this text?"}]Message array for subsequent requests
messages = [{"role": "system","content": "You are a language teacher and you can help students with reading comprehension."}, {"role": "user","content": "<Article content> Please analyze the third paragraph of this text."}]Questions differ but reference the same article. System prompt and article content remain unchanged, so each request shares a large overlapping prefix, increasing cache hit probability.
Code auto-completion
In code auto-completion, model auto-completes based on existing context. As user continues coding, prefix portion remains unchanged. Context cache caches preceding code to improve completion speed.
Multi-turn conversation
In multi-turn conversation, conversation history from previous turns is included in messages array. Each turn's request shares the same prefix as the previous turn, increasing cache hit probability.
Message array for the first turn of conversation
messages=[{"role": "system","content": "You are a helpful assistant."}, {"role": "user","content": "Who are you?"}]Message array for the second turn of conversation
messages=[{"role": "system","content": "You are a helpful assistant."}, {"role": "user","content": "Who are you?"}, {"role": "assistant","content": "I am Qwen, developed by Alibaba Cloud."}, {"role": "user","content": "What can you do?"}]As the number of conversation turns increases, the benefits of caching — faster inference and lower cost — become more pronounced.
Role playing or few-shot learning
Role playing or few-shot learning typically includes a large amount of information in prompt to guide model output format, creating a large shared prefix across requests.
For example, if you want the model to act as a marketing expert, the system prompt contains a large amount of text information. The following are message examples for two requests:
system_prompt = """You are an experienced marketing expert. Please provide detailed marketing suggestions for different products in the following format: 1. Target audience: xxx 2. Main selling points: xxx 3. Marketing channels: xxx ... 12. Long-term development strategy: xxx Please ensure that your suggestions are specific, actionable, and highly relevant to the product features.""" # User message for the first request, asking about a smartwatch messages_1=[ {"role": "system", "content": system_prompt}, {"role": "user", "content": "Please provide marketing suggestions for a newly launched smartwatch."} ] # User message for the second request, asking about a laptop. Because the system_prompt is the same, there is a high probability of hitting the Cache. messages_2=[ {"role": "system", "content": system_prompt}, {"role": "user", "content": "Please provide marketing suggestions for a newly launched laptop."} ]Context cache enables quick responses on cache hit, even when user frequently changes product type (e.g., smartwatch to laptop).
Video understanding
In video understanding scenarios, if you ask multiple questions about the same video, placing the
videobefore thetextincreases the probability of a cache hit. If you ask the same question about different videos, placing thetextbefore thevideoincreases the probability of a cache hit. The following is a message example for two requests for the same video:# User message for the first request, asking about the content of this video messages1 = [ {"role":"system","content":[{"text": "You are a helpful assistant."}]}, {"role": "user", "content": [ {"video": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250328/eepdcq/phase_change_480p.mov"}, {"text": "What is the content of this video?"} ] } ] # User message for the second request, asking about the video timestamp. Because the question is based on the same video, placing the video before the text has a high probability of hitting the Cache. messages2 = [ {"role":"system","content":[{"text": "You are a helpful assistant."}]}, {"role": "user", "content": [ {"video": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250328/eepdcq/phase_change_480p.mov"}, {"text": "Please describe the series of events in the video, and output the start time (start_time), end time (end_time), and event (event) in JSON format. Do not output the ```json``` code segment."} ] } ]
FAQ
Q: How do I disable implicit cache?
No. Implicit cache is always enabled for supported models — no quality impact, reduces costs and improves speed.
Q: Why was the explicit cache not hit after I created it?
A: Possible reasons:
Cache expired (not hit within 5 minutes).
If last
contentblock is separated from existing cache by >20 blocks, cache won't hit. Create new cache block instead.
Q: Does hitting the explicit cache reset its validity period?
Yes. Each hit resets cache block validity period to 5 minutes.
Q: Is the explicit cache shared between different accounts?
A: No. Both implicit and explicit cache data are isolated at the account level and are not shared across accounts.
Q: Is the explicit cache shared between different models under the same account?
A: No. Cache data is isolated between models and is not shared.
Q: Why is usage's input_tokens not equal to the sum of cache_creation_input_tokens and cached_tokens?
To ensure model output quality, backend service appends a small number of tokens (usually < 10) after user-provided prompt. These tokens are placed after cache_control marker, so they're not counted for cache creation or hits but are included in total input_tokens.