This topic describes how to use explicit cache and its best practices. By adding cache markers to your requests, explicit cache guarantees deterministic cache hits for identical input content, significantly reducing cost and latency.
When to use explicit cache
You need guaranteed cache hits: Explicit cache delivers 100% deterministic hits regardless of backend resource scheduling. If your application requires stable content reuse, explicit cache is the right choice.
You frequently reuse the same prompt: When identical or highly consistent prompts are submitted repeatedly, explicit cache significantly reduces costs. Creating the cache incurs only a 25% surcharge over the standard input price, while each subsequent hit saves 90%. A single hit is enough to break even.
You manage long contexts in production Agents: In Agent applications, common mechanisms like compression, recap, and system reminders cause the context to change continuously. Explicit cache lets you pin and reuse key context segments so they remain cached even as the surrounding context evolves.
Agent and coding tools
The following Agent and coding tools connect to Model Studio through the Anthropic protocol and natively support explicit cache. Configure them following their respective documentation, and they will automatically leverage explicit cache to optimize context management.
The examples below use the Singapore endpoint. For other regions, replace the base URL with the corresponding regional endpoint.
Claude Code
Claude Code v2.x and later automatically includes cache_control markers in requests (system, env, and most recent user message). No additional configuration is needed after connecting to Model Studio's Anthropic-compatible endpoint.
Configuration
Create or edit ~/.claude/settings.json (Windows: C:\Users\<username>\.claude\settings.json) with the appropriate plan settings. Alternatively, connect via environment variables:
export ANTHROPIC_BASE_URL="https://dashscope-intl.aliyuncs.com/apps/anthropic"
export ANTHROPIC_AUTH_TOKEN="${DASHSCOPE_API_KEY}"
export ANTHROPIC_MODEL="qwen3.7-max"
claudeSet the Anthropic protocol endpoint:
Token Plan (Team):
https://token-plan.ap-southeast-1.maas.aliyuncs.com/apps/anthropicCoding Plan:
https://coding-intl.dashscope.aliyuncs.com/apps/anthropicPay-as-you-go:
https://dashscope-intl.aliyuncs.com/apps/anthropic
For details, see Claude Code.
Optional: Improve cross-session hit rate
By default, Claude Code includes dynamic information in the system prompt (current directory, date, git status), which may reduce cross-session cache hit rates. Add the following flag at startup to move dynamic sections to user messages:
claude --exclude-dynamic-system-prompt-sectionsOpen Code
When OpenCode connects to Model Studio's Anthropic-compatible endpoint via @ai-sdk/anthropic, it automatically injects cache_control on the system message and the most recent non-system message.
Installation
npm install -g opencode-aiConfiguration
Create the configuration file ~/.config/opencode/opencode.json (Windows: C:\Users\<username>\.config\opencode\opencode.json):
{
"$schema": "https://opencode.ai/config.json",
"provider": {
"bailian": {
"npm": "@ai-sdk/anthropic",
"name": "Alibaba Cloud Model Studio",
"options": {
"baseURL": "https://dashscope-intl.aliyuncs.com/apps/anthropic/v1",
"apiKey": "{env:DASHSCOPE_API_KEY}"
},
"models": {
"qwen3.7-max": { "name": "qwen3.7-max" }
}
}
}
}The baseURL must end with /v1.
export DASHSCOPE_API_KEY=sk-xxxxx
opencode run -m "bailian/qwen3.7-max" "..."Other plan base URLs:
Token Plan (Team):
https://token-plan.ap-southeast-1.maas.aliyuncs.com/apps/anthropic/v1Coding Plan:
https://coding-intl.dashscope.aliyuncs.com/apps/anthropic/v1
For details, see OpenCode.
OpenClaw
When using the Anthropic-compatible endpoint, OpenClaw automatically injects cache_control markers on the system prompt and the most recent user message. No additional configuration is needed — as long as the provider's Base URL points to /apps/anthropic, explicit cache is automatically enabled.
Installation
npm install -g openclaw
# or
curl -fsSL https://openclaw.ai/install.sh | bashConfiguration
Edit the configuration file ~/.openclaw/openclaw.json. Set "api" to "anthropic-messages" and set the base URL:
Token Plan (Team):
https://token-plan.ap-southeast-1.maas.aliyuncs.com/apps/anthropic/v1Coding Plan:
https://coding-intl.dashscope.aliyuncs.com/apps/anthropic/v1Pay-as-you-go:
https://dashscope-intl.aliyuncs.com/apps/anthropic/v1
For details, see OpenClaw.
Optional: Custom cache boundary
If your system prompt contains both stable template content and dynamic content (timestamps, CWD, etc.), insert <!-- OPENCLAW_CACHE_BOUNDARY --> between them. OpenClaw will apply cache_control only to the stable prefix before the boundary, improving cross-session hit rates:
You are a Python engineer following these conventions:
- type hints required
- docstrings in Google format
<!-- OPENCLAW_CACHE_BOUNDARY -->
Current time: 2026-05-25 18:42
Working directory: /Users/.../projectWithout this boundary, OpenClaw applies cache_control to the entire system prompt using its built-in strategy, still benefiting from explicit cache.
Hermes
Configure using the hermes config set command. Set the base URL:
Token Plan (Team):
https://token-plan.ap-southeast-1.maas.aliyuncs.com/apps/anthropic/v1Coding Plan:
https://coding-intl.dashscope.aliyuncs.com/apps/anthropic/v1Pay-as-you-go:
https://dashscope-intl.aliyuncs.com/apps/anthropic/v1
For details, see Hermes Agent.
API integration
Key points
Add
"cache_control": {"type": "ephemeral"}to the message content you want to cache. All content from the beginning of the messages array up to that marker will be cached as a block.Cached content must be at least 1024 tokens.
A single request supports up to 4 cache markers.
Cache TTL is 5 minutes, automatically renewed on each hit.
Tool definitions are part of the system prompt for caching purposes. If the tools change, the cache will not hit.
Quick start
The following example demonstrates the basic workflow: the first request creates a cache, and the second request hits it.
OpenAI-compatible
from openai import OpenAI
import os
client = OpenAI(
api_key=os.getenv("DASHSCOPE_API_KEY"),
base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)
# Long text to cache (must exceed 1024 tokens)
long_text_content = "<Your Long Text Here>" * 400
def get_completion(user_input):
messages = [
{
"role": "system",
"content": [
{
"type": "text",
"text": long_text_content,
# Cache marker: content from the start of messages to this point will be cached
"cache_control": {"type": "ephemeral"},
}
],
},
{"role": "user", "content": user_input},
]
completion = client.chat.completions.create(
model="qwen3.7-max",
messages=messages,
extra_body={"enable_thinking": False},
)
return completion
# First request: creates cache
first = get_completion("Summarize the key points of this document")
print(f"Cache created: {first.usage.prompt_tokens_details.cache_creation_input_tokens}")
print(f"Cache hit: {first.usage.prompt_tokens_details.cached_tokens}")
# Second request: same system content, different question — hits cache
second = get_completion("What precautions are mentioned in the document?")
print(f"Cache created: {second.usage.prompt_tokens_details.cache_creation_input_tokens}")
print(f"Cache hit: {second.usage.prompt_tokens_details.cached_tokens}")Anthropic-compatible
import anthropic
import os
client = anthropic.Anthropic(
api_key=os.getenv("DASHSCOPE_API_KEY"),
base_url="https://dashscope-intl.aliyuncs.com/apps/anthropic",
)
# Long text to cache (must exceed 1024 tokens)
long_text_content = "<Your Long Text Here>" * 400
def get_completion(user_input):
response = client.messages.create(
model="qwen3.7-max",
max_tokens=1024,
system=[
{
"type": "text",
"text": long_text_content,
# Cache marker
"cache_control": {"type": "ephemeral"},
}
],
messages=[
{"role": "user", "content": user_input},
],
)
return response
# First request: creates cache
first = get_completion("Summarize the key points of this document")
print(f"Cache created: {first.usage.cache_creation_input_tokens}")
print(f"Cache hit: {first.usage.cache_read_input_tokens}")
# Second request: hits cache
second = get_completion("What precautions are mentioned in the document?")
print(f"Cache created: {second.usage.cache_creation_input_tokens}")
print(f"Cache hit: {second.usage.cache_read_input_tokens}")Expected output:
Cache created: 2005
Cache hit: 0
Cache created: 0
Cache hit: 2005The first request creates a cache block. The second request hits the cache because the system prompt content is identical. Cached tokens are billed at only 10% of the standard input price.
Verify cache status
Check the usage field in the response to confirm cache behavior:
cache_creation_input_tokens: Number of tokens for which a new cache was created. A value greater than 0 means a new cache block was created.cached_tokens(OpenAI-compatible) orcache_read_input_tokens(Anthropic-compatible): Number of tokens that hit cache. A value greater than 0 means the cache was successfully hit.
Best practices by scenario
Multi-turn conversations
Characteristics:
Users interact with the model over multiple turns, each request carrying the full conversation history
Typical use cases: customer service, knowledge Q&A, code assistants
Best practice: Add a cache_control marker to the last message in each request. Each turn hits the cache created by the previous turn (the conversation history), while creating a new cache that includes the current turn for the next round.
Example:
from openai import OpenAI
import os
client = OpenAI(
api_key=os.getenv("DASHSCOPE_API_KEY"),
base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)
# System prompt: product manual (must exceed 1024 tokens)
product_manual = """You are the support assistant for "BaiLian SmartHome" smart home controller. Here is the complete product manual:
## Product Overview
BaiLian SmartHome is a whole-home smart controller supporting voice control, scene automation, and energy management...
## Installation Guide
1. Install at a central location with good WiFi coverage...
2. Connect the power adapter (5V/2A)...
## FAQ
Q: Cannot connect to WiFi? A: Make sure your router supports 2.4GHz...
""" * 80 # Repeat to exceed 1024 tokens
messages = [{"role": "system", "content": product_manual}]
def chat(user_input):
# Key: add cache_control to the last user message
messages.append({
"role": "user",
"content": [
{
"type": "text",
"text": user_input,
"cache_control": {"type": "ephemeral"},
}
],
})
completion = client.chat.completions.create(
model="qwen3.7-max",
messages=messages,
extra_body={"enable_thinking": False},
)
assistant_msg = completion.choices[0].message.content
messages.append({"role": "assistant", "content": assistant_msg})
usage = completion.usage
created = usage.prompt_tokens_details.cache_creation_input_tokens
cached = usage.prompt_tokens_details.cached_tokens
print(f" [Cache] Created: {created} tokens, Hit: {cached} tokens")
return assistant_msg
# Simulate multi-turn conversation
print("User: What voice assistants does BaiLian SmartHome support?")
print(f"Agent: {chat('What voice assistants does BaiLian SmartHome support?')[:80]}...\n")
print("User: What if I cannot connect to WiFi?")
print(f"Agent: {chat('What if I cannot connect to WiFi?')[:80]}...\n")
print("User: How many devices can it control simultaneously?")
print(f"Agent: {chat('How many devices can it control simultaneously?')[:80]}...")Expected output:
User: What voice assistants does BaiLian SmartHome support?
[Cache] Created: 8739 tokens, Hit: 0 tokens
Agent: BaiLian SmartHome supports Tmall Genie, XiaoAi, Siri, and other voice assistants...
User: What if I cannot connect to WiFi?
[Cache] Created: 151 tokens, Hit: 8739 tokens
Agent: For WiFi connectivity issues, try the following: 1. Confirm your router supports 2.4GHz...
User: How many devices can it control simultaneously?
[Cache] Created: 101 tokens, Hit: 8890 tokens
Agent: BaiLian SmartHome can control up to 256 smart devices simultaneously...Starting from the second turn, each request hits the cache from the previous turn (the conversation history), while creating a new cache that includes the current turn. The more turns in the conversation, the greater the savings.
Production Agent (multiple cache markers)
Characteristics:
Long multi-turn conversations comprising: system prompt + skills/tools definitions + project context + user messages / tool calls
Different sections change at different frequencies
Typical use cases: AI coding assistants (Claude Code, OpenClaw), RAG-based Q&A systems
Best practice: Use multiple cache markers (up to 4) to pin content at different stability levels. Each marker must be on a separate message (different role) to serve as an independent breakpoint:
System prompt — one marker (rarely changes)
Skills/tools definitions — one marker (may change in combination)
Project context — one marker (may switch or compress)
User messages / tool calls — one marker (grows each turn)
Example: This example simulates a typical Agent architecture with 3 cache markers pinning the system persona and tools (marker 1), knowledge base (marker 2), and conversation history (marker 3). Note that the knowledge base is placed in a user message to ensure it has its own independent cache breakpoint — multiple system messages are merged internally and cannot serve as separate breakpoints:
from openai import OpenAI
import os
client = OpenAI(
api_key=os.getenv("DASHSCOPE_API_KEY"),
base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)
# Layer 1: System persona (rarely changes)
system_persona = """You are the senior AI support agent for "Model Studio Electronics". Your guidelines:
1. Answer questions based on the knowledge base
2. For information not in the knowledge base, say "Let me transfer you to a human agent"
3. Maintain a professional and friendly tone
4. If the user is unhappy, apologize first then resolve the issue
Below is your complete service specification and script guide:
""" + "Detailed service specification..." * 200 # Ensure > 1024 tokens
# Layer 2: Tools/skills definitions (changes occasionally, e.g. when new features launch)
tools_description = """### Available Tools
- search_product(query): Search product information
- check_inventory(sku, color): Check stock status
- create_ticket(type, description): Create a support ticket
- transfer_to_human(reason): Transfer to a human agent
### Tool Usage Rules
1. When user asks about product details, use search_product first
2. When user asks about stock/shipping, use check_inventory
3. When user requests return/exchange, use create_ticket
4. When a tool returns an error, apologize and transfer_to_human
""" + "Detailed tool usage examples..." * 150 # Ensure > 1024 tokens
# Layer 3: Project knowledge base (semi-stable, changes when user switches products)
knowledge_base_product_a = """### Current product: Model Studio Pro Max Wireless Earbuds
- SKU: BL-PM-2024
- Price: CNY 599
- Colors: Night Black / Nebula White / Ice Blue
- Battery: 8 hours (ANC on), 12 hours (ANC off)
- Water resistance: IPX5
- Warranty: 1 year, 7-day no-questions-asked return
- Stock: Night Black (in stock) / Nebula White (low) / Ice Blue (out of stock)
""" * 50 # Ensure > 1024 tokens
def ask_agent(user_question, history=None):
if history is None:
history = []
messages = [
{
"role": "system",
"content": [
{
"type": "text",
"text": system_persona + "\n\n" + tools_description,
"cache_control": {"type": "ephemeral"}, # Marker 1: system persona + tools
}
],
},
{
"role": "user",
"content": [
{
"type": "text",
"text": f"Here is the knowledge base for the current product:\n{knowledge_base_product_a}",
"cache_control": {"type": "ephemeral"}, # Marker 2: knowledge base
}
],
},
{"role": "assistant", "content": "Got it. I have the product details ready. How can I help you?"},
]
messages.extend(history)
# Add current question with marker 3
messages.append({
"role": "user",
"content": [
{
"type": "text",
"text": user_question,
"cache_control": {"type": "ephemeral"}, # Marker 3: conversation history
}
],
})
completion = client.chat.completions.create(
model="qwen3.7-max",
messages=messages,
extra_body={"enable_thinking": False},
)
usage = completion.usage
print(f" Created: {usage.prompt_tokens_details.cache_creation_input_tokens}, "
f"Hit: {usage.prompt_tokens_details.cached_tokens}")
return completion.choices[0].message.content
# First request
print("Q1: Is the Ice Blue color available?")
a1 = ask_agent("Is the Ice Blue color available?")
print(f"A1: {a1}\n")
# Second request: same product (persona + tools + knowledge base all hit)
history = [
{"role": "user", "content": "Is the Ice Blue color available?"},
{"role": "assistant", "content": a1},
]
print("Q2: When will it be back in stock?")
a2 = ask_agent("When will it be back in stock?", history)
print(f"A2: {a2}")Expected output:
Q1: Is the Ice Blue color available?
Created: 7659, Hit: 0
A1: I'm sorry, but the Ice Blue color... is currently out of stock...
Q2: When will it be back in stock?
Created: 73, Hit: 7659
A2: I don't have access to specific restock dates... Let me transfer you to a human agent...In Q2, the prefix up to marker 2 (persona + tools + knowledge base = 7,659 tokens) is unchanged, resulting in a full cache hit. Only the new content after marker 2 (conversation history + new question) requires processing.
How multi-marker caching works:
User continues asking about the same product: Persona, tools, and knowledge base are all unchanged, hitting the cache at marker 2 (longest prefix match) for maximum savings.
More conversation turns: Earlier content (persona + tools + knowledge base + history) hits the previous turn's cache; only the new content requires a fresh cache.
Arrange content from most stable to least stable: place content that changes least at the beginning (e.g., system persona) and content that changes most at the end (e.g., current conversation) to maximize cache hit rates.
Batch processing (task completion)
Characteristics:
Single-turn requests, no context memory needed
Fixed long system prompt (task instructions) + variable user input (data to process)
Typical use cases: text classification, intent recognition, data extraction, content moderation
Best practice: Add the cache_control marker only on the system prompt. All subsequent requests hit the cache as long as the system prompt remains unchanged.
Example:
from openai import OpenAI
import os
client = OpenAI(
api_key=os.getenv("DASHSCOPE_API_KEY"),
base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)
# Long system prompt: detailed classification rules (must exceed 1024 tokens)
classification_prompt = """You are a product review classifier. Classify each review into one of these categories:
- Positive
- Negative
- Neutral
- Question
- Complaint
Output only the category name, nothing else.
Detailed classification rules and examples:
""" + """Rules:
1. Positive: Contains positive sentiment words (e.g., "great", "excellent", "recommend"), or expresses satisfaction.
2. Negative: Contains negative sentiment words (e.g., "terrible", "disappointed", "return"), or expresses dissatisfaction.
3. Neutral: No clear sentiment, merely states facts.
4. Question: Phrased as a question asking for product information.
5. Complaint: Expresses suggestions for improvement or lodges a complaint.
""" * 100
# Reviews to classify (simulating batch processing)
reviews = [
"This product is amazing, great quality, highly recommended!",
"Shipping took a week and the packaging was damaged",
"Does this come in red? Does it run large or small?",
"You should add more size options, medium is too big for me",
"It's okay I guess, nothing special, does what it says",
]
print("=== Batch Classification (Explicit Cache) ===")
for i, review in enumerate(reviews):
completion = client.chat.completions.create(
model="qwen3.7-max",
messages=[
{
"role": "system",
"content": [
{
"type": "text",
"text": classification_prompt,
"cache_control": {"type": "ephemeral"}, # Cache classification rules
}
],
},
{"role": "user", "content": review},
],
)
result = completion.choices[0].message.content
cached = completion.usage.prompt_tokens_details.cached_tokens
created = completion.usage.prompt_tokens_details.cache_creation_input_tokens
print(f"Review {i+1}: \"{review[:40]}...\" -> {result}")
print(f" Created: {created}, Hit: {cached}")
Expected output:
Review 1: "This product is amazing, great quality, ..." -> Positive
Created: 10353, Hit: 0
Review 2: "Shipping took a week and the packaging w..." -> Negative
Created: 0, Hit: 10353
Review 3: "Does this come in red? Does it run large..." -> Question
Created: 0, Hit: 10353
Review 4: "You should add more size options, medium..." -> Complaint
Created: 0, Hit: 10353
Review 5: "It's okay I guess, nothing special, does..." -> Neutral
Created: 0, Hit: 10353After the first request creates the cache, all subsequent requests hit it. When processing 1,000 items, 999 requests see a 90% reduction in input token cost.
Function Calling with cached tool definitions
Characteristics:
Using Function Calling with a long list of tool definitions
Tool definitions remain unchanged across requests
Best practice: The tools parameter content is part of the system prompt for caching. Ensure tool definitions are exactly identical across requests (same order, same field order, same structure), and add a cache_control marker to the message content.
from openai import OpenAI
import os
client = OpenAI(
api_key=os.getenv("DASHSCOPE_API_KEY"),
base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)
# Long text to meet the 1024-token minimum
long_text_content = "<Your Code Here>" * 400
# Tool definitions: must be exactly identical across requests
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather for a given city",
"parameters": {
"type": "object",
"properties": {
"city": {"type": "string", "description": "City name"},
"unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
},
"required": ["city"]
}
}
},
{
"type": "function",
"function": {
"name": "search_flights",
"description": "Search flights between two cities",
"parameters": {
"type": "object",
"properties": {
"origin": {"type": "string", "description": "Departure city"},
"destination": {"type": "string", "description": "Destination city"},
"date": {"type": "string", "description": "Departure date in YYYY-MM-DD format"}
},
"required": ["origin", "destination", "date"]
}
}
}
]
def ask(user_input):
messages = [
{
"role": "system",
"content": [
{
"type": "text",
"text": long_text_content,
# cache_control can only be added to message content, not to tools
"cache_control": {"type": "ephemeral"},
}
],
},
{"role": "user", "content": user_input},
]
completion = client.chat.completions.create(
model="qwen3.7-max",
messages=messages,
tools=tools,
extra_body={"enable_thinking": False},
)
usage = completion.usage
print(f" Created: {usage.prompt_tokens_details.cache_creation_input_tokens}, "
f"Hit: {usage.prompt_tokens_details.cached_tokens}")
tool_calls = completion.choices[0].message.tool_calls
if tool_calls:
print(f" Tools called: {[t.function.name for t in tool_calls]}")
return completion
# First request: creates cache (includes tool definitions)
print("Q1: What's the weather in Beijing today?")
ask("What's the weather in Beijing today?")
# Second request: hits cache
print("\nQ2: Find flights from Shanghai to Beijing tomorrow")
ask("Find flights from Shanghai to Beijing tomorrow")Expected output:
Q1: What's the weather in Beijing today?
Created: 1995, Hit: 0
Tools called: ['get_weather']
Q2: Find flights from Shanghai to Beijing tomorrow
Created: 0, Hit: 1995
Tools called: ['search_flights']Keys to maximizing Function Calling cache hits:
Consistent tool order: Keep the same ordering of tools in the tools array.
Consistent field order: Keep JSON field ordering the same within each tool definition.
Consistent structure: Do not add, remove, or reorder fields between requests, even if they are optional or empty.
Important notes
Content format requirement: When adding
cache_control, the content field must be in array form. String-form content does not support cache markers.Cache marker granularity: Qwen3.5 and later models only support message-level cache breakpoints. Placing multiple
cache_controlmarkers within a single message's content array does not create separate breakpoints — the system only stores cache at the last marker position within that message and cannot truncate-match at intermediate content blocks. Additionally, multiple system messages are merged internally into a single segment and cannot serve as separate breakpoints. To create multiple independent breakpoints, distributecache_controlmarkers across messages with different roles (e.g., one on system, one on user). Models prior to Qwen3.5 support content-level (intra-message) breakpoints.Mutually exclusive with implicit cache: A request can only use one caching mode. If the request contains a
cache_controlmarker, explicit cache is used; otherwise, the system automatically uses implicit cache.
Supported models
For the list of models that support explicit cache, see the Context cache.