
We present Qwen3-Max-Thinking, our latest flagship reasoning model. By scaling up model parameters and leveraging substantial computational resources for reinforcement learning, Qwen3-Max-Thinking achieves significant performance improvements across multiple dimensions, including factual knowledge, complex reasoning, instruction following, alignment with human preferences, and agent capabilities. On 19 established benchmarks, it demonstrates performance comparable to leading models such as GPT-5.2-Thinking, Claude-Opus-4.5, and Gemini 3 Pro.
We further enhance Qwen3-Max-Thinking with two key innovations: (1) adaptive tool-use capabilities that enable on-demand retrieval and code interpreter invocation, now available at chat.qwen.ai; and (2) advanced test-time scaling techniques that significantly boost reasoning performance, surpassing Gemini 3 Pro on key reasoning benchmarks.

The table below presents a more comprehensive set of evaluation scores.
|
Capability |
Benchmark |
GPT-5.2-Thinking |
Claude-Opus-4.5 |
Gemini 3 Pro |
DeepSeek V3.2 |
Qwen3-Max-Thinking |
|
Knowledge |
MMLU-Pro |
87.4 |
89.5 |
89.8 |
85.0 |
85.7 |
|
MMLU-Redux |
95.0 |
95.6 |
95.9 |
94.5 |
92.8 |
|
|
C-Eval |
90.5 |
92.2 |
93.4 |
92.9 |
93.7 |
|
|
STEM |
GPQA |
92.4 |
87.0 |
91.9 |
82.4 |
87.4 |
|
HLE1 |
35.5 |
30.8 |
37.5 |
25.1 |
30.2 |
|
|
Reasoning |
LiveCodeBench v6 |
87.7 |
84.8 |
90.7 |
80.8 |
85.9 |
|
HMMT Feb 25 |
99.4 |
- |
97.5 |
92.5 |
98.0 |
|
|
HMMT Nov 25 |
- |
- |
93.3 |
90.2 |
94.7 |
|
|
IMOAnswerBench |
86.3 |
84.0 |
83.3 |
78.3 |
83.9 |
|
|
Agentic Coding |
SWE Verified |
80.0 |
80.9 |
76.2 |
73.1 |
75.3 |
|
Agentic Search |
HLE (w/ tools)2 |
45.5 |
43.2 |
45.8 |
40.8 |
49.8 |
|
Instruction Following & Alignment |
IFBench |
75.4 |
58.0 |
70.4 |
60.7 |
70.9 |
|
MultiChallenge |
57.9 |
54.2 |
64.2 |
47.3 |
63.3 |
|
|
Arena-Hard v23 |
80.6 |
76.7 |
81.7 |
66.5 |
90.2 |
|
|
Tool Use |
Tau² Bench4 |
80.9 |
85.7 |
85.4 |
80.3 |
82.1 |
|
BFCL-V45 |
63.1 |
77.5 |
72.5 |
61.2 |
67.7 |
|
|
Vita Bench |
38.2 |
56.3 |
51.6 |
44.1 |
40.9 |
|
|
Deep Planning6 |
44.6 |
33.9 |
23.3 |
21.6 |
28.7 |
|
|
Long Context |
AA-LCR |
72.7 |
74.0 |
70.7 |
65.0 |
68.7 |
Unlike earlier approaches that required users to manually select tools before each task, Qwen3-Max-Thinking autonomously selects and leverages its built-in Search, Memory, and Code Interpreter capabilities during conversations. This capability emerges from a focused training process: after initial fine-tuning for tool use, the model underwent further training on diverse tasks using both rule-based and model-based feedback. Empirically, we observe that the Search and Memory tools effectively mitigate hallucinations, provide access to real-time information, and enable more personalized responses. The Code Interpreter allows users to execute code snippets and apply computational reasoning to solve complex problems. Together, these features deliver a seamless and capable conversational experience.
Test-time scaling refers to techniques that allocate additional computation during inference to improve model performance. We propose an experience-cumulative, multi-round test-time scaling strategy for the heavy mode. Instead of simply increasing parallel trajectories N, which often yields redundant reasoning, we limit N and redirect saved computation to iterative self-reflection guided by a “take-experience” mechanism. This mechanism distills key insights from past rounds, allowing the model to avoid re-deriving known conclusions and focus on unresolved uncertainties. Crucially, it achieves higher context efficiency than naively referencing raw trajectories, enabling richer integration of historical information within the same context window. This approach consistently outperforms standard parallel sampling and aggregation with roughly the same token consumption: GPQA (90.3 → 92.8), HLE (34.1 → 36.5), LiveCodeBench v6 (88.0 → 91.4), IMO-AnswerBench (89.5 → 91.5), and HLE (w/ tools) (55.8 → 58.3).
Qwen3-Max-Thinking is now available in Qwen Chat, where users can interact with the model and its adaptive tool-use capabilities. Meanwhile, the API of Qwen3-Max-Thinking (whose model name is qwen3-max-2026-01-23) is available. You can first register an Alibaba Cloud account and activate Alibaba Cloud Model Studio service, and then navigate to the console and create an API key.
Since the APIs of Qwen are OpenAI-API compatible, we can directly follow the common practice of using OpenAI APIs. Below is an example of using Qwen3-Max-Thinking in Python:
from openai import OpenAI
import os
client = OpenAI(
api_key=os.getenv("API_KEY"),
base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)
completion = client.chat.completions.create(
model="qwen3-max-2026-01-23",
messages=[
{'role': 'user', 'content': 'Give me a short introduction to large language model.'}
],
extra_body={"enable_thinking": True}
)
print(completion.choices[0].message)
The APIs of Qwen are also compatible with the Anthropic API protocol, enabling Qwen3-Max-Thinking to work seamlessly with Claude Code. Simply use the API key created at Alibaba Cloud account and install Claude Code to elevate your coding experience. Below is the quick start script.
# Install Claude Code
npm install -g @anthropic-ai/claude-code
# Configure Environment Variables
export ANTHROPIC_MODEL="qwen3-max-2026-01-23"
export ANTHROPIC_SMALL_FAST_MODEL="qwen3-max-2026-01-23"
export ANTHROPIC_BASE_URL=https://dashscope.aliyuncs.com/apps/anthropic
export ANTHROPIC_AUTH_TOKEN=your-dashscope-apikey
# Execute
claude
Feel free to cite the following article if you find Qwen3-Max-Thinking helpful.
@misc{qwen3maxthinking,
title = {Pushing Qwen3-Max-Thinking Beyond its Limits},
url = {https://qwen.ai/blog?id=qwen3-max-thinking},
author = {Qwen Team},
month = {January},
year = {2026}
}
Qwen3-ASR & Qwen3-ForcedAligner is Now Open Sourced: Robust, Streaming and Multilingual!
1,353 posts | 478 followers
FollowAlibaba Cloud Community - January 27, 2026
Alibaba Cloud Community - October 28, 2025
Alibaba Clouder - December 23, 2019
Alibaba Cloud Community - January 30, 2026
plavookac - June 2, 2025
Farruh - April 8, 2025
1,353 posts | 478 followers
Follow
AI Acceleration Solution
Accelerate AI-driven business and AI model training and inference with Alibaba Cloud GPU technology
Learn More
Tongyi Qianwen (Qwen)
Top-performance foundation models from Alibaba Cloud
Learn More
Alibaba Cloud for Generative AI
Accelerate innovation with generative AI to create new business success
Learn More
Platform For AI
A platform that provides enterprise-level data modeling services based on machine learning algorithms to quickly meet your needs for data-driven operations.
Learn MoreMore Posts by Alibaba Cloud Community