×
Community Blog Qwen3.5: Towards Native Multimodal Agents

Qwen3.5: Towards Native Multimodal Agents

We are delighted to announce the official release of Qwen3.5, introducing the open-weight of the first model in the Qwen3.

3_5banner

We are delighted to announce the official release of Qwen3.5, introducing the open-weight of the first model in the Qwen3.5 series, namely Qwen3.5-397B-A17B. As a native vision-language model, Qwen3.5-397B-A17B demonstrates outstanding results across a full range of benchmark evaluations, including reasoning, coding, agent capabilities, and multimodal understanding, empowering developers and enterprises to achieve significantly greater productivity. Built on an innovative hybrid architecture that fuses linear attention (via Gated Delta Networks) with a sparse mixture-of-experts, the model attains remarkable inference efficiency: although it comprises 397 billion total parameters, just 17 billion are activated per forward pass, optimizing both speed and cost without sacrificing capability. We have also expanded our language and dialect support from 119 to 201, providing broader accessibility and enhanced support to users around the world.

Qwen3.5-Plus is the hosted model available via Alibaba Cloud Model Studio, featuring:

  • a 1M context window by default
  • official built-in tools and adaptive tool use

2

Performance

Below we present the comprehensive evaluation of our models against frontier models in a wide range of evaluation tasks, covering different tasks and modalities.

Language

 

GPT5.2

Claude 4.5 Opus

Gemini-3 Pro

Qwen3-Max-Thinking

K2.5-1T-A32B

Qwen3.5-397B-A17B

Knowledge

MMLU-Pro

87.4

89.5

89.8

85.7

87.1

87.8

MMLU-Redux

95.0

95.6

95.9

92.8

94.5

94.9

SuperGPQA

67.9

70.6

74.0

67.3

69.2

70.4

C-Eval

90.5

92.2

93.4

93.7

94.0

93.0

Instruction Following

IFEval

94.8

90.9

93.5

93.4

93.9

92.6

IFBench

75.4

58.0

70.4

70.9

70.2

76.5

MultiChallenge

57.9

54.2

64.2

63.3

62.7

67.6

Long Context

AA-LCR

72.7

74.0

70.7

68.7

70.0

68.7

LongBench v2

54.5

64.4

68.2

60.6

61.0

63.2

STEM

GPQA

92.4

87.0

91.9

87.4

87.6

88.4

HLE

35.5

30.8

37.5

30.2

30.1

28.7

HLE-Verified¹

43.3

38.8

48

37.6

--

37.6

Reasoning

LiveCodeBench v6

87.7

84.8

90.7

85.9

85.0

83.6

HMMT Feb 25

99.4

92.9

97.3

98.0

95.4

94.8

HMMT Nov 25

100

93.3

93.3

94.7

91.1

92.7

IMOAnswerBench

86.3

84.0

83.3

83.9

81.8

80.9

AIME26

96.7

93.3

90.6

93.3

93.3

91.3

General Agent

BFCL-V4

63.1

77.5

72.5

67.7

68.3

72.9

TAU2-Bench

87.1

91.6

85.4

84.6

77.0

86.7

VITA-Bench

38.2

56.3

51.6

40.9

41.9

49.7

DeepPlanning

44.6

33.9

23.3

28.7

14.5

34.3

Tool Decathlon

43.8

43.5

36.4

18.8

27.8

38.3

MCP-Mark

57.5

42.3

53.9

33.5

29.5

46.1

Search Agent

HLE w/ tool

45.5

43.4

45.8

49.8

50.2

48.3

BrowseComp

65.8

67.8

59.2

53.9

--/74.9

69.0/78.6

BrowseComp-zh

76.1

62.4

66.8

60.9

--

70.3

WideSearch

76.8

76.4

68.0

57.9

72.7

74.0

Seal-0

45.0

47.7

45.5

46.9

57.4

46.9

Multilingualism

MMMLU

89.5

90.1

90.6

84.4

86.0

88.5

MMLU-ProX

83.7

85.7

87.7

78.5

82.3

84.7

NOVA-63

54.6

56.7

56.7

54.2

56.0

59.1

INCLUDE

87.5

86.2

90.5

82.3

83.3

85.6

Global PIQA

90.9

91.6

93.2

86.0

89.3

89.8

PolyMATH

62.5

79.0

81.6

64.7

43.1

73.3

WMT24++

78.8

79.7

80.7

77.6

77.6

78.9

MAXIFE

88.4

79.2

87.5

84.0

72.8

88.2

Coding Agent

SWE-bench Verified

80.0

80.9

76.2

75.3

76.8

76.4

SWE-bench Multilingual

72.0

77.5

65.0

66.7

73.0

69.3

SecCodeBench

68.7

68.6

62.4

57.5

61.3

68.3

Terminal Bench 2

54.0

59.3

54.2

22.5

50.8

52.5

  • HLE-Verified: a verified and revised version of Humanity’s Last Exam (HLE), accompanied by a transparent, component-wise verification protocol and a fine-grained error taxonomy. We open-source the dataset at https://huggingface.co/datasets/skylenage/HLE-Verified.
  • TAU2-Bench: we follow the official setup except for the airline domain, where all models are evaluated by applying the fixes proposed in the Claude Opus 4.5 system card.
  • MCP-Mark: GitHub MCP server uses v0.30.3 from api.githubcopilot.com; Playwright tool responses are truncated at 32k tokens.
  • Search Agent: most search agents built on our model adopt a simple context-folding strategy(256k): once the cumulative Tool Response length reaches a preset threshold, earlier Tool Responses are pruned from the history to keep the context within limits.
  • BrowseComp: we tested two strategies, simple context-folding achieved a score of 69.0, while using the same discard-all strategy as DeepSeek-V3.2 and Kimi K2.5 achieved 78.6.
  • WideSearch: we use a 256k context window without any context management.
  • MMLU-ProX: we report the averaged accuracy on 29 languages.
  • WMT24++: a harder subset of WMT24 after difficulty labeling and rebalancing; we report the averaged scores on 55 languages using XCOMET-XXL.
  • MAXIFE: we report the accuracy on English + multilingual original prompts (totally 23 settings).
  • Empty cells (--) indicate scores not yet available or not applicable.

Vision Language

 

GPT5.2

Claude 4.5 Opus

Gemini-3 Pro

Qwen3-VL-235B-A22B

K2.5-1T-A32B

Qwen3.5-397B-A17B

STEM and Puzzle

MMMU

86.7

80.7

87.2

80.6

84.3

85.0

MMMU-Pro

79.5

70.6

81.0

69.3

78.5

79.0

MathVision

83.0

74.3

86.6

74.6

84.2

88.6

Mathvista(mini)

83.1

80.0

87.9

85.8

90.1

90.3

We-Math

79.0

70.0

86.9

74.8

84.7

87.9

DynaMath

86.8

79.7

85.1

82.8

84.4

86.3

ZEROBench

9

3

10

4

9

12

ZEROBench_sub

33.2

28.4

39.0

28.4

33.5

41.0

BabyVision

34.4

14.2

49.7

22.2

36.5

52.3/43.3

General VQA

RealWorldQA

83.3

77.0

83.3

81.3

81.0

83.9

MMStar

77.1

73.2

83.1

78.7

80.5

83.8

HallusionBench

65.2

64.1

68.6

66.7

69.8

71.4

MMBenchEN-DEV-v1.1

88.2

89.2

93.7

89.7

94.2

93.7

SimpleVQA

55.8

65.7

73.2

61.3

71.2

67.1

Text Recognition and Document Understanding

OmniDocBench1.5

85.7

87.7

88.5

84.5

88.8

90.8

CharXiv(RQ)

82.1

68.5

81.4

66.1

77.5

80.8

MMLongBench-Doc

--

61.9

60.5

56.2

58.5

61.5

CC-OCR

70.3

76.9

79.0

81.5

79.7

82.0

AI2D_TEST

92.2

87.7

94.1

89.2

90.8

93.9

OCRBench

80.7

85.8

90.4

87.5

92.3

93.1

Spatial Intelligence

ERQA

59.8

46.8

70.5

52.5

--

67.5

CountBench

91.9

90.6

97.3

93.7

94.1

97.2

RefCOCO(avg)

--

--

84.1

91.1

87.8

92.3

ODInW13

--

--

46.3

43.2

--

47.0

EmbSpatialBench

81.3

75.7

61.2

84.3

77.4

84.5

RefSpatialBench

--

--

65.5

69.9

--

73.6

LingoQA

68.8

78.8

72.8

66.8

68.2

81.6

V*

75.9

67.0

88.0

85.9

77.0

95.8/91.1

Hypersim

--

--

--

11.0

--

12.5

SUNRGBD

--

--

--

34.9

--

38.3

Nuscene

--

--

--

13.9

--

16.0

Video Understanding

VideoMME(w sub.)

86

77.6

88.4

83.8

87.4

87.5

VideoMME(w/o sub.)

85.8

81.4

87.7

79.0

83.2

83.7

VideoMMMU

85.9

84.4

87.6

80.0

86.6

84.7

MLVU (M-Avg)

85.6

81.7

83.0

83.8

85.0

86.7

MVBench

78.1

67.2

74.1

75.2

73.5

77.6

LVBench

73.7

57.3

76.2

63.6

75.9

75.5

MMVU

80.8

77.3

77.5

71.1

80.4

75.4

Visual Agent

ScreenSpot Pro

--

45.7

72.7

62.0

--

65.6

OSWorld-Verified

38.2

66.3

--

38.1

63.3

62.2

AndroidWorld

--

--

--

63.7

--

66.8

Medical VQA

SLAKE

76.9

76.4

81.3

54.7

81.6

79.9

PMC-VQA

58.9

59.9

62.3

41.2

63.3

64.2

MedXpertQA-MM

73.3

63.6

76.0

47.6

65.3

70.0

  • MathVision:our model’s score is evaluated using a fixed prompt, e.g., “Please reason step by step, and put your final answer within boxed{}.” For other models, we report the higher score between runs with and without the boxed{} formatting.
  • BabyVision: our model’s score is reported with CI (Code Interpreter) enabled; without CI, the result is 43.3.
  • V*: our model’s score is reported with CI (Code Interpreter) enabled; without CI, the result is 91.1.
  • Empty cells (--) indicate scores not yet available or not applicable.

Compared to the Qwen3 series, the post-training performance gains in Qwen3.5 primarily stem from our extensive scaling of virtually all RL tasks and environments we could conceive. Our approach placed strong emphasis on increasing the difficulty and generalizability of RL environments, rather than optimizing for specific metrics or narrow categories of queries. Below, we illustrate the improvements in general agent capabilities resulting from this RL environment scaling. The overall performance is calculated by averaging the ranking of each model on the following benchmarks: BFCL-V4, VITA-Bench, DeepPlanning, Tool-Decathlon, and MCP-Mark. Additional scaling results across a broader range of tasks will be detailed in our upcoming technical report.

3

Pretraining

Qwen3.5 advances pretraining across three dimensions—power, efficiency, and versatility:

Power: Trained on a significantly larger scale of visual-text tokens compared to Qwen3, with enriched Chinese/English, multilingual, STEM, and reasoning data under stricter filtering. This enables cross-generation parity: Qwen3.5-397B-A17B matches the >1T-parameter Qwen3-Max-Base.

Efficiency: Built on Qwen3-Next architecture—higher-sparsity MoE, Gated DeltaNet + Gated Attention hybrid attention, stability optimizations, and multi-token prediction. Under the 32k/256k context length, the decoding throughput of Qwen3.5-397B-A17B is 8.6x/19.0x that of Qwen3-Max, and the performance is comparable. The decoding throughput of Qwen3.5-397B-A17B is 3.5x/7.2 times that of Qwen3-235B-A22B.

Versatility: Natively multimodal via early text-vision fusion and expanded visual/STEM/video data, outperforming Qwen3-VL at similar scales. Multilingual coverage grows from 119 to 201 languages/dialects; a 250k vocabulary (vs. 150k) boosts encoding/decoding efficiency by 10–60% across most languages.

4

Below we present the performance of the base models.

 

Qwen3-235B-A22B

GLM-4.5-355B-A32B

DeepSeek-V3.2-671B-A37B

K2-1T-A32B

Qwen3.5-397B-A17B

General Knowledge & Multilingual

MMLU

87.33

86.56

88.11

87.38

88.61

MMLU-Pro

67.73

65.00

62.82

67.64

76.01

MMLU-Redux

87.44

86.86

87.29

86.65

89.09

SuperGPQA

42.84

44.56

43.46

44.86

57.96

C-Eval

91.82

85.50

90.48

91.82

91.82

MMMLU

81.27

82.26

83.20

82.26

85.82

Include

75.26

73.41

76.52

72.05

79.27

Nova

66.52

60.96

60.40

61.44

67.55

Reasoning & STEM

BBH

87.95

87.68

86.03

89.11

90.98

KoRBench

50.80

52.80

54.00

53.84

54.08

GPQA

47.47

44.63

44.16

46.78

54.64

MATH

71.84

61.84

64.40

71.50

74.14

GSM8K

91.17

89.31

89.12

92.12

93.71

Coding

Evalplus

77.60

69.49

62.68

71.77

79.32

MultiPLE

65.94

62.51

61.88

70.64

79.39

SWE-agentless

31.77

29.23

34.67

28.54

43.26

CRUX-I

64.25

67.63

63.25

70.50

71.13

CRUX-O

78.88

77.13

73.88

77.13

82.38

Infrastructure

Qwen3.5 enables efficient native multimodal training via a heterogeneous infrastructure that decouples parallelism strategies across vision and language components, avoiding uniform approaches’ inefficiencies. By exploiting sparse activations for cross-component computation overlap, it achieves near 100% training throughput versus pure-text baselines on mixed text-image-video data. Complementing this, a native FP8 pipeline applies low precision to activations, MoE routing, and GEMM operations—with runtime monitoring preserving BF16 in sensitive layers—yielding ~50% activation memory reduction and >10% speedup while scaling stably to tens of trillions of tokens.

To continuously unleash the power of reinforcement learning, we built a scalable asynchronous RL framework that supports Qwen3.5 models of all sizes, spanning text, multimodal, and multi-turn settings. By adopting a fully disaggregated training-inference architecture, the framework achieves significantly improved hardware utilization, dynamic load balancing, and fine-grained fault recovery. It further optimizes throughput and enhances train–infer consistency via techniques such as FP8 end-to-end training, rollout router replay, speculative decoding, and multi-turn rollout locking. Through tight system-algorithm co-design, the framework effectively bounds gradient staleness and mitigates data skewness, preserving both training stability and performance. Moreover, it natively supports agentic workflows, facilitating seamless multi-turn interactions without framework-induced interruptions. This decoupled design enables the system to accommodate million-scale agent scaffolds and environments, substantially boosting model generalization. Collectively, these optimizations yield a 3×–5× end-to-end speedup, demonstrating superior stability, efficiency, and scalability.

5

Play with Qwen3.5

Chat with Qwen3.5

Feel free to use Qwen3.5 on Qwen Chat. We provide three modes, auto, thinking, and fast, to users to choose. With “Auto” mode, users can leverage adaptive thinking, which can think and use tools including search and code interpreter, while with “Thinking” mode, the model can think deeply for hard problems. With “Fast” mode, the model answers questions instantly without spending tokens on thinking.

ModelStudio

Users can experience our flagship model, Qwen3.5-Plus, by invoking it through Alibaba Cloud ModelStudio. To enable advanced capabilities such as reasoning, web search, and Code Interpreter, simply pass the following parameters:

  • enable_thinking: Activates reasoning mode (chain-of-thought processing)
  • enable_search: Enables web search and Code Interpreter functionality

Example code is provided below:

"""
Environment variables (per official docs):
  DASHSCOPE_API_KEY: Your API Key from https://bailian.console.aliyun.com
  DASHSCOPE_BASE_URL: (optional) Base URL for compatible-mode API.
  DASHSCOPE_MODEL: (optional) Model name; override for different models.
  DASHSCOPE_BASE_URL:
    - Beijing: https://dashscope.aliyuncs.com/compatible-mode/v1
    - Singapore: https://dashscope-intl.aliyuncs.com/compatible-mode/v1
    - US (Virginia): https://dashscope-us.aliyuncs.com/compatible-mode/v1
"""
from openai import OpenAI
import os

api_key = os.environ.get("DASHSCOPE_API_KEY")
if not api_key:
    raise ValueError(
        "DASHSCOPE_API_KEY is required. "
        "Set it via: export DASHSCOPE_API_KEY='your-api-key'"
    )

client = OpenAI(
    api_key=api_key,
    base_url=os.environ.get(
        "DASHSCOPE_BASE_URL",
        "https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
    ),
)

messages = [{"role": "user", "content": "Introduce Qwen3.5."}]

model = os.environ.get(
    "DASHSCOPE_MODEL",
    "qwen3.5-plus",
)
completion = client.chat.completions.create(
    model=model,
    messages=messages,
    extra_body={
        "enable_thinking": True,
        "enable_search": False
    },
    stream=True
)

reasoning_content = ""  # Full reasoning trace
answer_content = ""  # Full response
is_answering = False  # Whether we have entered the answer phase
print("\n" + "=" * 20 + "Reasoning" + "=" * 20 + "\n")

for chunk in completion:
    if not chunk.choices:
        print("\nUsage:")
        print(chunk.usage)
        continue

    delta = chunk.choices[0].delta

    # Collect reasoning content only
    if hasattr(delta, "reasoning_content") and delta.reasoning_content is not None:
        if not is_answering:
            print(delta.reasoning_content, end="", flush=True)
        reasoning_content += delta.reasoning_content

    # Received content, start answer phase
    if hasattr(delta, "content") and delta.content:
        if not is_answering:
            print("\n" + "=" * 20 + "Answer" + "=" * 20 + "\n")
            is_answering = True
        print(delta.content, end="", flush=True)
        answer_content += delta.content

You can effortlessly integrate the Bailian API with third-party coding tools, such as Qwen Code, Claude Code, Cline, OpenClaw, OpenCode, etc., to enable a seamless “vibe coding” experience.

Summary and Future Work

Qwen3.5 provides a strong foundation for universal digital agents through its efficient hybrid architecture and native multimodal reasoning. The next leap requires shifting from model scaling to system integration: building agents with persistent memory for cross-session learning, embodied interfaces for real-world interaction, self-directed improvement mechanisms, and economic awareness to operate within practical constraints. The goal is coherent systems that function autonomously over time, transforming today’s task-bound assistants into persistent, trustworthy partners capable of executing complex, multi-day objectives with human-aligned judgment.

Citation

Feel free to cite the following article if you find Qwen3.5 helpful:

@misc{qwen35blog,
    title = {Qwen3.5: Accelerating Productivity with Native Multimodal Agents},
    url = {https://qwen.ai/blog?id=qwen3.5},
    author = {Qwen Team},
    month = {February},
    year = {2026}
}


Source

0 1 0
Share on

Alibaba Cloud Community

1,370 posts | 487 followers

You may also like

Comments