×
Community Blog Qwen3.7-Plus: Multimodal Agent Intelligence

Qwen3.7-Plus: Multimodal Agent Intelligence

This article introduces Qwen3.7-Plus — a multimodal agent model that unifies vision and language into a single, versatile agent foundation.

qwen3_7_plus_banner

Today we introduce Qwen3.7-Plus — a multimodal agent model that unifies vision and language into a single, versatile agent foundation. Building on Qwen3.7’s strong text backbone, Qwen3.7-Plus delivers a comprehensive upgrade in vision-language capabilities while retaining full agentic strength in coding, tool use, and productivity workflows.

What sets Qwen3.7-Plus apart is its ability to operate as a multimodal interactive hybrid agent. It perceives real-world scenes, reads screens and operates GUIs, writes code from visual references, navigates mobile apps end-to-end, and answers visual questions grounded in web knowledge — seamlessly blending GUI and CLI interactions within a single agent loop. As a versatile coding agent and productivity assistant, it handles the full spectrum from frontend prototyping to complex software engineering and multi-step workflow automation with full-modality input. It generalizes across agent scaffolds, performing consistently whether deployed through Claude Code, OpenClaw, Qwen Code, or other frameworks.

  • Qwen3.7-Plus — now available via Alibaba Cloud Model Studio:

    • Multimodal interactive hybrid agent: unified GUI & CLI operation across visual and text tasks
    • Versatile coding agent & productivity assistant with full-modality input
    • Visual Agent: perception, reasoning, grounding, and search-augmented QA
    • Cross-harness generalization across diverse agent frameworks
  • Call via API on Alibaba Cloud Model Studio.

Performance

Qwen3_7_Plus_Score_

Text Benchmarks

table1

* Terminal-Bench 2.0: Harbor/Terminus-2 harness; 5h timeout, 12 CPU/24 GB RAM; temp=1.0, top_p=0.95, top_k=20, max_tokens=80K, 256K ctx; avg of 5 runs. All experiments prepend a token at each turn, allowing the model to decide whether to engage extended thinking.
* SWE-Bench Series: Internal agent scaffold (bash + file-edit tools); temp=1.0, top_p=0.95, 200K context window.
* SWE-bench Pro: Problematic tasks corrected and all baselines evaluated on the refined benchmark.
* QwenClawBench: a real-user-distribution Claw agent benchmark; open-source: https://github.com/SKYLENAGE-AI/QwenClawBench.
* CoWorkBench: an internal cowork benchmark; long-horizon tasks across computer science, finance, law, medical, and other productivity domains.
* SkillsBench: Evaluated via OpenCode on 78 tasks (excluding 9 external API-dependent tasks); avg of 5 runs.
* MCP-Mark: GitHub MCP v0.30.3; Playwright responses truncated at 32K tokens.
* MCP-Atlas: Public set score; gemini-2.5-pro judger.
* VITA-Bench: Avg subdomain scores; using claude-4.5-sonnet as judger, as the older official judgers are no longer available.
* Kernel Bench L3: Metrics reported: median of per-problem speedup over PyTorch eager reference / fraction of problems faster than torch.compile, across 50 problems. Each test sample runs in an isolated Docker container with one H100 80GB GPU, with internet access restricted to the CUTLASS codebase and official CUDA documentation, limited to 500 tool calls with early stopping after 100 non-improving turns. GPT-5.4 (xhigh) is applied to detect potential hacking behaviors. CUPTI is used for kernel-level timing.
* Reasoning scenarios: Recommended system prompt: "Reasoning effort is set to xhigh. Please think carefully through the task, validate key assumptions, consider plausible alternatives, and prioritize correctness, consistency, and clarity in the final answer."
* WMT24++: Harder WMT24 subset; avg scores on 55 langs via XCOMET-XXL.
* MAXIFE: Accuracy on EN + multilingual prompts (23 settings total).
* MMLU-ProX: Avg accuracy across 29 languages.
* Empty cells (--) indicate scores not yet available.

Qwen3.7-Plus delivers competitive text performance that approaches Max-tier models across the board. In coding agents, it performs strongly on Terminal Bench 2.0, SWE-bench series, and SciCode, handling both real-world software engineering and scientific programming tasks effectively. In general-purpose agents, it demonstrates robust tool-use and planning capabilities across MCP-Mark, Deep-Planning, and Kernel Bench L3, showing particular strength in complex multi-step planning and GPU kernel optimization. Its reasoning performance on GPQA Diamond, HMMT, and IMOAnswerBench places it among the strongest Plus-tier models on hard STEM benchmarks. In instruction following and multilingual tasks, it delivers consistent quality across IFBench, WMT24++, and PolyMATH, with strong coverage across diverse languages.

Multimodal Benchmarks

table2

* Multimodal Search & Knowledge QA: All models evaluated with search augmentation enabled.
* BabyVision and CharXiv(RQ): Scores are reported as "with CI / without CI".
* VideoMME (w/ sub.): Scores are reported with subtitles.
* BC-VL and MMBC: Scores are reported with the recommended presence penalty 1.5 in BC tasks.
* ScreenSpot Pro and OSWorld-Verified: Scores are reported with "enable_thinking=False".
* Empty cells (--) indicate the scores are not yet available.

Qwen3.7-Plus’s multimodal improvements are not limited to isolated gains in visual understanding. Instead, they reflect a systematic enhancement of the core capabilities required by multimodal agents: understanding complex visual inputs, reasoning over visual information, using tools to solve problems, and ultimately executing tasks in code or GUI environments.

In Multimodal Reasoning, Qwen3.7-Plus delivers strong performance on challenging visual reasoning benchmarks such as BabyVision, MathVision, HiPhO, ERQA, and VisFactor. These results demonstrate the model’s ability to integrate fine-grained visual perception, spatial relationships, physical commonsense, and multi-step logical reasoning. In particular, its significant improvement on BabyVision over Qwen3.6-Plus suggests stronger generalization on tasks that are closer to early human visual cognition and spatial reasoning.

In Visual Agent & Coding, Qwen3.7-Plus shows substantial gains on ScreenSpot Pro, OSWorld-Verified, and AndroidWorld. This indicates that the model can not only recognize screen content, but also localize key UI elements, understand task intent, and complete multi-step interactions. On QwenVision2Code, the model also demonstrates strong vision-to-code generation capabilities, turning images, videos, and design references into executable code. These capabilities form the foundation for multimodal agents to move from “understanding interfaces” to “operating interfaces” and even “building interfaces.”

In Multimodal Search & Knowledge QA, Qwen3.7-Plus achieves clear improvements on SimpleVQA, WorldVQA, MMSearchPlus, BC-VL, and MMBC. The model can combine visual inputs with external knowledge retrieval to answer questions that cannot be solved from image content alone. This makes it better suited for real-world tasks, where users do not simply ask “what is in the image,” but expect the model to combine visual evidence, commonsense, and up-to-date knowledge to provide reliable answers.

In General Visual Understanding, Qwen3.7-Plus maintains strong performance across real-world scenes, document parsing, chart understanding, OCR, counting, and spatial localization. It performs strongly on tasks such as RealWorldQA, CountQA, OmniDocBench, CharXiv, and OCR-Bench-V2. These capabilities are essential for robustly handling real business inputs, including screenshots, receipts, tables, reports, posters, product images, and complex UI pages.

Beyond images, Qwen3.7-Plus further strengthens video understanding and driving-scene understanding. On video benchmarks such as VideoMMMU, MLVU, TVBench, and LVBench, it can reason over events, actions, temporal dynamics, and semantic relationships in both short and long videos. On driving-related evaluations such as LingoQA, Ego3D-Bench, SURDS, and VLADBench, it also demonstrates strong understanding of dynamic scenes, traffic participants, and spatial relationships. These capabilities lay an important foundation for real-world multimodal agents, autonomous driving understanding, and embodied AI scenarios.

Build with Qwen3.7-Plus

Qwen3.7-Plus is now available through Alibaba Cloud Model Studio.

API Usage

As a multimodal model, Qwen3.7-Plus accepts both text and image/video inputs. It also supports the preserve_thinking feature: preserving thinking content from all preceding turns in messages, which is recommended for agentic tasks.

Alibaba Cloud Model Studio

Alibaba Cloud Model Studio supports industry-standard protocols, including chat completions and responses APIs compatible with OpenAI’s specification.

"""
Environment variables:
  DASHSCOPE_API_KEY: Your API Key from https://modelstudio.console.alibabacloud.com
  DASHSCOPE_BASE_URL: (optional) Base URL for compatible-mode API.
    - Beijing: https://dashscope.aliyuncs.com/compatible-mode/v1
    - Singapore: https://dashscope-intl.aliyuncs.com/compatible-mode/v1
    - US (Virginia): https://dashscope-us.aliyuncs.com/compatible-mode/v1
"""
from openai import OpenAI
import os

api_key = os.environ.get("DASHSCOPE_API_KEY")
if not api_key:
    raise ValueError(
        "DASHSCOPE_API_KEY is required. "
        "Set it via: export DASHSCOPE_API_KEY='your-api-key'"
    )

client = OpenAI(
    api_key=api_key,
    base_url=os.environ.get(
        "DASHSCOPE_BASE_URL",
        "https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
    ),
)

messages = [{"role": "user", "content": "Write a Python function to merge two sorted linked lists."}]

completion = client.chat.completions.create(
    model="qwen3.7-plus",
    messages=messages,
    extra_body={
        "enable_thinking": True,
        # "preserve_thinking": True,
    },
    stream=True
)

reasoning_content = ""
answer_content = ""
is_answering = False
print("\n" + "=" * 20 + "Reasoning" + "=" * 20 + "\n")

for chunk in completion:
    if not chunk.choices:
        print("\nUsage:")
        print(chunk.usage)
        continue

    delta = chunk.choices[0].delta

    if hasattr(delta, "reasoning_content") and delta.reasoning_content is not None:
        if not is_answering:
            print(delta.reasoning_content, end="", flush=True)
        reasoning_content += delta.reasoning_content

    if hasattr(delta, "content") and delta.content:
        if not is_answering:
            print("\n" + "=" * 20 + "Answer" + "=" * 20 + "\n")
            is_answering = True
        print(delta.content, end="", flush=True)
        answer_content += delta.content

For more information, please visit the API doc.

Multimodal Interactive Hybrid Agent

Qwen3.7-Plus features multimodal hybrid-agent capabilities designed for closed-loop execution of real-world tasks. It can not only understand visual interfaces, perceive on-screen content, and perform both GUI interactions and CLI operations, but also leverage environmental feedback for code generation, application manipulation, testing, validation, and iterative optimization. By integrating the full workflow of “see, think, write, act, and verify” into a unified agent loop, it enables end-to-end automation of complex software tasks from initial understanding to final delivery.

We built the Hybrid-Agent intelligent agent system based on Qwen3.7, deeply integrating the code generation capabilities of large language models with GUI automation execution, achieving full-chain APP development from requirement analysis to version iteration. The Agent operated continuously and stably for over 11 hours, fully automating the complete R&D cycle of an English vocabulary learning APP. It generated more than 10,000+ lines of code, triggered over 1,000+ Agent calls, and covered core stages across the entire software development lifecycle: requirement document generation, automated coding, installation and deployment, test case creation, GUI-based automated testing, multi-scenario parallelized testing, automatic product documentation updates, and autonomous version evolution.

For professional desktop application scenarios, the Hybrid-Agent system deeply integrates the model’s GUI perception and code generation capabilities to enable one-click autonomous replication of professional desktop applications. The Agent autonomously completed a high-fidelity recreation of the native macOS Stocks app, covering the full pipeline from requirement understanding to delivery validation: autonomously interacting with the native app to comprehend UI layout and feature details, generating SwiftUI source code from interaction records, integrating with the LongBridge real-world market API for live data, automatically compiling and launching the recreated app, and finally conducting 10 functional verification tests autonomously – including real-time quote loading, stock selection and switching, multi-period view toggling, search filtering, and detailed stats panel display – all passed. The delivered application faithfully reproduces the native Stocks app’s dark theme, split-view layout, real-time market data, and full interactivity.

Visual Agent

Qwen3.7-Plus can serve as a powerful visual agent, combining visual understanding with tool use to solve complex visual tasks. Through integration with a code interpreter, it can analyze images to spot differences, complete missing puzzle pieces, solve sliding-block puzzles, navigate mazes, and assemble jigsaw puzzles—all by autonomously generating and executing code. With search augmentation, it can also leverage web knowledge to reason over real-world visual questions and provide multimodal answers across single-image, multi-image, and video inputs.

Below, we showcase several examples that demonstrate the multimodal agent capabilities of Qwen3.7-Plus.

Multimodal Reasoning

For multimodal reasoning, we introduce code execution to further enhance the model’s problem-solving ability. The model first understands the structure and constraints in the visual input, then transforms the visual task into a computable representation, and finally writes and executes code to solve, search, or verify the answer.

In tasks such as spot-the-difference, missing-block completion, sliding-block puzzles, mazes, and jigsaw puzzles, the model needs to go beyond recognizing visual content. It must also perform spatial modeling, path search, state simulation, and result verification. These examples highlight Qwen3.7-Plus’s ability to move from visual perception to programmatic problem solving.

Multimodal Search

In search-augmented visual question answering, Qwen3.7-Plus can combine image, video, or multi-image inputs with web search to answer real-world knowledge questions. The model first extracts key entities, scenes, text, and contextual clues from the visual input, then retrieves external knowledge through search, and finally synthesizes visual evidence with retrieved information to produce the answer.

This enables the model to handle a wide range of open-world questions, such as identifying locations, understanding the background of events, analyzing products or objects, and answering visual questions that depend on up-to-date knowledge.

Visual Coding

Qwen3.7-Plus demonstrates strong vision-to-code generation capabilities. It can transform images, videos, UI screenshots, and design references into executable code, covering a broad range of scenarios from SVG reconstruction to full webpage generation.

Image/Video to SVG

In image/video-to-SVG tasks, the model needs to understand geometric structures, colors, layouts, hierarchical relationships, and dynamic changes in visual content, and then express these elements precisely in code. This requires not only visual understanding, but also structured representation and code generation.

For icons, illustrations, animations, graphic design, and information visualization, this capability can significantly reduce the cost of turning visual references into editable code assets.

Vision-Driven Web Design

In vision-driven web design, Qwen3.7-Plus can generate complete interactive webpages based on visual references, video materials, or design intent. The model can also use generation tools to produce assets for webpage design.

It not only reproduces the visual style of a reference page, but also organizes layout, writes frontend code, handles interaction logic, and integrates multimodal assets into the final page. This demonstrates the potential of Qwen3.7-Plus as a visual coding assistant: moving from “given a reference image” to “generate a runnable web prototype.”

Browser Agent

Built on Qwen3.7-Plus, the browser Agent is demonstrated and recorded through Qwen for Chrome, a browser extension embedded in Chrome. Users can interact with Qwen directly from the browser sidebar and, with authorization, switch it into Agent mode. In this mode, Qwen can perceive the current webpage, understand the user’s task, plan the next steps, and operate as a Browser Agent to perform clicks, typing, navigation, configuration, and verification directly in the real browser environment.

With this setup, the Qwen3.7 browser Agent integrates page understanding, task planning, and GUI automation to operate inside real web-based work environments. Given a non-technical user’s request to purchase the cheapest ECS server, the Agent can navigate the cloud console, compare instance options, select a low-cost configuration, set up images, storage, security groups, and order details, while dynamically adjusting its strategy when prices change, inventory is limited, or purchase constraints arise. In the follow-up task, the Agent further handles instance scaling and maintenance, completing shutdown, configuration updates, disk expansion, service recovery, and final verification. This scenario covers the real cloud workflow from server purchase to upgrade, turning a complex console-based process into a continuous, efficient, and deliverable browser automation task.

Real-world Perception & Reasoning

Qwen3.7-Plus also shows strong performance in real-world perception and multimodal reasoning. Real-world scenes are often much more complex than standard visual question answering. They may involve occlusion, cluttered backgrounds, small objects, relationships among multiple entities, cross-image comparison, and implicit physical commonsense.

To answer these questions reliably, the model must first identify visual details robustly, then combine them with spatial relationships, commonsense knowledge, and logical reasoning.

Coding Assistants

Qwen3.7-Plus integrates seamlessly with popular agent frameworks and coding assistants:

Claude Code

Qwen APIs support the Anthropic API protocol, enabling direct use with Claude Code:

npm install -g @anthropic-ai/claude-code

export ANTHROPIC_MODEL="qwen3.7-plus"
export ANTHROPIC_SMALL_FAST_MODEL="qwen3.7-plus"
export ANTHROPIC_BASE_URL=https://dashscope-intl.aliyuncs.com/apps/anthropic
export ANTHROPIC_AUTH_TOKEN=<your_api_key>

claude

OpenClaw

Connect to OpenClaw via Model Studio:

curl -fsSL https://molt.bot/install.sh | bash
export DASHSCOPE_API_KEY=<your_api_key>
openclaw dashboard

Configure ~/.openclaw/openclaw.json:

{
  "models": {
    "mode": "merge",
    "providers": {
      "modelstudio": {
        "baseUrl": "https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
        "apiKey": "DASHSCOPE_API_KEY",
        "api": "openai-completions",
        "models": [
          {
            "id": "qwen3.7-plus",
            "name": "qwen3.7-plus",
            "reasoning": true,
            "input": ["text"],
            "contextWindow": 1000000,
            "maxTokens": 65536
          }
        ]
      }
    }
  },
  "agents": {
    "defaults": {
      "model": {
        "primary": "modelstudio/qwen3.7-plus"
      }
    }
  }
}

Qwen Code

Qwen Code is deeply optimized for the Qwen series:

npm install -g @qwen-code/qwen-code@latest
qwen

Summary

Qwen3.7-Plus is our most capable multimodal agent model, unifying vision understanding and language reasoning into a versatile agent foundation. It operates as a multimodal interactive hybrid agent — perceiving real-world scenes, operating graphical interfaces, writing code from visual references, and completing end-to-end tasks across both GUI and CLI environments. As a versatile coding agent and productivity assistant, it handles the full range of tasks from frontend prototyping to complex software engineering and multi-step workflow automation. It generalizes across agent scaffolds, performing consistently whether deployed through Claude Code, OpenClaw, Qwen Code, or other frameworks. We welcome community feedback and look forward to seeing what you build.

Citation

@misc{qwen37plus,
    title = {{Qwen3.7-Plus}: Multimodal Agent Intelligence},
    url = {https://qwen.ai/blog?id=qwen3.7-plus},
    author = {{Qwen Team}},
    month = {May},
    year = {2026}
}
0 0 0
Share on

Alibaba Cloud Community

1,418 posts | 496 followers

You may also like

Comments