Qwen3.7: The Agent Frontier

cover

Today we introduce Qwen3.7-Max, our latest proprietary model designed for the agent era. Qwen3.7-Max is built to be a versatile agent foundation — equally capable of writing and debugging code, automating office workflows, and sustaining autonomous execution across hundreds or thousands of steps.

What sets Qwen3.7-Max apart is the breadth and depth of its agent capabilities. It excels as a coding agent, from frontend prototyping to complex multi-file engineering. It serves as a reliable office and productivity assistant through MCP integrations and multi-agent orchestration. It sustains coherent reasoning across extremely long horizons — as demonstrated by a 35-hour, fully autonomous kernel optimization run comprising over 1,000 tool calls. It generalizes across agent scaffolds, performing consistently whether deployed through Claude Code, OpenClaw, Qwen Code, or other frameworks.

Qwen3.7-Max — available soon via Alibaba Cloud Model Studio:

frontier coding agent: from frontend prototyping to complex software engineering

office productivity and workflow automation via MCP and multi-agent orchestration

sustained autonomous execution across long-horizon tasks

cross-scaffold generalization across diverse agent frameworks

Call via API on Alibaba Cloud Model Studio (coming soon).

Performance

Qwen3_7_Max_Score

table

* Terminal-Bench 2.0: Harbor/Terminus-2 harness; 5h timeout, 12 CPU/24 GB RAM; temp=1.0, top_p=0.95, top_k=20, max_tokens=80K, 256K ctx; avg of 5 runs. All experiments prepend a token at each turn, allowing the model to decide whether to engage extended thinking.
* SWE-Bench Series: Internal agent scaffold (bash + file-edit tools); temp=1.0, top_p=0.95, 200K context window.
* SWE-bench Pro: Problematic tasks corrected and all baselines evaluated on the refined benchmark.
* NL2Repo: Evaluated via Claude-code. We disable Bash commands that attempt to access the specific repository, such as pip download, pip install, and git clone.
* QwenWebDev: Internal front-end code generation benchmark; bilingual (EN/CN), 7 categories; auto-render + multimodal judge; BT/Elo rating.
* QwenClawBench: a real-user-distribution Claw agent benchmark; open-source: https://github.com/SKYLENAGE-AI/QwenClawBench
* CoWorkBench: an internal cowork benchmark; long-horizon tasks across computer science, finance, law, medical, and other productivity domains.
* SkillsBench: Evaluated via OpenCode on 78 tasks (excluding 9 external API-dependent tasks); avg of 5 runs.
* MCP-Mark: GitHub MCP v0.30.3; Playwright responses truncated at 32K tokens.
* MCP-Atlas: Public set score; gemini-2.5-pro judger.
* VITA-Bench: Avg subdomain scores; using claude-4.5-sonnet as judger, as the older official judgers are no longer available.
* Kernel Bench L3: Metrics reported: median of per-problem speedup over PyTorch eager reference / fraction of problems faster than torch.compile, across 50 problems. Each test sample runs in an isolated Docker container with one H100 80GB GPU, with internet access restricted to the CUTLASS codebase and official CUDA documentation, limited to 500 tool calls with early stopping after 100 non-improving turns. GPT-5.4 (xhigh) is applied to detect potential hacking behaviors. CUPTI is used for kernel-level timing.
* QwenWorldBench: Internal benchmark for evaluating LLMs as world models for simulating agentic environments; 7 domains (Terminal, SWE, MCP, Search, OS, Android, Web); open-ended 5-dim rubric judge grounded in real-environment feedback.
* Reasoning scenarios: Recommended system prompt: "Reasoning effort is set to xhigh. Please think carefully through the task, validate key assumptions, consider plausible alternatives, and prioritize correctness, consistency, and clarity in the final answer."
* MRCR-v2: 128K context subset containing 8 needles utilized; evaluation protocol adopted from https://github.com/google-deepmind/eval_hub/tree/master/eval_hub/mrcr_v2
* WMT24++: Harder WMT24 subset; avg scores on 55 langs via XCOMET-XXL.
* MAXIFE: Accuracy on EN + multilingual prompts (23 settings total).
* MMLU-ProX: Avg accuracy across 29 languages.
* Empty cells (--) indicate scores not yet available.

In coding agents, Qwen3.7-Max performs strongly on SWE-Pro (60.6), SWE-Multilingual (78.3), SciCode (53.5), and QwenSVG (1608). On Terminal Bench 2.0-Terminus (69.7), it outperforms DS-V4-Pro Max (67.9). On SWE-Verified (80.4), it is on par with Opus-4.6 Max (80.8) and DS-V4-Pro Max (80.6).

In general-purpose agents, improvements are even more pronounced. Qwen3.7-Max performs exceptionally well on MCP-Mark (60.8 vs. GLM-5.1’s 57.5), MCP-Atlas (76.4 vs. Opus-4.6’s 75.8), and Skillsbench (59.2 vs. K2.6’s 56.2), and demonstrates strong GPU kernel optimization capabilities on Kernel Bench L3 (1.98x median speedup, 96% win rate). It also scores highly on BFCL-V4 (75.0), Qwenclaw (64.3), and ClawEval (65.2), closely approaching Opus-4.6 Max. On the office automation benchmark SpreadSheetBench-v1, it achieves a top-tier score of 87.

In reasoning, Qwen3.7-Max achieves leading results on GPQA Diamond (92.4 vs. Opus-4.6’s 91.3), HLE (41.4 vs. Opus-4.6’s 40), HMMT 2026 Feb (97.1 vs. Opus-4.6’s 96.2), IMOAnswerBench (90 vs. DS-V4-Pro’s 89.8), and Apex (44.5 vs. DS-V4-Pro’s 38.3), demonstrating exceptional strength on the hardest reasoning benchmarks.

In general capabilities and multilingualism, Qwen3.7-Max stands out on IFBench (79.1 vs. DS-V4-Pro’s 77.0), demonstrating precise instruction following. It achieves leading scores on WMT24++ (85.8) and MAXIFE (89.2), confirming top-tier multilingual understanding and translation quality. It also delivers strong results on SuperGPQA (73.6) and QwenWorldBench (57.3).

Notably, these scores are drawn from a wide variety of agent scaffolds. Rather than optimizing for any single framework, Qwen3.7-Max delivers consistently across Claude Code, OpenClaw, Qwen Code, and custom tool-use frameworks, making it a reliable drop-in backbone for any agent system.

Cowork Productivity Assistant

Qwen3.7-Max serves as your advanced coworker for real-world productivity. Its powerful agent capabilities fundamentally streamline professional workflows — synthesizing complex information, performing in-depth data analysis and modeling, and generating publication-ready documents and visualizations — to reliably handle high-complexity enterprise workloads.

Qwen3.7-Max features native compatibility with mainstream agent harnesses. For long-horizon tasks, it supports autonomous planning and continuous execution across multi-hour sessions. Through thousands of tool calls and dozens of refinement iterations, it steadily improves output quality. Complex projects that typically require one to two weeks of specialized team effort can now be completed end-to-end within hours, delivering measurable productivity gains.

Agent Scaling

Building on the environment scaling approach introduced in Qwen3.5, we have continued to aggressively expand both the quality and diversity of agentic training environments in Qwen3.7. Just as language models generalize from diverse pretraining text, we find that agentic capabilities generalize from diverse training environments.

As shown in the figure below, this environment scaling produces a clear and consistent improvement trajectory, with Qwen3.7-Max achieving a top-3 average ranking that approaches Claude-4.6-Opus-Max. Crucially, all benchmarks in our evaluation feature entirely unseen, out-of-domain environments that were never present in training.

We also observe a striking predictability in the scaling behavior: performance gains across any subset of benchmarks are highly consistent and can reliably predict the relative gains on the remaining benchmarks or the overall average, suggesting that environment scaling drives genuine capability generalization rather than benchmark-specific improvement. Further analysis of the scaling dynamics and methodology will be detailed in our upcoming technical report.

agent_scaling

Cross-Harness Generalization

Our Rollout environment infrastructure decouples each training instance into three orthogonal components — Task, Harness, and Verifier — that can be freely recombined. We support a wide range of harnesses and their evolving versions, and ground our environments in real-world settings rather than synthetic proxies. This decoupled design enables combinatorial scaling: the same task is paired with diverse harnesses (across types and versions) and verifiers at minimal marginal cost. More critically, it enables cross-harness and cross-verifier RL training, where the model encounters identical tasks under varying harness configurations, forcing it to learn generalizable problem-solving strategies rather than harness-specific shortcuts. Across QwenClawBench and CoWorkBench, Qwen3.7-Max delivers strong, consistent performance regardless of the harness used at evaluation time, confirming that the model has learned to solve tasks — not to exploit particular harnesses.

harness_generalization

Self-Evolving in the Wild

Extend Attention is a production-grade, variable-length multi-head attention operator in SGLang. In our test scenario, it computes attention scores between newly generated tokens and a prefix KV-cache of up to 32K entries with MTP — a memory-bound, latency-critical kernel in LLM serving. The reference implementation is SGLang’s official Triton implementation.

We tasked Qwen3.7-Max with optimizing this kernel on an ECS instance equipped with T-Head ZW-M890 PPUs — a hardware platform never seen during training. The model had no prior profiling data, no hardware documentation, and no example kernels for this architecture. It started from an empty workspace containing only a task description, the existing SGLang implementation, and an evaluation script.

Over the course of ~35 hours of continuous autonomous execution, the model performed 432 kernel evaluations across 1,158 tool calls. It wrote, compiled, profiled, and iteratively improved the Extend Attention Kernel entirely on its own — diagnosing compilation failures, fixing correctness bugs, identifying performance bottlenecks through runtime profiling, and redesigning the kernel architecture multiple times.

The final result: 10.0x geometric mean speedup over the Triton reference, measured across multiple workloads. The optimization trajectory shows sustained, non-trivial progress far beyond the first few hours: the model was still finding meaningful improvements after 30+ hours, demonstrating that long-horizon autonomous optimization is not just feasible but productive.

Key structural transitions in the optimization trajectory

We also ran the same task with several other models under identical conditions. GLM 5.1 reached 7.3x; Kimi K2.6 reached 5.0x; DeepSeek V4 Pro reached 3.3x; Qwen3.6-Plus reached 1.1x. Models that stopped early did so because the agent issued no tool calls for five consecutive rounds — the model concluded it could no longer make progress and voluntarily ended the session.

In addition to achieving strong kernel generation results on PPUs, Qwen3.7-Max also generates high-quality, production-grade kernels across a variety of NVIDIA GPUs. For example, on KernelBench L3, Qwen3.7-Max is able to produce accelerated kernels for 96% of the scenarios, compared to 98% for Opus-4.6, 78% for GLM 5.1, 80% for Kimi K2.6, 54% for DeepSeek V4 Pro, and 48% for Qwen3.6-Plus.

This result highlights two properties of Qwen3.7-Max as a foundation model powering long-horizon autonomous agents: sustained long-horizon reasoning — the model maintains coherent optimization strategy across over a thousand tool calls without losing context or regressing — and strong in-context generalization — it produces competitive kernels for an architecture it has never encountered, relying on runtime feedback rather than memorized hardware knowledge.

Reward Hacking Monitoring for Long-Horizon Training

We integrated Qwen3.7-Max into the Reinforcement Learning (RL) monitoring for Software Engineering (SWE) tasks, successfully building a framework for reward hacking self-monitoring and rule self-evolution. During RL experiments exceeding 80 hours, the model autonomously retrieved and replayed training trajectories, executing over 10,000 calls. The system systematically identified candidate hacking patterns (such as attempts to bypass constraints to access ground-truth answers on GitHub) while performing rule verification, counter-example mining, and iterative optimization.

As a result, Qwen3.7-Max achieved multiple rounds of rule self-evolution, adding 13 new heuristic rules and accurately flagging 1,618 hacking cases. This not only ensured the stability of RL rewards but also facilitated the continuous self-improvement of the model as a sophisticated software engineering agent.

autonomous_hacking_detect

Long-Horizon Planning and Execution in Startup Management

Within the framework of Dynamic Cumulative Survival Games, we have scaled the temporal complexity of training tasks to specifically reinforce long-horizon planning and execution capabilities. This advancement enhances the agent’s policy consistency throughout sequential decision-making trajectories exceeding a thousand steps, enabling it to continuously construct hypotheses, dynamically adjust strategies based on environmental feedback, and accumulate long-term experience and memory. Consequently, the agent maintains a stable execution cadence over vast time horizons, remaining resilient to the common pitfalls of context rot and instruction drift.

In YC-Bench — a benchmark simulating the full year-long lifecycle of a startup — the agent must navigate hundreds of decision-making rounds ranging from personnel management and contract screening to malicious client identification, all while maintaining a profit margin against rising labor costs. Qwen3.7-Max achieved a total revenue of 2.08M USD, which is double the performance of Qwen3.6-Plus (1.05M USD) and 5.9 times that of Qwen3.5-Plus (352K USD), successfully completing 237 tasks. Beyond the metrics, the model demonstrated a profound capacity for strategic evolution across context windows: it actively explored potential clients, identified and blacklisted malicious traps, prioritized reliable revenue streams, and autonomously recovered from mid-term crises to eventually converge into a stable, high-efficiency execution loop.

Build with Qwen3.7

Qwen3.7-Max will be available soon through Alibaba Cloud Model Studio. You can integrate it with popular agent frameworks and coding assistants.

API Usage

Qwen3.7-Max supports the preserve_thinking feature: preserving thinking content from all preceding turns in messages, which is recommended for agentic tasks.

Alibaba Cloud Model Studio

Alibaba Cloud Model Studio supports industry-standard protocols, including chat completions and responses APIs compatible with OpenAI’s specification, as well as an API interface compatible with Anthropic.

"""
Environment variables:
  DASHSCOPE_API_KEY: Your API Key from https://modelstudio.console.alibabacloud.com
  DASHSCOPE_BASE_URL: (optional) Base URL for compatible-mode API.
    - Beijing: https://dashscope.aliyuncs.com/compatible-mode/v1
    - Singapore: https://dashscope-intl.aliyuncs.com/compatible-mode/v1
    - US (Virginia): https://dashscope-us.aliyuncs.com/compatible-mode/v1
"""
from openai import OpenAI
import os

api_key = os.environ.get("DASHSCOPE_API_KEY")
if not api_key:
    raise ValueError(
        "DASHSCOPE_API_KEY is required. "
        "Set it via: export DASHSCOPE_API_KEY='your-api-key'"
    )

client = OpenAI(
    api_key=api_key,
    base_url=os.environ.get(
        "DASHSCOPE_BASE_URL",
        "https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
    ),
)

messages = [{"role": "user", "content": "Write a Python function to merge two sorted linked lists."}]

completion = client.chat.completions.create(
    model="qwen3.7-max",
    messages=messages,
    extra_body={
        "enable_thinking": True,
        # "preserve_thinking": True,
    },
    stream=True
)

reasoning_content = ""
answer_content = ""
is_answering = False
print("\n" + "=" * 20 + "Reasoning" + "=" * 20 + "\n")

for chunk in completion:
    if not chunk.choices:
        print("\nUsage:")
        print(chunk.usage)
        continue

    delta = chunk.choices[0].delta

    if hasattr(delta, "reasoning_content") and delta.reasoning_content is not None:
        if not is_answering:
            print(delta.reasoning_content, end="", flush=True)
        reasoning_content += delta.reasoning_content

    if hasattr(delta, "content") and delta.content:
        if not is_answering:
            print("\n" + "=" * 20 + "Answer" + "=" * 20 + "\n")
            is_answering = True
        print(delta.content, end="", flush=True)
        answer_content += delta.content

For more information, please visit the API doc.

Frontend Coding

Qwen3.7-Max can generate rich interactive web applications from a single prompt — including Three.js 3D scenes, Canvas animations, full page layouts, and dynamic SVG.

Office Assistant

Qwen3.7-Max can act as an intelligent office assistant through tool integration. In this example, it reads a university thesis formatting specification and automatically reformats a messy draft — fixing page layout, heading styles, fonts, margins, table of contents, and reference formatting — all through autonomous office-cli tool calls. (The sample thesis is AI-generated for demonstration purposes.)

LLM-Powered Phyicial-World Navigation Agent

One more thing, Qwen3.7-Max now can operate a robot dog through tool-use calls — performing physical understanding, planning, memory and decision-making in physical environments, powered by our robotics agent harness Qwen-RobotClaw, navigation foundation model Qwen-RobotNav, and several vision tools built with Qwen-plus model. In the demo below, the left panel shows a 20 mins agent’s tool-call interaction flow in the physical-world; the center shows the quadruped robot’s first-person view along its trajectory, and the right shows the agent’s long-term memory.

Coding Assistants

Qwen3.7-Max integrates seamlessly with popular agent frameworks and coding assistants:

Claude Code

Qwen APIs support the Anthropic API protocol, enabling direct use with Claude Code:

npm install -g @anthropic-ai/claude-code

export ANTHROPIC_MODEL="qwen3.7-max"
export ANTHROPIC_SMALL_FAST_MODEL="qwen3.7-max"
export ANTHROPIC_BASE_URL=https://dashscope-intl.aliyuncs.com/apps/anthropic
export ANTHROPIC_AUTH_TOKEN=<your_api_key>

claude

OpenClaw

Connect to OpenClaw via Model Studio:

curl -fsSL https://molt.bot/install.sh | bash
export DASHSCOPE_API_KEY=<your_api_key>
openclaw dashboard

Configure ~/.openclaw/openclaw.json:

{
  "models": {
    "mode": "merge",
    "providers": {
      "modelstudio": {
        "baseUrl": "https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
        "apiKey": "DASHSCOPE_API_KEY",
        "api": "openai-completions",
        "models": [
          {
            "id": "qwen3.7-max",
            "name": "qwen3.7-max",
            "reasoning": true,
            "input": ["text"],
            "contextWindow": 1000000,
            "maxTokens": 65536
          }
        ]
      }
    }
  },
  "agents": {
    "defaults": {
      "model": {
        "primary": "modelstudio/qwen3.7-max"
      }
    }
  }
}

Qwen Code

Qwen Code is deeply optimized for the Qwen series:

npm install -g @qwen-code/qwen-code@latest
qwen

Summary

Qwen3.7-Max is our most versatile and capable model for agent-driven workflows. From coding and office automation to long-horizon autonomous tasks, it combines frontier-level reasoning with robust cross-scaffold generalization and the ability to sustain productive execution over extended periods — providing a powerful foundation for building the next generation of AI agents. We welcome community feedback and look forward to seeing what you build.

Citation

@misc{qwen37,
    title = {{Qwen3.7}: The Agent Frontier},
    url = {https://qwen.ai/blog?id=qwen3.7},
    author = {{Qwen Team}},
    month = {May},
    year = {2026}
}

Source

Community

Qwen3.7: The Agent Frontier

Performance

Cowork Productivity Assistant

Agent Scaling

Cross-Harness Generalization

Self-Evolving in the Wild

Key structural transitions in the optimization trajectory

Reward Hacking Monitoring for Long-Horizon Training

Long-Horizon Planning and Execution in Startup Management

Build with Qwen3.7

API Usage

Alibaba Cloud Model Studio

Frontend Coding

Office Assistant

LLM-Powered Phyicial-World Navigation Agent

Coding Assistants

Claude Code

OpenClaw

Qwen Code

Summary

Citation

Read previous post:

Read next post:

Alibaba Cloud Community

You may also like

Comments

Alibaba Cloud Community

Related Products

Alibaba Cloud Model Studio

Qwen

Alibaba Cloud for Generative AI

AI Acceleration Solution