
Over the past few years, multimodal large language models have become increasingly capable of understanding images, videos, and real-world scenes. They can recognize objects, reason about spatial relationships, answer visual questions, and solve complex multimodal reasoning tasks.
But for embodied intelligence, understanding the world is only the first step. A truly embodied agent also needs to understand task goals, take actions in the physical world, and generalize across different robot embodiments, environments, and tasks.
This is the motivation behind Qwen-VLA.
Qwen-VLA is a general-purpose Vision-Language-Action model. Built upon the Qwen multimodal backbone, it extends visual perception, language understanding, and spatial reasoning into continuous action generation and trajectory prediction. In other words, it allows the model to not only see and think, but also begin to act.
Traditional embodied AI systems are often highly specialized: one model for tabletop manipulation, another for navigation, and yet another for a specific robot platform. This approach can work well for individual tasks, but it does not scale easily to broader tasks, diverse environments, or different robot embodiments.
Qwen-VLA explores a more unified direction:
Can a single generalist policy model support robotic manipulation, vision-language navigation, and cross-embodiment control at the same time?
In Qwen-VLA, robotic manipulation and vision-language navigation are formulated under the same framework: given visual observations, language instructions, and embodiment-specific conditions, the model predicts the next action or trajectory. The Qwen multimodal backbone understands the visual and language inputs, while an action decoder generates continuous actions.

The core of Qwen-VLA is not simply attaching an action head to a multimodal model. More importantly, it builds a joint training system that covers diverse tasks, environments, and robot embodiments. The full training pipeline progresses through four stages, from language priors to closed-loop control.
The pretraining data spans five major sources:
The key idea: first learn to generate action structures from language, then learn to adapt those actions to the visual environment.

The experimental results show the potential of Qwen-VLA as a generalist policy model. A single model can cover multiple manipulation benchmarks, including LIBERO, Simpler, RoboCasa, and RoboTwin, while approaching or surpassing specialized policy models on several tasks.
| Benchmark | Best Specialist Model | Qwen-VLA |
|---|---|---|
| LIBERO | ABot-M0 98.6% | 97.9% |
| RoboCasa-GR1 | ABot-M0 58.3% | 56.7% |
| Simpler-WidowX | StarVLA-OFT 64.6% | 73.7% |
| RoboTwin-Easy / Hard | ABot-M0 86.0% / 85.0% | 86.1% / 87.2% |
On robotic manipulation benchmarks, Qwen-VLA-Instruct achieves 97.9% on LIBERO, 73.7% on Simpler-WidowX, and 86.1% / 87.2% on RoboTwin-Easy / Hard. Many of the compared methods are specialist models fine-tuned for individual benchmarks, while Qwen-VLA is a unified generalist model trained under a single framework.
On vision-language navigation (VLN-CE), Qwen-VLA-Instruct achieves 69.0% Oracle Success Rate and 57.5% Success Rate on R2R Val-Unseen, and 59.6% SR and 47.8% SPL on the more challenging RxR Val-Unseen, surpassing all open-source baselines.
In real-world ALOHA dual-arm experiments, Qwen-VLA pretrained model achieves 83.6% average in-domain success and 76.9% average OOD success, substantially outperforming training from scratch (48.5% / 36.2%) and π0.5_π_0.5 (71.6% / 41.5%).
We also care about how Qwen-VLA generalizes on real robots.
In real-world ALOHA dual-arm robot experiments, Qwen-VLA demonstrates generalization to unseen colors, objects, backgrounds, positions, and language instructions. Compared with policies trained from scratch, models pretrained with Qwen-VLA show clear improvements under real-world out-of-distribution settings.
This part is best shown through videos. The following demonstrations are tested with the Qwen-VLA-Base model. When asked to “pick up the green ball” or “pick up the blue ball,” the model can correctly act based on color-specific instructions. When presented with unseen objects such as toys, vegetables, or sunglasses, it can still follow language commands to grasp or move them. When the background, lighting, and tabletop layout change, the model remains relatively stable. For compositional tasks such as “tidy up the table,” it can identify multiple targets and execute multi-step operations.
Compared with tables alone, these videos better illustrate the core value of Qwen-VLA:
The model is not merely memorizing action templates in a fixed environment. It is learning to understand goals and act under real-world variations.
Beyond static tabletop manipulation, Qwen-VLA also shows zero-shot generalization in dynamic manipulation tasks.
On the DOMINO dynamic manipulation benchmark, Qwen-VLA-Instruct is not specifically fine-tuned for the benchmark, yet it still achieves a 26.6% success rate and a 39.5 manipulation score, outperforming a range of standard VLA baselines and even some specialist models for dynamic manipulation.
This suggests that the model is not only learning grasping templates in static scenes, but also acquiring a more transferable action prior from spatial understanding to motion control. Given visual observations, language goals, and its action generation capability, the model can directly produce coherent action sequences and complete tasks within dynamic interaction windows.
Qwen-VLA is a natural extension of Qwen’s multimodal capabilities toward embodied intelligence.
In the past, multimodal models mainly focused on understanding the world. With Qwen-VLA, we further explore how models can generate actions in the physical world based on vision and language.
Qwen-VLA unifies robotic manipulation, vision-language navigation, and cross-embodiment control. It connects Qwen’s visual understanding and spatial reasoning capabilities to continuous action generation. Through joint pretraining on real robot data, human egocentric data, synthetic simulation data, and general vision-language data, it learns more general embodied experience. It also demonstrates the potential of generalist policy models across manipulation benchmarks, real-world out-of-distribution generalization, and zero-shot dynamic manipulation.
Embodied intelligence is still at an early stage. Long-horizon real-world tasks, failure recovery, continual learning, and more complex human-robot-environment interactions remain challenging. But Qwen-VLA points to a clear next step:
Models should not only understand the world — they should also learn to act in it.
@article{qwenvla,
title={Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments},
author={Qwen Team},
year={2026},
eprint={2605.30280},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2605.30280},
}
1,419 posts | 496 followers
FollowAlibaba Cloud Community - August 15, 2025
Alibaba Cloud Indonesia - July 16, 2025
Alibaba Cloud Community - April 2, 2026
Alibaba Clouder - April 8, 2021
Alibaba Cloud Native Community - November 21, 2025
Alibaba Cloud Native Community - January 21, 2026
1,419 posts | 496 followers
Follow
Alibaba Cloud Model Studio
A one-stop generative AI platform to build intelligent applications that understand your business, based on Qwen model series such as Qwen-Max and other popular models
Learn More
Qwen
Full-range, open-source, multimodal, and multi-functional
Learn More
Alibaba Cloud for Generative AI
Accelerate innovation with generative AI to create new business success
Learn More
AI Acceleration Solution
Accelerate AI-driven business and AI model training and inference with Alibaba Cloud GPU technology
Learn MoreMore Posts by Alibaba Cloud Community