Qwen-VLA: From Understanding the World to Acting in It

This article introduces Qwen-VLA, a general-purpose Vision-Language-Action model that extends multimodal perception and reasoning into continuous action generation for embodied intelligence.

head_en

Over the past few years, multimodal large language models have become increasingly capable of understanding images, videos, and real-world scenes. They can recognize objects, reason about spatial relationships, answer visual questions, and solve complex multimodal reasoning tasks.

But for embodied intelligence, understanding the world is only the first step. A truly embodied agent also needs to understand task goals, take actions in the physical world, and generalize across different robot embodiments, environments, and tasks.

This is the motivation behind Qwen-VLA.

Qwen-VLA is a general-purpose Vision-Language-Action model. Built upon the Qwen multimodal backbone, it extends visual perception, language understanding, and spatial reasoning into continuous action generation and trajectory prediction. In other words, it allows the model to not only see and think, but also begin to act.

One Model for Multiple Embodied Tasks

Traditional embodied AI systems are often highly specialized: one model for tabletop manipulation, another for navigation, and yet another for a specific robot platform. This approach can work well for individual tasks, but it does not scale easily to broader tasks, diverse environments, or different robot embodiments.

Qwen-VLA explores a more unified direction:

Can a single generalist policy model support robotic manipulation, vision-language navigation, and cross-embodiment control at the same time?

In Qwen-VLA, robotic manipulation and vision-language navigation are formulated under the same framework: given visual observations, language instructions, and embodiment-specific conditions, the model predicts the next action or trajectory. The Qwen multimodal backbone understands the visual and language inputs, while an action decoder generates continuous actions.

Training: From Language Priors to Closed-Loop Control

qwen35vla_arc

The core of Qwen-VLA is not simply attaching an action head to a multimodal model. More importantly, it builds a joint training system that covers diverse tasks, environments, and robot embodiments. The full training pipeline progresses through four stages, from language priors to closed-loop control.

Data

The pretraining data spans five major sources:

Robot manipulation trajectories form the foundation, covering tabletop, mobile, dual-arm, and dexterous manipulation. The public data totals over 10,000 hours, supplemented by more than 1,000 hours of internal real-robot trajectories and over 8 million synthetic simulation trajectories.
Human egocentric data provides richer object, scene, and hand-action priors from open-world environments. We incorporate Ego4D, EPIC-KITCHENS, EgoDex (829 hours), EgoVerse (1,300+ hours, 1,965 tasks, 240 scenes), and Xperience.
Synthetic simulation data fills long-tail gaps. Vision-conditioned data covers 20 tabletop scenes, 200 configurations, 450 tasks, and 359,848 successful trajectories. Text-to-action data spans 6 templates × 6 single-arm robots, yielding about 7.2 million trajectories and over 14,000 hours.
Vision-language navigation data provides long-horizon trajectory planning and instruction-following capabilities.
General vision-language data preserves multimodal understanding, spatial grounding, and instruction following. We also build around 48,000 fine-grained action descriptions annotated across 13 dimensions, aligning natural language with concrete execution details.

Four-Stage Training

The key idea: first learn to generate action structures from language, then learn to adapt those actions to the visual environment.

t2a_combined

Stage I: T2A (Text-to-Action Pretraining). An instruction like “pick up the red cup” is just a few words, but the corresponding robot action is a high-dimensional continuous trajectory. Qwen-VLA treats this as a form of decompression from language to action. In T2A, we freeze the VLM and train only the action decoder on language and embodiment prompts without any images.
Stage II: CPT（Continual Pretraining）. We unfreeze both the VLM and action decoder and jointly train on the full multimodal data mixture. This stage grounds the language-action priors from T2A in concrete visual scenes while adapting the backbone to embodied perception, producing Qwen-VLA-Base.
Stage III: SFT (Supervised Fine-Tuning). Starting from the CPT checkpoint, we branch into two tracks: multi-task SFT jointly fine-tunes on manipulation, navigation, VQA, and spatial grounding; real-robot SFT fine-tunes on in-house teleoperation data for physical deployment.
Stage IV: RL (Reinforcement Learning). Starting from the SFT checkpoint, we use PPO to directly optimize closed-loop task success in simulation, producing the final model Qwen-VLA-Instruct. RL is conducted only in SimplerEnv, yet experiments show its gains transfer to unseen environments and robot embodiments.

Performance

A Single Generalist Model Can Match or Even Surpass Specialist Models

The experimental results show the potential of Qwen-VLA as a generalist policy model. A single model can cover multiple manipulation benchmarks, including LIBERO, Simpler, RoboCasa, and RoboTwin, while approaching or surpassing specialized policy models on several tasks.

Benchmark	Best Specialist Model	Qwen-VLA
LIBERO	ABot-M0 98.6%	97.9%
RoboCasa-GR1	ABot-M0 58.3%	56.7%
Simpler-WidowX	StarVLA-OFT 64.6%	73.7%
RoboTwin-Easy / Hard	ABot-M0 86.0% / 85.0%	86.1% / 87.2%

On robotic manipulation benchmarks, Qwen-VLA-Instruct achieves 97.9% on LIBERO, 73.7% on Simpler-WidowX, and 86.1% / 87.2% on RoboTwin-Easy / Hard. Many of the compared methods are specialist models fine-tuned for individual benchmarks, while Qwen-VLA is a unified generalist model trained under a single framework.

On vision-language navigation (VLN-CE), Qwen-VLA-Instruct achieves 69.0% Oracle Success Rate and 57.5% Success Rate on R2R Val-Unseen, and 59.6% SR and 47.8% SPL on the more challenging RxR Val-Unseen, surpassing all open-source baselines.

In real-world ALOHA dual-arm experiments, Qwen-VLA pretrained model achieves 83.6% average in-domain success and 76.9% average OOD success, substantially outperforming training from scratch (48.5% / 36.2%) and π0.5_π_0.5 (71.6% / 41.5%).

Real-World Out-of-Distribution Generalization

We also care about how Qwen-VLA generalizes on real robots.

In real-world ALOHA dual-arm robot experiments, Qwen-VLA demonstrates generalization to unseen colors, objects, backgrounds, positions, and language instructions. Compared with policies trained from scratch, models pretrained with Qwen-VLA show clear improvements under real-world out-of-distribution settings.

This part is best shown through videos. The following demonstrations are tested with the Qwen-VLA-Base model. When asked to “pick up the green ball” or “pick up the blue ball,” the model can correctly act based on color-specific instructions. When presented with unseen objects such as toys, vegetables, or sunglasses, it can still follow language commands to grasp or move them. When the background, lighting, and tabletop layout change, the model remains relatively stable. For compositional tasks such as “tidy up the table,” it can identify multiple targets and execute multi-step operations.

Compared with tables alone, these videos better illustrate the core value of Qwen-VLA:

The model is not merely memorizing action templates in a fixed environment. It is learning to understand goals and act under real-world variations.

Zero-Shot Generalization in Dynamic Scenes

Beyond static tabletop manipulation, Qwen-VLA also shows zero-shot generalization in dynamic manipulation tasks.

On the DOMINO dynamic manipulation benchmark, Qwen-VLA-Instruct is not specifically fine-tuned for the benchmark, yet it still achieves a 26.6% success rate and a 39.5 manipulation score, outperforming a range of standard VLA baselines and even some specialist models for dynamic manipulation.

This suggests that the model is not only learning grasping templates in static scenes, but also acquiring a more transferable action prior from spatial understanding to motion control. Given visual observations, language goals, and its action generation capability, the model can directly produce coherent action sequences and complete tasks within dynamic interaction windows.

From Multimodal Understanding to Embodied Intelligence

Qwen-VLA is a natural extension of Qwen’s multimodal capabilities toward embodied intelligence.

In the past, multimodal models mainly focused on understanding the world. With Qwen-VLA, we further explore how models can generate actions in the physical world based on vision and language.

Qwen-VLA unifies robotic manipulation, vision-language navigation, and cross-embodiment control. It connects Qwen’s visual understanding and spatial reasoning capabilities to continuous action generation. Through joint pretraining on real robot data, human egocentric data, synthetic simulation data, and general vision-language data, it learns more general embodied experience. It also demonstrates the potential of generalist policy models across manipulation benchmarks, real-world out-of-distribution generalization, and zero-shot dynamic manipulation.

Embodied intelligence is still at an early stage. Long-horizon real-world tasks, failure recovery, continual learning, and more complex human-robot-environment interactions remain challenging. But Qwen-VLA points to a clear next step:

Models should not only understand the world — they should also learn to act in it.

Citation

@article{qwenvla,
  title={Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments},
  author={Qwen Team},
  year={2026},
  eprint={2605.30280},
  archivePrefix={arXiv},
  primaryClass={cs.RO},
  url={https://arxiv.org/abs/2605.30280}, 
}

Source

Community

Qwen-VLA: From Understanding the World to Acting in It

One Model for Multiple Embodied Tasks

Training: From Language Priors to Closed-Loop Control

Data

Four-Stage Training

Performance

A Single Generalist Model Can Match or Even Surpass Specialist Models

Real-World Out-of-Distribution Generalization

Zero-Shot Generalization in Dynamic Scenes

From Multimodal Understanding to Embodied Intelligence

Citation

Read previous post:

Read next post:

Alibaba Cloud Community

You may also like

Comments

Alibaba Cloud Community

Related Products

Alibaba Cloud Model Studio

Qwen

Alibaba Cloud for Generative AI

AI Acceleration Solution