
Qwen-Omni × Qwen-RobotManip — Qwen-Omni observes the scene, randomly proposes manipulation tasks via speech, and judges execution in real time. Each video shows Qwen-RobotManip completing tasks on the fly with no pre-defined task list, demonstrating open-ended instruction following and generalization.
Qwen-RobotManip is validated across various real-robot platforms and tasks, demonstrating strong generalization to novel scenes, unseen language instructions, and cross-embodiment transfer.

Foundation models in language and multimodality have achieved remarkable generalization because heterogeneous data sources can be aligned under a unified formulation, and abundant low-cost internet data allows diverse training signals to reinforce one another at scale. But can this scaling recipe be applied to robotic manipulation?
This is challenging. Unlike text or images, manipulation data is heterogeneous by nature, expensive to collect, and narrow in diversity. Aligning representations across different robot embodiments, sensors, and task domains while simultaneously scaling the data has remained an open problem.
Qwen-RobotManip is a generalizable Vision-Language-Action (VLA) foundation model built upon Qwen-VL. It introduces a unified alignment framework across the representation, motion, and behavioral dimensions of manipulation, making large-scale multi-source training coherent rather than conflicting. Using only open-source robotic manipulation datasets and human demonstration videos without any proprietary data collection, Qwen-RobotManip constructs a ~38,100 hours pretraining corpus and already exhibits emergent generalization capabilities.
Without unified cross-embodiment alignment, scaling data produces conflicts; without data diversity, alignment alone cannot generalize. Alignment and scale are tightly coupled prerequisites for robotic foundation models.

Robot manipulation data is scarce and expensive to collect. We introduce a Human-to-Robot synthesis pipeline that converts egocentric human manipulation videos into robot demonstrations across 15 robot embodiments via human-to-robot retargeting, hand removal and inpainting, and depth-guided robot compositing.
The resulting pretraining corpus totals over 38,100 hours from three complementary sources:

We design a multi-stage curation pipeline to ensure VLA training data quality and annotation correctness. Five state-action filtering stages remove noisy actions, fix temporal misalignment, and verify kinematic consistency. Three cross-modal checks then validate that language instructions match the video content, that visual observations agree with recorded robot states, and that video frames are free of corruption.

Qwen-RobotManip couples a Qwen3.5-4B vision-language backbone with a flow-matching Diffusion Transformer (DiT) action head. Three design choices enable coherent cross-embodiment training:
Training. Pre-training uses dual-stream co-training with a VLA stream (robot manipulation data) and a VLM stream (vision-language understanding data) at a 9:1 ratio. Post-training adopts generalist SFT on all demonstration data collected for each benchmark. We propose co-training with VL data and VLA data during post-training, which further improves OOD instruction following and generalization.

Qwen-RobotManip is evaluated across 500+ simulation tasks and 80+ real-world tasks spanning various robot embodiments.
A critical finding in our experiments: standard benchmarks systematically fail to capture the quality of pretraining. On in-distribution benchmarks like LIBERO and RoboTwin, models trained from scratch without any large-scale robot pretraining achieve performance comparable to previous SOTA pretrained models. Strong IID scores do not indicate genuine generalization; they can be achieved through pattern matching alone.
The separation only becomes visible under out-of-distribution evaluation: novel scenes and task variations, following unseen instructions, and cross-embodiment transfer. This is why Qwen-RobotManip adopts OOD benchmarks as the north star for evaluating robotic foundation models.
On standard benchmarks, Qwen-RobotManip matches or exceeds previous SOTA.
| Model | LIBERO | RT-Easy | RT-Hard |
|---|---|---|---|
| π0 | 94.4 | 65.9 | 58.4 |
| π0.5 | 97.6 | 82.7 | 76.8 |
| StarVLA | 98.0 | 85.7 | 87.3 |
| Abot-M0 | 98.6 | 86.1 | 85.1 |
| Being-H0.7 | 99.2 | 90.2 | 89.6 |
| Qwen-RobotManip-scratch | 98.2 | 88.7 | 88.4 |
| Qwen-RobotManip | 99.1 | 93.4 | 92.5 |
| Qwen-RobotManip-Context | 99.2 | 93.7 | 94.0 |

Qwen-RobotManip substantially outperforms all previous models across three OOD generalization axes: task and scene variations, instruction following, and cross-embodiment transfer.

An important finding: only models with unified cross-embodiment representations exhibit clean log-linear data scaling behavior. Without the alignment framework (UnifiedSpace + UnifiedEEF), adding more data produces erratic or flat scaling curves. This confirms that alignment is the prerequisite for scale, not the other way around.

In-domain evaluation across 7 tasks spanning basic pick-and-place, deformable object handling, and precision assembly:
| Task | π0.5 | StarVLA | Ours |
|---|---|---|---|
| table-cleanup | 4/5 | 0/5 | 5/5 |
| three-bowl-stacking | 5/5 | 4/5 | 5/5 |
| melon-in-bowl | 2/5 | 0/5 | 5/5 |
| towel-folding | 4/5 | 3/5 | 4/5 |
| block-in-drawer | 0/5 | 0/5 | 5/5 |
| yellow-disc-insertion | 0/5 | 0/5 | 2/5 |
| three-block-stacking | 0/5 | 0/5 | 5/5 |
| Average | 42.9% | 20.0% | 88.6% |
Out-of-domain evaluation with distribution shifts in visual scenes, objects, and instructions:
| Task | OOD Factors | π0.5 | StarVLA | Ours |
|---|---|---|---|---|
| target-object-in-basket | cluttered bg, unseen objects | 8/10 | 0/10 | 10/10 |
| left-right-bowl-stacking | cluttered bg, left-right reference | 1/10 | 0/10 | 10/10 |
| tool-on-towel | unseen small objects, distractors | 0/10 | 0/10 | 6/10 |
| banana-on-towel | dynamic lighting (disco light) | 6/10 | 0/10 | 9/10 |
| Average | 37.5% | 0.0% | 87.5% |
Qwen-RobotManip achieves 88.6% in-domain and 87.5% OOD success, substantially outperforming π0.5 (42.9% / 37.5%) and StarVLA (20.0% / 0.0%).

Few-shot adaptation. All methods are jointly finetuned on only 130 teleoperated demonstrations across 5 tasks. Qwen-RobotManip outperforms both baselines on 4 of 5 tasks:
| Task | Sub-step | StarVLA | π0.5 | Ours |
|---|---|---|---|---|
| Put Fruits | Place 1 / 2 / 3 | 3/1/0 | 9/5/2 | 9/5/3 |
| Avg. success | 13.3% | 53.3% | 56.7% | |
| Put Blocks | Open / Place1 / Place2 / Close | 1/1/0/0 | 4/2/2/2 | 5/4/3/3 |
| Avg. success | 5.0% | 25.0% | 37.5% | |
| Fold Towel | Fold 1 / Fold 2 | 0/0 | 3/1 | 3/3 |
| Avg. success | 0.0% | 20.0% | 30.0% | |
| Insert Screw | Handover / Insert | 0/0 | 2/0 | 2/0 |
| Avg. success | 0.0% | 10.0% | 10.0% | |
| Unscrew Cap | Grasp / Unscrew / Place | 4/0/0 | 9/2/1 | 9/4/3 |
| Avg. success | 13.3% | 40.0% | 53.3% |
Cross-embodiment skill transfer. A single policy jointly finetuned on 6K CobotMagic and 130 ARX demonstrations is evaluated on 4 novel tasks on ARX, for which ARX has zero training demonstrations:
| Model | Stack Plates | Stack Blocks | Fruits in Plate | Trash in Bucket | Avg. |
|---|---|---|---|---|---|
| w/o UnifiedSpace | 0/10 | 0/10 | 3/10 | 0/10 | 7.5% |
| w/o UnifiedEEF | 0/10 | 0/10 | 5/10 | 0/10 | 12.5% |
| Qwen-RobotManip | 3/10 | 5/10 | 7/10 | 7/10 | 55.0% |
The full unified framework achieves 55.0%, over 4× the best ablation variant, demonstrating that the unified representation enables skill-level transfer across kinematically different embodiments.
In the RoboChallenge Table30 v1 Generalist Track, which spans 30 tasks across 4 robot platforms, Qwen-RobotManip ranks 1st with 45% success rate and 59.83 process score, outperforming the runner-up by 20%. On 8 bimanual coordination tasks, it achieves 40% success versus π0.5 ’s 21.2%.


Bimanual coordination. Among the 30 benchmark tasks, 8 require tight bimanual coordination on the ALOHA platform, where the two arms must jointly stabilize, transport, and manipulate objects. Qwen-RobotManip achieves 40% average success rate, far exceeding π0.5 (21.2%), DM0 (16.2%), GR00T-MULTI (7.5%), and π0 (7.5%). Notably, Qwen-RobotManip is the only model to succeed on “pour fries into plate” (30% vs. 0% for all baselines), a task demanding sequential bimanual steps — stabilizing the fries box with the left arm, opening it with the right arm, picking it up, and pouring the contents onto the plate. We attribute this strong bimanual performance to two factors: (1) our pretraining corpus contains a substantial proportion of bimanual demonstration data, enabling the model to learn coordinated dual-arm control primitives; and (2) the Human-to-Robot synthesis pipeline further expands the effective bimanual pretraining data by synthesizing bimanual robot demonstrations from egocentric human videos.

Robust pick-and-place across embodiments. We identify 12 tasks across all four platforms that center on pick-and-place primitives, ranging from single-object grasping to multi-step sequential manipulation involving 4–5 objects. Qwen-RobotManip achieves 63.3% average success rate on these tasks, surpassing the next-best baseline DM0 (48.3%) by 15.0 percentage points. We attribute this capability to two factors: (1) the large-scale cross-embodiment pretraining data encodes abundant pick-and-place patterns, and (2) the unified action space enables knowledge sharing of fundamental spatial skills across different robot morphologies.

Reactive error recovery. When objects slip during grasping, the model autonomously retries until successful. This behavior emerges from pretraining at scale, not from explicit programming.
Qwen-RobotManip demonstrates that the scaling recipe behind language and multimodal foundation models can be extended to robotic manipulation, but only when alignment and scale work in tandem. The unified cross-embodiment representation is what makes large-scale multi-source training productive rather than conflicting, and the Human-to-Robot synthesis pipeline provides the data diversity that alignment alone cannot supply.
Embodied intelligence is still at an early stage. Contact-rich, long-horizon real-world tasks, failure recovery, continual learning, and complex human-robot-environment interactions remain challenging. Yet Qwen-RobotManip points to a clear path forward:
Alignment unlocks scale, and scale unlocks generalization.
@article{qwenrobotmanip,
title={Qwen-RobotManip Technical Report: Alignment Unlocks Scale for Robotic Manipulation Foundation Models},
author={Qwen Team},
year={2026}
}
Qwen-RobotNav: A Scalable Navigation Model Designed for an Agentic Navigation System
1,430 posts | 499 followers
FollowAlibaba Cloud Community - June 17, 2026
Alibaba Cloud Community - June 17, 2026
Alibaba Cloud Community - June 3, 2026
Apache Flink Community - July 11, 2025
Apache Flink Community China - November 8, 2023
Alibaba Cloud Community - January 30, 2026
1,430 posts | 499 followers
Follow
Alibaba Cloud Model Studio
A one-stop generative AI platform to build intelligent applications that understand your business, based on Qwen model series such as Qwen-Max and other popular models
Learn More
Qwen
Full-range, open-source, multimodal, and multi-functional
Learn More
Alibaba Cloud for Generative AI
Accelerate innovation with generative AI to create new business success
Learn More
AI Acceleration Solution
Accelerate AI-driven business and AI model training and inference with Alibaba Cloud GPU technology
Learn MoreMore Posts by Alibaba Cloud Community