×
Community Blog Qwen-RobotManip: Alignment Unlocks Scale for Robotic Manipulation Foundation Models

Qwen-RobotManip: Alignment Unlocks Scale for Robotic Manipulation Foundation Models

This article introduces Qwen-RobotManip, a scalable foundation model for generalizable robotic manipulation.

banner_manip_

Qwen-Omni × Qwen-RobotManip — Qwen-Omni observes the scene, randomly proposes manipulation tasks via speech, and judges execution in real time. Each video shows Qwen-RobotManip completing tasks on the fly with no pre-defined task list, demonstrating open-ended instruction following and generalization.


Qwen-RobotManip is validated across various real-robot platforms and tasks, demonstrating strong generalization to novel scenes, unseen language instructions, and cross-embodiment transfer.



teaser_v3

Foundation models in language and multimodality have achieved remarkable generalization because heterogeneous data sources can be aligned under a unified formulation, and abundant low-cost internet data allows diverse training signals to reinforce one another at scale. But can this scaling recipe be applied to robotic manipulation?

This is challenging. Unlike text or images, manipulation data is heterogeneous by nature, expensive to collect, and narrow in diversity. Aligning representations across different robot embodiments, sensors, and task domains while simultaneously scaling the data has remained an open problem.

Qwen-RobotManip is a generalizable Vision-Language-Action (VLA) foundation model built upon Qwen-VL. It introduces a unified alignment framework across the representation, motion, and behavioral dimensions of manipulation, making large-scale multi-source training coherent rather than conflicting. Using only open-source robotic manipulation datasets and human demonstration videos without any proprietary data collection, Qwen-RobotManip constructs a ~38,100 hours pretraining corpus and already exhibits emergent generalization capabilities.

Without unified cross-embodiment alignment, scaling data produces conflicts; without data diversity, alignment alone cannot generalize. Alignment and scale are tightly coupled prerequisites for robotic foundation models.

Key Highlights

  • Unified Cross-Embodiment Alignment Framework — a unified 80-dimensional state-action representation accommodates diverse embodiments, camera-frame end-effector delta poses make visually similar motions numerically proximate, and in-context policy adaptation reads execution history as an implicit embodiment identifier — together enabling consistent signal extraction across embodiments
  • Human-to-Robot Synthesis at Scale — a pipeline converting 1,933h of egocentric human video into 24,808h of robot demonstrations across 15 embodiments via action retargeting, hand removal and inpainting, simulated rendering, and depth-guided compositing, coupled with a multi-stage curation pipeline ensuring data quality
  • OOD Generalization: LIBERO-Plus 91.4% (+7.0 over π0.5), RoboTwin-C2R Hard 69.4% (+21.5 over π0.5), RoboCasa365 Composite-Unseen 14.9% (3× next best), EBench 45.6% (+18.5 over next best); RoboTwin-IF 72.0% (+22.4 over π0.5) confirming genuine language-conditioned control; 3× next best on RoboTwin-XE showing zero-shot cross-embodiment transfer
  • Strong Real-World Performance: #1 on RoboChallenge Table30 v1 generalist track with 45% SR, sweeping top 2 and leading 3rd place by 20%; validated on real-robot platforms with 2× prior SOTA on in-domain and OOD tasks, few-shot adaptation, and cross-embodiment skill transfer

Scaling Manipulation Data

Human-to-Robot Data Synthesis

h2r_pipeline

Robot manipulation data is scarce and expensive to collect. We introduce a Human-to-Robot synthesis pipeline that converts egocentric human manipulation videos into robot demonstrations across 15 robot embodiments via human-to-robot retargeting, hand removal and inpainting, and depth-guided robot compositing.

Data Sources

The resulting pretraining corpus totals over 38,100 hours from three complementary sources:

  • Robot data (~11,420h): Open-source robotic datasets covering single-arm, dual-arm, and mobile manipulation.
  • Egocentric human data (~1,933h): Human manipulation videos collected from open-world environments, providing rich object-interaction and scene priors.
  • Human-to-Robot synthesized data (~24,808h): Generated from the egocentric data above across 15 robot platforms, serving as the primary scaling engine.

Data Curation

data_preprocess

We design a multi-stage curation pipeline to ensure VLA training data quality and annotation correctness. Five state-action filtering stages remove noisy actions, fix temporal misalignment, and verify kinematic consistency. Three cross-modal checks then validate that language instructions match the video content, that visual observations agree with recorded robot states, and that video frames are free of corruption.

Qwen-RobotManip Model Design

method_v2

Qwen-RobotManip couples a Qwen3.5-4B vision-language backbone with a flow-matching Diffusion Transformer (DiT) action head. Three design choices enable coherent cross-embodiment training:

  • Canonical State-Action Representation. All robot states and actions are mapped to a unified 80-dimensional vector covering single-arm, dual-arm, dexterous hand, and mobile base configurations. A per-dimension binary mask ensures gradients flow only through populated slots, ensuring different embodiments share the same representation without conflict.
  • Camera-Frame Delta Pose. End-effector actions are expressed as deltas in the camera coordinate frame rather than the robot base frame, making visually similar actions numerically proximate across embodiments. Camera extrinsics are injected via Camera Positional Encoding (CaPE) in the cross-attention layers, while intrinsics are encoded into visual tokens for field-of-view awareness. The DiT is further conditioned on end-effector type embeddings for embodiment-aware action denoising.
  • In-Context Policy Adaptation. The model conditions action prediction on a structured embodiment prompt (specifying robot platform, execution speed, and FPS) together with a historical observation-action chunk, enabling on-the-fly adaptation to different embodiments and behavior patterns. A stochastic context sampling strategy during training prevents action-copy shortcuts and forces genuine policy learning.

Training. Pre-training uses dual-stream co-training with a VLA stream (robot manipulation data) and a VLM stream (vision-language understanding data) at a 9:1 ratio. Post-training adopts generalist SFT on all demonstration data collected for each benchmark. We propose co-training with VL data and VLA data during post-training, which further improves OOD instruction following and generalization.

Evaluation

eval_setting_overall

Qwen-RobotManip is evaluated across 500+ simulation tasks and 80+ real-world tasks spanning various robot embodiments.

Why OOD Evaluation Matters

A critical finding in our experiments: standard benchmarks systematically fail to capture the quality of pretraining. On in-distribution benchmarks like LIBERO and RoboTwin, models trained from scratch without any large-scale robot pretraining achieve performance comparable to previous SOTA pretrained models. Strong IID scores do not indicate genuine generalization; they can be achieved through pattern matching alone.

The separation only becomes visible under out-of-distribution evaluation: novel scenes and task variations, following unseen instructions, and cross-embodiment transfer. This is why Qwen-RobotManip adopts OOD benchmarks as the north star for evaluating robotic foundation models.

In-Distribution Results

On standard benchmarks, Qwen-RobotManip matches or exceeds previous SOTA.

Model LIBERO RT-Easy RT-Hard
π0 94.4 65.9 58.4
π0.5 97.6 82.7 76.8
StarVLA 98.0 85.7 87.3
Abot-M0 98.6 86.1 85.1
Being-H0.7 99.2 90.2 89.6
Qwen-RobotManip-scratch 98.2 88.7 88.4
Qwen-RobotManip 99.1 93.4 92.5
Qwen-RobotManip-Context 99.2 93.7 94.0

Out-of-Distribution Generalization

ood_summary

Qwen-RobotManip substantially outperforms all previous models across three OOD generalization axes: task and scene variations, instruction following, and cross-embodiment transfer.

Data Scaling: Alignment Enables Scale

scaling_downstream

An important finding: only models with unified cross-embodiment representations exhibit clean log-linear data scaling behavior. Without the alignment framework (UnifiedSpace + UnifiedEEF), adding more data produces erratic or flat scaling curves. This confirms that alignment is the prerequisite for scale, not the other way around.

Real-World Experiments

Generalization to Novel Scenes and Instructions

real_world_cobotmagic_setup

In-domain evaluation across 7 tasks spanning basic pick-and-place, deformable object handling, and precision assembly:

Task π0.5 StarVLA Ours
table-cleanup 4/5 0/5 5/5
three-bowl-stacking 5/5 4/5 5/5
melon-in-bowl 2/5 0/5 5/5
towel-folding 4/5 3/5 4/5
block-in-drawer 0/5 0/5 5/5
yellow-disc-insertion 0/5 0/5 2/5
three-block-stacking 0/5 0/5 5/5
Average 42.9% 20.0% 88.6%

Out-of-domain evaluation with distribution shifts in visual scenes, objects, and instructions:

Task OOD Factors π0.5 StarVLA Ours
target-object-in-basket cluttered bg, unseen objects 8/10 0/10 10/10
left-right-bowl-stacking cluttered bg, left-right reference 1/10 0/10 10/10
tool-on-towel unseen small objects, distractors 0/10 0/10 6/10
banana-on-towel dynamic lighting (disco light) 6/10 0/10 9/10
Average 37.5% 0.0% 87.5%

Qwen-RobotManip achieves 88.6% in-domain and 87.5% OOD success, substantially outperforming π0.5 (42.9% / 37.5%) and StarVLA (20.0% / 0.0%).

Data-Efficient Skill Transfer Across Embodiments

arx_evaluation_setup

Few-shot adaptation. All methods are jointly finetuned on only 130 teleoperated demonstrations across 5 tasks. Qwen-RobotManip outperforms both baselines on 4 of 5 tasks:

Task Sub-step StarVLA π0.5 Ours
Put Fruits Place 1 / 2 / 3 3/1/0 9/5/2 9/5/3
Avg. success 13.3% 53.3% 56.7%
Put Blocks Open / Place1 / Place2 / Close 1/1/0/0 4/2/2/2 5/4/3/3
Avg. success 5.0% 25.0% 37.5%
Fold Towel Fold 1 / Fold 2 0/0 3/1 3/3
Avg. success 0.0% 20.0% 30.0%
Insert Screw Handover / Insert 0/0 2/0 2/0
Avg. success 0.0% 10.0% 10.0%
Unscrew Cap Grasp / Unscrew / Place 4/0/0 9/2/1 9/4/3
Avg. success 13.3% 40.0% 53.3%

Cross-embodiment skill transfer. A single policy jointly finetuned on 6K CobotMagic and 130 ARX demonstrations is evaluated on 4 novel tasks on ARX, for which ARX has zero training demonstrations:

Model Stack Plates Stack Blocks Fruits in Plate Trash in Bucket Avg.
w/o UnifiedSpace 0/10 0/10 3/10 0/10 7.5%
w/o UnifiedEEF 0/10 0/10 5/10 0/10 12.5%
Qwen-RobotManip 3/10 5/10 7/10 7/10 55.0%

The full unified framework achieves 55.0%, over 4× the best ablation variant, demonstrating that the unified representation enables skill-level transfer across kinematically different embodiments.

Complex Multi-Step Tasks and Emergent Recovery

In the RoboChallenge Table30 v1 Generalist Track, which spans 30 tasks across 4 robot platforms, Qwen-RobotManip ranks 1st with 45% success rate and 59.83 process score, outperforming the runner-up by 20%. On 8 bimanual coordination tasks, it achieves 40% success versus π0.5 ’s 21.2%.

challenge_rc
bimanual_pickplace_bar

Bimanual coordination. Among the 30 benchmark tasks, 8 require tight bimanual coordination on the ALOHA platform, where the two arms must jointly stabilize, transport, and manipulate objects. Qwen-RobotManip achieves 40% average success rate, far exceeding π0.5 (21.2%), DM0 (16.2%), GR00T-MULTI (7.5%), and π0 (7.5%). Notably, Qwen-RobotManip is the only model to succeed on “pour fries into plate” (30% vs. 0% for all baselines), a task demanding sequential bimanual steps — stabilizing the fries box with the left arm, opening it with the right arm, picking it up, and pouring the contents onto the plate. We attribute this strong bimanual performance to two factors: (1) our pretraining corpus contains a substantial proportion of bimanual demonstration data, enabling the model to learn coordinated dual-arm control primitives; and (2) the Human-to-Robot synthesis pipeline further expands the effective bimanual pretraining data by synthesizing bimanual robot demonstrations from egocentric human videos.

case_study_bimanual

Robust pick-and-place across embodiments. We identify 12 tasks across all four platforms that center on pick-and-place primitives, ranging from single-object grasping to multi-step sequential manipulation involving 4–5 objects. Qwen-RobotManip achieves 63.3% average success rate on these tasks, surpassing the next-best baseline DM0 (48.3%) by 15.0 percentage points. We attribute this capability to two factors: (1) the large-scale cross-embodiment pretraining data encodes abundant pick-and-place patterns, and (2) the unified action space enables knowledge sharing of fundamental spatial skills across different robot morphologies.

case_study_retry

Reactive error recovery. When objects slip during grasping, the model autonomously retries until successful. This behavior emerges from pretraining at scale, not from explicit programming.

Towards Scalable Robotic Foundation Models

Qwen-RobotManip demonstrates that the scaling recipe behind language and multimodal foundation models can be extended to robotic manipulation, but only when alignment and scale work in tandem. The unified cross-embodiment representation is what makes large-scale multi-source training productive rather than conflicting, and the Human-to-Robot synthesis pipeline provides the data diversity that alignment alone cannot supply.

Embodied intelligence is still at an early stage. Contact-rich, long-horizon real-world tasks, failure recovery, continual learning, and complex human-robot-environment interactions remain challenging. Yet Qwen-RobotManip points to a clear path forward:

Alignment unlocks scale, and scale unlocks generalization.

Citation

@article{qwenrobotmanip,
  title={Qwen-RobotManip Technical Report: Alignment Unlocks Scale for Robotic Manipulation Foundation Models},
  author={Qwen Team},
  year={2026}
}
0 0 0
Share on

Alibaba Cloud Community

1,430 posts | 499 followers

You may also like

Comments