

Embodied intelligence requires agents to perceive, reason about, and act within physical environments. World models offer a scalable path forward — but current approaches face a fundamental tension. General video generation models learn rich visual priors but lack the ability to model embodied physics. Domain-specific embodied models are tailored to individual scenarios and cannot generalize across embodiments.
Qwen-RobotWorld bridges this gap by treating natural language as a universal action interface. A single instruction like "pick up the red cup and place it on the shelf" implicitly encodes the complete action sequence, goal state, and physical constraints — no robot-specific control interface needed. This allows manipulation, autonomous driving, and indoor navigation to be trained jointly, with each domain's physical knowledge reinforcing the others.
Language unifies the action space: world knowledge and embodied knowledge reinforce each other within a single model, enabling cross-scenario, cross-task physical generalization.

Qwen-RobotWorld adopts a dual-stream Multimodal Diffusion Transformer (MMDiT):
The two streams interact via joint attention at every layer, enabling bidirectional cross-modal fusion throughout the denoising process.
Using an MLLM as the action encoder — rather than lightweight encoders like T5 or CLIP — provides two key advantages: (1) deep language understanding accurately parses complex, compositional instructions into precise condition signals; (2) internalized world knowledge (e.g., that robot arms are rigid bodies with fixed joint constraints) implicitly constrains physically plausible transitions, preventing common failure modes like object deformation across frames.
Scene2Robot enables cross-embodiment video editing: human demonstrations are retargeted to 14 robot morphologies via a multi-segment conditioning mechanism, where joint attention allows the generation to simultaneously attend to scene appearance and robot motion trajectory. This capability both serves as a data scaling engine during training and enables human-to-robot transfer at inference time.

Single-camera observation inevitably occludes critical contact and spatial details. Qwen-RobotWorld generates 2–4 synchronized camera streams — main view, wrist-mounted views, and third-person views — with geometrically consistent object identity and motion across all viewpoints. During training, synchronized frames from multiple cameras are spatially concatenated into a single input; the model generates all views simultaneously, with asymmetric 3D RoPE providing spatial encoding and attention layers naturally establishing cross-view correspondence — without any architectural modification. This cross-view consistency further acts as a geometric regularizer, teaching the model object shape, depth, and spatial layout.

The Embodied World Knowledge (EWK) dataset is organized along four complementary axes, each targeting a distinct source of physical variation:
The central challenge in building a universal world model is representational heterogeneity: manipulation uses joint angles, driving uses steering commands, navigation uses heading vectors — each requiring a separate model. Our action-language mapping framework resolves this by projecting all action signals onto a shared natural language space, so that videos from a Franka gripper, an autonomous vehicle, and a navigation agent all become instances of the same language-conditioned video generation task.
A hierarchical five-layer annotation pipeline ensures caption quality and precision:
1. Task Goal
High-level intent — what should change between states
2. Action Detail
Spatio-temporal trajectories with explicit viewpoint declaration
3. Physical Feedback
Observable consequences on the environment
4. Comprehensive Caption
Full description for precise prediction
5. Concise Caption
Essential elements for brief task-level commands
During training, comprehensive and concise descriptions are sampled with equal probability, so the model handles both detailed trajectory specifications and brief task-level commands.
Training follows a general-to-expert progressive curriculum:
|
Stage |
Phase |
Data Mix |
Objective |
|
Pretraining |
T2I / T2V / TI2V joint |
General data |
Build foundational visual priors |
|
Human interaction |
Ego4D, EPIC-Kitchen, etc. |
Grasping & tool-use priors |
|
|
SFT |
Phase 1: Single-view manipulation |
Embodied + general |
Core manipulation physics |
|
Phase 2: Multi-view expansion |
Broaden viewpoint coverage |
||
|
Phase 3: Multi-view concatenation |
Cross-view geometric consistency |
||
|
Phase 4: Complex cross-domain |
Long-horizon & cross-scenario |
Pretraining on general data and human interaction videos (Ego4D, EPIC-Kitchen) builds broad visual priors — the T2I task specifically anchors object geometry that transfers to video generation through the shared backbone. SFT then progressively deepens embodied expertise across four phases while keeping general data in every batch, ensuring both capabilities advance together rather than trade off.
We evaluate against general video generation models (Sora2, Veo3, Wan2.6, Kling, LTX-2) and embodied world models (Cosmos, LVP, GigaWorld, Vidar, Wow) across four benchmarks.

@article{qwenrobot-world,
title={Qwen-RobotWorld Technical Report: Unifying Embodied World Modeling through Language-Conditioned Video Generation},
author={Qwen Team},
year={2026}
}
Qwen-RobotManip: Alignment Unlocks Scale for Robotic Manipulation Foundation Models
1,430 posts | 499 followers
FollowAlibaba Cloud Community - June 17, 2026
Alibaba Cloud Community - June 17, 2026
Alibaba Cloud Community - July 27, 2023
Alibaba Cloud Community - June 3, 2026
Alibaba Cloud Community - October 15, 2025
Alibaba Cloud Community - June 8, 2026
1,430 posts | 499 followers
Follow
Alibaba Cloud Model Studio
A one-stop generative AI platform to build intelligent applications that understand your business, based on Qwen model series such as Qwen-Max and other popular models
Learn More
Qwen
Full-range, open-source, multimodal, and multi-functional
Learn More
Alibaba Cloud for Generative AI
Accelerate innovation with generative AI to create new business success
Learn More
AI Acceleration Solution
Accelerate AI-driven business and AI model training and inference with Alibaba Cloud GPU technology
Learn MoreMore Posts by Alibaba Cloud Community