The Qwen and Wan model series in Alibaba Cloud Model Studio are self-developed models trained on datasets containing trillions of tokens across text, image, video, and audio. The earliest dataset used during model development pre-dates January 2022.
Data sources
Training data includes four categories:
-
Publicly available internet data -- web content accessible to the general public.
-
Third-party partner data -- non-public data provided by partners under agreement.
-
Labeled and contracted data -- data from data-labeling services and paid contractors.
-
Synthetic data -- data generated by proprietary Alibaba Cloud models to supplement real-world data.
Datasets may include copyrighted, trademarked, or patented content, as well as public domain content.
Synthetic data
Synthetic data generated by proprietary models addresses four needs:
-
Fill gaps where real-world training data is limited
-
Handle complex tasks that require specialized training examples
-
Improve model generalization and perceptual performance
-
Strengthen model safety and robustness
Core capabilities
Training data builds these inference service capabilities:
-
General language understanding
-
Advanced reasoning
-
Multimodal interaction
-
Long-context processing
-
Visual generation
Training data also supports task alignment, modality fusion, and capability refinement, enabling accurate and secure handling of diverse, specialized, and multimodal requests.
Customer data policy
Alibaba Cloud Model Studio does not use customer business data to develop or improve models without explicit consent.
Data governance
A data governance framework covers the full training pipeline (pre-training through post-training) with measures for data cleaning, processing, and structural optimization.
Pre-training
Raw training data undergoes cleaning and filtering:
-
Automated content safety screening removes harmful or sensitive content.
-
Human review supplements automated filters as an additional safeguard.
-
Personal information filtering reduces personal information in training datasets during preprocessing.
These measures help the model identify and reduce bias during inference, improving fairness and impartiality in outputs.
Post-training
Language model data augmentation
High-quality data is selected using a fine-grained annotation framework across five dimensions:
| Dimension | Purpose |
|---|---|
| Educational value | Prioritize data with strong learning signals |
| Domain coverage | Cover breadth across knowledge areas |
| Language types | Support multilingual understanding |
| Reasoning complexity | Build advanced reasoning capabilities |
| Safety levels | Filter based on content safety criteria |
Synthetic data from specialized models, including Qwen-Math and Qwen-Coder, further improves performance in multilingual understanding, complex reasoning, and long-context modeling.
Visual generative model processing
Multimodal data undergoes targeted preprocessing:
-
High-precision OCR and document structural parsing for text extraction from images and documents
-
2D/3D spatial semantic annotation for spatial understanding
-
Explicit temporal alignment between video frames and text for video comprehension
Large-scale multimodal synthetic datasets strengthen cross-modal alignment between vision and language. They support complex document and long-form video understanding, and provide a training foundation for visual generation and agent-based interactions.
Safety alignment
Dedicated safety datasets strengthen the model's built-in safety capabilities through safety alignment.
Objective
These processes improve data quality and task alignment, maintain model safety and compliance, and achieve target performance across general-purpose, specialized, and multimodal inference.