Training Data Overview
The Qwen and Wan models are self-developed and trained on diverse sources of data, including information that is publicly available on the internet, non-public data provided by our third-party partners, data provided by data-labeling services and paid contractors, as well as synthetic data generated by our proprietary professional models.
The training data for Qwen and Wan models is systematically designed and carefully curated to comprehensively support the core capabilities of the inference service in areas such as general language understanding, advanced reasoning, multimodal interaction, long-context processing, and visual generation. Specifically, the data enables the following:
General Language Understanding and Global Deployment: By leveraging massive textual corpora covering over 110 languages and dialects, the models achieve cross-lingual dialogue, content creation, and machine translation capabilities, addressing multilingual reasoning demands in globalized scenarios.
Advanced Reasoning and Professional Task Execution: Extensive real-world and synthetic data from STEM (Science, Technology, Engineering, and Mathematics) domains - such as math problems generated by Qwen-Math or programming tasks from Qwen-Coder - significantly enhance performance in complex mathematical derivations, algorithm implementation, and logical reasoning. This supports high-value applications of the Model Studio Service in research, education, and engineering.
Multimodal Understanding and Document Intelligence: Integration of heterogeneous data (e.g., images, videos, PDFs) with structured multimodal corpora constructed through OCR, layout parsing, and spatial annotations enables precise execution of tasks like Vision-Language Question Answering (VQA), long-document parsing, chart comprehension, and UI element recognition.
Native Long-Context Support: Specialized construction of ultra-long text and video sequences allows the models to process complex inputs such as entire technical manuals, multi-page contracts, and long-form video content in single inference sessions. This ensures cross-page information correlation, event temporal reasoning, and contextually consistent outputs, aligning with high-throughput, low-latency requirements for long-document reasoning services.
Visual Generation Capabilities: Relying on large-scale image-text pairs and synthetic visual corpora, the models can generate high-quality, high-fidelity, and style-controllable images based on natural language instructions. Supported capabilities include text-to-image generation, image editing, multi-turn interactive drawing, and text-to-video synthesis.
The training data not only imparts world knowledge but also ensures Qwen and Wan models deliver efficient, secure, and accurate responses to diverse, specialized, and multimodal user demands through task alignment, modality fusion, and capability enhancement in inference services.
We aim to achieve the following goals by using synthetic data:
Functionality:
Address Data Scarcity: Generate high-quality samples in domains like advanced mathematics, rare languages, and complex algorithms; supplement real-world data in visual reasoning tasks.
Support Complex Tasks: Generate multi-turn dialogues, tool call trajectories, and long-video captions to train planning, decision-making, and reflection capabilities.
Enhance Generalization: Improve handling of complex scenarios such as long documents, multi-image associations, and spatiotemporal reasoning via synthetic data.
Strengthen Perception: Generate high-fidelity image-text pairs, 3D scene descriptions, and fine-grained referential objects to elevate visual detail comprehension.
Safety:
Pre-training Phase: Implement multidimensional data labeling with explicit safety metrics to filter out high-risk data at the source.
Post-training Phase: Construct safety QA pairs covering violations, biases, ethics, and harmful behaviors to enhance instruction-following and content safety.
Training Data Governance
We enforce rigorous cleansing and filtering mechanisms to maintain data quality and mitigate potential risks. We apply filtering measures during the data preprocessing stage to reduce personal information from the training datasets. We have established rigorous data governance processes, supported by comprehensive data cleaning, processing, and structural optimization, to ensure the quality, security, and diversity of the data. Our governance processes include:
Pre-training Phase: Data is sourced from public internet content and non-public third-party datasets. We apply rigorous data cleaning and filtering procedures to the raw training data, including automated content safety screening and human review mechanisms, to systematically remove harmful or sensitive content. These measures are intended to enable the model to identify and mitigate the influence of bias on its outputs, thereby enhancing the model’s fairness and impartiality.
Post-training Phase: Data is derived from synthetic and annotated datasets.
Language Model Data Augmentation and Optimization: A granular labeling system is developed, covering dimensions such as educational value, domain distribution, language types, reasoning complexity, and safety levels. High-quality data is selected based on this framework, while synthetic data from proprietary models (e.g., Qwen-Math, Qwen-Coder) is injected to enhance performance in multilingual understanding, complex reasoning, and long-context modeling. This also improves data controllability.
Specialized Multimodal Data Processing: For multimodal data, high-precision OCR and document structure parsing, 2D/3D spatial semantic annotations, and explicit timestamp alignment between video frames and text are implemented. Large-scale multimodal synthetic datasets are systematically constructed to strengthen vision-language cross-modal alignment, support high-dimensional information understanding (e.g., complex documents, long videos), and provide robust training foundations for visual generation and agent interaction.
Safety Alignment: Dedicated safety datasets are constructed to align the model’s intrinsic safety capabilities.
The core objective of all processing is to elevate data quality and task alignment, ensure model safety and compliance, and precisely achieve the expected capabilities of inference services across general, specialized, and multimodal scenarios.