Alibaba Cloud Model Studio - Training data summary - Alibaba Cloud Model Studio

The Qwen and Wan model series integrated in Alibaba Cloud Model Studio are self-developed and trained on diverse sources of data, including information that is publicly available on the internet, non-public data provided by our third-party partners, data provided by data-labeling services and paid contractors, as well as synthetic data generated by our proprietary models. Our models are trained on datasets containing trillions of tokens of text, image, video and audio. The first date a dataset was used during the development of our model pre-dates January 2022.

The training data for our models is systematically designed and carefully curated to comprehensively support the core capabilities of the inference service in areas such as general language understanding, advanced reasoning, multimodal interaction, long-context processing, and visual generation. The training data not only imparts world knowledge but also ensures our models deliver efficient, secure, and accurate responses to diverse, specialized, and multimodal user demands through task alignment, modality fusion, and capability enhancement in inference services. Our datasets may include data that is protected by copyright, trademark, or patent protection, as well as data in the public domain. In addition, the use of synthetic data is intended to address limitations in the availability of training data, enable the handling of complex tasks, improve model generalization and perceptual performance, and ensure model safety and robustness.

We enforce rigorous cleansing and filtering mechanisms to maintain data quality and mitigate potential risks. We apply filtering measures during the data preprocessing stage to reduce personal information from the training datasets. We will not use Alibaba Cloud Model Studio's customer business data to develop or improve our models, unless customers separately provide consent. We have established a rigorous data governance framework, supported by comprehensive data cleaning, processing, and structural optimization measures, to ensure data quality, security, and diversity, including the following stages:

Pre-training phase:
We apply strict cleaning and filtering procedures to raw training data, including automated content safety screening and human review mechanisms, to systematically remove harmful or sensitive content. These measures are designed to ensure that the model, during inference, can proactively identify and mitigate the impact of bias, thereby enhancing fairness and impartiality in model outputs.
Post-training phase:
1. Data augmentation and optimization for language models. At the data quality level, we have developed a fine-grained annotation framework spanning multiple dimensions, including educational value, domain coverage, language types, reasoning complexity, and safety levels. High-quality data is then selected based on these annotations. In parallel, we proactively inject synthetic data generated by our proprietary specialized models (such as Qwen-Math and Qwen-Coder). These practices significantly improve the model’s performance in core capabilities, including multilingual understanding, complex reasoning, and long-context modeling, while enhancing the controllability of the training data.
2. Specialized processing for visual generative models. For multimodal data, we conduct targeted preprocessing, including high-precision OCR and document structural parsing, 2D/3D spatial semantic annotation, and explicit temporal alignment between video frames and textual content. We also systematically construct large-scale multimodal synthetic datasets. These efforts aim to strengthen cross-modal alignment between vision and language, support the understanding of high-dimensional information such as complex documents and long-form videos, and provide a high-quality training foundation for advanced applications, including visual generation and agent-based interactions.
3. Safety alignment. We construct dedicated safety datasets to perform safety alignment, thereby enhancing the model’s intrinsic safety capabilities.

The overarching objective of all these processes is to improve data quality and task alignment, ensure model safety and regulatory compliance, and precisely achieve the intended performance goals of inference services across general-purpose, specialized, and multimodal scenarios.