All Products
Search
Document Center

Alibaba Cloud Model Studio:Alibaba Cloud Model Studio - Training data summary

Last Updated:Mar 15, 2026

The Qwen and Wan model series in Alibaba Cloud Model Studio are self-developed models trained on datasets containing trillions of tokens across text, image, video, and audio. The earliest dataset used during model development pre-dates January 2022.

Data sources

Training data includes four categories:

  • Publicly available internet data -- web content accessible to the general public.

  • Third-party partner data -- non-public data provided by partners under agreement.

  • Labeled and contracted data -- data from data-labeling services and paid contractors.

  • Synthetic data -- data generated by proprietary Alibaba Cloud models to supplement real-world data.

Datasets may include copyrighted, trademarked, or patented content, as well as public domain content.

Synthetic data

Synthetic data generated by proprietary models addresses four needs:

  • Fill gaps where real-world training data is limited

  • Handle complex tasks that require specialized training examples

  • Improve model generalization and perceptual performance

  • Strengthen model safety and robustness

Core capabilities

Training data builds these inference service capabilities:

  • General language understanding

  • Advanced reasoning

  • Multimodal interaction

  • Long-context processing

  • Visual generation

Training data also supports task alignment, modality fusion, and capability refinement, enabling accurate and secure handling of diverse, specialized, and multimodal requests.

Customer data policy

Alibaba Cloud Model Studio does not use customer business data to develop or improve models without explicit consent.

Data governance

A data governance framework covers the full training pipeline (pre-training through post-training) with measures for data cleaning, processing, and structural optimization.

Pre-training

Raw training data undergoes cleaning and filtering:

  • Automated content safety screening removes harmful or sensitive content.

  • Human review supplements automated filters as an additional safeguard.

  • Personal information filtering reduces personal information in training datasets during preprocessing.

These measures help the model identify and reduce bias during inference, improving fairness and impartiality in outputs.

Post-training

Language model data augmentation

High-quality data is selected using a fine-grained annotation framework across five dimensions:

Dimension Purpose
Educational value Prioritize data with strong learning signals
Domain coverage Cover breadth across knowledge areas
Language types Support multilingual understanding
Reasoning complexity Build advanced reasoning capabilities
Safety levels Filter based on content safety criteria

Synthetic data from specialized models, including Qwen-Math and Qwen-Coder, further improves performance in multilingual understanding, complex reasoning, and long-context modeling.

Visual generative model processing

Multimodal data undergoes targeted preprocessing:

  • High-precision OCR and document structural parsing for text extraction from images and documents

  • 2D/3D spatial semantic annotation for spatial understanding

  • Explicit temporal alignment between video frames and text for video comprehension

Large-scale multimodal synthetic datasets strengthen cross-modal alignment between vision and language. They support complex document and long-form video understanding, and provide a training foundation for visual generation and agent-based interactions.

Safety alignment

Dedicated safety datasets strengthen the model's built-in safety capabilities through safety alignment.

Objective

These processes improve data quality and task alignment, maintain model safety and compliance, and achieve target performance across general-purpose, specialized, and multimodal inference.