All Products
Search
Document Center

Platform For AI:Create and manage datasets

Last Updated:Dec 25, 2025

Before you process data or train a model, you must prepare your datasets. PAI Asset Management provides a powerful dataset management feature to create and manage datasets and their versions. You can use dataset version management to accurately reproduce experiments, track data versions, and record data lineage. You can also quickly revert to an older version if a new version has issues. This ensures business continuity.

Overview

The dataset management feature supports the comprehensive management of basic and labeled datasets. Basic datasets typically contain a large volume of raw data and are primarily used with pre-trained models to capture broad features and patterns. Labeled datasets contain data with explicit, manually added labels. These datasets are mainly used for model fine-tuning and evaluation to improve performance on specific tasks.

Item

Basic dataset

Labeled dataset

Definition

Unlabeled raw data

Data manually annotated with labels

Data processing

Data cleaning, deduplication, and more

Data annotation, validation, and more

Scenarios

  • Unsupervised learning

  • Pre-trained models to capture broad features

  • Supervised learning and model evaluation

  • Model fine-tuning to improve task-specific performance

Go to the Datasets page

  1. Log on to the PAI console.

  2. In the upper-left corner, select the required region.

  3. In the navigation pane on the left, choose Workspaces and click the name of the target workspace.

  4. In the navigation pane on the left, choose AI Asset Management > Datasets.

Create a basic dataset

On the Custom Datasets tab, click Create Dataset and select Basic for Data Type. The supported Storage Type options are Object Storage Service (OSS) and file storage. Supported file storage types include General-purpose NAS file system, Extreme NAS file system, CPFS, and Intelligent Computing CPFS. The key parameter settings are as follows:

Storage type is Object Storage Service (OSS)

Parameter

Description

Content Type

The type of data. Valid values: Image, Text, Audio, Video, Table, and General. If you select a specific type, the system helps you filter datasets in later annotation scenarios.

Owner

Select the dataset owner. This parameter can only be set by workspace administrators.

Import Format/OSS Path

  • If you set Import Format to File, you must specify a file for OSS Path. The created dataset corresponds to this file. This is often used to create datasets for iTAG.

  • If you set Import Format to Folder, you must specify a folder path for OSS Path. This path can be mounted in a container. This is often used for datasets in DSW, DLC, or EAS.

Default Mount Path

The default path where the data is mounted. This is often used in DSW and DLC:

  • In DSW, you can mount a created file system to this path when you create an instance.

  • In DLC, the system searches for files in this directory when it runs code. For example, python /root/data/file.py.

Enable Version Acceleration

If you set Import Format to Folder, you can enable dataset version acceleration. The key parameters are described as follows:

  • Maximum Capacity: The capacity of the dataset acceleration slot. This value must be greater than or equal to the dataset size. You can set this parameter based on the size of the dataset that you want to accelerate.

  • Accelerated Mount Target: An internal mount target is used by default. You can also use an existing accelerated mount target or create one.

    Note

    When you use Lingjun resources, if you set Accelerated Mount Target to Create Mount Target, you must set Mount Target Type to VPC. The selected VPC and vSwitch must be the same as those of the Lingjun resources.

  • Accelerated Version Default Mount Path: The default mount path for the accelerated dataset version.

Storage type is file storage

Parameter

Description

Content Type

The type of data. Valid values: Image, Text, Audio, Video, Table, and General. If you select a specific type, the system helps you filter datasets in later annotation scenarios.

Owner

Select the dataset owner. You must be a workspace administrator to configure this parameter.

File System

Select the file system that corresponds to the Storage Type.

File System Mount Target

Configure a mount target to access the NAS file system.

File System Path

Configure an existing storage path in the NAS file system. For example, /.

Default Mount Path

The default path where the data is mounted. This is often used in DSW and DLC:

  • In DSW, you can mount a created file system to this path when you create an instance.

  • In DLC, the system searches for files in this directory when it runs code. For example, python /root/data/file.py.

Enable Version Acceleration

If the Storage Type is General-purpose NAS, Extreme NAS, or CPFS, you can enable dataset version acceleration. The key parameters are described as follows:

  • Maximum Capacity: The capacity of the dataset acceleration slot. This value must be greater than or equal to the dataset size. You can set this parameter based on the size of the dataset that you want to accelerate.

  • Accelerated Version Default Mount Path: The default mount path for the accelerated dataset version.

Create a basic dataset version

On the Custom Datasets tab, click Create Version in the Actions column for the target dataset.

image

Note:

  • The dataset name, storage type, and data type are inherited from the V1 version and cannot be changed.

  • The dataset version is generated by the system and cannot be changed.

  • For information about other key parameters, see the parameter descriptions in Create a basic dataset.

View public datasets

PAI provides multiple built-in public datasets, such as MMLU, CMMLU, and GSM8K. On the Public Datasets tab, you can click a dataset name to view its basic information.

image

Manage datasets

For custom datasets, you can view the version list, create a new version, make a dataset public, or delete it. For labeled datasets, you can view data, make a dataset public, or delete it.

image

Note:

  • For a dataset with its Visibility set to Visible only to dataset owner, you can click Make Dataset Public to share the dataset within the workspace. This allows all workspace members to view the dataset. Once public, a dataset cannot be made private again. Proceed with caution.

  • If a Resource Access Management (RAM) user does not have the required access permissions to view dataset data, see grant permissions to the RAM user.

  • Deleting a dataset may affect running tasks. Once deleted, a dataset cannot be recovered. Proceed with caution.