Before you process data or train a model, you must prepare your datasets. PAI Asset Management provides a powerful dataset management feature to create and manage datasets and their versions. You can use dataset version management to accurately reproduce experiments, track data versions, and record data lineage. You can also quickly revert to an older version if a new version has issues. This ensures business continuity.
Overview
The dataset management feature supports the comprehensive management of basic and labeled datasets. Basic datasets typically contain a large volume of raw data and are primarily used with pre-trained models to capture broad features and patterns. Labeled datasets contain data with explicit, manually added labels. These datasets are mainly used for model fine-tuning and evaluation to improve performance on specific tasks.
Item | Basic dataset | Labeled dataset |
Definition | Unlabeled raw data | Data manually annotated with labels |
Data processing | Data cleaning, deduplication, and more | Data annotation, validation, and more |
Scenarios |
|
|
Go to the Datasets page
Log on to the PAI console.
In the upper-left corner, select the required region.
In the navigation pane on the left, choose Workspaces and click the name of the target workspace.
In the navigation pane on the left, choose AI Asset Management > Datasets.
Create a basic dataset
On the Custom Datasets tab, click Create Dataset and select Basic for Data Type. The supported Storage Type options are Object Storage Service (OSS) and file storage. Supported file storage types include General-purpose NAS file system, Extreme NAS file system, CPFS, and Intelligent Computing CPFS. The key parameter settings are as follows:
Storage type is Object Storage Service (OSS)
Parameter | Description |
Content Type | The type of data. Valid values: Image, Text, Audio, Video, Table, and General. If you select a specific type, the system helps you filter datasets in later annotation scenarios. |
Owner | Select the dataset owner. This parameter can only be set by workspace administrators. |
Import Format/OSS Path |
|
Default Mount Path | The default path where the data is mounted. This is often used in DSW and DLC:
|
Enable Version Acceleration | If you set Import Format to Folder, you can enable dataset version acceleration. The key parameters are described as follows:
|
Storage type is file storage
Parameter | Description |
Content Type | The type of data. Valid values: Image, Text, Audio, Video, Table, and General. If you select a specific type, the system helps you filter datasets in later annotation scenarios. |
Owner | Select the dataset owner. You must be a workspace administrator to configure this parameter. |
File System | Select the file system that corresponds to the Storage Type. |
File System Mount Target | Configure a mount target to access the NAS file system. |
File System Path | Configure an existing storage path in the NAS file system. For example, |
Default Mount Path | The default path where the data is mounted. This is often used in DSW and DLC:
|
Enable Version Acceleration | If the Storage Type is General-purpose NAS, Extreme NAS, or CPFS, you can enable dataset version acceleration. The key parameters are described as follows:
|
Create a basic dataset version
On the Custom Datasets tab, click Create Version in the Actions column for the target dataset.

Note:
The dataset name, storage type, and data type are inherited from the V1 version and cannot be changed.
The dataset version is generated by the system and cannot be changed.
For information about other key parameters, see the parameter descriptions in Create a basic dataset.
View public datasets
PAI provides multiple built-in public datasets, such as MMLU, CMMLU, and GSM8K. On the Public Datasets tab, you can click a dataset name to view its basic information.

Manage datasets
For custom datasets, you can view the version list, create a new version, make a dataset public, or delete it. For labeled datasets, you can view data, make a dataset public, or delete it.

Note:
For a dataset with its Visibility set to Visible only to dataset owner, you can click Make Dataset Public to share the dataset within the workspace. This allows all workspace members to view the dataset. Once public, a dataset cannot be made private again. Proceed with caution.
If a Resource Access Management (RAM) user does not have the required access permissions to view dataset data, see grant permissions to the RAM user.
Deleting a dataset may affect running tasks. Once deleted, a dataset cannot be recovered. Proceed with caution.