All Products
Search
Document Center

Platform For AI:Create and manage datasets

Last Updated:Dec 26, 2025

To process data or train a model, you must first prepare a dataset. AI Asset Management provides powerful features to create and manage datasets. Its version management allows you to precisely reproduce experiments, track data versions, and record data lineage. If a new version causes issues, you can quickly roll back to a previous version to ensure business continuity.

Overview

AI Asset Management lets you manage basic and labeled datasets. A basic dataset typically contains large volumes of raw information and is primarily used to pre-train models to capture broad features and patterns. A labeled dataset contains human-annotated data with specific labels and is primarily used for model fine-tuning and evaluation to improve model performance on specific tasks.

Item

Basic dataset

Labeled dataset

Definition

Raw, unlabeled data.

Human-annotated data

Data processing

Data cleaning, deduplication, and more.

Data labeling, validation, and more

Scenarios

  • Unsupervised learning

  • Pre-training models to capture broad features

  • Supervised learning and model evaluation

  • Fine-tune models to improve performance on specific tasks

Go to the Datasets page

  1. Log on to the PAI console.

  2. In the upper-left corner, select the region where your workspace is located.

  3. In the left-side navigation pane, choose Workspaces. Click the name of the workspace that you want to open.

  4. In the left-side navigation pane, choose AI Asset Management > Datasets.

Create a basic dataset

On the Custom Datasets tab, click Create Dataset and select Basic for Data Type. You can create a dataset from Object Storage Service (OSS) or File Storage (General-purpose NAS, Extreme NAS, CPFS, and AI-CPFS).

Storage type is Object Storage Service (OSS)

Parameter

Description

Content Type

Select the data type, such as image, text, audio, video, table, or general. Specifying a type allows the system to filter datasets for future labeling tasks.

Owner

Select the dataset owner. Only workspace administrators can configure this parameter.

Import Format/OSS Path

  • File: Specify the path to a single file in OSS. This is commonly used for creating datasets for iTAG.

  • Folder: Specify a folder path for OSS Path. The folder can then be mounted in a container. This is commonly used for datasets in DSW, DLC, or EAS.

Default Mount Path

The default path for mounting the data. This is often used in DSW and DLC:

  • In DSW, you can mount an existing file system to this path when you create an instance.

  • In DLC, your code can access files in this directory. For example, python /root/data/file.py.

Enable Version Acceleration

You can enable dataset version acceleration when you set Import Format to Folder. Key settings include:

  • Maximum Capacity: The capacity of the acceleration slot. This value must be greater than or equal to the dataset size.

  • Accelerated Mount Target: By default, an internal mount target is used. You can also select an existing accelerated mount target or create a new one.

    Note

    When using Lingjun Intelligent Computing Resources, if you choose to Create Mount Target for the Accelerated Mount Target, you must set Mount Target Type to VPC. The selected VPC and vSwitch must match those used by the Lingjun resources.

  • Accelerated Version Default Mount Path: The default mount path for the accelerated dataset version.

Storage type is file system

Parameter

Description

Content Type

Select the data type, such as image, text, audio, video, table, or general. Specifying a type allows the system to filter datasets for future labeling tasks.

Owner

Select the dataset owner. Only workspace administrators can configure this parameter.

File System

Select a file system that corresponds to the Storage Type.

Mount Target

Configure a mount target to access the file system.

File System Path

Specify the path to your data in the file system. For example, /.

Default Mount Path

The default path for mounting the data. This is often used in DSW and DLC:

  • In DSW, you can mount an existing file system to this path when you create an instance.

  • In DLC, your code can access files in this directory. For example, python /root/data/file.py.

Enable Version Acceleration

If the Storage Type is General-purpose NAS, Extreme NAS, or CPFS, you can enable dataset version acceleration. The key parameters are described as follows:

  • Maximum Capacity: The capacity of the acceleration slot. This value must be greater than or equal to the dataset size.

  • Accelerated Version Default Mount Path: The default mount path for the accelerated dataset version.

Create a basic dataset version

On the Custom Datasets tab, click Create Version in the Actions column for the target dataset.

image

Note:

  • The dataset name, storage type, and data type are inherited from the V1 version and cannot be changed.

  • The system automatically generates the dataset version, which is read-only.

  • For other parameter settings, see the descriptions in the Create a basic dataset section.

View public datasets

The system provides a variety of built-in public datasets, such as MMLU, CMMLU, and GSM8K. On the Public Datasets tab, you can click a dataset name to view its basic information.

image

Manage datasets

For custom datasets, you can view the list of versions, create a new version, set a dataset to public, or delete it. For labeled datasets, you can view their data, set them to public, or delete them.

image

Note:

  • For a dataset with its Visibility set to Visible only to dataset owner, you can click Make Dataset Public to share the dataset within the workspace. This allows all workspace members to view the dataset. Once public, a dataset cannot be made private again. Proceed with caution.

  • If a RAM user receives an access denied error when trying to view dataset data, you must grant permissions to the RAM user.

  • Deleting a dataset might affect running tasks that depend on it. Important: Deleting a dataset is irreversible. Proceed with caution.