All Products
Search
Document Center

Platform For AI:Create and manage datasets

Last Updated:Apr 03, 2024

High-quality datasets are essential to high-precision models. The goal of data preparation is to create high-quality datasets. Platform for AI (PAI) provides a dataset management module. This module allows you to create datasets based on all types of data, including data stored on on-premises machines and data stored in Alibaba Cloud storage services. This module also allows you to scan Object Storage Service (OSS) folders to generate index datasets that can be used in intelligent labeling and model training. This topic describes how to create and manage datasets.

Background information

The dataset module provides multiple methods that allow you to create datasets. You can select one of the following methods based on the data source and scenario in which the data is used:

  • Create a dataset based on data that is stored in an Alibaba Cloud storage service

    Create a dataset based on the data that is stored in OSS or Apsara File Storage NAS. You can use the dataset in subsequent data processing or modeling.

  • Create a dataset by scanning a folder

    Scan a file that is stored in OSS to generate an index file whose extension is .manifest and use the index file as a dataset. You can use the dataset in scenarios in which iTAG is used.

  • Create a dataset by registering a public dataset

    The public datasets available in PAI are open source datasets provided by Alibaba Cloud. The public datasets are stored in the public storage of Alibaba Cloud. You can register the public datasets without the need to create replicas in your storage. After you register a public dataset, you can use the dataset in subsequent data processing and modeling.

Prerequisites

An AI workspace is created. The datasets that you want to register are added to the AI workspace.

Limits

  • In the China (Ulanqab) region, you can create datasets only by using data from an Alibaba Cloud storage service or by scanning a folder.

  • You can create CPFS for Lingjun datasets only in the China (Ulanqab) region. The Alibaba Cloud File Storage (CPFS) datasets are not supported in the China (Ulanqab) region.

Account and permission requirements

  • Alibaba Cloud account: You can use an Alibaba Cloud account to complete all operations without additional authorization.

  • RAM user: You need to add the following permissions to the RAM user.

    • Dataset-related permissions

      You need to add a RAM user as a workspace member of certain roles and assign permissions to the roles. For information about permissions of roles, go to the Roles and Permissions page. For information about how to add a RAM user as a workspace member, see Manage workspace members. image.png

    • Permissions to view and use OSS buckets when you use an OSS dataset.

      Use the following script to create a policy and attach the policy to the RAM user. For information about how to create a policy, see Create custom policies. For information about how to grant permissions to a RAM user, see Grant permissions to a RAM user.

      {
          "Version": "1",
          "Statement": [
              {
                  "Effect": "Allow",
                  "Action": [
                            "oss:ListBuckets",
                            "oss:GetBucketStat",
                            "oss:GetBucketInfo",
                            "oss:GetBucketTagging",
                            "oss:GetBucketLifecycle",
                            "oss:GetBucketWorm",                      
                            "oss:GetBucketVersioning", 
                            "oss:GetBucketAcl" 
                            ],    
                  "Resource": "acs:oss:*:*:*"
              },
              {
                  "Effect": "Allow",
                  "Action": [
                      "oss:ListObjects",
                      "oss:GetBucketAcl"
                  ],
                  "Resource": "acs:oss:*:*:mybucket"
              },
              {
                  "Effect": "Allow",
                  "Action": [
                      "oss:GetObject",
                      "oss:GetObjectAcl"
                  ],
                  "Resource": "acs:oss:*:*:mybucket/*"
              }
          ]
      }
    • Permissions to view and use the NAS file systems including the permissions to query file systems and protocol service information (for CPFS only) when you use a NAS or CPFS dataset.

      Use the following script to create a policy and attach the policy to the RAM user. For information about how to create a policy, see Create custom policies. For information about how to grant permissions to a RAM user, see Grant permissions to a RAM user.

      {
          "Version": "1",
          "Statement": [
              {
                  "Effect": "Allow",
                  "Action": [
                      "nas:DescribeFileSystems",
                      "nas:DescribeProtocolMountTarget",
                      "nas:DescribeProtocolService "
                  ],
                  "Resource": "acs:nas:*:*:filesystem/*"
              }
          ]
      }

Create a dataset based on data that is stored in an Alibaba Cloud storage service

  1. Go to the Dataset management page.

    1. Log on to the PAI console.

    2. In the left-side navigation pane, click Workspaces. On the Workspaces page, click the name of the workspace that you want to manage.

    3. In the left-side navigation pane, choose AI Computing Asset Management > Datasets.

  2. On the Dataset management page, click Create dataset.

  3. In the Create dataset panel, select From Alibaba Cloud in the Create Dataset section and configure the required parameters as shown in the following figure. For other parameters, you can follow the on-screen instructions and complete configuration.

    从阿里云云产品

    You can set the Select data store parameter to Alibaba Cloud Object Storage Service (OSS), General-purpose NAS file system, Extreme NAS file system, Cloud Paralleled File System (CPFS), or CPFS for Lingjun. The following tables describe the parameters that you need to configure for each storage service.

    Alibaba Cloud Object Storage (OSS)

    The following table describes the parameters when you set Select data storage to Alibaba Cloud Object Storage Service (OSS).

    Parameter

    Description

    Property

    Select the type of the task. Valid values:

    • Document: Select a file. If the dataset that you want to create is used in iTAG, we recommend that you select a document. The path of the generated dataset is the same as the path of the selected document.

    • Folder: Select a folder path. If the dataset that you want to create is used for jobs related to Data Science Workshop (DSW), Deep Learning Containers (DLC), or Elastic Algorithm Service (EAS), we recommend that you select a folder. The folder can be mounted to a container.

    Visible range

    The visibility scope of the dataset. Valid values:

    • Only visible to oneself: The dataset is visible to only you and the administrator of the current workspace.

    • Publicly visible in the workspace: The dataset is visible to all users of the current workspace.

    Dataset type

    The type of data that you register. Valid values:

    • Picture

    • Text

    • Audio

    • Video

    • General

    The Dataset type parameter is optional. Default value: General. When you select a specific dataset type, the system displays datasets of the specified type in subsequent labeling scenarios.

    Create a dataset that is stored in Alibaba Cloud storage

    Click 文件夹 to select the OSS path of the file. In the Select OSS file dialog box, you can select an existing file in the OSS path or perform the following steps to upload a file from your on-premises machine.

    Note

    If no OSS buckets are available in the current region, click Create Bucket to create an OSS bucket.

    • The region of the bucket must be the same as that of PAI.

    • You cannot change the region of a bucket after it is created.

    1. In the Select OSS file dialog box, click Upload file.

    2. Click View local files and select a file that you want to upload from your on-premises machine, or directly drag the file to the blank area.

    Default Mount path

    You can use the default mount path in DLC and DSW.

    • When you create an instance in DSW, you can mount the file system that you create to the default mount path.

    • When you run code in DLC, the system searches files in the default mount path. Example: python /root/data/file.py.

    Enable Dataset Acceleration

    This parameter is available if you set Property to Folder. You must also configure the relevant parameters to enable the dataset acceleration feature.

    Parameters:

    • Dataset Accelerator: Select an existing dataset accelerator instance.

    • Maximum Capacity: Specify the capacity of the slot. The slot capacity must be greater than or equal to the dataset capacity.

    • Accelerated Mount Target: An internal mount target is used by default. You can use an existing mount target or create a mount target.

      Note

      If you use intelligent computing LINGJUN resources, you need to set the Mount Target Type of the Accelerated Mount Target to VPC and the VPC and vSwitch must be the same as the LINGJUN resources that you use.

    • Accelerated Dataset Default Mount Path: Specify the default mount path of the data.

    Apsara Cloud file storage (NAS)/Cloud Paralleled File System (CPFS)

    The following table describes the parameters when you set Select data storage to Apsara Cloud file storage (NAS) or Cloud Paralleled File System (CPFS).

    Note

    You can mount only general-purpose NAS datasets for EAS jobs.

    Header

    Description

    Visible range

    The visibility scope of the dataset. Valid values:

    • Only visible to oneself: The dataset is visible to only you and the administrator of the current workspace.

    • Publicly visible in the workspace: The dataset is visible to all users of the current workspace.

    Dataset type

    The type of data that you register. Valid values:

    • Picture

    • Text

    • Audio

    • Video

    • General

    The Dataset type parameter is optional. Default value: General. When you select a specific dataset type, the system displays datasets of the specified type in subsequent labeling scenarios.

    Select File System

    You can follow the on-screen instructions to select one of the following types of NAS file systems in the current region.

    Note
    • You can mount only general-purpose NAS datasets for EAS jobs.

    • You can create CPFS for Lingjun datasets only in the China (Ulanqab) region.

    • You can mount NAS file systems that have encrypted transmission configured for DLC and DSW jobs

    • General-purpose NAS

    • Extreme NAS

    • CPFS

    • CPFS for Lingjun

    Mount Target

    Specify the mount target to access the NAS file system.

    File System Path

    Specify an existing path in the NAS file system. Example: /.

    Default Mount path

    You can use the default mount path in DLC and DSW.

    • When you create an instance in DSW, you can mount the file system that you create to the default mount path.

    • When you run code in DLC, the system searches files in the default mount path. Example: python /root/data/file.py.

    Enable Dataset Acceleration

    This parameter is available if you set Select File System to CPFS. You must also configure the relevant parameters to enable the dataset acceleration feature.

    Parameters:

    • Dataset Accelerator: Select an existing dataset accelerator instance.

    • Maximum Capacity: Specify the capacity of the slot. The slot capacity must be greater than or equal to the dataset capacity.

    • Accelerated Mount Target: An internal mount target is used by default. You can use an existing mount target or create a mount target.

      Note

      If you use intelligent computing LINGJUN resources, you need to set the Mount Target Type of the Accelerated Mount Target to VPC and the VPC and vSwitch must be the same as the LINGJUN resources that you use.

    • Accelerated Dataset Default Mount Path: Specify the default mount path of the data.

  4. Click Submit.

Create a dataset by scanning a folder

Select an OSS folder in the current region. Then, the system scans the files in the specified folder and generates an index file whose extension is .manifest. You can use the index file in data labeling scenarios. Perform the following steps to enable topology-aware CPU scheduling.

  1. Go to the Dataset management page.

    1. Log on to the PAI console.

    2. In the left-side navigation pane, click Workspaces. On the Workspaces page, click the name of the workspace that you want to manage.

    3. In the left-side navigation pane, choose AI Computing Asset Management > Datasets.

  2. On the Dataset management page, click Create dataset.

  3. In the Create dataset panel, configure the parameters.

    扫描文件夹创建数据集

    Parameter

    Description

    Create Dataset

    Select Scan folders to create datasets.

    Name

    The name of the custom dataset.

    Visible range

    The visibility scope of the dataset. Valid values:

    • Only visible to oneself: The dataset is visible to only you and the administrator of the current workspace.

    • Publicly visible in the workspace: The dataset is visible to all users of the current workspace.

    Dataset type

    The type of data that you register. Valid values:

    • Picture

    • Text

    • Audio

    • Video

    • General

    The Dataset type parameter is optional. Default value: General. When you select a specific dataset type, the system displays datasets of the specified type in subsequent labeling scenarios.

    Scan folder path

    Select an OSS folder in the current region. If no OSS buckets are available in the current region, click Create Bucket to create an OSS bucket.

    Note
    • The region of the bucket must be the same as that of PAI.

    • You cannot change the region of a bucket after it is created.

    Path wildcard

    Specify a wildcard based on your business requirements.

    • If you want to scan all files in the specified OSS folder, set Path wildcard to *.

    • If you want to scan only JPG files in the specified OSS folder, set Path wildcard to *.jpg.

    • If you want to scan only WAV files in the specified OSS folder, set Path wildcard to */*.wav.

    Note

    The system can scan up to 100,000 files in an OSS folder.

    Preview

    After you click Start scanning, the system scans files in the specified OSS folder based on the specified wildcard character and lists the index-related files in an index file whose extension is .manifest for preview.

    Save path of scan result file

    The OSS path where the dataset_xxx.manifest index file that is generated by the system is stored. You can change the name of the index file.

  4. Click Submit.

Create a dataset by registering a public dataset

  1. Go to the Dataset management page.

    1. Log on to the PAI console.

    2. In the left-side navigation pane, click Workspaces. On the Workspaces page, click the name of the workspace that you want to manage.

    3. In the left-side navigation pane, choose AI Computing Asset Management > Datasets.

  2. On the Dataset management page, click Create dataset.

  3. In the Create dataset panel, set Create Dataset to Public dataset.

  4. Select a public dataset and click Submit.

    The public datasets available in PAI are open source datasets provided by Alibaba Cloud. The public datasets are stored in the public storage of Alibaba Cloud. You can register the public datasets without the need to create replicas in your storage. After you select a public dataset, the system automatically obtains the OSS folder that stores the public dataset.

Manage datasets

You can view all datasets on which you have permissions to manage on the Datasets page of the PAI console, and perform operations on the datasets. For example, you can view the details of a dataset or delete a dataset.image.png

  • You can find the dataset that you want to manage and click View datasets to go to the OSS path of the dataset and view the dataset details. You can also click Delete to delete the dataset.

    Note

    If you click View datasets and the system prompts you that you do not have permission to access OSS, log on to the console by using your Alibaba Cloud account and grant the AliyunOSSFullAccess permission to your RAM user. For more information, see Step 2: Grant permissions to the RAM user.

  • For datasets whose visibility scope is Only visible to oneself, you can click Public data set to make the datasets visible to all users in the workspace.

    Important

    After you set the visibility scope of a dataset to Publicly visible in the workspace, you can no longer set the visibility scope of the dataset to Only visible to oneself. Proceed with caution.

  • You can add labels to datasets and then filter datasets based on label keys or label values.

  • You can click the column filter icon in the upper-right part of the Dataset management page to specify the columns that you want to display in the dataset list.