High-quality datasets are essential to high-precision models. The core goal of data preparation is to prepare high-quality datasets. You can use the dataset manager module to register datasets based on the source data in Object Storage Service (OSS) buckets or the data of on-premises CSV or manifest files. This way, you can use the dataset manager module to manage all your data in Machine Learning Platform for AI (PAI) to prepare for data labeling or model training. This topic describes how to register and export datasets.

Background information

You can use one of the following methods to register a dataset: In the dataset manager module, all registered datasets are managed by using manifest files. You can export registered datasets as manifest files for business purposes. For more information about how to export a dataset, see Export a dataset.

Register a dataset by creating a dataset

If your source data, such as image, text, video, and audio files, is stored in OSS buckets, you can create a dataset in the PAI console. The system scans all files of the specified type in the specified OSS folder and generates a manifest file in the specified OSS path.

  1. Go to the Dataset Manager page.
    1. Log on to the Machine Learning Platform for AI (PAI) console.
    2. In the left-side navigation pane, choose AI Computing Asset Management > Dataset Manager.
  2. On the Dataset Manager page, click Register Dataset.
  3. On the Register Dataset page, set the parameters.
    Parameter Description
    Dataset Name The name must be 1 to 30 characters in length and can contain underscores (_) and hyphens (-). It must start with a letter or a digit.
    Method Set the Method parameter to New Dataset.
    Data Type The data type. Valid values:
    • Image: JPEG, JPG, PNG, and WebP formats are supported.
    • Text: CSV and TXT formats are supported. Data entries in the dataset are separated by line breaks.
    • Video: MP4, OGG, and WebM formats are supported.
    Storage Type Only OSS is supported. You cannot change the value. If the current account is not authorized to access OSS, you can click Authorize below the field to authorize the current account to access OSS.
    Path Set the Path parameter to the OSS folder where your source data is stored.
    Tags You can add one or more tags to each dataset to help search for or classify datasets. Each tag can contain underscores (_) and hyphens (-). It must start with a letter or a digit.
  4. Click Submit. Then, a manifest file is generated. The following code provides an example of the manifest file.
    {"data":{"picUrl":"oss://****/pics/fruit/apple-1.jpg"}}
    {"data":{"picUrl":"oss://****/pics/fruit/apple-10.jpg"}}
    {"data":{"picUrl":"oss://****/pics/fruit/apple-11.jpg"}}
    ...

Register a dataset by importing a dataset file

If you have an on-premises CSV file or manifest file, you can register a dataset by importing the dataset file. If you import a CSV file, the system automatically converts it to a manifest file.

  1. Go to the Dataset Manager page.
    1. Log on to the Machine Learning Platform for AI (PAI) console.
    2. In the left-side navigation pane, choose AI Computing Asset Management > Dataset Manager.
  2. On the Dataset Manager page, click Register Dataset.
  3. On the Register Dataset page, set the parameters.
    Parameter Description
    Dataset Name The name must be 1 to 30 characters in length and can contain underscores (_) and hyphens (-). It must start with a letter or a digit.
    Method Set the Method parameter to Import Dataset.
    Data Type The data type. Valid values:
    • Image: JPEG, JPG, PNG, and WebP formats are supported.
    • Text: CSV and TXT formats are supported. Data entries in the dataset are separated by line breaks.
    • Video: MP4, OGG, and WebM formats are supported.
    Storage Type Only OSS is supported. You cannot change the value. If the current account is not authorized to access OSS, you can click Authorize below the field to authorize the current account to access OSS.
    Upload Drag an on-premises CSV or manifest file to the Upload field.
    Note If the imported file is used for a labeling job, the names of the fields in the file must comply with the data schema of the template that is used to create the labeling job. For more information, see Labeling templates for images.
    Path The OSS path to which the file is uploaded.
    Tags You can add one or more tags to each dataset to help search for or classify datasets. Each tag can contain underscores (_) and hyphens (-). It must start with a letter or a digit.
  4. Click Submit.

Export a dataset

PAI allows you to export registered datasets as manifest files to your computer for business purposes.

  1. Go to the Dataset Manager page.
    1. Log on to the Machine Learning Platform for AI (PAI) console.
    2. In the left-side navigation pane, choose AI Computing Asset Management > Dataset Manager.
  2. To export a dataset to your computer, go to the Dataset Manager page, find the dataset that you want to export, and then click Export Dataset in the Actions column.