High-quality datasets are essential to high-precision models. The core goal of data
preparation is to prepare high-quality datasets. You can use the dataset manager module
to register datasets based on the source data in Object Storage Service (OSS) buckets
or the data of on-premises CSV or manifest files. This way, you can use the dataset manager module to manage all your data in
Machine Learning Platform for AI (PAI) to prepare for data labeling or model training.
This topic describes how to register and export datasets.
Background information
You can use one of the following methods to register a dataset:
In the dataset manager module, all registered datasets are managed by using
manifest files. You can export registered datasets as
manifest files for business purposes. For more information about how to export a dataset,
see
Export a dataset.
Register a dataset by creating a dataset
If your source data, such as image, text, video, and audio files, is stored in OSS
buckets, you can create a dataset in the PAI console. The system scans all files of the specified type in the specified
OSS folder and generates a manifest file in the specified OSS path.
- Go to the Dataset Manager page.
- Log on to the Machine Learning Platform for AI (PAI) console.
- In the left-side navigation pane, choose .
- On the Dataset Manager page, click Register Dataset.
- On the Register Dataset page, set the parameters.
Parameter |
Description |
Dataset Name |
The name must be 1 to 30 characters in length and can contain underscores (_) and
hyphens (-). It must start with a letter or a digit.
|
Method |
Set the Method parameter to New Dataset.
|
Data Type |
The data type. Valid values:
- Image: JPEG, JPG, PNG, and WebP formats are supported.
- Text: CSV and TXT formats are supported. Data entries in the dataset are separated by
line breaks.
- Video: MP4, OGG, and WebM formats are supported.
|
Storage Type |
Only OSS is supported. You cannot change the value. If the current account is not authorized
to access OSS, you can click Authorize below the field to authorize the current account to access OSS.
|
Path |
Set the Path parameter to the OSS folder where your source data is stored.
|
Tags |
You can add one or more tags to each dataset to help search for or classify datasets.
Each tag can contain underscores (_) and hyphens (-). It must start with a letter
or a digit.
|
- Click Submit. Then, a manifest file is generated. The following code provides an example of the
manifest file.
{"data":{"picUrl":"oss://****/pics/fruit/apple-1.jpg"}}
{"data":{"picUrl":"oss://****/pics/fruit/apple-10.jpg"}}
{"data":{"picUrl":"oss://****/pics/fruit/apple-11.jpg"}}
...
Register a dataset by importing a dataset file
If you have an on-premises CSV file or manifest file, you can register a dataset by importing the dataset file. If you import a CSV file, the system automatically converts it to a manifest file.
- Go to the Dataset Manager page.
- Log on to the Machine Learning Platform for AI (PAI) console.
- In the left-side navigation pane, choose .
- On the Dataset Manager page, click Register Dataset.
- On the Register Dataset page, set the parameters.
Parameter |
Description |
Dataset Name |
The name must be 1 to 30 characters in length and can contain underscores (_) and
hyphens (-). It must start with a letter or a digit.
|
Method |
Set the Method parameter to Import Dataset.
|
Data Type |
The data type. Valid values:
- Image: JPEG, JPG, PNG, and WebP formats are supported.
- Text: CSV and TXT formats are supported. Data entries in the dataset are separated by
line breaks.
- Video: MP4, OGG, and WebM formats are supported.
|
Storage Type |
Only OSS is supported. You cannot change the value. If the current account is not authorized
to access OSS, you can click Authorize below the field to authorize the current account to access OSS.
|
Upload |
Drag an on-premises CSV or manifest file to the Upload field.
Note If the imported file is used for a labeling job, the names of the fields in the file
must comply with the data schema of the template that is used to create the labeling
job. For more information, see Labeling templates for images.
|
Path |
The OSS path to which the file is uploaded. |
Tags |
You can add one or more tags to each dataset to help search for or classify datasets.
Each tag can contain underscores (_) and hyphens (-). It must start with a letter
or a digit.
|
- Click Submit.
Export a dataset
PAI allows you to export registered datasets as manifest files to your computer for business purposes.
- Go to the Dataset Manager page.
- Log on to the Machine Learning Platform for AI (PAI) console.
- In the left-side navigation pane, choose .
- To export a dataset to your computer, go to the Dataset Manager page, find the dataset that you want to export, and then click Export Dataset in the Actions column.