All Products
Search
Document Center

Platform For AI:Create a dataset for a labeling job

Last Updated:Mar 27, 2024

When you create a labeling job, you must select a dataset. This topic describes how to create a dataset for a labeling job and the format requirements for the dataset.

Background information

Before you create a labeling job by using iTAG, you must create the file that you want to label as a dataset. iTAG of Platform for AI (PAI) allows you to create a labeling job by using a common template or custom template. Data preparations and the dataset creation method for a labeling job vary based on the template that is used to create the labeling job.

  • Common templates

    iTAG provides the following types of common templates: image, text, video, and audio. For more information about how to create a dataset for a labeling job that uses a common template and the format requirements for the dataset, see Create a text dataset and Create an image dataset, a video dataset, or an audio dataset.

  • Custom templates

    Custom templates help you label data in a flexible manner. For example, you can label multiple types of samples such as images and text in a labeling job. For more information about how to create a dataset for a labeling job that uses a custom template and the format requirements for the dataset, see Create a custom dataset.

Prerequisites

Object Storage Service (OSS) is activated. For more information, see Get started by using the OSS console.

Create a text dataset

Item

Method 1: Use data stored in an Alibaba Cloud storage service

Method 2: Upload data from an on-premises machine

Procedure

  1. Create a .manifest or .txt file on your on-premises machine based on the file format requirements.

  2. Upload the .manifest or .txt file that you create to OSS. For more information, see Simple upload.

  3. Create a dataset based on the data that is stored in an Alibaba Cloud storage service. For more information, see Create a dataset based on data that is stored in an Alibaba Cloud storage service.

  1. Create a .csv or .xlsx file on your on-premises machine based on the file format requirements.

  2. Go to the iTAG page.

    1. Log on to the PAI console.

    2. In the left-side navigation pane, click Workspaces. On the Workspaces page, click the name of the workspace that you want to manage.

    3. In the left-side navigation pane of the page that appears, choose Data Preparation>iTAG.

  3. On the iTAG page, click Go to Task Center or Go to Management Page.

  4. On the page that appears, click the Data Management tab. In the upper-right corner of the Data Management tab, click Create Original Dataset.

  5. In the Create Original Dataset dialog box, configure the parameters.

    • Select Local Upload for Import Data.

    • Select File for Import Format.

    • Configure the OSS Bucket and OSS File Path parameters.

    • Click Upload File and select the .csv or .xlsx file that you create.

  6. Click Create.

File name extension

A .manifest or .txt file.

A .csv or .xlsx file.

File format

{"data":{"source":"text sample 1"}}
{"data":{"source":"text sample 2"}}
{"data":{"source":"text sample 3"}}

source indicates the sample content that you want to label. You must replace the value of source with the related text content that you want to label.

A column in the .csv or .xlsx file can be the text content that you want to label or an image URL.

File demo

textDemo1.manifest

textDemo2.csv

Create an image dataset, a video dataset, or an audio dataset

This section describes how to create an image dataset. The procedure for creating a video dataset or an audio dataset is the same as the procedure for creating an image dataset.

Item

Method 1: Scan a folder

Method 2: Upload data from an on-premises machine

Procedure

  1. Upload the image file that you want to label to an OSS bucket and obtain the path of the OSS bucket. For more information, see Simple upload.

  2. Create a dataset by scanning a folder. A .manifest file is automatically generated. For more information, see Create and manage datasets.

  1. Create a folder that contains an image file on your on-premises machine.

  2. Go to the iTAG page.

    1. Log on to the PAI console.

    2. In the left-side navigation pane, click Workspaces. On the Workspaces page, click the name of the workspace that you want to manage.

    3. In the left-side navigation pane of the page that appears, choose Data Preparation>iTAG.

  3. On the iTAG page, click Go to Task Center or Go to Management Page.

  4. On the page that appears, click the Data Management tab. In the upper-right corner of the Data Management tab, click Create Original Dataset. In the Create Original Dataset dialog box, configure the parameters.

    • Select Local Upload for Import Data.

    • Select Folder for Import Format.

    • Configure the OSS Bucket and OSS File Path parameters.

    • Click Upload Folder to upload the folder that you create.

  5. Click Create.

File content

{"data":{"source":"oss://****.oss-cn-hangzhou.aliyuncs.com/iTAG/pic/1.jpg"}}
{"data":{"source":"oss://****.oss-cn-hangzhou.aliyuncs.com/iTAG/pic/10.jpg"}}
{"data":{"source":"oss://****.oss-cn-hangzhou.aliyuncs.com/iTAG/pic/11.jpg"}}

source indicates the sample content that you want to label. You must replace the value of source with the path of the OSS bucket.

File demo

Create a custom dataset

Item

Use data stored in an Alibaba Cloud storage service

Procedure

  1. Create a .manifest or .txt file on your on-premises machine based on the file format requirements.

  2. Upload the .manifest or .txt file that you create to OSS. For more information, see Simple upload.

  3. Create a dataset based on the data that is stored in an Alibaba Cloud storage service. For more information, see Create a dataset based on data that is stored in an Alibaba Cloud storage service.

File name extension

A .manifest or .txt file.

File format

{"data":{"picture_url":"oss://****.oss-cn-hangzhou.aliyuncs.com/iTAG/pic/1.jpg","text":"Jack Ma established Alibaba Group in an apartment in Hangzhou with 18 founders. The first website of Alibaba Group is Alibaba.com, which is an English website that focuses on the global wholesale trade market."}}
{"data":{"picture_url":"oss://****.oss-cn-hangzhou.aliyuncs.com/iTAG/pic/10.jpg","text":"Alibaba Group held the first West Lake Cybersecurity Conference. During the conference, commercial and opinion leaders of the Internet industry came together to discuss major issues of the industry."}}
{"data":{"picture_url":"oss://****.oss-cn-hangzhou.aliyuncs.com/iTAG/pic/11.jpg","text":"Alibaba Group raised USD 82 million from multiple investment agencies. This event became the largest private equity financing in the China Internet industry at that time."}} 

"data" at the beginning of each row indicates a labeling job. Multiple types of samples can be checked in a labeling job. The names of samples are separated by commas (,).

The following sample code shows that an image and a text are checked in the labeling job. The storage path of the sample image is oss://****.oss url 01. The sample text is text sample1.

{"data":{"picture_url":"oss://****.oss url 01","text":"text sample1"}}

File demo

multiModal.manifest

What to do next

After you create a dataset, you can create a labeling job based on the dataset. For more information, see Create a labeling job.