All Products
Search
Document Center

Platform For AI:Audio labeling templates

Last Updated:Feb 27, 2024

iTAG of Machine Learning Platform for AI (PAI) provides labeling templates for audio classification, audio segmentation, and Automatic Speech Recognition (ASR). When you create an audio labeling job, you can select a labeling template based on your business scenario. This topic describes scenarios of audio labeling templates and the data structures of input and output data for these templates.

Background information

iTAG provides audio labeling templates that support the following features:

Audio classification

Audio classification is used to find one or more labels that match input audio from a set of labels and add the labels to the audio. This template supports single-label and multi-label audio classification.

  • Scenarios

    This labeling template applies to scenarios such as classification of environment sound.

  • Data structures

    • Input data

      Each row in the .manifest file of input data contains an object. Each row must contain the source field.

      {"data":{"source":"oss://examplebucket.oss-cn-hangzhou.aliyuncs.com/iTAG/audio/1.wav"}}
      ...
    • Output data

      Each row in the .manifest file of output data contains an object and the labeling results for the object. The following code provides an example on the JSON string in each row:

      {
          "data": {
              "source": "oss://itag.oss-cn-hangzhou.aliyuncs.com/examplebucket/6.wav"
          },
          "label-1432993193909231616": {
              "results": [
                  {
                      "questionId": "1", 
                      "data": "Label 1", 
                      "markTitle": "Single-choice", 
                      "type": "survey/value"
                  }
              ]
          }
      }

Audio segmentation

Audio segmentation is used to divide a piece of audio into several clips and label these clips. You can use a sound wave graph to decide how to divide the audio.

  • Scenarios

    This labeling template applies to scenarios such as dialogue analysis.

  • Data structures

    • Input data

      Each row in the .manifest file of input data contains an object. Each row must contain the source field.

      {"data":{"source":"oss://examplebucket.oss-cn-hangzhou.aliyuncs.com/iTAG/audio/1.wav"}}
      ...
    • Output data

      Each row in the .manifest file of output data contains an object and the labeling results for the object. The following code provides an example on the JSON string in each row:

      {
          "data": {
              "source": "oss://itag.oss-cn-hangzhou.aliyuncs.com/examplebucket/21.wav"
          }, 
          "label-1435480301706092544": {
              "results": [
                  {
                      "duration": 0, 
                      "objects": [
                          {
                              "result": {
                                  "Audio segmentation result": "Result 1", 
                                  "Single-choice": "Label 1"
                              }, 
                              "color": null, 
                              "id": "wavesurfer_ei0aet9uvp8", 
                              "start": 2.3886218302094817, 
                              "end": 4.635545755237045
                          }, 
                          {
                              "result": {
                                  "Audio segmentation result": "Result 2", 
                                  "Single-choice": "Label 2"
                              }, 
                              "color": null, 
                              "id": "wavesurfer_kl39gnlb2k", 
                              "start": 5.698280044101433, 
                              "end": 7.348048511576626
                          }
                      ], 
                      "empty": false
                  }
              ]
          }
      }

ASR

ASR is used to transform the content of audio to text and label the text.

  • Scenarios

    This labeling template applies to scenarios such as dialect recognition.

  • Data structures

    • Input data

      Each row in the .manifest file of input data contains an object. Each row must contain the source field.

      {"data":{"source":"oss://examplebucket.oss-cn-hangzhou.aliyuncs.com/iTAG/audio/1.wav"}}
      ...
    • Output data

      Each row in the .manifest file of output data contains an object and the labeling results for the object. The following code provides an example on the JSON string in each row:

      {
          "data": {
              "source": "oss://itag.oss-cn-hangzhou.aliyuncs.com/examplebucket/14.wav"
          }, 
          "label-1435448359497441280": {
              "results": [
                  {
                      "questionId": "1", 
                      "data": "ASR result", 
                      "markTitle": "ASR result", 
                      "type": "survey/value"
                  }, 
                  {
                      "questionId": "3", 
                      "data": [
                          "Label 1", 
                          "Label 2"
                      ], 
                      "markTitle": "Multiple-choice", 
                      "type": "survey/multivalue"
                  }
              ]
          }
      }