All Products
Search
Document Center

Platform For AI:Use EasyASR for speech classification

Last Updated:Oct 20, 2023

Machine Learning Platform for AI (PAI) provides EasyASR, which is an enhanced algorithm framework for speech intelligence. EasyASR provides a variety of features for model training and prediction. You can use EasyASR to train and apply speech intelligence models for your speech intelligence applications. For example, you can train a model for background music detection. This topic describes how to use EasyASR for speech classification in Data Science Workshop (DSW).

Prerequisites

A DSW instance is created and the requirements on software versions are met. For more information, see Create and manage DSW instances and Limits.

Note

We recommend that you use a GPU-accelerated DSW instance.

Limits

Take note of the following items that are related to software versions:

  • Python 3.6 is supported.

  • TensorFlow 1.12 is supported.

  • All versions of PyTorch are not supported.

  • We recommend that you use the tensorflow:1.12PAI-gpu-py36-cu101-ubuntu18.04 image of DSW.

Procedure

To use EasyASR for speech classification in DSW, perform the following steps:

  1. Step 1: Prepare data

    Download the data for model training.

  2. Step 2: Build a dataset and train a model

    Convert the training data to TFRecord files and train a speech classification model.

  3. Step 3: Evaluate and export the model

    After the training is complete, evaluate the classification precision of the model. If you are satisfied with the model, export the model as a SavedModel file that you can use to perform distributed batch predictions.

  4. Step 4: Perform predictions

    Use the exported SavedModel file to perform predictions.

Step 1: Prepare data

In this example, a demo dataset is used to train a speech classification model. We recommend that you use your own data to train models.

  1. Go to the development environment of Data Science Workshop (DSW).

    1. Log on to the Machine Learning Platform for AI console.

    2. In the left-side navigation pane, click Workspaces. On the Workspace list page, click the name of the workspace that you want to manage.

    3. In the upper-left corner of the page, select the region where you want to use the service.

    4. In the left-side navigation pane, choose Model Training > Notebook Service (DSW).

    5. Optional:On the Interactive Modeling (DSW) page, enter the name of a DSW instance or a keyword in the search box to search for the DSW instance.

    6. Find the DSW instance and click Launch in the Actions column.

  2. In the development environment of DSW, click Terminal in the Other section to launch Terminal. image.png

  3. Download data.

    1. In the toolbar in the upper-left corner, click the Create folder icon icon to create a project folder. In this example, the folder is named asr_test.

    2. Run the following commands in Terminal. The cd command is used to go to the folder that you create, and the wget commands are used to download the demo dataset that is used to train a speech classification model.

      cd asr_test
      wget https://pai-audio-open-modelzoo.oss-cn-zhangjiakou.aliyuncs.com/dsw_sample_data/demo_data.tar.gz
      wget https://pai-audio-open-modelzoo.oss-cn-zhangjiakou.aliyuncs.com/dsw_sample_data/sample_asr_cls_data.csv
    3. Run the following commands in Terminal to create a subfolder named data and decompress the demo dataset to the subfolder:

      mkdir data
      tar xvzf demo_data.tar.gz -C data
    4. Download the configuration file that is used for model training.

      PAI provides a time-delay neural network (TDNN)-based configuration file that you can use for model training. To download the configuration file, run the following command in Terminal:

      wget https://pai-audio-open-modelzoo.oss-cn-zhangjiakou.aliyuncs.com/public_model_zoo/audio_cls/w2lplus_cls.py
  4. View the subfolders and files in the project folder asr_test.

    The project folder contains the following subfolders and files:

    • data: the subfolder that stores the speech files that are used for model training. Generally, a speech file for model training is a mono audio file in the WAV format with a length of up to 15 seconds and a sampling rate of 16,000 HZ.

    • sample_asr_cls_data.csv: the file that stores the paths and category labels of all WAV files.

    • w2lplus_cls.py: the configuration file of the speech classification model to be trained.

    Files and subfolders in the project folder

Step 2: Build a dataset and train a model

  1. Convert the data that you prepare to TFRecord files by using the data conversion feature of EasyASR. To do so, run the following command in the asr_test folder:

    easyasr_create_dataset --input_path='sample_asr_cls_data.csv' --output_prefix='tfrecords/'

    The command contains the following parameters:

    • input_path: the name of the CSV file that specifies the training data. The file contains the paths and category labels of all WAV files to be used for the training.

    • output_prefix: the prefix of the path of the output TFRecord files. In this example, all TFRecord files are exported to the tfrecords folder. You can modify this parameter as required.

      Note

      Do not omit the forward slash (/) at the end of the path.

  2. Run the following command in Terminal to train a speech classification model:

    easyasr_train --config_file='w2lplus_cls.py' --log_dir='model_dir' --num_audio_features=80 --label_set='0,1' --train_data='tfrecords/train_*.tfrecord'

    The command contains the following parameters:

    • config_file: the configuration file of the speech classification model to be trained. In this example, the TDNN-based configuration file w2plus_cls.py is used. You can modify this parameter as required.

    • log_dir: the path of the output model checkpoints. You can modify this parameter as required.

    • num_audio_features: the audio feature dimension of the model. You can modify this parameter as required.

    • label_set: a set of labels used for speech classification. The specified labels must be separated by commas (,). You can modify this parameter as required.

    • train_data: the TFRecord files to be used for the training. The value of this parameter must be a regular expression. You can modify this parameter as required.

Step 3: Evaluate and export the model

After the training is complete, you can evaluate the classification precision of the model. You can divide a dataset into a training dataset and a prediction dataset as needed. The following section provides an example on how to evaluate and export a model.

  1. Run the following command in Terminal to evaluate the classification precision of the model:

    easyasr_eval --config_file='w2lplus_cls.py' --checkpoint='model_dir/model.ckpt-100' --num_audio_features=80 --label_set='0,1' --eval_data='tfrecords/train_*.tfrecord'

    The command contains the following parameters:

    • config_file: the configuration file of the model. In this example, the TDNN-based configuration file w2plus_cls.py is used. You can modify this parameter as required.

    • checkpoint: the path of the checkpoints of the model to be evaluated and exported. Multiple checkpoints are saved during the training. You can modify this parameter as required.

    • num_audio_features: the audio feature dimension of the model. You can modify this parameter as required.

    • label_set: a set of labels used for speech classification. The specified labels must be separated by commas (,). You can modify this parameter as required.

    • eval_data: the TFRecord files to be used to evaluate the model. The value format of this parameter is the same as that of the train_data parameter.

  2. Export the trained model as a SavedModel file so that you can use the file to perform distributed batch predictions. To do so, run the following command in Terminal to export the model:

    easyasr_export --config_file='w2lplus_cls.py' --checkpoint='model_dir/model.ckpt-100' --num_audio_features=80 --label_set='0,1'  --cls  --mode='interactive_infer'

    The command contains the following parameters:

    • config_file: the configuration file of the model. In this example, the TDNN-based configuration file w2plus_cls.py is used. You can modify this parameter as required.

    • num_audio_features: the audio feature dimension of the model. You can modify this parameter as required.

    • label_set: a set of labels used for speech classification. The specified labels must be separated by commas (,). You can modify this parameter as required.

    • cls: specifies that the current model is a speech classification model. You must specify --cls if you want to export a speech classification model.

    • mode: the mode in which the model is to be exported. The current version of EasyASR supports only the interactive_infer mode. You do not need to change the parameter value used in the sample command.

    You can view the exported model in the asr_test folder. The exported SavedModel file is stored in the export_dir subfolder, as shown in the following figures. Subfolders and files in the asr_test folder Exported model

Step 4: Perform predictions

You can use the exported SavedModel file to perform predictions. If you use EasyASR in DSW, input and output data are stored in CSV files.

  1. Run the following commands in Terminal to install FFmpeg that you can use for audio decoding:

    sudo apt update
    sudo apt install ffmpeg
    Note

    In this example, Ubuntu is used. If you use another OS, you need to only install FFmpeg in the OS. If FFmpeg is already installed, skip this step.

  2. Run the following command in the asr_test folder in Terminal to download the sample input file:

    wget https://pai-audio-open-modelzoo.oss-cn-zhangjiakou.aliyuncs.com/dsw_sample_data/input_predict.csv

    Each row in the input file indicates the URL of an audio file.

  3. Run the following command in Terminal to perform predictions on the input file by using the model that you have trained:

    easyasr_predict --input_csv='input_predict.csv' --output_csv='output_predict.csv' --num_features=80 --use_model='cls' --num_audio_features=80 --label_set='0,1' --seg_time_in_seconds=10 --export_dir='export_dir' --num_predict_process=3  --num_preproces=3

    The command contains the following parameters:

    • input_csv: the name of the input file that contains the URLs of audio files. You can modify this parameter as required.

    • output_csv: the name of the output file to be generated for the predictions. You can enter a custom name without the need to create a file with the name in advance.

    • num_features: the acoustic feature dimension of the model.

    • use_model: the type of the model. In this example, this parameter is set to "cls" because the model is a speech classification model.

    • num_audio_features: the audio feature dimension of the model. You can modify this parameter as required.

    • label_set: a set of labels used for speech classification. The specified labels must be separated by commas (,). You can modify this parameter as required.

    • seg_time_in_seconds: the length of audio on which a prediction is performed at a time. You can modify this parameter as required. If you set this parameter to 10, predictions are separately performed on the first 10 seconds and the last 5 seconds of an audio file of 15 seconds in length.

    • export_dir: the path of the exported SavedModel file. You can modify this parameter as required.

    • num_predict_process: the number of threads to be used to perform predictions. You can modify this parameter as required.

    • num_preproces: the number of threads to be used to download and preprocess audio files. You can modify this parameter as required.