All Products
Search
Document Center

Platform For AI:Use EasyASR for speech recognition

Last Updated:Feb 18, 2024

Platform for AI (PAI) provides EasyASR, which is an enhanced algorithm framework for speech intelligence. EasyASR provides a variety of features for model training and prediction. You can use EasyASR to train and apply speech recognition models for your speech recognition applications. This topic describes how to use EasyASR for speech recognition in Data Science Workshop (DSW).

Prerequisites

A DSW instance is created and the requirements for software versions are met. For more information, see Create and manage DSW instances and Limits.

Note

We recommend that you use a GPU-accelerated DSW instance.

Background information

In this example, the pre-trained wav2letter-small model is used. PAI also provides the pre-trained wav2letter-base, transformer-small, and transformer-base models for automatic speech recognition (ASR). To use a specific pre-trained model, click the corresponding file names in the following table to download the model files and adjust the code provided in this topic as needed.

Model

Vocabulary

Configuration file

Model file

Description

wav2letter-small

alphabet4k.txt

w2lplus-small.py

The wav2letter series is suitable for scenarios in which low precision is acceptable but high inference speed is required. The wav2letter-base model has more parameters than the wav2letter-small model.

wav2letter-base

alphabet4k.txt

w2lplus-base.py

transformer-small

alphabet6k.txt

transformer-jca-small.py

The transformer series is suitable for scenarios in which low inference speed is acceptable but high precision is required. The transformer-base model has more parameters than the transformer-small model.

transformer-base

alphabet6k.txt

transformer-jca-base.py

Limits

Take note of the following items that are related to software versions:

  • Python 3.6 is supported.

  • TensorFlow 1.12 and PAI-TensorFlow V1.15 are supported.

  • All versions of PyTorch are not supported.

  • We recommend that you use the tensorflow:1.12PAI-gpu-py36-cu101-ubuntu18.04 or tensorflow:1.15-gpu-py36-cu100-ubuntu18.04 image of DSW.

Procedure

To use EasyASR for speech recognition in DSW, perform the following steps:

  1. Step 1: Prepare data

    Download the training data for speech recognition.

  2. Step 2: Build a dataset and train the ASR model

    Convert the training data to TFRecord files and train an ASR model.

  3. Step 3: Evaluate and export the ASR model

    After the training is complete, evaluate the recognition precision of the model. If you are satisfied with the model, export the model as a SavedModel file and use the file to perform distributed batch predictions.

  4. Step 4: Perform predictions

    Use the exported SavedModel file to perform predictions.

Step 1: Prepare data

In this example, a pre-trained ASR model provided in the EasyASR public model zoo, wav2letter-small, is slightly fine-tuned, and the dataset used is a subset of THCHS-30. THCHS-30 is a public dataset in Chinese. We recommend that you use your own data to train models.

  1. Go to the development environment of Data Science Workshop (DSW).

    1. Log on to the PAI console.

    2. In the left-side navigation pane, click Workspaces. On the Workspace list page, click the name of the workspace that you want to manage.

    3. In the upper-left corner of the page, select the region where you want to use the service.

    4. In the left-side navigation pane, choose Model Training > Notebook Service (DSW).

    5. Optional: On the Interactive Modeling (DSW) page, enter the name of a DSW instance or a keyword in the search box to search for the DSW instance.

    6. Find the DSW instance and click Launch in the Actions column.

  2. In the development environment of DSW, click Notebook in the top navigation bar.

  3. Download data.

    1. In the toolbar in the upper-left corner, click the 创建文件夹 icon to create a project folder. In this example, the folder is named asr_test.

    2. In the DSW development environment, click Terminal in the top navigation bar. On the Terminal tab, click Create Terminal.

    3. Run the following commands in Terminal. The cd command is used to go to the folder that you create, and the wget commands are used to download the demo dataset that is used to train the ASR model.

      cd asr_test
      wget https://pai-audio-open-modelzoo.oss-cn-zhangjiakou.aliyuncs.com/dsw_sample_data/demo_data.tar.gz
      wget https://pai-audio-open-modelzoo.oss-cn-zhangjiakou.aliyuncs.com/dsw_sample_data/sample_asr_data.csv
    4. Run the following commands in Terminal to create a subfolder named data and decompress the demo dataset to the subfolder:

      mkdir data
      tar xvzf demo_data.tar.gz -C data
    5. Download an ASR model.

      Four pre-trained ASR models, wav2letter-small, wav2letter-base, transformer-small, and transformer-base, are provided in the EasyASR public model zoo. The two wav2letter models provide higher inference speed, whereas the two transformer models provide higher precision. In this example, a wav2letter model is used. Run the following commands in Terminal to download the wav2letter-small model:

      mkdir wav2letter-small
      wget -P wav2letter-small https://pai-audio-open-modelzoo.oss-cn-zhangjiakou.aliyuncs.com/public_model_zoo/w2lplus-small/model.ckpt.index
      wget -P wav2letter-small https://pai-audio-open-modelzoo.oss-cn-zhangjiakou.aliyuncs.com/public_model_zoo/w2lplus-small/model.ckpt.meta
      wget -P wav2letter-small https://pai-audio-open-modelzoo.oss-cn-zhangjiakou.aliyuncs.com/public_model_zoo/w2lplus-small/model.ckpt.data-00000-of-00001
      wget https://pai-audio-open-modelzoo.oss-cn-zhangjiakou.aliyuncs.com/public_model_zoo/w2lplus-small/alphabet4k.txt
      wget https://pai-audio-open-modelzoo.oss-cn-zhangjiakou.aliyuncs.com/public_model_zoo/w2lplus-small/w2lplus-small.py
  4. View the subfolders and files in the project folder asr_test.

    The project folder contains the following subfolders and files:

    • data: the subfolder that stores the speech files that are used for model training. Generally, a speech file for model training is a mono audio file in the WAV format with a length of up to 15 seconds and a sampling rate of 16,000 HZ.

    • w2lplus-small: the subfolder that stores the pre-training checkpoints of the model.

    • alphabet4K.txt: the file that stores the 4K Chinese character vocabulary for the model.

    • sample_asr_data.csv: the file that stores the paths and annotations of all WAV files. If you want to use custom data, you must separate characters with spaces and sentences with semicolons (;) in an annotation. The characters specified must be in the vocabulary. If a character is not in the vocabulary, replace the character with an asterisk (*).

    • w2lplus-small.py: the configuration file of the model.

    asr_test文件夹层级You can go to the w2lplus-small folder to view the pre-training checkpoints of the model, as shown in the following figure.预训练Checkpoint文件

Step 2: Build a dataset and train the ASR model

  1. Convert the data that you prepare to TFRecord files by using the data conversion feature of EasyASR. To do so, run the following command in the asr_test folder:

    easyasr_create_dataset --input_path='sample_asr_data.csv' --output_prefix='tfrecords/'

    The command contains the following parameters:

    • input_path: the name of the CSV file that specifies the training data. The file contains the paths and annotations of all WAV files to be used for the training.

    • output_prefix: the prefix of the path of the output TFRecord files. In this example, all TFRecord files are exported in the tfrecords folder. You can modify this parameter as required.

      Important

      Do not omit the forward slash (/) at the end of the path.

  2. Run the following command in Terminal to train the ASR model:

    easyasr_train --config_file='w2lplus-small.py' --log_dir='model_dir' --load_model_ckpt='wav2letter-small/model.ckpt' --vocab_file='alphabet4k.txt' --train_data='tfrecords/train_*.tfrecord'

    The command contains the following parameters:

    • config_file: the configuration file of the model. In this example, the configuration file of the wav2letter-small model named w2plus-small.py is used. You can modify this parameter as required.

    • log_dir: the path of the output model checkpoints. You can modify this parameter as required.

    • load_model_ckpt: the pre-training checkpoints of the model. In this example, the pre-training checkpoints of the wav2letter-small model are loaded. If you do not specify this parameter, the model is to be trained from scratch.

    • vocab_file: the Chinese character vocabulary for the model. If you use a pre-trained wav2letter model, set this parameter to alphabet4k.txt and keep the TXT file unchanged. If you use a pre-trained transformer model, set this parameter to alphabet6k.txt and keep the TXT file unchanged.

    • train_data: the TFRecord files to be used for the training. The value of this parameter must be a regular expression. You can modify this parameter as required.

Step 3: Evaluate and export the ASR model

After the training is complete, you can evaluate the recognition precision of the model. You can divide a dataset into a training dataset and a prediction dataset as needed. The following section provides an example on how to evaluate and export a model.

  1. Run the following command in Terminal to evaluate the recognition precision of the model:

    easyasr_eval --config_file='w2lplus-small.py' --checkpoint='model_dir/model.ckpt-1000' --vocab_file='alphabet4k.txt' --eval_data='tfrecords/train_*.tfrecord'

    The command contains the following parameters:

    • config_file: the configuration file of the model. In this example, the configuration file of the wav2letter-small model named w2plus-small.py is used. You can modify this parameter as required.

    • checkpoint: the path of the checkpoints of the model to be evaluated and exported. Multiple checkpoints are saved during the training. You can modify this parameter as required.

    • vocab_file: the Chinese character vocabulary for the model.

      Important

      You must use the same vocabulary to train and evaluate a model.

    • eval_data: the TFRecord files to be used to evaluate the model. The value format of this parameter is the same as that of the train_data parameter.

  2. Export the trained model as a SavedModel file and use the file to perform distributed batch predictions. To do so, run the following command in Terminal to export the model:

    easyasr_export --config_file='w2lplus-small.py' --checkpoint='model_dir/model.ckpt-1000' --vocab_file='alphabet4k.txt'  --mode='interactive_infer'

    The command contains the following parameters:

    • config_file: the configuration file of the model. In this example, the configuration file of the wav2letter-small model named w2plus-small.py is used. You can modify this parameter as required.

    • checkpoint: the path of the checkpoints of the model to be evaluated and exported. Multiple checkpoints are saved during the training. You can modify this parameter as required.

    • vocab_file: the Chinese character vocabulary for the model.

    • mode: the mode in which the model is to be exported. The current version of EasyASR supports only the interactive_infer mode. You do not need to change the parameter value used in the sample command.

    You can view the exported model in the asr_test folder. The exported SavedModel file is stored in the export_dir subfolder, as shown in the following figures. 导出模型Go to the export_dir subfolder to view the exported model, as shown in the following figure. export_dir

Step 4: Perform predictions

You can use the exported SavedModel file to perform predictions. If you use EasyASR in DSW, input and output data are stored in CSV files.

  1. Run the following commands in Terminal to install FFmpeg that you can use for audio decoding:

    sudo apt update
    sudo apt install ffmpeg
    Note

    In this example, Ubuntu is used. If you use another OS, you need to only install FFmpeg in the OS. If FFmpeg is already installed, skip this step.

  2. Run the following command in the asr_test folder in Terminal to download the sample input file:

    wget https://pai-audio-open-modelzoo.oss-cn-zhangjiakou.aliyuncs.com/dsw_sample_data/input_predict.csv

    Each row in the input file indicates the URL of an audio file.

  3. Run the following command in Terminal to perform predictions on the input file by using the ASR model that you have trained:

    easyasr_predict --input_csv='input_predict.csv' --output_csv='output_predict.csv' --num_features=64 --use_model='w2l' --vocab_file='alphabet4k.txt' --export_dir='export_dir' --num_predict_process=3  --num_preproces=3

    The command contains the following parameters:

    • input_csv: the name of the input file that contains the URLs of audio files. You can modify this parameter as required.

    • output_csv: the name of the output file to be generated for the predictions. You can enter a custom name without the need to create a file with the name in advance.

    • num_features: the acoustic feature dimension of the model. If you use the pre-trained wav2letter-small or wav2letter-base model, set this parameter to 64. If you use the pre-trained transformer-small or transformer-base model, set this parameter to 80. You can modify this parameter as required.

    • use_model: the type of the model. Valid values:

      • w2l: a wav2letter model.

      • transformer: a transformer model.

      In this example, this parameter is set to w2l because the wav2letter-small model is used to perform predictions.

    • vocab_file: the Chinese character vocabulary for the model.

    • export_dir: the path of the exported SavedModel file. You can modify this parameter as required.

    • num_predict_process: the number of threads to be used to perform predictions. You can modify this parameter as required.

    • num_preproces: the number of threads to be used to download and preprocess audio files. You can modify this parameter as required.