Model Hub of Machine Learning Platform for AI (PAI) provides a trained automatic speech recognition (ASR) model. You can deploy the model to make service calls online. This topic describes the input format and output format for ASR models and provides a test example.

Background information

One of the most important goals of AI is to enable machines to understand human language. To achieve this goal, the first important step is to transcribe human language to text. ASR is an important technology that integrates the disciplines of AI, linguistics, and acoustics. This technology automatically transcribes the audio input of human language to text.

On the basis of ASR, you can perform speech understanding, where AI technologies are used to analyze audio features for deep understanding of the input speech. PAI allows you to deploy ASR and speech understanding models. It provides the speech understanding models described in the following table for online use.
Model Description
General Chinese speech recognition model (Express) Automatically recognizes Chinese speech from the input audio or video in common scenarios.
General Chinese speech recognition model (Transformer) End-to-end Transformer-based Chinese speech recognition model for general use. This model can automatically recognize Chinese speech from the input audio or video and convert Chinese speech into text.
Chinese speech recognition model for E-commerce live streaming (Express) Automatically recognizes Chinese speech from the input audio or video in Chinese E-commerce live streaming scenarios.
Chinese speech recognition model for E-commerce live streaming (Transformer) End-to-end Transformer-based Chinese speech recognition model fine-tuned for E-commerce live streaming scenarios. This model can automatically recognize speech in the input audio or video. Compared to the general Chinese speech recognition model, this model is optimized for Chinese E-commerce live streaming scenarios.
Chinese speech vectorization model Recognizes Chinese speech from the input audio or video, uses the self-supervised learning technology to vectorize the speech data, and then export the vectorized results.
English speech vectorization model Recognizes English speech from the input audio or video, uses the self-supervised learning technology to vectorize the speech data, and then export the vectorized results.
Classification model for Chinese speakers based on speech attributes Classifies speakers based on the speech attributes that are recognized in Chinese audio or video clips.
Chinese speech detection model Detects whether the input audio or video contains Chinese speech.
Background music detection model Detects whether the input audio or video contains background music.

Go to Model Hub

To go to Model Hub, perform the following steps:
  1. Log on to the Machine Learning Platform for AI console.
  2. In the left-side navigation pane, choose AI Computing Asset Management > ModelHub.

General Chinese speech recognition model (Express)

  • Overview
    The general Chinese speech recognition model is an end-to-end Wav2Letter model provided by PAI. This model can automatically recognize Chinese speech from the input audio or video in common scenarios. The following figure shows the structure of the model. Structure of the speech recognition model
  • Input format
    The input data must be in the JSON format. It contains the url and play_duration fields. The value of the url field is the URL of the input audio or video. The value of the play_duration field is the first several microseconds of the input audio or video to be processed. If the play_duration field is not specified, the total length of the input audio or video is processed. The following code provides an example of the input data:
    {
      "input": {
        "url": "URL of the input audio or video",
        "play_duration": "Length of the input audio or video to be processed"
      }
    }
  • Output format
    The output data consists of key-value pairs in the JSON format. Each key indicates the start timestamp, in microseconds, of the input audio or video clip that is processed. Each value indicates the output text that is transcribed by the ASR model. The model supports about 4,000 common Chinese characters in the output text. If the result of the model contains a Chinese character that is not included in the supported Chinese character list, the character is replaced by an asterisk (*). Short sentences are separated by semicolons (;). The following code provides an example of the output data:
    {
      "0": "Text 1 obtained after transcription",
      "500000000": "Text 2 obtained after transcription",
      "1000000000": "Text 3 obtained after transcription",
    }
  • Example
    The following code provides an example of the input data of the model:
    {
      "input": {
        "url": "http://pai-vision-data-sh.oss-cn-shanghai-internal.aliyuncs.com/tmp/5000563****.mp4",
        "play_duration": "39000000"
      }
    }
    PAI displays information similar to the following output:
    {
      "0": ";\u5206\u6d3b\u7845\u85fb\u571f\u9020\u8131;\u6709\u5b54\u901f\u5ea6\u5927\u5438\u6536\u6027\u5f3a\u51c0\u5316\u7a7a\u6c14\u7684\u7279\u70b9;\u53ef\u653e\u7f6e\u624b\u5de5\u6d01\u9762\u9020\u5f62\u517d\u9020\u4e0d\u6613\u9020\u7b49;\u80a5\u7682\u653e\u7f6e\u540e\u51e0\u79d2\u5185\u5c31\u80fd\u77ac\u95f4\u5438\u6536",
      "20031996": "\u7531\u4e8e\u80a5\u7682\u4f7f\u7528\u540e\u6709\u53d8\u8f6f\u7684\u7279\u8d28;\u6240\u4ee5\u7845\u85fb\u4e3b\u9020\u79d1\u80fd\u5b8c\u6574\u4fdd\u62a4\u80a5\u7682;\u4e14\u4e0d\u7528\u62c5\u5fc3\u7682\u6c34\u5916\u6d41;\u666e\u901a\u6ca5\u6c34\u9020\u51fa\u5e95\u90e8\u7684\u683c\u81ea\u4f1a\u5bfc\u81f4\u80a5\u7682\u53d8\u5f62\u53d8\u5c0f;\u7682\u6c34\u8fd8\u4f1a\u7559\u7684\u5012\u4f4f\u6ce5"
    }
    The output Unicode data can be decoded into Chinese characters in downstream applications.

General Chinese speech recognition model (Transformer)

  • Overview
    The general Chinese speech recognition model (Transformer) is an end-to-end Transformer-based speech recognition model which is intended for common scenarios provided by PAI. This model can automatically recognize Chinese speech from the input audio or video and convert Chinese speech into texts. Compared to the Wav2Letter model, this model is slower but more accurate. The following figure shows the structure of the model. Model structure of general Chinese speech recognition model (Transformer)
  • Input format
    The input data must be in the JSON format. It contains the url and play_duration fields. The value of the url field is the URL of the input audio or video. The value of the play_duration field is the first several microseconds of the input audio or video to be processed. If the play_duration field is not specified, the total length of the input audio or video is processed. The following code provides an example of the input data:
    {
      "input": {
        "url": "URL of the input audio or video",
        "play_duration": "Length of the input audio or video clip to be processed"
      }
    }
  • Output format
    The output data consists of key-value pairs in the JSON format. Each key indicates the start timestamp, in microseconds, of the input audio or video clip that is processed. Each value indicates the output text that is transcribed by the ASR model. The model supports about 4,000 common Chinese characters in the output text. If the result of the model contains a Chinese character that is not included in the supported Chinese character list, the character is replaced by an asterisk (*). Short sentences are separated by semicolons (;). The following code provides an example of the output data:
    {
      "0": "Text 1 after transcription",
      "500000000": "Text 2 after transcription",
      "1000000000": "Text 3 after transcription"
    }
  • Example
    The following code provides an example of the input data of the model:
    {
      "input": {
        "url": "http://pai-vision-data-sh.oss-cn-shanghai-internal.aliyuncs.com/tmp/5000563****.mp4",
        "play_duration": "39000000"
      }
    }
    The system displays information similar to the following output:
    {
      "0": ";\u5206\u6d3b\u7845\u85fb\u571f\u9020\u8131;\u6709\u5b54\u901f\u5ea6\u5927\u5438\u6536\u6027\u5f3a\u51c0\u5316\u7a7a\u6c14\u7684\u7279\u70b9;\u53ef\u653e\u7f6e\u624b\u5de5\u6d01\u9762\u9020\u5f62\u517d\u9020\u4e0d\u6613\u9020\u7b49;\u80a5\u7682\u653e\u7f6e\u540e\u51e0\u79d2\u5185\u5c31\u80fd\u77ac\u95f4\u5438\u6536",
      "20031996": "\u7531\u4e8e\u80a5\u7682\u4f7f\u7528\u540e\u6709\u53d8\u8f6f\u7684\u7279\u8d28;\u6240\u4ee5\u7845\u85fb\u4e3b\u9020\u79d1\u80fd\u5b8c\u6574\u4fdd\u62a4\u80a5\u7682;\u4e14\u4e0d\u7528\u62c5\u5fc3\u7682\u6c34\u5916\u6d41;\u666e\u901a\u6ca5\u6c34\u9020\u51fa\u5e95\u90e8\u7684\u683c\u81ea\u4f1a\u5bfc\u81f4\u80a5\u7682\u53d8\u5f62\u53d8\u5c0f;\u7682\u6c34\u8fd8\u4f1a\u7559\u7684\u5012\u4f4f\u6ce5"
    }
    The output Unicode data can be decoded into Chinese characters in downstream applications.

Chinese speech recognition model for E-commerce live streaming (Express)

  • Overview

    The Chinese speech recognition model for E-commerce live streaming (Express) is an end-to-end Wav2Letter model provided by PAI. This model can automatically recognize Chinese speech from the input audio or video in E-commerce live streaming scenarios. Compared to the general Chinese speech recognition model (Express), this model is optimized for Chinese E-commerce live streaming scenarios. However, this model shares the same structure as the general Chinese speech recognition model (Express).

  • Input format
    The input data must be in the JSON format. It contains the url and play_duration fields. The value of the url field is the URL of the input audio or video. The value of the play_duration field is the first several microseconds of the input audio or video to be processed. If the play_duration field is not specified, the total length of the input audio or video is processed. The following code provides an example of the input data:
    {
      "input": {
        "url": "URL of the input audio or video",
        "play_duration": "Length of the input audio or video to be processed"
      }
    }
  • Output format
    The output data consists of key-value pairs in the JSON format. Each key indicates the start timestamp, in microseconds, of the input audio or video clip that is processed. Each value indicates the output text that is transcribed by the ASR model. The model supports about 6,000 common Chinese characters in the output text, which is greater than that supported by the general Chinese speech recognition model (Express). If the result of the model contains a Chinese character that is not included in the supported Chinese character list, the character is replaced by an asterisk (*). Short sentences are separated by semicolons (;). The following code provides an example of the output data:
    {
      "0": "Text 1 after transcription",
      "500000000": "Text 2 after transcription",
      "1000000000": "Text 3 after transcription"
    }
  • Example
    The following code provides an example of the input data of the model:
    {
      "input": {
        "url": "https://pai-vision-data-sh.oss-cn-shanghai-internal.aliyuncs.com/chengyu.wcy/tblive_sample/example1.wav",
      }
    }
    PAI displays information similar to the following output:
    {
      "0": "\u800c\u4e14\u8fdb\u4e00\u6b65\u5f3a\u5316\u4e86\u4ea7\u54c1\u7684\u4e00\u4e2a\u4fee\u590d\u7279\u6548;\u4fee\u590d\u529f\u6548\u4f1a\u66f4\u597d;\u800c\u4e14\u5b83\u6bd4\u91d1\u80f6\u7684\u8bdd\u662f\u66f4\u52a0\u6e29\u548c;\u76ae\u80a4\u4e0d\u8010\u53d7\u79ef\u7387\u63a5\u8fd1\u4e3a\u96f6\u4e5f\u5c31\u662f\u8bf4;\u554a\u4eca\u5929\u665a\u4eca\u5929\u51cc\u6668\u53d1\u8d27"
    }

Chinese speech recognition model for E-commerce live streaming (Transformer)

  • Overview

    The Chinese speech recognition model for E-commerce live streaming (Transformer) is an end-to-end Transformer-based model provided by PAI. This model can automatically recognize Chinese speech from the input audio or video in E-commerce live streaming scenarios. This model can automatically recognize speech in the input audio or video. Compared to the general Chinese speech recognition model, this model is optimized for Chinese E-commerce live streaming scenarios. This model shares the same structure as the general Chinese speech recognition model (Transformer).

  • Input format
    The input data must be in the JSON format. It contains the url and play_duration fields. The value of the url field is the URL of the input audio or video. The value of the play_duration field is the first several microseconds of the input audio or video to be processed. If the play_duration field is not specified, the total length of the input audio or video is processed. The following code provides an example of the input data:
    {
      "input": {
        "url": "URL of the input audio or video",
        "play_duration": "Length of the input audio or video clip to be processed"
      }
    }
  • Output format
    The output data consists of key-value pairs in the JSON format. Each key indicates the start timestamp, in microseconds, of the input audio or video clip that is processed. Each value indicates the output text that is transcribed by the ASR model. The model supports about 6,000 common Chinese characters in the output text, which is greater than that supported by the general Chinese speech recognition model (Express). If the result of the model contains a Chinese character that is not included in the supported Chinese character list, the character is replaced by an asterisk (*). Short sentences are separated by semicolons (;). The following code provides an example of the output data:
    {
      "0": "Text 1 after transcription",
      "500000000": "Text 2 after transcription",
      "1000000000": "Text 3 after transcription"
    }
  • Example
    The following code provides an example of the input data of the model:
    {
      "input": {
        "url": "http://pai-vision-data-sh.oss-cn-shanghai-internal.aliyuncs.com/tmp/5000563****.mp4",
      }
    }
    The system displays information similar to the following output:
    {
      "0": "\u5438\u6536\u6027\u5f3a;\u51c0\u5316\u7a7a\u6c14\u7684\u7279\u70b9;\u53ef\u653e\u7f6e\u624b\u5de5\u6d01\u9762\u7682;\u5438\u6536\u6027\u5f3a;\u51c0\u5316\u7a7a\u6c14\u7684\u7279\u70b9;\u53ef\u653e\u7f6e\u624b\u5de5\u6d01\u9762\u7682;\u5438\u6536\u7682;\u6c90\u6d74\u7682",
      "20031996": "\u7531\u4e8e\u80a5\u7682\u4f7f\u7528\u540e\u6709\u53d8\u8f6f\u7684\u7279\u8d28;\u6240\u4ee5;\u7845\u85fb\u571f\u7682\u79d1\u80fd\u5b8c\u6574\u4fdd\u62a4\u80a5\u7682;\u4e14\u4e0d\u7528\u62c5\u5fc3\u7682\u6c34\u5916\u6d41;\u666e\u901a\u5229\u6c34\u7682\u6258\u5e95\u90e8\u7684\u683c\u5b50\u4f1a\u5bfc\u81f4\u80a5\u7682",
      "40063991": "\u9020\u4f5c\u529f\u80fd\u591a\u6837\u53ef\u5f53\u5bc6\u5c01\u76d6\u673a\u4f4d\u53a8\u5e08\u4e5f\u53ef\u5f53\u676f\u57ab\u53ef\u9694\u70ed\u5feb\u901f\u5438\u6c34\u5012\u6389\u5806\u79ef\u7684\u9020\u4f5c\u53ea\u4f7f\u7528\u8fdc\u6b65\u6cbe\u53d6\u9002\u91cf\u767d\u918b\u6216\u7802\u7eb8\u64e6\u62ed"
    }

Chinese speech vectorization model

  • Overview
    Chinese speech vectorization model is an end-to-end Mockingjay model provided by PAI. This model uses the self-supervised learning technology to analyze Chinese speech from the input audio to meet personalized requirements. This model recognizes Chinese speech from the input audio or video, uses the self-supervised learning technology to vectorize the speech data, and then export the vectorize results. The following figure shows the structure of the model. Model structure
  • Input format
    The input data must be in the JSON format. It contains only the url field. The value of the url field is the URL of the input audio or video. The following code provides an example of the input data:
    {
      "input": {
        "url": "URL of the input audio or video",
      }
    }
  • Output format
    The output data is a string that consists of vector features. The features are separated by commas (,). The following code provides an example of the output data:
    "Vector feature 1, Vector feature 2, Vector feature 3, ..., Vector feature N"
  • Example
    The following code provides an example of the input data of the model:
    {
      "input": {
        "url": "http://pai-vision-data-sh.oss-cn-shanghai-internal.aliyuncs.com/tmp/5000563****.mp4",
      }
    }
    PAI displays information similar to the following output:
    "0.5291504,-0.47187772,-0.7588605,...,-0.48115134,1.7070293"

English speech vectorization model

  • Overview

    English speech vectorization model is an end-to-end Mockingjay model provided by PAI. This model uses the self-supervised learning technology to analyze English speech from the input audio to meet personalized requirements. This model recognizes English speech from the input audio or video, uses the self-supervised learning technology to vectorize the speech data, and then export the vectorized results. This model shares the same structure as the Chinese speech vectorization model. For more information, see Chinese speech vectorization model.

  • Input format
    The input data must be in the JSON format. It contains only the url field. The value of the url field is the URL of the input audio or video. The following code provides an example of the input data:
    {
      "input": {
        "url": "URL of the input audio or video",
      }
    }
  • Output format
    The output data is a string that consists of vector features. The features are separated by commas (,). The following code provides an example of the output data:
    "Vector feature 1, Vector feature 2, Vector feature 3, ..., Vector feature N"
  • Example
    The following code provides an example of the input data of the model:
    {
      "input": {
        "url": "http://pai-vision-data-sh.oss-cn-shanghai-internal.aliyuncs.com/tmp/5000563****.mp4",
      }
    }
    PAI displays information similar to the following output:
    "0.29688737,0.78769636,0.4556097,...,0.8212023,0.5032284"

Classification model for Chinese speakers based on speech attributes

  • Overview
    The classification model for Chinese speakers based on speech attributes is an end-to-end time delay neural network (TDNN) model provided by PAI. This model can classify Chinese speakers based on the speech attributes that are recognized in Chinese audio or video clips. The following figure shows the structure of the model. Structure of the classification model for Chinese speakers based on speech attributes
  • Input format
    The input data must be in the JSON format. It contains the url and play_duration fields. The value of the url field is the URL of the input audio or video. The value of the play_duration field is the first several microseconds of the input audio or video to be processed. If the play_duration field is not specified, the total length of the input audio or video is processed. The following code provides an example of the input data:
    {
      "input": {
        "url": "URL of the input audio or video",
        "play_duration": "Length of the input audio or video to be processed"
      }
    }
  • Output format
    The output data consists of key-value pairs in the JSON format. Each key indicates the start timestamp, in microseconds, of the input audio or video clip that is processed. Each value indicates a classification result, including the speaker label predicted by the model. The following code provides an example of the output data:
    {
      "0": "{\"class\":\"Predicted label\"}\n",
      "500000000": "{\"class\":\"Predicted label\"}\n",
      "1000000000": "{\"class\":\"Predicted label\"}\n",
    }
  • Example
    The following code provides an example of the input data of the model:
    {
      "input": {
        "url": "http://pai-vision-data-sh.oss-cn-shanghai-internal.aliyuncs.com/tmp/5000563****.mp4",
        "play_duration": "39000000"
      }
    }
    PAI displays information similar to the following output:
    {
      "0": "{\"class\":\"Predicted label\"}\n",  
      "20031996": "{\"class\":\"Predicted label\"}\n",
    }
    The predicted label in the classification result is displayed in the Unicode format. The output Unicode data can be decoded into Chinese characters in downstream applications.

Chinese speech detection model

  • Overview

    The Chinese speech detection model is an end-to-end TDNN model provided by PAI. This model can detect whether the input audio or video contains Chinese speech. This model shares the same structure as the classification model for Chinese speakers based on speech attributes.

  • Input format
    The input data must be in the JSON format. It contains the url and play_duration fields. The value of the url field is the URL of the input audio or video. The value of the play_duration field is the first several microseconds of the input audio or video to be processed. If the play_duration field is not specified, the total length of the input audio or video is processed. The following code provides an example of the input data:
    {
      "input": {
        "url": "URL of the input audio or video",
        "play_duration": "Length of the input audio or video to be processed"
      }
    }
  • Output format
    The output data consists of key-value pairs in the JSON format. Each key indicates the start timestamp, in microseconds, of the input audio or video clip that is processed. Each value indicates a detection result. The predicted label in the detection result can be Yes or No. The value Yes indicates that the audio or video clip contains Chinese speech. The following code provides an example of the output data:
    {
      "0": "{\"class\":\"Predicted label\"}\n",
      "500000000": "{\"class\":\"Predicted label\"}\n",
      "1000000000": "{\"class\":\"Predicted label\"}\n",
    }
  • Example
    The following code provides an example of the input data of the model:
    {
      "input": {
        "url": "http://pai-vision-data-sh.oss-cn-shanghai-internal.aliyuncs.com/tmp/5000563****.mp4",
      }
    }
    PAI displays information similar to the following output:
    {
      "0": "{\"class\":\"u662f\"}\n", 
      "20031996": "{\"class\":\"u662f\"}\n", 
      "40063991": "{\"class\":\"\u5426\"}\n"
    }
    The predicted label in the detection result is displayed in the Unicode format. The output Unicode data can be decoded into Chinese characters in downstream applications.

Background music detection model

  • Overview

    The background music detection model is an end-to-end TDNN model provided by PAI. This model can detect whether the input audio or video contains background music. This model shares the same structure as the classification model for Chinese speakers based on speech attributes.

  • Input format
    The input data must be in the JSON format. It contains the url and play_duration fields. The value of the url field is the URL of the input audio or video. The value of the play_duration field is the first several microseconds of the input audio or video to be processed. If the play_duration field is not specified, the total length of the input audio or video is processed. The following code provides an example of the input data:
    {
      "input": {
        "url": "URL of the input audio or video",
        "play_duration": "Length of the input audio or video to be processed"
      }
    }
  • Output format
    The output data consists of key-value pairs in the JSON format. Each key indicates the start timestamp, in microseconds, of the input audio or video clip that is processed. Each value indicates a detection result. The predicted label in the detection result can be Yes or No. The value Yes indicates that the audio or video clip contains background music. The following code provides an example of the output data:
    {
      "0": "{\"class\":\"Predicted label\"}\n",
      "500000000": "{\"class\":\"Predicted label\"}\n",
      "1000000000": "{\"class\":\"Predicted label\"}\n",
    }
  • Example
    The following code provides an example of the input data of the model:
    {
      "input": {
        "url": "http://pai-vision-data-sh.oss-cn-shanghai-internal.aliyuncs.com/tmp/5000563****.mp4",
      }
    }
    PAI displays information similar to the following output:
    {
      "0": "{\"class\":\"u662f\"}\n", 
      "20031996": "{\"class\":\"u662f\"}\n", 
      "40063991": "{\"class\":\"\u662f\"}\n"
    }
    The predicted label in the detection result is displayed in the Unicode format. The output Unicode data can be decoded into Chinese characters in downstream applications.