Man VS Machine: The Secrets Behind Alibaba Cloud’s Speech Recognition Technology - Alibaba Cloud Developer Forums: Cloud Discussion Forums

  • UID63
  • Fans7
  • Follows1
  • Posts67

Man VS Machine: The Secrets Behind Alibaba Cloud’s Speech Recognition Technology

More Posted time:May 9, 2016 18:45 PM

In the previous article, we described combat performance in the Artificial Intelligence PK Gold Medal Stenography Competition and told the story behind the annual Alibaba Cloud meeting's Man VS Machine competition. Are there any curious technology geeks out there? What was the on-site real-time transcription system? What on earth is the core of a speech recognition system? How does it work? How come the Alibaba Cloud iDST speech recognition system is so accurate? What's the secret? This article will reveal the answer to each of these questions.

The Man VS Machine Competition

The photo above shows the annual Alibaba Cloud meeting. The screen on the left displays the speech recognition program and the screen on the right shows the human stenographer.
The image above is a screenshot from a video of the annual Alibaba Cloud meeting's man-machine competition. Head of Alibaba Cloud Sun Quan is giving a speech on the stage while the automatic speech recognition system
and a stenographer each record the text of the speech. Meanwhile, the voting screen shows the on-site PK accuracy. The screen on the left shows the results produced by the speech recognition system, displayed as subtitles on the live
video. The screen on the other side shows the shorthand text of the speech written by Mr. Jiang Yi, a runner up at the world stenography championship. His results are shown as black text in a white background.
Now you may be wondering: how can the Alibaba Cloud iDST system perform real time transcription and present the speech recognition
results as subtitles? Let’s take a closer look at the science behind it all.

Real-time Transcription System Architecture

The software/hardware structure of the demonstrated system is shown below.

  1. Audio Solution: The speech is delivered to a wireless microphone which transmits the voice to a USB sound card. To achieve simultaneous speech recognition and a live sound effect, the USB sound card output has one channel that transmits voice to the mixer and then to the on-site speakers, which broadcast the speech to the audience. Another output channel runs through PC software that captures the audio and sends the captured audio data to the speech recognition program running on the Alibaba Cloud speech server. Then, the text produced by the speech recognition program is returned in real time as source for the real-time subtitles.
    2. Video Solution: The on-site video equipment has two video input channels: The first is the  image transmitted back to the center console from the camera recording thespeech; the second is the streaming speech recognition result text, which is rendered to produce rolling subtitles and then presented on a green screen image. Finally, the center console uses screen keying software to superimpose the rolling subtitles on the image of the speech to produce real-time subtitles.
In the entire system, the most essential part of the algorithm is the speech recognition service, which transcribes the text of the speech in real time.
This leads to another question: how does speech recognition work?

Overview of Speech Recognition Technology

Speech recognition is a technology that converts speech to text. This technology has evolved over several decades and has become a very promising application in the AI field. Well, what are the basic principles behind this mysterious speech recognition technology? For reasons of space, here, we can only briefly explain the basic principles of speech recognition.

Currently, many mainstream speech recognition systems are created using statistical machine learning methods. A typical speech
recognition system consists of the following modules:
  1. Voice Acquisition Module: In this module, the microphone audio input is acquired and expressed as a digitalized voice signal. For example, a 16-bit digital voice signal with a 16k sampling rate means that each second of speech is represented as 1,6000 16-bit integers. Feature Extraction Module: This module is primarily tasked with converting the digital voice signal into a feature vector and supplying it to the acoustic model for processing.
  2. Acoustic Model: The acoustic model is used to characterize the acoustic similarities of human speech and the text produced by speech recognition. Traditionally, speech recognition used Hidden Markov Model-Gaussian Mixture Model (HMM-GMM), but in recent years, many systems have used Hidden Markov Model-Deep Neural Network (HMM-DNN)  or other improved models.
  3. Pronunciation Dictionary: The pronunciation dictionary contains all the words and pronunciations that can be handled by a speech recognition system. It actually provides mapping relationships between the acoustic modeling units and language modeling units.
  4. Language Model: The language model models the target language for the system and is used to assess the fluency of the recognized text. Currently, the most widely used language models are based on the Ngram model and its variants.
  5. Decoder: As the core of the speech recognition system, the decoder is tasked with receiving the input feature vector and searching for the word string that most probably outputs this feature vector based on the acoustic and language models. Generally, this search process is completed using a beam search-based Viterbi algorithm.  
The above are the basic principles behind most speech recognition systems as described in many popular science writings around the
subject. In fact, the iDST speech recognition system also falls within the basic framework described above. But, if this is the case, why is the iDST speech recognition system so accurate? What is the secret?

IDST Speech Recognition System
One minute on the stage requires ten years of preparation. At the Alibaba Cloud annual meeting, the iDST speech recognition system
demonstrated extremely high accuracy thanks to the iDST speech team's wealth of industry experience and the results of their hard work over the past year. Here, we will provide a simple introduction to the unique features of the iDST speech recognition system.

  1. Industry-leading BLSTM Acoustic ModelingTechnology
For the speech recognition acoustic model, the iDST team boasts the industry-leading BLSTM (bi-directional long-short-term-memory)
modeling technology. By using the series modeling method, the technology can simultaneously use historical information and "future" information in the speech time series. This ensures optimal acoustic modeling accuracy and effectively increases the accuracy of speech recognition. BLSTM increases performance by 15%-20% compared to the previous generation of DNN-based modeling methods. To support BLSTM technology, the iDST speech team came up with a original solution to the BLSTM latency problem in training efficiency and actual deployment. They were the first in the world to build BLSTM technology into a real-time industrial system.
     2. Industry-leading Very Large Scale Language Modeling Technolog

As discussed in the previous section, the language model is used to assess the fluidity of sentences in the speech recognition system. The degree of the match between the model and a test text (expressed as "perplexity") is a core measure of the language model. In actual applications, the more a language model correlates with a field and the more speech corpus it covers, the better the recognition results. In order to increase language model coverage, its recognition performance must be ensured for the various fields. Benefiting from Alibaba Cloud's computing advantages, the iDST uses a network-wide speech corpus as training data, as well as a self-developed concurrent language model training tool based on MaxCompute.
This training activity produces an extremely large language model (the model file size can reach several hundred GBs) with tens of billions of Ngram entries. It is just this large-scale language model that allows the iDST speech recognition system to recognize many uncommon words, such as trendy phrases
like "The Actress Playing Aunt Kui in The Legend of Mi Yue", ancient poetry, and scientific terms.
    3. Industry-leading Speech Recognition Decoding Technology

The speech recognition decoder is the core component of speech recognition in the industry. Although the basic principles of a decoder
are known (Viterbi search) and a mock decoder prototype can work with only about 200 lines of code, developing a true industry-level decoder is the greatest challenge in speech recognition technology.
 Speech recognition decoding is a truly computation-intensive and memory intensive computing process. The first challenge in decoder development is that acoustic model (DNN) scoring is a typical computation-intensive process (matrix multiplication). In order to ensure the efficiency of this process, the iDST team performed various algorithm optimizations (quantization decoding, model compression, etc.) and command optimizations (for various hardware platforms), reducing the amount of computation required in this process.

 An even greater challenge facing the decoder is how to use very large scale language models in speech recognition. This is because massive models may create a memory bottleneck in the decoding process and the repeated searches of the language model during decoding will produce a computation bottleneck. To include a very large scale language model into the decoding process, the iDST team extensively customizes expressions stored in the language model, the decoder's core algorithms, and its method of interaction with the language model. This has reduced the consumption of the language model's memory during decoding and fully utilized the information in the decoding process to reduce the number of language model computations, making it possible to use the decoder online. In fact, only a few speech recognition systems in the world can use a language model of this size for single-pass decoding.
 In the future, the iDST team will write an article explaining the mysteries of this section.
    4. Rapid Model Iteration and Training

Another contributor to the system's outstanding performance is the model's rapid iteration and training. In speech recognition, the
acoustic and language model technology requires learning from mass data, making it crucial to enable rapid iteration on massive volumes of data. In addition to the large-scale language model training tool mentioned above, the iDST team has used a GPU cluster concurrent deep learning system built on Alibaba Cloud's infrastructure for acoustic model training (for details, refer to the GPU Training article). This system makes rapid iteration

    5. Robust Computing

As we have discussed earlier, speech recognition is itself a computation-intensive system. To ensure optimal system performance on the day of demonstration, Alibaba Cloud used the HPC, its next-gen high-performance computing platform with GPU acceleration, at the annual meeting. This platform provides a single-node computing capability of up to 16 TFLOPS, which, coupled with algorithm optimization, enabled the speech recognition system to respond in real time.
These are all the secrets that may interest tech Geeks. If you are still interested in learning more about this technology, please see the
Alibaba iDST account in the Yunqi Community.
[Cloudy edited the post at May 11, 2016 17:47 PM]