Man VS Machine: The Secrets Behind Alibaba Cloud’s Speech Recognition Technology
Created#More Posted time:May 9, 2016 18:45 PM
In the previous article, we described combat performance in the Artificial Intelligence PK Gold Medal Stenography Competition and told the story behind the annual Alibaba Cloud meeting's Man VS Machine competition. Are there any curious technology geeks out there? What was the on-site real-time transcription system? What on earth is the core of a speech recognition system? How does it work? How come the Alibaba Cloud iDST speech recognition system is so accurate? What's the secret? This article will reveal the answer to each of these questions.
The Man VS Machine Competition
The photo above shows the annual Alibaba Cloud meeting. The screen on the left displays the speech recognition program and the screen on the right shows the human stenographer.
The image above is a screenshot from a video of the annual Alibaba Cloud meeting's man-machine competition. Head of Alibaba Cloud Sun Quan is giving a speech on the stage while the automatic speech recognition system
and a stenographer each record the text of the speech. Meanwhile, the voting screen shows the on-site PK accuracy. The screen on the left shows the results produced by the speech recognition system, displayed as subtitles on the live
video. The screen on the other side shows the shorthand text of the speech written by Mr. Jiang Yi, a runner up at the world stenography championship. His results are shown as black text in a white background.
Now you may be wondering: how can the Alibaba Cloud iDST system perform real time transcription and present the speech recognition
results as subtitles? Let’s take a closer look at the science behind it all.
Real-time Transcription System Architecture
The software/hardware structure of the demonstrated system is shown below.
This leads to another question: how does speech recognition work?
Overview of Speech Recognition Technology
Speech recognition is a technology that converts speech to text. This technology has evolved over several decades and has become a very promising application in the AI field. Well, what are the basic principles behind this mysterious speech recognition technology? For reasons of space, here, we can only briefly explain the basic principles of speech recognition.
Currently, many mainstream speech recognition systems are created using statistical machine learning methods. A typical speech
recognition system consists of the following modules:
The above are the basic principles behind most speech recognition systems as described in many popular science writings around the
subject. In fact, the iDST speech recognition system also falls within the basic framework described above. But, if this is the case, why is the iDST speech recognition system so accurate? What is the secret?
IDST Speech Recognition System
One minute on the stage requires ten years of preparation. At the Alibaba Cloud annual meeting, the iDST speech recognition system
demonstrated extremely high accuracy thanks to the iDST speech team's wealth of industry experience and the results of their hard work over the past year. Here, we will provide a simple introduction to the unique features of the iDST speech recognition system.
For the speech recognition acoustic model, the iDST team boasts the industry-leading BLSTM (bi-directional long-short-term-memory)
modeling technology. By using the series modeling method, the technology can simultaneously use historical information and "future" information in the speech time series. This ensures optimal acoustic modeling accuracy and effectively increases the accuracy of speech recognition. BLSTM increases performance by 15%-20% compared to the previous generation of DNN-based modeling methods. To support BLSTM technology, the iDST speech team came up with a original solution to the BLSTM latency problem in training efficiency and actual deployment. They were the first in the world to build BLSTM technology into a real-time industrial system.
2. Industry-leading Very Large Scale Language Modeling Technolog
As discussed in the previous section, the language model is used to assess the fluidity of sentences in the speech recognition system. The degree of the match between the model and a test text (expressed as "perplexity") is a core measure of the language model. In actual applications, the more a language model correlates with a field and the more speech corpus it covers, the better the recognition results. In order to increase language model coverage, its recognition performance must be ensured for the various fields. Benefiting from Alibaba Cloud's computing advantages, the iDST uses a network-wide speech corpus as training data, as well as a self-developed concurrent language model training tool based on MaxCompute.
This training activity produces an extremely large language model (the model file size can reach several hundred GBs) with tens of billions of Ngram entries. It is just this large-scale language model that allows the iDST speech recognition system to recognize many uncommon words, such as trendy phrases
like "The Actress Playing Aunt Kui in The Legend of Mi Yue", ancient poetry, and scientific terms.
3. Industry-leading Speech Recognition Decoding Technology
The speech recognition decoder is the core component of speech recognition in the industry. Although the basic principles of a decoder
are known (Viterbi search) and a mock decoder prototype can work with only about 200 lines of code, developing a true industry-level decoder is the greatest challenge in speech recognition technology.
Speech recognition decoding is a truly computation-intensive and memory intensive computing process. The first challenge in decoder development is that acoustic model (DNN) scoring is a typical computation-intensive process (matrix multiplication). In order to ensure the efficiency of this process, the iDST team performed various algorithm optimizations (quantization decoding, model compression, etc.) and command optimizations (for various hardware platforms), reducing the amount of computation required in this process.
An even greater challenge facing the decoder is how to use very large scale language models in speech recognition. This is because massive models may create a memory bottleneck in the decoding process and the repeated searches of the language model during decoding will produce a computation bottleneck. To include a very large scale language model into the decoding process, the iDST team extensively customizes expressions stored in the language model, the decoder's core algorithms, and its method of interaction with the language model. This has reduced the consumption of the language model's memory during decoding and fully utilized the information in the decoding process to reduce the number of language model computations, making it possible to use the decoder online. In fact, only a few speech recognition systems in the world can use a language model of this size for single-pass decoding.
In the future, the iDST team will write an article explaining the mysteries of this section.
4. Rapid Model Iteration and Training
Another contributor to the system's outstanding performance is the model's rapid iteration and training. In speech recognition, the
acoustic and language model technology requires learning from mass data, making it crucial to enable rapid iteration on massive volumes of data. In addition to the large-scale language model training tool mentioned above, the iDST team has used a GPU cluster concurrent deep learning system built on Alibaba Cloud's infrastructure for acoustic model training (for details, refer to the GPU Training article). This system makes rapid iteration
5. Robust Computing
As we have discussed earlier, speech recognition is itself a computation-intensive system. To ensure optimal system performance on the day of demonstration, Alibaba Cloud used the HPC, its next-gen high-performance computing platform with GPU acceleration, at the annual meeting. This platform provides a single-node computing capability of up to 16 TFLOPS, which, coupled with algorithm optimization, enabled the speech recognition system to respond in real time.
These are all the secrets that may interest tech Geeks. If you are still interested in learning more about this technology, please see the
Alibaba iDST account in the Yunqi Community.
[Cloudy edited the post at May 11, 2016 17:47 PM]