Cloudy
Administrator
Administrator
  • UID63
  • Fans4
  • Follows1
  • Posts58
Reads:2781Replies:0

Artificial Intelligence vs Gold Medal Stenographer

Created#
More Posted time:May 9, 2016 18:31 PM
Introduction
At the Alibaba Cloud annual meeting on March 23, 2016, over 2,000 attendees witnessed a contest between man and machine. During the stages of on-site presentation and sharing, the Alibaba Cloud iDST team's real-time speech recognition system challenged Mr. Jiang Yi, the world stenography championship runner-up and Jack Ma's designated gold medal stenographer. With a strong and defiant short-term memory, the stenographer is able to type at ultra-fast speeds and with exceptional accuracy. Can technology go one further and beat the stenographer in a competition of accuracy? Let’s take a look.
1: Data is King
As many of us know, a standard machine learning process goes like this:

data--> training --> testing --> tuning --> re-training --> more data --> repeat



Although we have built some technical expertise and a large volume of online data, the free talking style used in the presentations at the annual meeting means that speakers may speak either passionately or in a whisper. The speed of speech, clarity of articulation, and accents differ widely (this is especially true in places with a rich diversity of dialects, like China). Data has to be collected on a blanket basis. However, data collection is only the first step. Next, the data has to be cleaned, filtered, and tagged. We will skip these things here, but those who are familiar with the dirty work on machine learning are well aware of what this entails.


Finally, we used thousands of hours of basic data, coupled with nearly a hundred hours of field data for acoustic model training. The language model basically used network-wide data for training. The rapid iteration training for the models was made possible by the speech recognition pipeline and GPU multi-host, multi-card Machine Learning Middleware. In this experiment, we can perform 30 or more groups of concurrent training activities simultaneously, with only 2 days of iteration for each model.

2: From Sloth to Rabbit: Faster! Faster! Faster!


In order to ensure good recognition results, we utilized large-scale models and the latest BLSTM (Bi-directional Long Short Term Memory) acoustic models. However, the larger and better the model, the higher the computation cost and the worse the real-time performance. Previously, we had only used large-scale BLSTM-DNN hybrid speech recognition acoustic models in quasi-real time customer service systems. For the ultimate performance, we also used a language model (network-wide data) with over 10 billion grammar points. Then how did we manage to minimize the recognition latency for such massive and complex models?  We could only rely on our high-performance speech recognition decoder and Alibaba Cloud's High Performance Computing (HPC) service. Technical details will be given in the following article: Annual Alibaba Cloud Meeting's Man Machine Competition - Technical Secrets


3: Alibaba Strives for Perfection
After the above preparations, we, with perfection in mind, began to explore ways of making the effect more attractive. After all, we did not expect the audience to see only a big white screen with dense black text at a major annual meeting. At this point, we designed the stream return format for the recognition results. This would also allow the audience to see the rapid speed at which the machine can correct recognition errors. The text was to be presented as rolling subtitles. Superimposition technology: Green screen subtitles + live video.
Even in terms of font, we thought hard whether to use black on white or white on black.



A Superhuman Opponent
Our opponent in this competition was the world stenography championship runner-up Mr. Jiang Yi. The stellar stenographer seems to be able to defy the natural limits of humans. When a speaker talks at a speed of 300 Chinese characters/min, he can listen and type, turning spoken words into text. In other words, he can type 5 characters a second (not including punctuation, line breaks, deletions, and comments). Given that a single character is typed with an average of 4 keystrokes, he hits 20 keys each second! Despite various types of disturbance, he is able to get things right more than 90%!

 While studying the opposition, we realized that we had little to learn from him (unlike in the AlphaGo Man vs. Machine showdown, TA needed to learn from its
opponent's playing style). Instead, we began to provide back office support, such as ensuring a stable audio source (in a noisy environment, the stenographer would also have to hear clearly and work quietly), a comfortable
work environment (a desk), an optimized display of the stenography results...... When everything was ready, we eagerly awaited the results.
A Real Master Cares Little About Victory or Defeat
The actual results of the competition were revealed at the event. In the end, both parties performed excellently during the 7 minute 50 second speech, but the Alibaba Cloud iDST real-time speech recognition system outperformed Mr. Jiang Yi by a margin of 0.67%. Those present would feel differently about this result. Our site proofreaders repeatedly revised the text to ensure it was error-free when the results were passed to the moderator. Everyone was wondering: "Why?" In response, we should cite
Alibaba Cloud's Mr. Li Jin: There is a difference between atomic correctness and faithfulness.  Let's just look at a few examples:

Speaker: “…entered into the overall Baba [i.e., Alibaba] system.”
Machine: …entered into the overall father [homophone for Baba in Chinese] system
Stenographer: …entered into the overall Alibaba system

Speaker: “…achieved unified marketing and unified management.”
Machine: …achieved agreement [homophone in Chinese] marketing and unified management
Stenographer: …achieved unified marketing and unified management

As can be seen from the examples above, the machine still sometimes stumbles on the odd homophone here and there. In contrast, humans can use their semantic understanding to better process ambiguous words in the context of a speech.

Speaker: “Alibaba Cloud”
Machine: Zero
Stenographer: Alibaba Cloud

During the annual meeting, attentive attendees would notice that the machine often mistook the word "Alibaba Cloud" as "Zero". Why? In fact,
this is related to the "mysterious current" phenomenon explained in the next section.

Speaker: “It is a fiscal year with very many challenges.”
Machine: It is a fiscal year with very many challenges
Stenographer: It is a fiscal year that is very challenging

Speaker: “For the first time, we achieved continuous growth for three quarters.”
Machine: For the first time, we achieved continuous growth for three quarters
Stenographer: For the first time we achieved growth for three quarters

The machine aims to provide a transcription of the speech without missing out a single word. On the other hand, the stenographer can choose to ignore some content (repetitions, fillers, etc.) to ensure the content is largely correct. The machine often makes semantic errors, but stenographers rarely do. In terms of readability, humans can produce results more faithful to the original. However, in terms of objective statistics for atomic accuracy, the machine wins by a slim margin.

A stenographer must maintain a high degree of concentration at work, as the job is mentally and physically demanding. When it comes to working for a longer period of time, it is hard for humans to maintain this high level of mental labor. However, a machine would never complain. Once connected to a power source, it can work 24/7 (simply to the satisfaction of bosses).
 
Going forward, we will need to further improve the system's adaptability to accents, noise, and new data. We hope that the system will perform better and better and we will look to the cloud to help us in these improvements.
Currently, various products are available in a form that combines "cloud", "network" and "terminal". Thus, a user holds a client (terminal) and transmits data to the cloud though a network (in the narrow sense) then the cloud provides data services. Hardware connections for the "terminals" and the stability of networking and cloud services may affect the experience of the end user.

 
 
Text taken from the Alibaba Cloud Yunqi Community website. If reprinted, retain the author and source (Alibaba Cloud Yunqi Community) and send an email notification to Yunqi at (yqeditor@list.alibaba-inc.com).

[Cloudy edited the post at May 11, 2016 17:51 PM]
Guest