Ten papers on Alibaba voice AI were selected for the top speech conference interviewer 2020-Alibaba Cloud Developer Community

SOURCE Alibaba voice AI public account

ten papers of Alibaba speech AI were selected for the speech summit interviewer 2020. Their research fields include speech recognition, speech synthesis, speaker recognition, speech enhancement, and signal processing. We will conduct a detailed interpretation of some papers in the future, please look forward ~ ~

1) speech recognition

• Zhifu Gao, Shiliang Zhang, Ming Lei, Ian McLoughlin, SAN-M: Memory Equipped Self-Attention for End-to-End Speech Recognition.

• Shiliang Zhang, Zhifu Gao, Haoneng Luo, Ming Lei, Jie Gao, Zhijie Yan, Lei Xie, Streaming Chunk-Aware Multihead Attention for Online End-to-End Speech Recognition

• Yingzhu Zhao, Chongjia Ni, Cheung-Chi LEUNG, Shafiq Joty, Eng Siong Chng and Bin Ma, Cross Attention with Monotonic Alignment for Speech Transformer

• Yingzhu Zhao, Chongjia Ni, Cheung-Chi LEUNG, Shafiq Joty, Eng Siong Chng and Bin Ma, Speech Transformer with Speaker Aware Persistent Memory

• Yingzhu Zhao, Chongjia Ni, Cheung-Chi LEUNG, Shafiq Joty, Eng Siong Chng and Bin Ma, Universal Speech Transformer

2) speech synthesis

• Shengkui Zhao, Trung Hieu Nguyen, Hao Wang and Bin Ma, Towards Natural Bilingual and Code-Switched Speech Synthesis Based on Mix of Monolingual Recordings and Cross-Lingual Voice Conversion

3) speaker recognition

• Siqi Zheng, Yun Lei, Hongbin Suo, Phonetically-Aware Coupled Network For Short Duration Text-independent Speaker Verification.

4) voice enhancement

• Zhihao Du, Ming Lei, Jiqing Han, Shiliang Zhang, Self-supervised Adversarial Multi-task Learning for Vocoder-based Monaural Speech Enhancement

5) signal processing

• Weilong Huang and Jinwei Feng,Differential Beamforming for Uniform Circular Array with Directional Microphones

• Ziteng Wang, Yueyue Na, Zhang Liu, Yun Li, Biao Tian and Qiang Fu, A Semi-blind Source Separation Approach for Speech Dereverberation.

Introduction to the speech laboratory of DAMO Academy

it is committed to the research work of the next generation of human-computer speech interaction basic theories, key technologies and application systems such as speech recognition, speech synthesis, speech wake-up, acoustic design and signal processing, voiceprint recognition, audio event detection, etc. It has formed products and solutions covering e-commerce, new retail, justice, transportation, manufacturing and other industries, providing high-quality voice interaction services for consumers, enterprises and governments.

Main Research directions

speech recognition and wake-up

facing complex scenes such as home furnishing, vehicle-mounted, office, public space, strong noise, near and far field, this paper studies multi-language, multi-modal, terminal cloud integrated speech recognition and wake-up technology, the platform provides a wide range of self-learning capabilities for developers to customize models, enabling businesses to customize voice models.

Speech synthesis

this paper studies high-quality and expressive speech synthesis technology, personalized speech synthesis and speaker conversion technology, which are mainly applied to scenarios such as speech interaction, information broadcast and text reading.

Acoustics and signal processing

research on acoustic devices, structure and hardware scheme design, sound source localization based on physical modeling and machine learning, voice enhancement and separation technology, multi-modal and distributed signal processing, etc.

Voiceprint recognition and audio event detection

research text-related/irrelevant voiceprint recognition, dynamic password, near-field/far-field environment voiceprint recognition, gender and age portrait, large-scale voiceprint retrieval, language dialect recognition, audio fingerprint retrieval, audio event analysis, etc.

Oral comprehension and dialogue system

based on the natural language understanding technology, a speech comprehension and dialogue system in voice interaction scenarios is built to provide developers with self-correction and dialogue customization capabilities.

Voice interaction platform

it comprehensively applies atomic capabilities such as acoustics, signal, wake-up, recognition, understanding, dialogue, and synthesis to build a distributed speech interaction platform that is full-link, cross-platform, low-cost, highly reproducible, and end-cloud integrated, allows third parties to implement scalable and customized scenarios.

Multimodal human-computer interaction

it is the first technology in the industry to realize wake-up-free remote voice interaction in a noisy environment in public places. It combines technologies such as stream multi-round and multi-intent oral understanding and adaptive business knowledge graph, provides natural speech interaction for real and complex scenarios in public space.

Link to the official website of intelligent speech interaction:


Selected, One-Stop Store for Enterprise Applications
Support various scenarios to meet companies' needs at different stages of development

Start Building Today with a Free Trial to 50+ Products

Learn and experience the power of Alibaba Cloud.

Sign Up Now