SOURCE Alibaba voice AI public account
ten papers of Alibaba speech AI were selected for the speech summit interviewer 2020. Their research fields include speech recognition, speech synthesis, speaker recognition, speech enhancement, and signal processing. We will conduct a detailed interpretation of some papers in the future, please look forward ~ ~
1) speech recognition
• Zhifu Gao, Shiliang Zhang, Ming Lei, Ian McLoughlin, SAN-M: Memory Equipped Self-Attention for End-to-End Speech Recognition.
• Shiliang Zhang, Zhifu Gao, Haoneng Luo, Ming Lei, Jie Gao, Zhijie Yan, Lei Xie, Streaming Chunk-Aware Multihead Attention for Online End-to-End Speech Recognition
• Yingzhu Zhao, Chongjia Ni, Cheung-Chi LEUNG, Shafiq Joty, Eng Siong Chng and Bin Ma, Cross Attention with Monotonic Alignment for Speech Transformer
• Yingzhu Zhao, Chongjia Ni, Cheung-Chi LEUNG, Shafiq Joty, Eng Siong Chng and Bin Ma, Speech Transformer with Speaker Aware Persistent Memory
• Yingzhu Zhao, Chongjia Ni, Cheung-Chi LEUNG, Shafiq Joty, Eng Siong Chng and Bin Ma, Universal Speech Transformer
2) speech synthesis
3) speaker recognition
• Siqi Zheng, Yun Lei, Hongbin Suo, Phonetically-Aware Coupled Network For Short Duration Text-independent Speaker Verification.
4) voice enhancement
• Zhihao Du, Ming Lei, Jiqing Han, Shiliang Zhang, Self-supervised Adversarial Multi-task Learning for Vocoder-based Monaural Speech Enhancement
5) signal processing
• Weilong Huang and Jinwei Feng,Differential Beamforming for Uniform Circular Array with Directional Microphones
Introduction to the speech laboratory of DAMO Academy
it is committed to the research work of the next generation of human-computer speech interaction basic theories, key technologies and application systems such as speech recognition, speech synthesis, speech wake-up, acoustic design and signal processing, voiceprint recognition, audio event detection, etc. It has formed products and solutions covering e-commerce, new retail, justice, transportation, manufacturing and other industries, providing high-quality voice interaction services for consumers, enterprises and governments.
speech recognition and wake-up
facing complex scenes such as home furnishing, vehicle-mounted, office, public space, strong noise, near and far field, this paper studies multi-language, multi-modal, terminal cloud integrated speech recognition and wake-up technology, the platform provides a wide range of self-learning capabilities for developers to customize models, enabling businesses to customize voice models.
this paper studies high-quality and expressive speech synthesis technology, personalized speech synthesis and speaker conversion technology, which are mainly applied to scenarios such as speech interaction, information broadcast and text reading.
Acoustics and signal processing
research on acoustic devices, structure and hardware scheme design, sound source localization based on physical modeling and machine learning, voice enhancement and separation technology, multi-modal and distributed signal processing, etc.
Voiceprint recognition and audio event detection
research text-related/irrelevant voiceprint recognition, dynamic password, near-field/far-field environment voiceprint recognition, gender and age portrait, large-scale voiceprint retrieval, language dialect recognition, audio fingerprint retrieval, audio event analysis, etc.
Oral comprehension and dialogue system
based on the natural language understanding technology, a speech comprehension and dialogue system in voice interaction scenarios is built to provide developers with self-correction and dialogue customization capabilities.
it comprehensively applies atomic capabilities such as acoustics, signal, wake-up, recognition, understanding, dialogue, and synthesis to build a distributed speech interaction platform that is full-link, cross-platform, low-cost, highly reproducible, and end-cloud integrated, allows third parties to implement scalable and customized scenarios.
Multimodal human-computer interaction
it is the first technology in the industry to realize wake-up-free remote voice interaction in a noisy environment in public places. It combines technologies such as stream multi-round and multi-intent oral understanding and adaptive business knowledge graph, provides natural speech interaction for real and complex scenarios in public space.
Link to the official website of intelligent speech interaction: