Automatic Speech Recognition


Gain insights with our speech-to-text service powered by AI

Overview

Alibaba Cloud’s Automatic Speech Recognition (ASR) is an enterprise-level speech-to-text solution powered by our state-of-the-art AI technology. It offers high-precision capabilities to convert speech from various types of audio and video files to text in complex environments, with a Character Error Rate (CER) lower than 15%. This solution accurately identifies English, Mandarin, and Cantonese speech across multilingual contexts, with more languages to come. You can deploy this solution in a Virtual Private Cloud (VPC) network or a hybrid cloud environment to meet your business needs and ensure data privacy, security, and compliance. You can also add these capabilities to your applications through our APIs.

Highlights

Multilingual and Versatile Speech Transcription

Transcribe the mixture of English, Mandarin, and Cantonese from audio and video files in various formats and efficiently process nine major video formats without pre-conversion

High-Accuracy and Robust Speech Recognition

Elevate your speech recognition accuracy with our Large Audio Language Models tailored for Mandarin, English, and Cantonese, and advance Voice Activity Detection (VAD) technology, reducing the CER rate to below 15% for exceptional clarity and reliability

Quality Measurement Tailored for Customer Service Scenario

Revolutionize ASR quality control with our dual-validation system that combines LLM-powered inspection with a regular expression (regex) engine to significantly improve precision and coverage

Enhanced Performance for High-Concurrency Tasks

Achieve industry-leading throughput efficiency (more than 1:15) with a GPU-accelerated architecture validated on T4 GPU benchmarks, which can complete speaker diarization and audio transcription for a 15-minute audio clip within one minute

Comprehensive Security and Privacy Protection

Ensure the security of data and privacy during full lifecycle management through algorithm and model encryption, data transmission encryption, storage isolation, and identity authentication services

Flexible and Secure Deployment Options

Choose the deployment option that suits your business: in your VPC network or on our Apsara Stack clusters (for isolation of resources and cloud management), with support for multiple GPU hardware configurations and customizable business priority settings

Architecture

icon

Our ASR solution architecture is meticulously designed to tackle the challenges of multi-language mixed speech processing in a global context, poor format compatibility and unstable recognition rates in traditional ASR systems, and compliance issues related to data sovereignty in sectors like government and finance. The architecture integrates various modules and algorithms to enhance efficiency through mixed language recognition and multi-modal compatibility. It supports real-time and offline mixed-language ASR for English, Mandarin, and Cantonese, with more languages to come, ensuring accurate and flexible speech processing. Private deployment options safeguard sensitive data, adhering to strict regulatory requirements. A quality inspection engine with customizable rules significantly reduces manual review costs by over 50%. Additionally, the system allows for prioritization across multiple business scenarios, optimizing resource utilization. With support for algorithm images and API outputs, it offers partners the flexibility to deploy the solution according to their specific needs. This comprehensive approach ensures robust performance, data security, and operational efficiency.