All Products
Search
Document Center

Alibaba Cloud Model Studio:Speech-to-text models

Last Updated:Apr 23, 2026

Choose a model for real-time speech recognition or audio file transcription.

Answer two questions to find the right model:

  1. Do you need real-time results as users speak, or can you process recorded audio files in batches?

  2. Does your audio contain domain-specific terminology?

Real-time or non-real-time?

Real-time

These models use the WebSocket protocol to process streaming audio input and return streaming text output. Ideal for live captioning, voice assistants, and meeting transcription.

Model

Series

Advantage

fun-asr-realtime

Fun-ASR

Hotword, dialect support, and multilingual mixed-language recognition

qwen3-asr-flash-realtime

Qwen3-ASR

Emotion recognition

qwen3.5-omni-plus-realtime

Qwen3.5-Omni

Prompt context injection, semantic interruption, and 113 languages

qwen3.5-omni-flash-realtime

Qwen3.5-Omni

Lightweight and cost-effective

qwen3-omni-flash-realtime

Qwen3-Omni (previous generation)

Prompt context injection

Non-real-time

Submit audio files and poll for results. These models support files up to 12 hours or 2 GB, suitable for call center recordings, podcasts, and interviews.

Model

Series

Advantage

fun-asr

Fun-ASR

Speaker diarization, hotword, and multilingual mixed-language recognition

qwen3-asr-flash-filetrans

Qwen3-ASR

Emotion recognition

qwen3.5-omni-plus

Qwen3.5-Omni

Prompt context injection, 113 languages, and an OpenAI-compatible API

qwen3.5-omni-flash

Qwen3.5-Omni

Lightweight, cost-effective, and an OpenAI-compatible HTTP API

qwen3-omni-flash

Qwen3-Omni-Flash (previous generation)

Prompt context injection, multimodal, and an OpenAI-compatible API

Near-real-time alternative

The non-real-time API also accepts short audio clips. You can submit 5-second audio chunks for near-real-time results without WebSocket. For latency-sensitive applications, use a WebSocket-based real-time model to avoid connection overhead.

Handling terminology

Two options, from most to least flexible:

  1. Prompt context injection (Qwen3.5-Omni): Describe your domain-specific context in the system prompt. The model adapts per request without pre-configuration. The tradeoff is higher latency than dedicated ASR models.

  2. Hotword (Fun-ASR): Provide a weighted vocabulary list. This method is best for stable, infrequently changing vocabulary lists.

Note

Qwen3.5-Omni is not a traditional ASR (Automatic Speech Recognition) model; it is a language model that understands audio. You inject context through a prompt, and the model adapts without hotword lists.

Speaker diarization

Only the non-real-time Fun-ASR models, such as fun-asr and fun-asr-mtl, support speaker diarization. Use these models if you need to distinguish speakers.

Emotion recognition

qwen3-asr-flash-realtime, qwen3-asr-flash-filetrans, and Qwen3.5-Omni series models support emotion recognition during transcription.

Full comparison

Model

Mode

API

Accuracy enhancement

Emotion

Speaker diarization

Language

Max duration

fun-asr-realtime

Real-time

WebSocket

Hotword (Chinese mainland only)

Not supported

Not supported

Chinese, English, Japanese, and dialects

Streaming

fun-asr

Non-real-time

Asynchronous REST

Hotword

Not supported

Supported

Chinese, English, Japanese, and dialects

12 hours / 2 GB

qwen3-asr-flash-realtime

Real-time

WebSocket

--

Supported

Not supported

26 languages

Streaming

qwen3-asr-flash-filetrans

Non-real-time

Asynchronous REST

--

Supported

Not supported

26 languages

12 hours / 2 GB

paraformer-realtime-v2

Real-time

WebSocket

Hotword

Not supported

Not supported

Chinese, English, Japanese, Korean, German, French, and Russian

Streaming

paraformer-v2

Non-real-time

Asynchronous REST

Hotword

Not supported

Supported

Chinese, English, Japanese, Korean, German, French, and Russian

12 hours / 2 GB

paraformer-realtime-8k-v2

Real-time

WebSocket

Hotword

Supported

Not supported

Chinese

Streaming

paraformer-8k-v2

Non-real-time

Asynchronous REST

Hotword

Not supported

Not supported

Chinese

12 hours / 2 GB

qwen3.5-omni-plus

Non-real-time

HTTP (OpenAI-compatible)

Prompt context

Supported

Not supported

113 languages

Per-request limit

qwen3.5-omni-flash

Non-real-time

HTTP (OpenAI-compatible)

Prompt context

Supported

Not supported

113 languages

Per-request limit

qwen3.5-omni-plus-realtime

Real-time

WebSocket

Prompt context

Supported

Not supported

113 languages

120 minutes

qwen3.5-omni-flash-realtime

Real-time

WebSocket

Prompt context

Supported

Not supported

113 languages

120 minutes

qwen3-omni-flash (previous generation)

Non-real-time

HTTP (OpenAI-compatible)

Prompt context

Supported

Not supported

Chinese, English, Japanese, Korean, German, French, Italian, Spanish, Portuguese, and Russian; Chinese dialects: Sichuanese, Shanghainese, Cantonese, Min Nan, Shaanxi, Nanjing, Tianjin, and Beijing

Per-request limit

qwen3-omni-flash-realtime (previous generation)

Real-time

WebSocket

Prompt context

Supported

Not supported

Chinese, English, Japanese, Korean, German, French, Italian, Spanish, Portuguese, and Russian; Chinese dialects: Sichuanese, Shanghainese, Cantonese, Min Nan, Shaanxi, Nanjing, Tianjin, and Beijing

120 minutes

Note

All models support common audio formats such as WAV, MP3, and AAC.

Availability

Check which models are available in your API key's region.

International

Use an API key from the Singapore region to access the following models.

Model series

Mode

Available models

Fun-ASR

Real-time

fun-asr-realtime

Non-real-time

fun-asr, fun-asr-mtl

Qwen3-ASR

Real-time

qwen3-asr-flash-realtime

Non-real-time

qwen3-asr-flash-filetrans, qwen3-asr-flash

Qwen3.5-Omni

Qwen3-Omni

Real-time / Non-real-time

qwen3.5-omni-plus-realtime, qwen3.5-omni-flash-realtime, qwen3.5-omni-plus, qwen3.5-omni-flash, qwen3-omni-flash-realtime (previous generation), qwen3-omni-flash (previous generation)

Chinese mainland

Use an API key from the China (Beijing) region to access the following models.

Model series

Mode

Type

Available models

Fun-ASR

Real-time

Recommended

fun-asr-realtime, fun-asr-flash-8k-realtime, fun-asr-mtl-realtime

Non-real-time

Recommended

fun-asr, fun-asr-mtl

Qwen3-ASR

Real-time

Recommended

qwen3-asr-flash-realtime

Non-real-time

Recommended

qwen3-asr-flash-filetrans, qwen3-asr-flash

Qwen3.5-Omni

Qwen3-Omni

Real-time / Non-real-time

Recommended

qwen3.5-omni-plus-realtime, qwen3.5-omni-flash-realtime, qwen3.5-omni-plus, qwen3.5-omni-flash, qwen3-omni-flash-realtime (previous generation), qwen3-omni-flash (previous generation)

Legacy

Real-time

Legacy

gummy-realtime-v1, gummy-chat-v1, paraformer-realtime-v2, paraformer-realtime-v1, paraformer-realtime-8k-v2, paraformer-realtime-8k-v1

Non-real-time

paraformer-v2, paraformer-8k-v2, paraformer-v1, paraformer-8k-v1, paraformer-mtl-v1

Non-real-time

sensevoice-v1 (to be deprecated)

Note

The US region also supports qwen3-asr-flash-us (non-real-time), which requires an API key from the US region.