Speech-to-text models - Alibaba Cloud Model Studio - Alibaba Cloud Documentation Center

Choose a model for real-time speech recognition or audio file transcription.

Answer two questions to find the right model:

Do you need real-time results as users speak, or can you process recorded audio files in batches?
Does your audio contain domain-specific terminology?

Real-time or non-real-time?

Real-time

These models use the WebSocket protocol to process streaming audio input and return streaming text output. Ideal for live captioning, voice assistants, and meeting transcription.

Model	Series	Advantage
`fun-asr-realtime`	Fun-ASR	Hotword, dialect support, and multilingual mixed-language recognition
`qwen3-asr-flash-realtime`	Qwen3-ASR	Emotion recognition
`qwen3.5-omni-plus-realtime`	Qwen3.5-Omni	Prompt context injection, semantic interruption, and 113 languages
`qwen3.5-omni-flash-realtime`	Qwen3.5-Omni	Lightweight and cost-effective
`qwen3-omni-flash-realtime`	Qwen3-Omni (previous generation)	Prompt context injection

Non-real-time

Submit audio files and poll for results. These models support files up to 12 hours or 2 GB, suitable for call center recordings, podcasts, and interviews.

Model	Series	Advantage
`fun-asr`	Fun-ASR	Speaker diarization, hotword, and multilingual mixed-language recognition
`qwen3-asr-flash-filetrans`	Qwen3-ASR	Emotion recognition
`qwen3.5-omni-plus`	Qwen3.5-Omni	Prompt context injection, 113 languages, and an OpenAI-compatible API
`qwen3.5-omni-flash`	Qwen3.5-Omni	Lightweight, cost-effective, and an OpenAI-compatible HTTP API
`qwen3-omni-flash`	Qwen3-Omni-Flash (previous generation)	Prompt context injection, multimodal, and an OpenAI-compatible API

Near-real-time alternative

The non-real-time API also accepts short audio clips. You can submit 5-second audio chunks for near-real-time results without WebSocket. For latency-sensitive applications, use a WebSocket-based real-time model to avoid connection overhead.

Handling terminology

Two options, from most to least flexible:

Prompt context injection (Qwen3.5-Omni): Describe your domain-specific context in the system prompt. The model adapts per request without pre-configuration. The tradeoff is higher latency than dedicated ASR models.
Hotword (Fun-ASR): Provide a weighted vocabulary list. This method is best for stable, infrequently changing vocabulary lists.

Note

Qwen3.5-Omni is not a traditional ASR (Automatic Speech Recognition) model; it is a language model that understands audio. You inject context through a prompt, and the model adapts without hotword lists.

Speaker diarization

Only the non-real-time Fun-ASR models, such as fun-asr and fun-asr-mtl, support speaker diarization. Use these models if you need to distinguish speakers.

Emotion recognition

qwen3-asr-flash-realtime, qwen3-asr-flash-filetrans, and Qwen3.5-Omni series models support emotion recognition during transcription.

Full comparison

Model	Mode	API	Accuracy enhancement	Emotion	Speaker diarization	Language	Max duration
`fun-asr-realtime`	Real-time	WebSocket	Hotword (Chinese mainland only)	Not supported	Not supported	Chinese, English, Japanese, and dialects	Streaming
`fun-asr`	Non-real-time	Asynchronous REST	Hotword	Not supported	Supported	Chinese, English, Japanese, and dialects	12 hours / 2 GB
`qwen3-asr-flash-realtime`	Real-time	WebSocket	--	Supported	Not supported	26 languages	Streaming
`qwen3-asr-flash-filetrans`	Non-real-time	Asynchronous REST	--	Supported	Not supported	26 languages	12 hours / 2 GB
`paraformer-realtime-v2`	Real-time	WebSocket	Hotword	Not supported	Not supported	Chinese, English, Japanese, Korean, German, French, and Russian	Streaming
`paraformer-v2`	Non-real-time	Asynchronous REST	Hotword	Not supported	Supported	Chinese, English, Japanese, Korean, German, French, and Russian	12 hours / 2 GB
`paraformer-realtime-8k-v2`	Real-time	WebSocket	Hotword	Supported	Not supported	Chinese	Streaming
`paraformer-8k-v2`	Non-real-time	Asynchronous REST	Hotword	Not supported	Not supported	Chinese	12 hours / 2 GB
`qwen3.5-omni-plus`	Non-real-time	HTTP (OpenAI-compatible)	Prompt context	Supported	Not supported	113 languages	Per-request limit
`qwen3.5-omni-flash`	Non-real-time	HTTP (OpenAI-compatible)	Prompt context	Supported	Not supported	113 languages	Per-request limit
`qwen3.5-omni-plus-realtime`	Real-time	WebSocket	Prompt context	Supported	Not supported	113 languages	120 minutes
`qwen3.5-omni-flash-realtime`	Real-time	WebSocket	Prompt context	Supported	Not supported	113 languages	120 minutes
`qwen3-omni-flash` (previous generation)	Non-real-time	HTTP (OpenAI-compatible)	Prompt context	Supported	Not supported	Chinese, English, Japanese, Korean, German, French, Italian, Spanish, Portuguese, and Russian; Chinese dialects: Sichuanese, Shanghainese, Cantonese, Min Nan, Shaanxi, Nanjing, Tianjin, and Beijing	Per-request limit
`qwen3-omni-flash-realtime` (previous generation)	Real-time	WebSocket	Prompt context	Supported	Not supported	Chinese, English, Japanese, Korean, German, French, Italian, Spanish, Portuguese, and Russian; Chinese dialects: Sichuanese, Shanghainese, Cantonese, Min Nan, Shaanxi, Nanjing, Tianjin, and Beijing	120 minutes

Note

All models support common audio formats such as WAV, MP3, and AAC.

Availability

Check which models are available in your API key's region.

International

Use an API key from the Singapore region to access the following models.

Model series	Mode	Available models
Fun-ASR	Real-time	`fun-asr-realtime`
Fun-ASR	Non-real-time	`fun-asr`, `fun-asr-mtl`
Qwen3-ASR	Real-time	`qwen3-asr-flash-realtime`
Qwen3-ASR	Non-real-time	`qwen3-asr-flash-filetrans`, `qwen3-asr-flash`
Qwen3.5-Omni Qwen3-Omni	Real-time / Non-real-time	`qwen3.5-omni-plus-realtime`, `qwen3.5-omni-flash-realtime`, `qwen3.5-omni-plus`, `qwen3.5-omni-flash`, `qwen3-omni-flash-realtime` (previous generation), `qwen3-omni-flash` (previous generation)

Chinese mainland

Use an API key from the China (Beijing) region to access the following models.

Model series	Mode	Type	Available models
Fun-ASR	Real-time	Recommended	`fun-asr-realtime`, `fun-asr-flash-8k-realtime`, `fun-asr-mtl-realtime`
Fun-ASR	Non-real-time	Recommended	`fun-asr`, `fun-asr-mtl`
Qwen3-ASR	Real-time	Recommended	`qwen3-asr-flash-realtime`
Qwen3-ASR	Non-real-time	Recommended	`qwen3-asr-flash-filetrans`, `qwen3-asr-flash`
Qwen3.5-Omni Qwen3-Omni	Real-time / Non-real-time	Recommended	`qwen3.5-omni-plus-realtime`, `qwen3.5-omni-flash-realtime`, `qwen3.5-omni-plus`, `qwen3.5-omni-flash`, `qwen3-omni-flash-realtime` (previous generation), `qwen3-omni-flash` (previous generation)
Legacy	Real-time	Legacy	`gummy-realtime-v1`, `gummy-chat-v1`, `paraformer-realtime-v2`, `paraformer-realtime-v1`, `paraformer-realtime-8k-v2`, `paraformer-realtime-8k-v1`
	Non-real-time		`paraformer-v2`, `paraformer-8k-v2`, `paraformer-v1`, `paraformer-8k-v1`, `paraformer-mtl-v1`
	Non-real-time		`sensevoice-v1` (to be deprecated)

Note

The US region also supports qwen3-asr-flash-us (non-real-time), which requires an API key from the US region.