Choose a model for real-time speech recognition or audio file transcription.
Answer two questions to find the right model:
Do you need real-time results as users speak, or can you process recorded audio files in batches?
Does your audio contain domain-specific terminology?
Real-time or non-real-time?
Real-time
These models use the WebSocket protocol to process streaming audio input and return streaming text output. Ideal for live captioning, voice assistants, and meeting transcription.
Model | Series | Advantage |
| Fun-ASR | Hotword, dialect support, and multilingual mixed-language recognition |
| Qwen3-ASR | Emotion recognition |
| Qwen3.5-Omni | Prompt context injection, semantic interruption, and 113 languages |
| Qwen3.5-Omni | Lightweight and cost-effective |
| Qwen3-Omni (previous generation) | Prompt context injection |
Non-real-time
Submit audio files and poll for results. These models support files up to 12 hours or 2 GB, suitable for call center recordings, podcasts, and interviews.
Model | Series | Advantage |
| Fun-ASR | Speaker diarization, hotword, and multilingual mixed-language recognition |
| Qwen3-ASR | Emotion recognition |
| Qwen3.5-Omni | Prompt context injection, 113 languages, and an OpenAI-compatible API |
| Qwen3.5-Omni | Lightweight, cost-effective, and an OpenAI-compatible HTTP API |
| Qwen3-Omni-Flash (previous generation) | Prompt context injection, multimodal, and an OpenAI-compatible API |
Near-real-time alternative
The non-real-time API also accepts short audio clips. You can submit 5-second audio chunks for near-real-time results without WebSocket. For latency-sensitive applications, use a WebSocket-based real-time model to avoid connection overhead.
Handling terminology
Two options, from most to least flexible:
Prompt context injection (Qwen3.5-Omni): Describe your domain-specific context in the system prompt. The model adapts per request without pre-configuration. The tradeoff is higher latency than dedicated ASR models.
Hotword (Fun-ASR): Provide a weighted vocabulary list. This method is best for stable, infrequently changing vocabulary lists.
Qwen3.5-Omni is not a traditional ASR (Automatic Speech Recognition) model; it is a language model that understands audio. You inject context through a prompt, and the model adapts without hotword lists.
Speaker diarization
Only the non-real-time Fun-ASR models, such as fun-asr and fun-asr-mtl, support speaker diarization. Use these models if you need to distinguish speakers.
Emotion recognition
qwen3-asr-flash-realtime, qwen3-asr-flash-filetrans, and Qwen3.5-Omni series models support emotion recognition during transcription.
Full comparison
Model | Mode | API | Accuracy enhancement | Emotion | Speaker diarization | Language | Max duration |
| Real-time | WebSocket | Hotword (Chinese mainland only) | Not supported | Not supported | Chinese, English, Japanese, and dialects | Streaming |
| Non-real-time | Asynchronous REST | Hotword | Not supported | Supported | Chinese, English, Japanese, and dialects | 12 hours / 2 GB |
| Real-time | WebSocket | -- | Supported | Not supported | 26 languages | Streaming |
| Non-real-time | Asynchronous REST | -- | Supported | Not supported | 26 languages | 12 hours / 2 GB |
| Real-time | WebSocket | Hotword | Not supported | Not supported | Chinese, English, Japanese, Korean, German, French, and Russian | Streaming |
| Non-real-time | Asynchronous REST | Hotword | Not supported | Supported | Chinese, English, Japanese, Korean, German, French, and Russian | 12 hours / 2 GB |
| Real-time | WebSocket | Hotword | Supported | Not supported | Chinese | Streaming |
| Non-real-time | Asynchronous REST | Hotword | Not supported | Not supported | Chinese | 12 hours / 2 GB |
| Non-real-time | HTTP (OpenAI-compatible) | Prompt context | Supported | Not supported | 113 languages | Per-request limit |
| Non-real-time | HTTP (OpenAI-compatible) | Prompt context | Supported | Not supported | 113 languages | Per-request limit |
| Real-time | WebSocket | Prompt context | Supported | Not supported | 113 languages | 120 minutes |
| Real-time | WebSocket | Prompt context | Supported | Not supported | 113 languages | 120 minutes |
| Non-real-time | HTTP (OpenAI-compatible) | Prompt context | Supported | Not supported | Chinese, English, Japanese, Korean, German, French, Italian, Spanish, Portuguese, and Russian; Chinese dialects: Sichuanese, Shanghainese, Cantonese, Min Nan, Shaanxi, Nanjing, Tianjin, and Beijing | Per-request limit |
| Real-time | WebSocket | Prompt context | Supported | Not supported | Chinese, English, Japanese, Korean, German, French, Italian, Spanish, Portuguese, and Russian; Chinese dialects: Sichuanese, Shanghainese, Cantonese, Min Nan, Shaanxi, Nanjing, Tianjin, and Beijing | 120 minutes |
All models support common audio formats such as WAV, MP3, and AAC.
Availability
Check which models are available in your API key's region.
International
Use an API key from the Singapore region to access the following models.
Model series | Mode | Available models |
Fun-ASR | Real-time |
|
Non-real-time |
| |
Qwen3-ASR | Real-time |
|
Non-real-time |
| |
Qwen3.5-Omni Qwen3-Omni | Real-time / Non-real-time |
|
Chinese mainland
Use an API key from the China (Beijing) region to access the following models.
Model series | Mode | Type | Available models |
Fun-ASR | Real-time | Recommended |
|
Non-real-time | Recommended |
| |
Qwen3-ASR | Real-time | Recommended |
|
Non-real-time | Recommended |
| |
Qwen3.5-Omni Qwen3-Omni | Real-time / Non-real-time | Recommended |
|
Legacy | Real-time | Legacy |
|
Non-real-time |
| ||
Non-real-time |
|
The US region also supports qwen3-asr-flash-us (non-real-time), which requires an API key from the US region.