Select a model for use cases like conversational speech and speech translation.
S2S (speech-to-speech) vs. pipeline
There are two approaches to building voice applications:
S2S | Pipeline (ASR + LLM + TTS) | |
Latency | Low -- single-model stream processing | Higher -- three-stage serial processing |
Audio understanding | End-to-end -- perceives tone and emotion and responds accordingly | Converts to text before processing, losing subtle audio cues |
Voice customization | Selection of preset voices via a system prompt | Voice cloning and voice design (CosyVoice) |
Use S2S for low latency, audio-aware responses, and interactive conversation.
Use a pipeline when you need to customize voices or select best-in-class ASR, LLM, and TTS models for each stage.
Real-time or file mode?
Real-time (WebSocket): For real-time voice interaction such as voice assistants, call centers, and simultaneous interpretation. Supports streaming audio input and speech output. Model names contain
-realtime.File mode (HTTP): Trades higher latency for better results, ideal for video dubbing, podcast translation, and offline content processing. Supports function calling (Qwen3.5-Omni, Qwen3-Omni-Flash), web search (Qwen3.5-Omni), thinking mode (Qwen3-Omni-Flash), and Livetranslate.
Function calling
Allows the model to perform actions—such as querying a knowledge base, checking a schedule, or triggering a workflow—based on audio and visual input. Supported by Qwen3.5-Omni (in WebSocket and HTTP modes) and Qwen3-Omni (in HTTP mode).
Not supported by real-time models or the Livetranslate model.
Web search
Allows the model to retrieve real-time information to answer questions about current events, stock prices, and weather. Qwen3.5-Omni-Plus and -Flash (in WebSocket and HTTP modes) support web search. The model autonomously decides whether to search.
Not supported by Qwen3-Omni-Flash and the Livetranslate model.
Thinking mode
Use Qwen3-Omni (HTTP mode) when answer quality outweighs latency. The model performs step-by-step reasoning before replying, ideal for video analysis and batch labeling.
Voice generation is not supported in thinking mode.
Speech translation
All three model series support speech translation:
Qwen3-Livetranslate: Supports 17 languages and 5 Chinese dialects with a latency of approximately 3 seconds. In file mode, it uses video input to provide more accurate context-aware translations. For 7 of these languages, the output is text-only (no speech).
Qwen3.5-Omni: Supports 29 output languages and 8 Chinese dialects. Offers audio and video understanding and web search. Use a system prompt to inject terminology and domain context. Supports both real-time and file modes.
Qwen3-Omni-Flash: Supports 11 output languages and 8 Chinese dialects. Use a system prompt to inject terminology and domain context. Supports both real-time and file modes. More cost-effective.
We recommend Qwen3-Livetranslate for quickly building translation applications, Qwen3.5-Omni for the highest quality and broadest language coverage, and Qwen3-Omni-Flash for cost-sensitive scenarios.
Recommended models
Model | API | Input | Function calling | Web search | Thinking mode |
| WebSocket | text, audio, image | Supported | Supported | -- |
| HTTP | text, audio, image, video | Supported | Supported | -- |
| WebSocket | text, audio, image | Supported | Supported | -- |
| HTTP | text, audio, image, video | Supported | Supported | -- |
| WebSocket | text, audio, image, video | -- | -- | -- |
| HTTP | text, audio, image, video | Supported | -- | Supported |
| WebSocket | audio, image | -- | -- | -- |
| HTTP | audio, video | -- | -- | -- |
All models
Qwen3.5-Omni
The following models are available in international and Chinese mainland service regions.
Model | API | Input | Function calling | Web search | Thinking mode |
| WebSocket | Text, audio, image, video | Supported | Supported | -- |
| WebSocket | Text, audio, image, video | Supported | Supported | -- |
| HTTP | Text, audio, image, video | Supported | Supported | -- |
| HTTP | Text, audio, image, video | Supported | Supported | -- |
| WebSocket | Text, audio, image, video | Supported | Supported | -- |
| WebSocket | Text, audio, image, video | Supported | Supported | -- |
| HTTP | Text, audio, image, video | Supported | Supported | -- |
| HTTP | Text, audio, image, video | Supported | Supported | -- |
Qwen3-Omni
The following models are available in international and Chinese mainland service regions.
Model | API | Input | Function calling | Web search | Thinking mode |
| WebSocket | Text, audio, image, video | -- | -- | -- |
| WebSocket | Text, audio, image, video | -- | -- | -- |
| WebSocket | Text, audio, image, video | -- | -- | -- |
| HTTP | Text, audio, image, video | Supported | -- | Supported |
| HTTP | Text, audio, image, video | Supported | -- | Supported |
| HTTP | Text, audio, image, video | Supported | -- | Supported |
Qwen3-Livetranslate
The following models are available in international and Chinese mainland service regions.
Model | API | Input | Languages |
| WebSocket | Audio | 18 |
| WebSocket | Audio | 18 |
| HTTP | Audio, video | 18 |
| HTTP | Audio, video | 18 |
Legacy models
These models are no longer updated. For new projects, use Qwen3.5-Omni.
Model | Input | API |
| Text, audio, image, video | HTTP |
| Text, audio, image, video | HTTP |
| Text, audio, image, video | HTTP |
| Text, audio, image, video | HTTP |
| Text, audio | WebSocket |
| Text, audio | WebSocket |
| Text, audio | WebSocket |