All Products
Search
Document Center

Alibaba Cloud Model Studio:Speech-to-speech models

Last Updated:Apr 23, 2026

Select a model for use cases like conversational speech and speech translation.

S2S (speech-to-speech) vs. pipeline

There are two approaches to building voice applications:

S2S

Pipeline (ASR + LLM + TTS)

Latency

Low -- single-model stream processing

Higher -- three-stage serial processing

Audio understanding

End-to-end -- perceives tone and emotion and responds accordingly

Converts to text before processing, losing subtle audio cues

Voice customization

Selection of preset voices via a system prompt

Voice cloning and voice design (CosyVoice)

  • Use S2S for low latency, audio-aware responses, and interactive conversation.

  • Use a pipeline when you need to customize voices or select best-in-class ASR, LLM, and TTS models for each stage.

Real-time or file mode?

  • Real-time (WebSocket): For real-time voice interaction such as voice assistants, call centers, and simultaneous interpretation. Supports streaming audio input and speech output. Model names contain -realtime.

  • File mode (HTTP): Trades higher latency for better results, ideal for video dubbing, podcast translation, and offline content processing. Supports function calling (Qwen3.5-Omni, Qwen3-Omni-Flash), web search (Qwen3.5-Omni), thinking mode (Qwen3-Omni-Flash), and Livetranslate.

Function calling

Allows the model to perform actions—such as querying a knowledge base, checking a schedule, or triggering a workflow—based on audio and visual input. Supported by Qwen3.5-Omni (in WebSocket and HTTP modes) and Qwen3-Omni (in HTTP mode).

Not supported by real-time models or the Livetranslate model.

Web search

Allows the model to retrieve real-time information to answer questions about current events, stock prices, and weather. Qwen3.5-Omni-Plus and -Flash (in WebSocket and HTTP modes) support web search. The model autonomously decides whether to search.

Not supported by Qwen3-Omni-Flash and the Livetranslate model.

Thinking mode

Use Qwen3-Omni (HTTP mode) when answer quality outweighs latency. The model performs step-by-step reasoning before replying, ideal for video analysis and batch labeling.

Voice generation is not supported in thinking mode.

Speech translation

All three model series support speech translation:

  • Qwen3-Livetranslate: Supports 17 languages and 5 Chinese dialects with a latency of approximately 3 seconds. In file mode, it uses video input to provide more accurate context-aware translations. For 7 of these languages, the output is text-only (no speech).

  • Qwen3.5-Omni: Supports 29 output languages and 8 Chinese dialects. Offers audio and video understanding and web search. Use a system prompt to inject terminology and domain context. Supports both real-time and file modes.

  • Qwen3-Omni-Flash: Supports 11 output languages and 8 Chinese dialects. Use a system prompt to inject terminology and domain context. Supports both real-time and file modes. More cost-effective.

Note

We recommend Qwen3-Livetranslate for quickly building translation applications, Qwen3.5-Omni for the highest quality and broadest language coverage, and Qwen3-Omni-Flash for cost-sensitive scenarios.

Supported languages

Language

Qwen3-Livetranslate

Qwen3.5-Omni

Qwen3-Omni-Flash

English

Supported

Supported

Supported

Chinese (Mandarin)

Supported

Supported

Supported

Cantonese

Supported

Supported

Supported

Sichuan dialect

Supported

Supported

Supported

Shanghainese

Supported

Supported

Supported

Beijing dialect

Supported

Supported

Supported

Tianjin dialect

Supported

Supported

Supported

Nanjing dialect

--

Supported

Supported

Shaanxi dialect

--

Supported

Supported

Minnan dialect

--

Supported

Supported

French

Supported

Supported

Supported

German

Supported

Supported

Supported

Russian

Supported

Supported

Supported

Italian

Supported

Supported

Supported

Spanish

Supported

Supported

Supported

Portuguese

Supported

Supported

Supported

Japanese

Supported

Supported

Supported

Korean

Supported

Supported

Supported

Thai

Text-only

Supported

Supported

Indonesian

Text-only

Supported

--

Vietnamese

Text-only

Supported

--

Arabic

Text-only

Supported

--

Hindi

Text-only

Supported

--

Turkish

Text-only

Supported

--

Finnish

--

Supported

--

Polish

--

Supported

--

Dutch

--

Supported

--

Czech

--

Supported

--

Urdu

--

Supported

--

Tagalog

--

Supported

--

Swedish

--

Supported

--

Danish

--

Supported

--

Hebrew

--

Supported

--

Icelandic

--

Supported

--

Malay

--

Supported

--

Norwegian

--

Supported

--

Persian

--

Supported

--

Greek

Text-only

--

--

"Supported" means the model provides both speech and text output. "Text-only" means the model provides text output but no speech.

Qwen3.5-Omni supports 113 input languages and dialects.

The legacy qwen-omni-turbo model supports only Chinese and English.

Recommended models

Model

API

Input

Function calling

Web search

Thinking mode

qwen3.5-omni-plus-realtime

WebSocket

text, audio, image

Supported

Supported

--

qwen3.5-omni-plus

HTTP

text, audio, image, video

Supported

Supported

--

qwen3.5-omni-flash-realtime

WebSocket

text, audio, image

Supported

Supported

--

qwen3.5-omni-flash

HTTP

text, audio, image, video

Supported

Supported

--

qwen3-omni-flash-realtime

WebSocket

text, audio, image, video

--

--

--

qwen3-omni-flash

HTTP

text, audio, image, video

Supported

--

Supported

qwen3-livetranslate-flash-realtime

WebSocket

audio, image

--

--

--

qwen3-livetranslate-flash

HTTP

audio, video

--

--

--

All models

Qwen3.5-Omni

The following models are available in international and Chinese mainland service regions.

Model

API

Input

Function calling

Web search

Thinking mode

qwen3.5-omni-plus-realtime

WebSocket

Text, audio, image, video

Supported

Supported

--

qwen3.5-omni-plus-realtime-2026-03-15

WebSocket

Text, audio, image, video

Supported

Supported

--

qwen3.5-omni-plus

HTTP

Text, audio, image, video

Supported

Supported

--

qwen3.5-omni-plus-2026-03-15

HTTP

Text, audio, image, video

Supported

Supported

--

qwen3.5-omni-flash-realtime

WebSocket

Text, audio, image, video

Supported

Supported

--

qwen3.5-omni-flash-realtime-2026-03-15

WebSocket

Text, audio, image, video

Supported

Supported

--

qwen3.5-omni-flash

HTTP

Text, audio, image, video

Supported

Supported

--

qwen3.5-omni-flash-2026-03-15

HTTP

Text, audio, image, video

Supported

Supported

--

Qwen3-Omni

The following models are available in international and Chinese mainland service regions.

Model

API

Input

Function calling

Web search

Thinking mode

qwen3-omni-flash-realtime

WebSocket

Text, audio, image, video

--

--

--

qwen3-omni-flash-realtime-2025-12-01

WebSocket

Text, audio, image, video

--

--

--

qwen3-omni-flash-realtime-2025-09-15

WebSocket

Text, audio, image, video

--

--

--

qwen3-omni-flash

HTTP

Text, audio, image, video

Supported

--

Supported

qwen3-omni-flash-2025-12-01

HTTP

Text, audio, image, video

Supported

--

Supported

qwen3-omni-flash-2025-09-15

HTTP

Text, audio, image, video

Supported

--

Supported

Qwen3-Livetranslate

The following models are available in international and Chinese mainland service regions.

Model

API

Input

Languages

qwen3-livetranslate-flash-realtime

WebSocket

Audio

18

qwen3-livetranslate-flash-realtime-2025-09-22

WebSocket

Audio

18

qwen3-livetranslate-flash

HTTP

Audio, video

18

qwen3-livetranslate-flash-2025-12-01

HTTP

Audio, video

18

Legacy models

These models are no longer updated. For new projects, use Qwen3.5-Omni.

Model

Input

API

qwen2.5-omni-7b

Text, audio, image, video

HTTP

qwen-omni-turbo

Text, audio, image, video

HTTP

qwen-omni-turbo-latest

Text, audio, image, video

HTTP

qwen-omni-turbo-2025-03-26

Text, audio, image, video

HTTP

qwen-omni-turbo-realtime

Text, audio

WebSocket

qwen-omni-turbo-realtime-latest

Text, audio

WebSocket

qwen-omni-turbo-realtime-2025-05-08

Text, audio

WebSocket