Speech-to-speech models - Model Studio - Alibaba Cloud Model Studio

Choose a model for voice conversation, speech translation, or simultaneous interpretation.

Migrate from closed-source models

Map your current OpenAI Realtime or Gemini Live setup to an equivalent Bailian model.

	Closed-source examples	Bailian recommendation
Real-time conversation	OpenAI GPT Realtime, Gemini 3.1 Live	`qwen3.5-omni-plus-realtime`
Cost-sensitive conversation	OpenAI gpt-4o-mini Realtime	`qwen3.5-omni-flash-realtime`
Real-time translation	Gemini 3.1 Live	`qwen3.5-livetranslate-flash-realtime`

This page covers speech-to-speech. For visual understanding, audio/video analysis, or content moderation, see the Omni-modal documentation.

S2S (speech-to-speech) vs. pipeline

There are two approaches to building voice applications:

	S2S	Pipeline (ASR + LLM + TTS)
Latency	Low -- single-model stream processing	Higher -- three-stage serial processing
Audio understanding	End-to-end -- perceives tone and emotion and responds accordingly	Converts to text before processing, losing subtle audio cues
Voice customization	Selection of preset voices via a system prompt	Voice cloning and voice design (CosyVoice)

Use S2S for low latency, audio-aware responses, and interactive conversation.
Use a pipeline when you need voice customization or want to select ASR, LLM, and TTS models independently.

This page covers the S2S single-model approach (Omni and Livetranslate series). For the pipeline approach, select each component separately:

ASR (speech recognition): Speech-to-text
LLM (large language model): Text generation
TTS (text-to-speech): Speech synthesis

Real-time or file mode?

Real-time (WebSocket): Voice assistants, call centers, and simultaneous interpretation. Streams audio input and speech output. Model names contain -realtime.
File mode (HTTP): Higher latency but better quality. Ideal for video dubbing, podcast translation, and offline processing. Also supports function calling, web search, thinking mode, and video context (see Companion capabilities below).

Choose a model by use case (S2S single-model approach)

All use cases below use the S2S single-model approach. For the pipeline approach, use the ASR, LLM, and TTS guides linked above.

Use case	Recommended model	API
Voice assistants and customer-service conversations	`qwen3.5-omni-plus-realtime`	WebSocket
Cost-sensitive conversations	`qwen3.5-omni-flash-realtime`	WebSocket
Simultaneous interpretation and live translation	`qwen3.5-livetranslate-flash-realtime`	WebSocket
Video dubbing and podcast translation	`qwen3-livetranslate-flash`	HTTP
Video analysis and batch labeling (requires thinking mode)	`qwen3-omni-flash`	HTTP
Semantic VAD voice assistants and smart customer service (with Function Calling support)	`qwen-audio-3.0-realtime-plus`	WebSocket

Companion capabilities of the S2S single-model approach

Qwen3.5-Omni and Qwen3-Omni provide these capabilities natively. With a pipeline, equivalent functionality comes from individual components (typically the LLM).

Function calling

The model can query knowledge bases, check schedules, or trigger workflows based on what it hears and sees. Use Qwen3.5-Omni (WebSocket or HTTP) or Qwen3-Omni (HTTP only).

Not supported by Qwen3.5-Omni/Qwen3-Omni real-time (WebSocket) models or Livetranslate models. Qwen-Audio Realtime (WebSocket) supports function calling.

Web search

Retrieves real-time information for current events, stock prices, weather, and similar queries. Available in Qwen3.5-Omni (WebSocket or HTTP, both Plus and Flash). The model decides autonomously whether to search.

Not supported by Qwen3-Omni-Flash or the Livetranslate model.

Thinking mode

Use Qwen3-Omni (HTTP) when answer quality outweighs latency. Reasons step by step before replying, ideal for video analysis and batch labeling.

Voice generation is not supported in thinking mode.

Speech translation

The following model series support speech translation:

Qwen3.5-Livetranslate: 60 languages (29 with audio+text output, 31 text-only). Covers Chinese, English, French, German, Russian, Japanese, Korean, Spanish, Portuguese, Arabic, and more.
Qwen3-Livetranslate: 18 languages and 5 Chinese dialects (~3 s latency). File mode accepts video input for context-aware translations. 7 languages produce text-only output.
Qwen3.5-Omni: 29 output languages and 8 Chinese dialects. Strong audio/video understanding and web search. Inject terminology and domain context via system prompt. Real-time and file modes.
Qwen3-Omni-Flash: 11 output languages and 8 Chinese dialects. Inject terminology and domain context via system prompt. Real-time and file modes, at lower cost.

Note

Quick start: Livetranslate series. Best quality and language coverage: Qwen3.5-Omni. Cost-sensitive: Qwen3-Omni-Flash.

Supported languages

Language	Qwen3.5-Livetranslate	Qwen3-Livetranslate	Qwen3.5-Omni	Qwen3-Omni-Flash
English	Supported	Supported	Supported	Supported
Chinese (Mandarin)	Supported	Supported	Supported	Supported
Cantonese	Text-only	Supported	Supported	Supported
Sichuan dialect	Supported	Supported	Supported	Supported
Shanghainese	Supported	Supported	Supported	Supported
Beijing dialect	Supported	Supported	Supported	Supported
Tianjin dialect	Supported	Supported	Supported	Supported
Nanjing dialect	--	--	Supported	Supported
Shaanxi dialect	--	--	Supported	Supported
Minnan dialect	--	--	Supported	Supported
French	Supported	Supported	Supported	Supported
German	Supported	Supported	Supported	Supported
Russian	Supported	Supported	Supported	Supported
Italian	Supported	Supported	Supported	Supported
Spanish	Supported	Supported	Supported	Supported
Portuguese	Supported	Supported	Supported	Supported
Japanese	Supported	Supported	Supported	Supported
Korean	Supported	Supported	Supported	Supported
Thai	Supported	Text-only	Supported	Supported
Indonesian	Supported	Text-only	Supported	--
Vietnamese	Supported	Text-only	Supported	--
Arabic	Supported	Text-only	Supported	--
Hindi	Supported	Text-only	Supported	--
Turkish	Supported	Text-only	Supported	--
Finnish	Supported	--	Supported	--
Polish	Supported	--	Supported	--
Dutch	Supported	--	Supported	--
Czech	Supported	--	Supported	--
Urdu	Supported	--	Supported	--
Tagalog	Supported	--	Supported	--
Swedish	Supported	--	Supported	--
Danish	Supported	--	Supported	--
Hebrew	Supported	--	Supported	--
Icelandic	Supported	--	Supported	--
Malay	Supported	--	Supported	--
Norwegian	Supported	--	Supported	--
Persian	Supported	--	Supported	--
Greek	Text-only	Text-only	--	--
Afrikaans	Text-only	--	--	--
Asturian	Text-only	--	--	--
Belarusian	Text-only	--	--	--
Bulgarian	Text-only	--	--	--
Bengali	Text-only	--	--	--
Bosnian	Text-only	--	--	--
Catalan	Text-only	--	--	--
Cebuano	Text-only	--	--	--
Estonian	Text-only	--	--	--
Galician	Text-only	--	--	--
Gujarati	Text-only	--	--	--
Croatian	Text-only	--	--	--
Hungarian	Text-only	--	--	--
Javanese	Text-only	--	--	--
Kazakh	Text-only	--	--	--
Kannada	Text-only	--	--	--
Kyrgyz	Text-only	--	--	--
Latvian	Text-only	--	--	--
Macedonian	Text-only	--	--	--
Malayalam	Text-only	--	--	--
Marathi	Text-only	--	--	--
Punjabi	Text-only	--	--	--
Romanian	Text-only	--	--	--
Slovak	Text-only	--	--	--
Slovenian	Text-only	--	--	--
Swahili	Text-only	--	--	--
Tajik	Text-only	--	--	--
Azerbaijani	Text-only	--	--	--
Ukrainian	Text-only	--	--	--

"Supported" = speech + text output. "Text-only" = text output only, no speech.

Qwen3.5-Omni supports 113 input languages and dialects.

Qwen3.5-Livetranslate supports 60 languages (29 with audio and text, 31 text only).

The legacy qwen-omni-turbo model supports only Chinese and English.

Recommended models

The table lists the entry-point model in each series. To pin a dated version for regression testing or stability, see All models below.

Model	API	Input	Function calling	Web search	Thinking mode	Translation
`qwen-audio-3.0-realtime-plus`	WebSocket	audio, text	Supported	--	--	--
`qwen-audio-3.0-realtime-flash`	WebSocket	audio, text	Supported	--	--	--
`qwen3.5-omni-flash-realtime`	WebSocket	text, audio, image	Supported	Supported	--	29 languages
`qwen3.5-omni-flash`	HTTP	text, audio, image, video	Supported	Supported	--	29 languages
`qwen3-omni-flash-realtime`	WebSocket	text, audio, image, video	--	--	--	11 languages
`qwen3-omni-flash`	HTTP	text, audio, image, video	Supported	--	Supported	11 languages
`qwen3.5-livetranslate-flash-realtime`	WebSocket	audio, image	--	--	--	60 languages
`qwen3-livetranslate-flash`	HTTP	audio, video	--	--	--	18 languages

All models

Qwen3.5-Omni

Model	API	Input	Function calling	Web search	Thinking mode
`qwen3.5-omni-plus-realtime`	WebSocket	Text, audio, image, video	Supported	Supported	--
`qwen3.5-omni-plus-realtime-2026-03-15`	WebSocket	Text, audio, image, video	Supported	Supported	--
`qwen3.5-omni-plus`	HTTP	Text, audio, image, video	Supported	Supported	--
`qwen3.5-omni-plus-2026-03-15`	HTTP	Text, audio, image, video	Supported	Supported	--
`qwen3.5-omni-flash-realtime`	WebSocket	Text, audio, image, video	Supported	Supported	--
`qwen3.5-omni-flash-realtime-2026-03-15`	WebSocket	Text, audio, image, video	Supported	Supported	--
`qwen3.5-omni-flash`	HTTP	Text, audio, image, video	Supported	Supported	--
`qwen3.5-omni-flash-2026-03-15`	HTTP	Text, audio, image, video	Supported	Supported	--

Qwen3-Omni

Model	API	Input	Function calling	Web search	Thinking mode
`qwen3-omni-flash-realtime`	WebSocket	Text, audio, image, video	--	--	--
`qwen3-omni-flash-realtime-2025-12-01`	WebSocket	Text, audio, image, video	--	--	--
`qwen3-omni-flash-realtime-2025-09-15`	WebSocket	Text, audio, image, video	--	--	--
`qwen3-omni-flash`	HTTP	Text, audio, image, video	Supported	--	Supported
`qwen3-omni-flash-2025-12-01`	HTTP	Text, audio, image, video	Supported	--	Supported
`qwen3-omni-flash-2025-09-15`	HTTP	Text, audio, image, video	Supported	--	Supported

Qwen3.5-Livetranslate

Model	API	Input	Languages
`qwen3.5-livetranslate-flash-realtime`	WebSocket	Audio	60
`qwen3.5-livetranslate-flash-realtime-2026-05-19`	WebSocket	Audio	60

Qwen3-Livetranslate

Model	API	Input	Languages
`qwen3-livetranslate-flash-realtime`	WebSocket	Audio	18
`qwen3-livetranslate-flash-realtime-2025-09-22`	WebSocket	Audio	18
`qwen3-livetranslate-flash`	HTTP	Audio, video	18
`qwen3-livetranslate-flash-2025-12-01`	HTTP	Audio, video	18

Qwen-Audio

Model	API	Input	Function calling	Web search	Thinking mode	Translation
`qwen-audio-3.0-realtime-plus`	WebSocket	audio, text	Supported	--	--	--
`qwen-audio-3.0-realtime-flash`	WebSocket	audio, text	Supported	--	--	--

Legacy models

These models are no longer updated. For new projects, use Qwen3.5-Omni.

Model	Input	API
`qwen2.5-omni-7b`	Text, audio, image, video	HTTP
`qwen-omni-turbo`	Text, audio, image, video	HTTP
`qwen-omni-turbo-latest`	Text, audio, image, video	HTTP
`qwen-omni-turbo-2025-03-26`	Text, audio, image, video	HTTP
`qwen-omni-turbo-realtime`	Text, audio	WebSocket
`qwen-omni-turbo-realtime-latest`	Text, audio	WebSocket
`qwen-omni-turbo-realtime-2025-05-08`	Text, audio	WebSocket

What's next

API documentation by model series:

Qwen3.5-Omni and Qwen3-Omni (WebSocket, real-time): Qwen-Omni-Realtime
Qwen3.5-Omni and Qwen3-Omni (HTTP, file): Non-real-time (Qwen-Omni)
Qwen3.5-Livetranslate (WebSocket, real-time): Real-time audio and video translation - Qwen
Qwen3-Livetranslate (HTTP, file): Audio and video file translation (Qwen)
Qwen-Audio Realtime (WebSocket, real-time voice conversation): Realtime Audio Chat (Qwen-Audio-Realtime)