Speech synthesis models - TTS, voice cloning, voice design - Alibaba Cloud Model Studio

Standard speech synthesis or custom voices?

TTS models convert text to natural-sounding speech. Decide whether built-in voices or custom voices fit your needs:

	Standard speech synthesis	Custom voices
Voice source	Built-in voice library, ready to use	Cloned from an audio sample or created from a text description
Getting started	No extra setup required — select a model and voice to start synthesizing	Provide an audio sample or text description to create a voice
Use cases	Customer service bots, audiobook narration, news broadcasts, e-commerce live streaming	Brand-specific voices, virtual streamers, game character dubbing
Recommended models	`cosyvoice-v3-plus`	`cosyvoice-v3.5-plus` (voice cloning + voice design)

Use standard speech synthesis when built-in voices meet your needs and you want zero-configuration setup.
Use custom voices when you need a brand-exclusive voice, want to replicate a specific speaker, or need to create a new character voice.

Voice cloning or voice design?

Custom voices offer two creation methods:

	Voice cloning	Voice design
Input	An audio sample from the target speaker	A text description of the desired voice (for example, "warm, low-pitched female voice")
Result	Synthesized speech closely resembles the original speaker	A brand-new voice generated from scratch based on the description
Use cases	Reusing a brand spokesperson or streamer's voice, virtual streamers, personalized voice assistants	Brand voice customization (no recordings available), game or animation character dubbing, creative content production
Recommended models	`cosyvoice-v3.5-plus`, `cosyvoice-v3.5-flash`	`cosyvoice-v3.5-plus`, `cosyvoice-v3.5-flash`
Voice management service	`voice-enrollment` (register and manage voices)	`voice-enrollment` (register and manage voices)

Use voice cloning when you have a recording of the target speaker and want to reproduce that voice.
Use voice design when no recording is available and you want to create a voice from a text description.

WebSocket or HTTP?

WebSocket: Bidirectional streaming that supports streaming input and output. Audio is returned as it is synthesized, providing the lowest latency. Best for real-time scenarios: customer service bots, voice assistants, and call centers.
HTTP: Accepts full text input with streaming audio output delivered in segments. Best for audiobook narration, content generation, and offline production.

CosyVoice models share one model name for both WebSocket and HTTP. Qwen models use a -realtime suffix for WebSocket; models without this suffix use HTTP.

CosyVoice and Qwen WebSocket models can be accessed through the DashScope SDK (Java, Python). Other models require direct calls using the corresponding WebSocket or HTTP protocol.

WebSocket access: Real-time speech synthesis. HTTP access: Non-real-time speech synthesis.

Instruction control

Use natural-language instructions to control speech rate, emotion, and style per request — for example, "speak gently at a slightly slower pace" or "use an excited broadcast style." Ideal for emotionally expressive content, professional broadcasts, and audiobook narration.

Supported models: CosyVoice (cosyvoice-v3.5-plus, cosyvoice-v3.5-flash, cosyvoice-v3-flash) and Qwen-TTS (qwen3-tts-instruct-flash-realtime, qwen3-tts-instruct-flash). Real-time speech synthesis > Instruction control.

Recommended models

The following table lists the recommended model for each scenario. Visit the Model Gallery for a full catalog.

Model ID	Series	API	Voice cloning	Voice design	Instruction control
`cosyvoice-v3.5-plus`	CosyVoice	WebSocket	Supported	Supported	Supported
`cosyvoice-v3-plus`	CosyVoice	WebSocket	Supported	Supported	Unsupported

All models

CosyVoice

Some CosyVoice models support SSML markup and reading LaTeX formulas aloud.

Model ID	API	Voice cloning	Voice design	Instruction control
`cosyvoice-v3.5-plus`	WebSocket	Supported	Supported	Supported
`cosyvoice-v3.5-flash`	WebSocket	Supported	Supported	Supported
`cosyvoice-v3-plus`	WebSocket	Supported	Supported	Unsupported
`cosyvoice-v3-flash`	WebSocket	Supported	Supported	Supported
`cosyvoice-v2`	WebSocket	Supported	Unsupported	Unsupported

Supported languages (by version):

cosyvoice-v3.5-plus and cosyvoice-v3.5-flash (no system voices):
- Voice cloning: Chinese (Mandarin; Cantonese, Northeastern, Gansu, Guizhou, Henan, Hubei, Jiangxi, Min Nan, Ningxia, Shanxi, Shaanxi, Shandong, Shanghainese, Sichuanese, Tianjin, and Yunnan dialects via instruction control), English, French, German, Japanese, Korean, Russian, Portuguese, Thai, Indonesian, and Vietnamese
- Voice design: Mandarin Chinese and English
cosyvoice-v3-plus:
- System voices: Mandarin Chinese and English (varies by voice)
- Voice cloning: Chinese (Mandarin; Cantonese, Northeastern, Gansu, Guizhou, Henan, Hubei, Jiangxi, Min Nan, Ningxia, Shanxi, Shaanxi, Shandong, Shanghainese, Sichuanese, Tianjin, and Yunnan dialects via instruction control), English, French, German, Japanese, Korean, and Russian
- Voice design: Mandarin Chinese and English
cosyvoice-v3-flash:
- System voices (varies by voice): Mandarin Chinese (with Cantonese, Northeastern, Henan, Hunan, Shaanxi, Shandong, Sichuanese, Anhui, and Min Nan dialects — some directly supported by system voices, others configurable via instruction control), English
- Voice cloning: Chinese (Mandarin; Cantonese, Northeastern, Gansu, Guizhou, Henan, Hubei, Jiangxi, Min Nan, Ningxia, Shanxi, Shaanxi, Shandong, Shanghainese, Sichuanese, Tianjin, and Yunnan dialects via instruction control), English, French, German, Japanese, Korean, Russian, Portuguese, Thai, Indonesian, and Vietnamese
- Voice design: Mandarin Chinese and English
cosyvoice-v2 (no voice design):
- System voices: Mandarin Chinese (with Cantonese, Northeastern, Min Nan, and Shaanxi dialects), English, Japanese, and Korean (varies by voice)
- Voice cloning: Mandarin Chinese and English

Qwen3-TTS

Model ID	API	Voice cloning	Voice design	Instruction control
`qwen3-tts-flash`	HTTP	Unsupported	Unsupported	Unsupported
`qwen3-tts-flash-2025-11-27`	HTTP	Unsupported	Unsupported	Unsupported
`qwen3-tts-flash-2025-09-18`	HTTP	Unsupported	Unsupported	Unsupported
`qwen3-tts-flash-realtime`	WebSocket	Unsupported	Unsupported	Unsupported
`qwen3-tts-flash-realtime-2025-11-27`	WebSocket	Unsupported	Unsupported	Unsupported
`qwen3-tts-flash-realtime-2025-09-18`	WebSocket	Unsupported	Unsupported	Unsupported
`qwen3-tts-instruct-flash`	HTTP	Unsupported	Unsupported	Supported
`qwen3-tts-instruct-flash-2026-01-26`	HTTP	Unsupported	Unsupported	Supported
`qwen3-tts-instruct-flash-realtime`	WebSocket	Unsupported	Unsupported	Supported
`qwen3-tts-instruct-flash-realtime-2026-01-22`	WebSocket	Unsupported	Unsupported	Supported
`qwen3-tts-vc-2026-01-22`	HTTP	Supported	Unsupported	Unsupported
`qwen3-tts-vc-realtime-2026-01-15`	WebSocket	Supported	Unsupported	Unsupported
`qwen3-tts-vc-realtime-2025-11-27`	WebSocket	Supported	Unsupported	Unsupported
`qwen3-tts-vd-2026-01-26`	HTTP	Unsupported	Supported	Unsupported
`qwen3-tts-vd-realtime-2026-01-15`	WebSocket	Unsupported	Supported	Unsupported
`qwen3-tts-vd-realtime-2025-12-16`	WebSocket	Unsupported	Supported	Unsupported

Supported languages (by series):

Qwen3-TTS-Flash series (system voices) — qwen3-tts-flash, qwen3-tts-flash-2025-11-27, qwen3-tts-flash-2025-09-18, qwen3-tts-flash-realtime, qwen3-tts-flash-realtime-2025-11-27, qwen3-tts-flash-realtime-2025-09-18: Chinese (Mandarin; Beijing, Shanghai, Sichuan, Nanjing, Shaanxi, Min Nan, Tianjin, and Cantonese dialects, varies by voice), English, German, Italian, Portuguese, Spanish, Japanese, Korean, French, and Russian
Qwen3-TTS-Instruct-Flash series (system voices) — qwen3-tts-instruct-flash, qwen3-tts-instruct-flash-2026-01-26, qwen3-tts-instruct-flash-realtime, qwen3-tts-instruct-flash-realtime-2026-01-22: Mandarin Chinese, English, German, Italian, Portuguese, Spanish, Japanese, Korean, French, and Russian
Qwen3-TTS-VC series (voice cloning) — qwen3-tts-vc-2026-01-22, qwen3-tts-vc-realtime-2026-01-15, qwen3-tts-vc-realtime-2025-11-27: Mandarin Chinese, English, German, Italian, Portuguese, Spanish, Japanese, Korean, French, and Russian
Qwen3-TTS-VD series (voice design) — qwen3-tts-vd-2026-01-26, qwen3-tts-vd-realtime-2026-01-15, qwen3-tts-vd-realtime-2025-12-16: Mandarin Chinese, English, German, Italian, Portuguese, Spanish, Japanese, Korean, French, and Russian

Qwen-TTS (legacy, token-based billing)

Legacy Qwen-TTS models billed by token. If you have migrated to Qwen3-TTS, use the recommended models listed earlier.

Model ID	API	Description
`qwen-tts`	HTTP	Non-streaming synthesis, billed by token
`qwen-tts-latest`	HTTP	Non-streaming synthesis, billed by token
`qwen-tts-2025-05-22`	HTTP	Snapshot version, billed by token
`qwen-tts-2025-04-10`	HTTP	Snapshot version, billed by token
`qwen-tts-realtime`	WebSocket	Streaming synthesis, billed by token
`qwen-tts-realtime-latest`	WebSocket	Streaming synthesis, billed by token
`qwen-tts-realtime-2025-07-15`	WebSocket	Snapshot version, streaming synthesis, billed by token

Supported languages (by series):

Qwen-TTS series (system voices) — qwen-tts, qwen-tts-latest, qwen-tts-2025-05-22, qwen-tts-2025-04-10: Chinese (Mandarin; Beijing, Shanghai, and Sichuan dialects, varies by voice), English, German, Italian, Portuguese, Spanish, Japanese, Korean, French, and Russian
Qwen-TTS-Realtime series (system voices) — qwen-tts-realtime, qwen-tts-realtime-latest, qwen-tts-realtime-2025-07-15: Mandarin Chinese, English, German, Italian, Portuguese, Spanish, Japanese, Korean, French, and Russian