Choose the right model for speech synthesis, voice cloning, and voice design.
This page lists models for speech synthesis and voice services, including previous versions. Answer two questions to narrow your selection:
Do you need a custom voice, or will a built-in one be sufficient?
Do you need real-time streaming output, or is non-streaming acceptable?
Standard voice synthesis or a custom voice?
Standard voice synthesis
Use built-in voices without extra configuration. Select a model and a voice to start synthesis.
International
Model | Series | Key advantage |
| CosyVoice | High quality, with a rich voice library |
| CosyVoice | Fast synthesis |
| Qwen3-TTS | Low latency, high quality |
| Qwen3-TTS | Low latency, high quality (snapshot version) |
| Qwen3-TTS | Low latency, high quality (snapshot version) |
| Qwen3-TTS | Real-time streaming output, low latency |
| Qwen3-TTS | Real-time streaming output, low latency (snapshot version) |
| Qwen3-TTS | Real-time streaming output, low latency (snapshot version) |
| Qwen3-TTS | Instruction control (speech rate, emotion, and style) |
| Qwen3-TTS | Instruction control (speech rate, emotion, and style) (snapshot version) |
| Qwen3-TTS | Real-time streaming output and instruction control (speech rate, emotion, and style) |
| Qwen3-TTS | Real-time streaming output and instruction control (speech rate, emotion, and style) (snapshot version) |
Chinese mainland
Model | Series | Key advantage |
| CosyVoice | High quality, with a continuously updated voice library |
| CosyVoice | Fast synthesis |
| CosyVoice | High quality, with a rich voice library |
| CosyVoice | Fast synthesis |
| CosyVoice | Legacy high-quality synthesis |
| CosyVoice | Legacy basic synthesis |
| Qwen3-TTS | Low latency, high quality |
| Qwen3-TTS | Low latency, high quality (snapshot version) |
| Qwen3-TTS | Low latency, high quality (snapshot version) |
| Qwen3-TTS | Real-time streaming output, low latency |
| Qwen3-TTS | Real-time streaming output, low latency (snapshot version) |
| Qwen3-TTS | Real-time streaming output, low latency (snapshot version) |
| Qwen3-TTS | Instruction control (speech rate, emotion, and style) |
| Qwen3-TTS | Instruction control (speech rate, emotion, and style) (snapshot version) |
| Qwen3-TTS | Real-time streaming output and instruction control (speech rate, emotion, and style) |
| Qwen3-TTS | Real-time streaming output and instruction control (speech rate, emotion, and style) (snapshot version) |
| MiniMax | High-fidelity speech synthesis |
| MiniMax | High-fidelity speech synthesis |
| MiniMax | Low-latency, fast synthesis |
| MiniMax | Low-latency, fast synthesis |
Custom voice
Create unique voices from audio samples or text descriptions.
International
Model | Series | Key advantage |
| Qwen3-TTS | Voice cloning from audio samples |
| Qwen3-TTS | Real-time voice cloning |
| Qwen3-TTS | Real-time voice cloning |
| Qwen3-TTS | Voice design from text descriptions |
| Qwen3-TTS | Real-time voice design |
| Qwen3-TTS | Real-time voice design |
| Qwen Voice Cloning | Voice cloning (voice enrollment and management) |
| Qwen Voice Design | Voice design (creating voices from text) |
Voice cloning vs. voice design: Voice cloning duplicates a specific voice from audio samples. Voice design creates a new voice from a text description, such as "a warm, low-pitched female voice". Use voice cloning when you have a target voice. Use voice design when you want to create a voice from scratch.
Control voice expression
Three options are available, ordered by flexibility:
Instruction control (
qwen3-tts-instruct-flash,qwen3-tts-instruct-flash-realtime): Use natural language to describe the desired expression style and control speech rate, emotion, and style on demand.Voice design (
qwen3-tts-vd-*): Creates a custom voice from a text description. Ideal for creating a brand voice without audio samples.Voice cloning (
qwen3-tts-vc-*): Copies an existing voice from an audio sample. Suitable for replicating a specific person's voice.
Full comparison
Model | Series | Streaming | Custom voice | Instruction control |
| CosyVoice | Supported | Not supported | Not supported |
| CosyVoice | Supported | Not supported | Not supported |
| CosyVoice | Supported | Not supported | Not supported |
| CosyVoice | Supported | Not supported | Not supported |
| CosyVoice | Supported | Not supported | Not supported |
| Qwen3-TTS | Supported | Not supported | Not supported |
| Qwen3-TTS | Supported | Not supported | Not supported |
| Qwen3-TTS | Supported | Not supported | Not supported |
| Qwen3-TTS | Supported | Not supported | Not supported |
| Qwen3-TTS | Supported | Not supported | Not supported |
| Qwen3-TTS | Supported | Not supported | Not supported |
| Qwen3-TTS | Supported | Not supported | Supported |
| Qwen3-TTS | Supported | Not supported | Supported |
| Qwen3-TTS | Supported | Not supported | Supported |
| Qwen3-TTS | Supported | Not supported | Supported |
| Voice cloning | Not supported | Supported | Not supported |
| Voice cloning | Supported | Supported | Not supported |
| Voice cloning | Supported | Supported | Not supported |
| Voice design | Not supported | Supported | Not supported |
| Voice design | Supported | Supported | Not supported |
| Voice design | Supported | Supported | Not supported |
| Qwen-TTS (Legacy) | Not supported (full-passage generation) | Not supported | Not supported |
| Qwen-TTS (Legacy) | Not supported (full-passage generation) | Not supported | Not supported |
| Qwen-TTS (Legacy) | Not supported (full-passage generation) | Not supported | Not supported |
| Qwen-TTS (Legacy) | Not supported (full-passage generation) | Not supported | Not supported |
| Qwen-TTS (Legacy) | Supported | Not supported | Not supported |
| Qwen-TTS (Legacy) | Supported | Not supported | Not supported |
| Qwen-TTS (Legacy) | Supported | Not supported | Not supported |
| Voice service | N/A | Supported (voice enrollment) | Not supported |
| Voice service | N/A | Supported (voice design) | Not supported |
Legacy models (Qwen-TTS, token-based billing)
Legacy Qwen-TTS models use token-based billing and are accessible over HTTP or WebSocket. If you have migrated to Qwen3-TTS, use the standard speech synthesis models above.
International
Model | Access method | Description |
| HTTP | Non-streaming synthesis, token-based billing |
| HTTP | Non-streaming synthesis, token-based billing |
| HTTP | Snapshot version, token-based billing |
| HTTP | Snapshot version, token-based billing |
| WebSocket | Streaming synthesis, token-based billing |
| WebSocket | Streaming synthesis, token-based billing |
| WebSocket | Snapshot version, streaming synthesis, token-based billing |