Speech synthesis, also known as text-to-speech (TTS), is a technology that converts text into natural-sounding speech. It uses machine learning algorithms to learn the prosody, intonation, and pronunciation rules of a language from numerous audio samples, enabling it to generate human-like, natural speech from text input.
Core features
Generates high-fidelity audio in real time and supports natural-sounding speech in multiple languages, including Chinese and English.
Provides a voice cloning feature to quickly create custom voices.
Supports streaming input and output for low-latency responses in real-time interactive scenarios.
Allows fine-grained control over speech performance by adjusting speech rate, pitch, volume, and bitrate.
Compatible with major audio formats and supports an audio sampling rate of up to 48 kHz.
Availability
Supported regions: This service is available only in the China (Beijing) region. Use an API key from the China (Beijing) region.
Supported models: cosyvoice-v3-plus, cosyvoice-v3-flash, cosyvoice-v2
Supported voices: See CosyVoice voice list.
Model selection
Scenario | Recommended model | Reason | Notes |
Brand voice customization/Personalized voice cloning service | cosyvoice-v3-plus | Features the strongest voice cloning capability and supports 48 kHz high-quality audio output. The combination of high-quality audio and voice cloning helps create a human-like brand voiceprint. | Higher cost ($0.286706 per 10,000 characters). Recommended for core scenarios |
Smart customer service / Voice assistant | cosyvoice-v3-flash | This is the lowest-cost model ($0.14335 per 10,000 characters). It supports streaming interaction and emotional expression, provides fast response times, and is highly cost-effective. | |
Dialect broadcasting system | cosyvoice-v3-flash, cosyvoice-v3-plus | Supports various dialects such as Northeastern Mandarin and Minnan, suitable for local content broadcasting. | cosyvoice-v3-plus has a higher cost ($0.286706 per 10,000 characters) |
Educational applications (including formula reading) | cosyvoice-v2, cosyvoice-v3-flash, cosyvoice-v3-plus | Supports converting LaTeX formulas to speech, suitable for explaining math, physics, and chemistry lessons. | cosyvoice-v2 and cosyvoice-v3-plus have a higher cost ($0.286706 per 10,000 characters). |
Structured voice announcements (news/bulletins) | cosyvoice-v3-plus, cosyvoice-v3-flash, cosyvoice-v2 | Supports SSML to control speech rate, pauses, pronunciation, and more, enhancing the professionalism of broadcasts. | Requires additional development for SSML generation logic. Does not support setting emotions. |
Precise audio-text alignment (for caption generation, lecture playback, dictation training) | cosyvoice-v3-flash, cosyvoice-v3-plus, cosyvoice-v2 | Supports timestamp output, which allows synchronization between synthesized audio and the original text. | The timestamp feature must be explicitly enabled because it is disabled by default. |
Multilingual products for global markets | cosyvoice-v3-flash, cosyvoice-v3-plus | Supports multiple languages. | Sambert does not support streaming input and is more expensive than cosyvoice-v3-flash. |
For more information, see Feature comparison.
Getting started
The following code shows an example of how to call the API. For more code examples that cover common scenarios, see our repository on GitHub.
You must obtain an API key and set the API key as an environment variable. If you use an SDK to make calls, you must also install the DashScope SDK.
CosyVoice Save synthesized audio to a filePythonJavaConvert text generated by an LLM into speech in real time and play it through a speakerThe following code shows how to play the real-time text output from the Qwen large language model (qwen-turbo) on an on-premises device. PythonBefore running the Python example, you need to install a third-party audio playback library using pip. Java |
API reference
Feature comparison
Feature | cosyvoice-v3-plus | cosyvoice-v3-flash | cosyvoice-v2 |
Supported languages | Varies by voice: Chinese (Mandarin, Cantonese, Northeastern, Gansu, Guizhou, Henan, Hubei, Jiangxi, Minnan, Ningxia, Shanxi, Shaanxi, Shandong, Shanghai, Sichuan, Tianjin, Yunnan), English, French, German, Japanese, Korean, Russian | Varies by voice: Chinese, English (British, American), Korean, Japanese | |
Audio format | pcm, wav, mp3, opus | ||
Audio sampling rate | 8 kHz, 16 kHz, 22.05 kHz, 24 kHz, 44.1 kHz, 48 kHz | ||
Voice cloning | . See CosyVoice voice cloning API. | ||
SSML | . See SSML markup language overview. This feature applies to cloned voices and the system voices marked as supported in the voice list. | ||
LaTeX | |||
Volume adjustment | |||
Speech rate adjustment | |||
Pitch adjustment | |||
Bitrate adjustment | . Only available for the opus format. | ||
Timestamp | . Disabled by default but can be enabled. This feature applies to cloned voices and the system voices marked as supported in the voice list. | ||
Instruction control (Instruct) | . This feature applies to cloned voices and the system voices marked as supported in the voice list. | ||
Streaming input | |||
Streaming output | |||
Rate limit (RPS) | 3 | ||
Connection types | Java/Python SDK, WebSocket API | ||
Price | $0.286706 per 10,000 characters | $0.14335 per 10,000 characters | $0.286706 per 10,000 characters |
FAQ
Q: What should I do if the speech synthesis has incorrect pronunciation? How can I control the pronunciation of polyphonic characters?
Replace the polyphonic character with a homophone to quickly correct the pronunciation.
Use Speech Synthesis Markup Language (SSML) to control the pronunciation.