Speech synthesis, also known as Text-to-Speech (TTS), learns the rhythm, intonation, and pronunciation patterns of a language, and generates human-like speech from text input.
Core features
-
Generates high-fidelity speech in real time with support for multiple languages, including Chinese and English.
-
Offers two voice customization methods: voice cloning and voice design.
-
Supports streaming input and output with low latency, ideal for real-time interactive scenarios.
-
Allows adjustment of speech rate, pitch, volume, and bitrate for fine-grained control over voice output.
-
Compatible with mainstream audio formats with output sample rates up to 48 kHz.
Availability
Supported models:
International
In the international deployment mode, access points and data storage are located in the Singapore region. Model inference computing resources are dynamically scheduled worldwide, excluding the Chinese mainland.
When you invoke the following models, select the API key for the Singapore region.
-
CosyVoice: cosyvoice-v3-plus, cosyvoice-v3-flash
Chinese mainland
In the Chinese mainland deployment mode, access points and data storage are located in the Beijing region. Model inference computing resources are limited to the Chinese mainland.
When you invoke the following models, select an API key for the Beijing region:
-
CosyVoice: cosyvoice-v3.5-plus, cosyvoice-v3.5-flash, cosyvoice-v3-plus, cosyvoice-v3-flash, cosyvoice-v2
For more information, see the Model list.
Model selection
|
Scenario |
Recommended |
Reason |
Notes |
|
Voice customization for brand identity, exclusive voice, or extended system voices (based on text description) |
cosyvoice-v3.5-plus |
Supports voice design, allowing you to create customized voices from text descriptions without audio samples. Ideal for designing brand-exclusive voices from scratch. |
cosyvoice-v3.5-plus is available only in the Beijing region and does not support system voices. |
|
Voice customization for brand identity, exclusive voice, or extended system voices (based on audio samples) |
cosyvoice-v3.5-plus |
Supports voice cloning, enabling you to quickly clone voices from real audio samples to create human-like brand voiceprints with high fidelity and consistency. |
cosyvoice-v3.5-plus is available only in the Beijing region and does not support system voices. |
|
Intelligent customer service / Voice assistant |
cosyvoice-v3-flash, cosyvoice-v3.5-flash |
Lower cost than plus models with support for streaming interaction and emotional expression, delivering fast responses at an affordable price point. |
cosyvoice-v3.5-flash is available only in the Beijing region and does not support system voices. |
|
Regional dialect broadcasting |
cosyvoice-v3.5-plus |
Supports multiple Chinese dialects, such as Northeastern Mandarin and Minnan, making it ideal for localized content broadcasting. |
cosyvoice-v3.5-plus is available only in the Beijing region and does not support system voices. |
|
Educational applications (including formula reading) |
cosyvoice-v2, cosyvoice-v3-flash, cosyvoice-v3-plus |
Supports LaTeX formula-to-speech conversion, ideal for mathematics, physics, and chemistry instruction. |
cosyvoice-v2 and cosyvoice-v3-plus have higher costs ($0.286706 per 10,000 characters). |
|
Structured voice broadcasting (news/announcements) |
cosyvoice-v3-plus, cosyvoice-v3-flash, cosyvoice-v2 |
Supports SSML for controlling speech rate, pauses, and pronunciation to enhance broadcast professionalism. |
Implement the SSML generation logic independently. This model does not support emotion settings. |
|
Precise speech-text alignment for scenarios such as caption generation, lesson playback, and dictation practice |
cosyvoice-v3-flash, cosyvoice-v3-plus, cosyvoice-v2 |
Supports timestamp output to synchronize the synthesized speech with the original text. |
Manually enable the timestamp feature. |
|
Multilingual international products |
cosyvoice-v3-flash, cosyvoice-v3-plus |
Supports multiple languages. |
Capabilities vary by region and model. Before selecting a model, review the Compare models section.
Getting started
The following examples demonstrate how to invoke the API. For more code examples covering common scenarios, see GitHub.
Get an API key and export the API key as an environment variable. If you use an SDK to make calls, install the DashScope SDK.
|
CosyVoice Important
Use system voicesThe following example demonstrates how to perform speech synthesis using system voices. For more information, see the Voice list. Save synthesized audio to a filePython
JavaConvert LLM-generated text to speech in real time and play it through speakersThe following code shows how to play text content returned in real time from the Qwen large language model (qwen-turbo) on a local device. PythonBefore you run the Python example, install a third-party audio playback library using pip.
Java
Use cloned voicesVoice cloning and speech synthesis are two separate but related steps that follow a "create then use" workflow:
Sample code: Use designed voicesVoice design and speech synthesis are two separate but related steps that follow a "create then use" workflow:
Sample code:
|
Voice cloning: Input audio format
Not supported in the Singapore region.
High-quality input audio is the foundation for achieving excellent cloning results.
|
Item |
Requirements |
|
Supported formats |
WAV (16-bit), MP3, M4A |
|
Audio duration |
Recommended: 10 to 20 seconds. Maximum: 60 seconds. |
|
File size |
≤ 10 MB |
|
Sample rate |
≥ 16 kHz |
|
Sound channel |
Mono or stereo. For stereo audio, only the first channel is processed. Make sure that the first channel contains a clear human voice. |
|
Content |
The audio must contain at least 5 seconds of continuous, clear speech without background sound. The rest of the audio can have only short pauses (≤ 2 seconds). The entire audio segment should be free of background music, noise, or other voices to ensure high-quality core speech content. Use normal spoken audio as input. Do not upload songs or singing audio to ensure accuracy and usability of the cloning effect. |
Voice design: Write high-quality voice descriptions?
Not supported in the Singapore region.
Limitations
When writing voice descriptions (voice_prompt), follow these technical constraints:
-
Length limit: The content of
voice_promptmust not exceed 500 characters. -
Supported languages: The description text supports only Chinese and English.
Core principles
A high-quality voice description (voice_prompt) is essential for creating your ideal voice. It serves as the blueprint for voice design and directly guides the model to generate sounds with specific characteristics.
Follow these core principles when describing voices:
-
Be specific, not vague: Use words that describe concrete sound qualities, such as "deep," "crisp," or "fast-paced." Avoid subjective, uninformative terms such as "nice-sounding" or "ordinary."
-
Be multidimensional, not single-dimensional: Excellent descriptions typically combine multiple dimensions, such as gender, age, and emotion. Single-dimensional descriptions, such as "female voice," are too broad to generate distinctive voices.
-
Be objective, not subjective: Focus on the physical and perceptual characteristics of the sound itself, not your personal preferences. For example, use "high-pitched with energetic delivery" instead of "my favorite voice."
-
Be original, not imitative: Describe sound characteristics rather than requesting imitation of specific individuals, such as celebrities or actors. Such requests pose copyright risks, and the model does not support direct imitation.
-
Be concise, not redundant: Ensure every word adds meaning. Avoid repeating synonyms or using meaningless intensifiers, such as "very very nice voice."
Dimension example
|
Dimension |
Example |
|
Gender |
Male, female, neutral |
|
Age |
Child (5-12 years), teenager (13-18 years), young adult (19-35 years), middle-aged (36-55 years), senior (55+ years) |
|
Pitch |
High, medium, low, slightly high, slightly low |
|
Speech rate |
Fast, medium, slow, slightly fast, slightly slow |
|
Emotion |
Cheerful, calm, gentle, serious, lively, cool, soothing |
|
Characteristics |
Magnetic, crisp, raspy, mellow, sweet, rich, powerful |
|
Purpose |
News broadcasting, advertisement voice-over, audiobooks, animated characters, voice assistants, documentary narration |
Example comparison
✅ Good cases
-
"Young and lively female voice, fast speech rate with noticeable rising intonation, suitable for introducing fashion products."
Analysis: This description combines age, personality, speech rate, and intonation, and specifies the use case, creating a clear voice profile.
-
"Calm middle-aged male, slow speech rate, deep and magnetic voice quality, suitable for reading news or documentary narration."
Analysis: This description clearly defines gender, age range, speech rate, voice quality, and intended use.
-
"Cute child's voice, approximately 8-year-old girl, slightly childish speech, suitable for animated character dubbing."
Analysis: This description pinpoints the specific age and voice quality (childishness) and has a clear purpose.
-
"Gentle and intellectual female, around 30 years old, calm tone, suitable for audiobook narration."
Analysis: This description effectively conveys voice emotion and style through terms such as "intellectual" and "calm."
❌ Bad cases and suggestions
|
Bad case |
Main issue |
Improvement suggestion |
|
'Nice-sounding voice' |
This description is too vague and subjective, and lacks actionable detail. |
Add specific dimensions, such as "Clear-toned young female voice with gentle intonation." |
|
'Voice like a celebrity' |
This poses a copyright risk. The model does not support direct imitation. |
Extract the voice characteristics for the description, such as "Mature, magnetic, steady-paced male voice." |
|
'Very very very nice female voice' |
This description is redundant. Repeating words does not help define the voice. |
Remove repetitions and add effective descriptions, such as "A 20- to 24-year-old female voice with a light, cheerful tone, lively pitch, and sweet quality." |
|
123456 |
This is an invalid input. It cannot be parsed as voice characteristics. |
Provide a meaningful text description. For more information, see the recommended examples above. |
API reference
Compare models
International
In the international deployment mode, access points and data storage are located in the Singapore region. Model inference computing resources are dynamically scheduled worldwide, excluding the Chinese mainland.
|
Feature |
cosyvoice-v3-plus |
cosyvoice-v3-flash |
|
Supported languages |
Varies by system voice: Chinese (Mandarin, Northeastern, Minnan, Shaanxi), English, Japanese, Korean |
Varies by system voice: Chinese (Mandarin), English |
|
Audio format |
pcm, wav, mp3, opus |
|
|
Audio sample rate |
8 kHz, 16 kHz, 22.05 kHz, 24 kHz, 44.1 kHz, 48 kHz |
|
|
Voice cloning |
|
|
|
Voice design |
|
|
|
SSML |
This feature applies to cloned voices and system voices in the Voice list marked as supporting SSML. For usage instructions, see SSML |
|
|
LaTeX |
For usage instructions, see LaTeX formula-to-speech |
|
|
Volume adjustment |
See request parameter
|
|
|
Speech rate adjustment |
See request parameter
In the Java SDK, this parameter is
|
|
|
Pitch adjustment |
For usage, see the request parameter
In the Java SDK, this parameter is
|
|
|
Bitrate adjustment |
Only the opus audio format supports this feature. For usage instructions, see the request parameter
In the Java SDK, this parameter is
|
|
|
Timestamp |
Disabled by default but can be enabled. This feature applies to cloned voices and system voices in the Voice list marked as supporting timestamps. For usage instructions, see request parameter
In the Java SDK, this parameter is
|
|
|
Instruction control (Instruct) |
|
This feature applies to system voices in the Voice list marked as supporting Instruct. For more information, see request parameter
|
|
Streaming input |
|
|
|
Streaming output |
|
|
|
Rate limiting (RPS) |
3 |
|
|
Connection type |
Java/Python SDK, WebSocket API |
|
|
Price |
$0.26 per 10,000 characters |
$0.13 per 10,000 characters |
Chinese mainland
In the Chinese mainland deployment mode, access points and data storage are located in the Beijing region. Model inference computing resources are limited to the Chinese mainland.
|
Feature |
cosyvoice-v3.5-plus |
cosyvoice-v3.5-flash |
cosyvoice-v3-plus |
cosyvoice-v3-flash |
cosyvoice-v2 |
|
Supported languages |
No system voices. Cloned voices support the following languages: Chinese (Mandarin, Cantonese, Henan, Hubei, Minnan, Ningxia, Shaanxi, Shandong, Shanghai, Sichuan), English, French, German, Japanese, Korean, Russian, Portuguese, Thai, Indonesian, and Vietnamese. Designed voices support the following languages: Chinese (Mandarin) and English. |
System voices (varies by voice): Chinese (Mandarin, Northeastern, Minnan, Shaanxi), English, Japanese, Korean Cloned voices: Chinese (Mandarin), English, French, German, Japanese, Korean, and Russian. |
System voices (varies by voice): Chinese (Mandarin), English Cloned voices: Chinese (Mandarin, Cantonese, Northeastern, Gansu, Guizhou, Henan, Hubei, Jiangxi, Minnan, Ningxia, Shanxi, Shaanxi, Shandong, Shanghai, Sichuan, Tianjin, Yunnan), English, French, German, Japanese, Korean, Russian, Portuguese, Thai, Indonesian, and Vietnamese. |
System voices (varies by voice): Chinese (Mandarin), English, Korean, Japanese Cloned voices: Chinese (Mandarin) and English. |
|
|
Audio format |
pcm, wav, mp3, opus |
||||
|
Audio sample rate |
8 kHz, 16 kHz, 22.05 kHz, 24 kHz, 44.1 kHz, 48 kHz |
||||
|
Voice cloning |
For usage instructions, see CosyVoice voice cloning/design API The following languages are supported for voice cloning: cosyvoice-v2: Chinese (Mandarin) and English. cosyvoice-v3-flash: Chinese (Mandarin, Cantonese, Northeastern, Gansu, Guizhou, Henan, Hubei, Jiangxi, Minnan, Ningxia, Shanxi, Shaanxi, Shandong, Shanghai, Sichuan, Tianjin, Yunnan), English, French, German, Japanese, Korean, Russian, Portuguese, Thai, Indonesian, and Vietnamese. cosyvoice-v3-plus: Chinese (Mandarin), English, French, German, Japanese, Korean, and Russian. cosyvoice-v3.5-plus, cosyvoice-v3.5-flash: Chinese (Mandarin, Cantonese, Henan, Hubei, Minnan, Ningxia, Shaanxi, Shandong, Shanghai, Sichuan), English, French, German, Japanese, Korean, Russian, Portuguese, Thai, Indonesian, and Vietnamese. |
||||
|
Voice design |
For usage instructions, see CosyVoice voice cloning/design API The following languages are supported for voice design: Chinese and English. |
|
|||
|
SSML |
This feature applies to cloned voices and system voices in the Voice list marked as supporting SSML. For usage instructions, see SSML |
||||
|
LaTeX |
For usage instructions, see LaTeX formula-to-speech |
||||
|
Volume adjustment |
For usage instructions, see the request parameter
|
||||
|
Speech rate adjustment |
For usage, see the request parameter
In the Java SDK, this parameter is
|
||||
|
Pitch adjustment |
For more information, see the request parameter
In the Java SDK, this parameter is
|
||||
|
Bitrate adjustment |
Only the opus audio format supports this feature. For usage instructions, see request parameter
In the Java SDK, this parameter is
|
||||
|
Timestamp |
Disabled by default but can be enabled. This feature applies to cloned voices and system voices in the Voice list marked as supporting timestamps. See request parameter
In the Java SDK, this parameter is
|
||||
|
Instruction control (Instruct) |
This feature applies to cloned voices and system voices in the Voice list marked as supporting Instruct Suitable for scenarios that require exaggerated expressiveness, such as video dubbing and audiobook narration. If you want to preserve the original timbre and prosody, you do not need to enable this feature. Instruct commands may not take effect if they conflict with the inherent style of the voice. For example, applying a sad instruction to a cheerful voice may not produce the expected result. For usage instructions, see request parameter
|
|
This feature applies to cloned voices and system voices in the Voice list marked as supporting Instruct. For usage, see the
|
|
|
|
Streaming input |
|
||||
|
Streaming output |
|
||||
|
Rate limiting (RPS) |
3 |
||||
|
Connection type |
Java/Python SDK, WebSocket API |
||||
|
Price |
$0.22 per 10,000 characters |
$0.116 per 10,000 characters |
$0.286706 per 10,000 characters |
$0.14335 per 10,000 characters |
$0.286706 per 10,000 characters |
System voices
FAQ
Q: What should I do if speech synthesis produces incorrect pronunciations? How can I control the pronunciation of characters with multiple pronunciations?
-
Replace characters with multiple pronunciations with homophones to quickly resolve pronunciation issues.
-
Use the Speech Synthesis Markup Language (SSML) to control pronunciation.
Q: How do I troubleshoot silent audio output from a cloned voice?
-
Confirm the voice status.
Call the CosyVoice voice cloning/design API and check whether the voice
statusisOK. -
Check model version consistency.
Ensure the
target_modelparameter used for voice cloning exactly matches themodelparameter used for speech synthesis. For example:-
When cloning, use
cosyvoice-v3-plus. -
Also use
cosyvoice-v3-plusfor synthesis.
-
-
Verify source audio quality.
Check whether the source audio used for voice cloning meets the requirements specified in the CosyVoice voice cloning/design API:
-
Audio duration: 10 to 20 seconds
-
Clear audio quality
-
No background noise
-
-
Check the request parameters.
Confirm that the
voiceparameter in the speech synthesis request is set to the ID of the cloned voice.
Q: What should I do if the synthesis effect is unstable or the speech is incomplete after voice cloning?
If the synthesized speech after voice cloning has the following issues:
-
Incomplete playback where only part of the text is spoken
-
Inconsistent synthesis quality
-
Abnormal pauses or silent segments in the speech
Possible cause: The source audio quality does not meet the requirements.
Solution: Check whether the source audio meets the following requirements. If not, consider re-recording the audio by following the Recording operation guide:
-
Check audio continuity: Ensure the source audio contains uninterrupted speech with no pauses or silent segments longer than 2 seconds. If the audio contains significant silent gaps, the model may treat the silence or noise as part of the voice profile, degrading output quality.
-
Check speech activity ratio: Ensure active speech comprises at least 60% of the total audio duration. Excessive background noise or non-speech segments can interfere with voice feature extraction.
-
Verify the audio quality details:
-
Audio duration: 10 to 20 seconds (15 seconds is recommended)
-
Clear pronunciation and a stable speech rate
-
No background noise, echo, or static
-
Consistent speech levels with no long silent gaps
-