Speech synthesis (Text-to-Speech, TTS) converts text into natural speech. This document covers supported models, call methods, and parameter configurations for real-time speech synthesis.
Core features
Generates high-fidelity speech in real-time, supporting natural pronunciation in multiple languages including Chinese and English.
Provides voice cloning to quickly customize personalized timbres.
Supports streaming input and output with low-latency response for real-time interaction.
Adjusts speech rate, pitch, volume, and bitrate for fine-grained control.
Supports mainstream audio formats with up to 48 kHz sample rate output.
Supported models
Supported models:
International
In International deployment mode, access points and data storage are located in the Singapore region . Model inference computing resources are dynamically scheduled globally (excluding Mainland China).
When invoking the following models, select an API key for the Singapore region:
CosyVoice : cosyvoice-v3-plus, cosyvoice-v3-flash
Mainland China
In Mainland China deployment mode, access points and data storage are located in the Beijing region . Model inference computing resources are limited to Mainland China.
When invoking the following models, select an API key for the Beijing region:
CosyVoice : cosyvoice-v3-plus, cosyvoice-v3-flash, cosyvoice-v2
See Model list.
Model selection
Scenario | Recommended | Reason | Notes |
Brand voice customization / Personalized voice cloning service | cosyvoice-v3-plus | Strongest voice cloning capability, supports 48 kHz high-quality audio output. High-quality audio and voice cloning create a human-like brand voiceprint. | Higher cost ($0.286706/10,000 characters). Use for core scenarios. |
Intelligent customer service / Voice assistant | cosyvoice-v3-flash | Lowest cost ($0.14335/10,000 characters). Supports streaming interaction and emotional expression, with fast response and high cost-effectiveness. | |
Dialect broadcast system | cosyvoice-v3-flash, cosyvoice-v3-plus | Supports multiple dialects such as Northeastern Mandarin and Minnan, suitable for local content broadcasting. | cosyvoice-v3-plus has a higher cost ($0.286706/10,000 characters). |
Educational applications (including formula reading) | cosyvoice-v2, cosyvoice-v3-flash, cosyvoice-v3-plus | Supports LaTeX formula-to-speech, suitable for explaining math, physics, and chemistry courses. | cosyvoice-v2 and cosyvoice-v3-plus have higher costs ($0.286706/10,000 characters). . |
Structured voice broadcasting (news/announcements) | cosyvoice-v3-plus, cosyvoice-v3-flash, cosyvoice-v2 | Supports SSML to control speech rate, pauses, and pronunciation, enhancing broadcast professionalism. | Requires additional development for SSML generation logic. Does not support emotion settings. |
Precise speech-text alignment (such as caption generation, teaching playback, dictation training) | cosyvoice-v3-flash, cosyvoice-v3-plus, cosyvoice-v2 | Supports timestamp output for synchronizing synthesized speech with the original text. | Explicitly enable the timestamp feature; it is disabled by default. . |
Multilingual overseas products | cosyvoice-v3-flash, cosyvoice-v3-plus | Supports multiple languages. | Sambert does not support streaming input and is more expensive than cosyvoice-v3-flash. |
Capabilities vary across regions and models. Review Model feature comparison before selecting a model.
Getting started
Sample code for calling an API is provided below. For more code examples, see GitHub.
You must obtain an API key and set the API key as an environment variable. If you use an SDK to make calls, you must also install the DashScope SDK.
CosyVoice Save synthesized audio to a filePythonJavaConvert text from an LLM to speech in real time and play it through a speakerThe following code shows how to play real-time text from the Qwen large language model (qwen-turbo) on an on-premises device. PythonBefore running the Python example, install a third-party audio playback library using pip. Java |
API reference
Model feature comparison
International
In the international deployment mode, the endpoint and data storage are both located in the Singapore region. Model inference computing resources are dynamically scheduled worldwide, excluding Mainland China.
Feature | cosyvoice-v3-plus | cosyvoice-v3-flash |
Supported languages | Varies by system voice: Chinese (Mandarin, Northeastern, Minnan, Shaanxi), English, Japanese, Korean | Varies by system voice: Chinese (Mandarin), English |
Audio format | pcm, wav, mp3, opus | |
Audio sampling rate | 8 kHz, 16 kHz, 22.05 kHz, 24 kHz, 44.1 kHz, 48 kHz | |
Voice cloning | ||
SSML | This feature is available for cloned voices and system voices marked as SSML-compatible in the voice list. See Introduction to SSML | |
LaTeX | See Convert LaTeX formulas to speech | |
Volume adjustment | See the | |
Speech rate adjustment | See the In the Java SDK, this parameter is | |
Pitch adjustment | See the In the Java SDK, this parameter is | |
Bitrate adjustment | This feature is supported only for audio in opus format. See the In the Java SDK, this parameter is | |
Timestamp | Disabled by default, but can be enabled. This feature is available for cloned voices and system voices marked as timestamp-compatible in the voice list. See the In the Java SDK, this parameter is | |
Instruction control (Instruct) | This feature is available for cloned voices and system voices marked as Instruct-compatible in the voice list. See the | |
Streaming input | ||
Streaming output | ||
Rate limit (RPS) | 3 | |
Connection type | Java/Python SDK, WebSocket API | |
Price | $0.26/10,000 characters | $0.13/10,000 characters |
Mainland China
In the Mainland China deployment mode, the endpoint and data storage are both located in the Beijing region. Model inference computing resources are restricted to Mainland China.
Feature | cosyvoice-v3-plus | cosyvoice-v3-flash | cosyvoice-v2 |
Supported languages | System voices (varies by voice): Chinese (Mandarin, Northeastern, Minnan, Shaanxi), English, Japanese, Korean Cloned voices: Chinese (Mandarin, Cantonese, Northeastern, Gansu, Guizhou, Henan, Hubei, Jiangxi, Minnan, Ningxia, Shanxi, Shaanxi, Shandong, Shanghai, Sichuan, Tianjin, Yunnan), English, French, German, Japanese, Korean, Russian | System voices (varies by voice): Chinese (Mandarin), English Cloned voices: Chinese (Mandarin, Cantonese, Northeastern, Gansu, Guizhou, Henan, Hubei, Jiangxi, Minnan, Ningxia, Shanxi, Shaanxi, Shandong, Shanghai, Sichuan, Tianjin, Yunnan), English, French, German, Japanese, Korean, Russian | System voices (varies by voice): Chinese (Mandarin), English, Korean, Japanese Cloned voices: Chinese (Mandarin), English |
Audio format | pcm, wav, mp3, opus | ||
Audio sampling rate | 8 kHz, 16 kHz, 22.05 kHz, 24 kHz, 44.1 kHz, 48 kHz | ||
Voice cloning | See CosyVoice Voice Cloning API The languages supported by voice cloning are as follows: cosyvoice-v2: Chinese (Mandarin), English cosyvoice-v3-flash, cosyvoice-v3-plus: Chinese (Mandarin, Cantonese, Northeastern, Gansu, Guizhou, Henan, Hubei, Jiangxi, Minnan, Ningxia, Shanxi, Shaanxi, Shandong, Shanghai, Sichuan, Tianjin, Yunnan), English, French, German, Japanese, Korean, Russian | ||
SSML | This feature is available for cloned voices and system voices marked as SSML-compatible in the voice list. See Introduction to SSML | ||
LaTeX | See Convert LaTeX formulas to speech | ||
Volume adjustment | See the | ||
Speech rate adjustment | See the In the Java SDK, this parameter is | ||
Pitch adjustment | See the In the Java SDK, this parameter is | ||
Bitrate adjustment | This feature is supported only for audio in opus format. See the In the Java SDK, this parameter is | ||
Timestamp | Disabled by default, but can be enabled. This feature is available for cloned voices and system voices marked as timestamp-compatible in the voice list. See the In the Java SDK, this parameter is | ||
Instruction control (Instruct) | This feature is available for cloned voices and system voices marked as Instruct-compatible in the voice list. See the | ||
Streaming input | |||
Streaming output | |||
Rate limit (RPS) | 3 | ||
Connection type | Java/Python SDK, WebSocket API | ||
Price | $0.286706/10,000 characters | $0.14335/10,000 characters | $0.286706/10,000 characters |
Supported system voices
FAQ
Q: What if speech synthesis mispronounces words? How do I control the pronunciation of homographs?
You can replace polyphonic characters with homophones to quickly fix pronunciation problems.
Use SSML markup language to control pronunciation. .
Q: How do I troubleshoot if audio generated with a cloned voice has no sound?
Check Voice Status
Call the Query a specific voice interface to check whether the voice model's
statusisOK.Check model version consistency
Ensure that the
target_modelparameter used for voice cloning is identical to themodelparameter used for speech synthesis. For example:Use
cosyvoice-v3-plusfor cloning.You must also use
cosyvoice-v3-plusfor synthesis.
Verify source audio quality
Verify that the source audio used for voice cloning meets the audio requirements:
Audio duration: 10-20 seconds
Clear sound quality
No background noise
Check request parameters
Confirm that the speech synthesis request parameter
voiceis set to the cloned voice ID.
Q: How do I handle unstable synthesis effects or incomplete speech after voice cloning?
If the synthesized speech after voice cloning has these issues:
Incomplete speech playback, where only part of the text is read.
Unstable synthesis effect or inconsistent quality.
Speech contains abnormal pauses or silent segments.
Possible reason: The source audio quality does not meet requirements.
Solution: Check whether the source audio meets these requirements. We recommend re-recording by following the Recording guide.
Check audio continuity: Ensure continuous speech content in the source audio. Avoid long pauses or silent segments (over 2 seconds). If the audio contains obvious blank segments, the model may interpret silence or noise as part of the voice characteristics, affecting the generation quality.
Check speech activity ratio: Ensure effective speech accounts for more than 60% of the total audio duration. Excessive background noise or non-speech segments can interfere with voice characteristic extraction.
Verify audio quality details:
Audio duration: 10-20 seconds (15 seconds recommended)
Clear pronunciation, stable speech rate
No background noise, echo, or static
Concentrated speech energy, no long silent segments