This topic describes the parameters and interface details of the CosyVoice Python SDK for speech synthesis.
To use a model in the China (Beijing) region, go to the API key page for the China (Beijing) region
User guide: For more information about the models and guidance on model selection, see Real-time speech synthesis - CosyVoice/Sambert.
Prerequisites
You have activated the Model Studio and created an API key. To prevent security risks, export the API key as an environment variable instead of hard-coding it in your code.
NoteTo grant temporary access permissions to third-party applications or users, or if you want to strictly control high-risk operations such as accessing or deleting sensitive data, we recommend that you use a temporary authentication token.
Compared with long-term API keys, temporary authentication tokens are more secure because they are short-lived (60 seconds). They are suitable for temporary call scenarios and can effectively reduce the risk of API key leakage.
To use a temporary token, replace the API key used for authentication in your code with the temporary authentication token.
Models and pricing
Model | Price | Free quota (Note) |
cosyvoice-v3-plus | $0.286706 per 10,000 characters | No free quota |
cosyvoice-v3-flash | $0.14335 per 10,000 characters | |
cosyvoice-v2 | $0.286706 per 10,000 characters |
Text and format limitations
Text length limits
For non-streaming call or unidirectional streaming call), the text in a single request cannot exceed 20,000 characters.
For a bidirectional streaming call, the text in a single request cannot exceed 20,000 characters. The cumulative length of all text sent cannot exceed 200,000 characters.
Character counting rules
A Chinese character, which includes simplified or traditional Chinese, Japanese kanji, and Korean hanja, is counted as 2 characters. All other characters, such as punctuation marks, letters, numbers, and Japanese or Korean kana or hangul, are counted as 1 character.
SSML tags are not included in the text length calculation.
Examples:
"你好"→ 2(你) + 2(好) = 4 characters"中A文123"→ 2(中) + 1(A) + 2(文) + 1(1) + 1(2) + 1(3) = 8 characters"中文。"→ 2(中) + 2(文) + 1(。) = 5 characters"中 文。"→ 2(中) + 1(space) + 2(文) + 1(。) = 6 characters"<speak>你好</speak>"→ 2(你) + 2(好) = 4 characters
Encoding format
Use UTF-8 encoding.
Support for mathematical expressions
The mathematical expression parsing feature is currently available only for the cosyvoice-v2, cosyvoice-v3-flash, and cosyvoice-v3-plus models. This feature supports common mathematical expressions from primary and secondary school, such as basic arithmetic, algebra, and geometry.
For more information, see LaTeX Formula to Speech.
SSML support
The Speech Synthesis Markup Language (SSML) feature is currently available only for cloned voices of the cosyvoice-v3-flash, cosyvoice-v3-plus, and cosyvoice-v2 models, and for system voices that are indicated as supported in the voice list. The following conditions must be met:
You must use DashScope SDK version 1.23.4 or later.
This feature supports only non-streaming calls and unidirectional streaming calls (using the
callmethod of the SpeechSynthesizer class). It does not support bidirectional streaming calls (using thestreaming_callmethod of the SpeechSynthesizer class).The usage is the same as for standard speech synthesis. Pass the text that contains SSML to the
callmethod of the SpeechSynthesizer class.
Getting started
The SpeechSynthesizer class is the primary class for speech synthesis and supports the following invocation methods:
Non-streaming call: A blocking call that sends the complete text at once and returns the complete audio directly. This method is suitable for short text synthesis scenarios.
Unidirectional streaming call: A non-blocking call that sends the complete text at once and uses a callback function to receive audio data, which may be delivered in chunks. This method is suitable for short text synthesis scenarios that require low latency.
Bidirectional streaming call: A non-blocking call that sends text in fragments and uses a callback function to receive the synthesized audio stream incrementally in real time. This method is suitable for long text synthesis scenarios that require low latency.
Npn-streaming call
This method submits a single speech synthesis task without using a callback function. The synthesis does not stream intermediate results. Instead, the complete result is returned at once.
You can instantiate the SpeechSynthesizer class, attach the request parameters, and call the call method to synthesize the text and retrieve the binary audio data.
The text that you send cannot be longer than 2,000 characters. For more information, see the call method of the SpeechSynthesizer class.
Before each call to the call method, you must create a new SpeechSynthesizer instance.
Unidirectional streaming invocation
This method submits a single speech synthesis task. Intermediate results are streamed through callbacks, and the final synthesis result is streamed through the ResultCallback callback function.
You can instantiate the SpeechSynthesizer class, attach the request parameters and the ResultCallback interface, and call the call method to perform the synthesis. You can then retrieve the real-time synthesis results through the on_data method of the ResultCallback interface.
The length of the text to send cannot exceed 2,000 characters. For more information, see the call method of the SpeechSynthesizer class.
Before each call to the call method, you must create a new SpeechSynthesizer instance.
Bidirectional streaming call
This method lets you submit text in multiple parts within a single speech synthesis task and receive the synthesis results in real time through a callback.
To stream input, call the
streaming_callmethod multiple times to submit text fragments in order. The server automatically segments the text fragments into sentences after it receives them:Complete sentences are synthesized immediately.
Incomplete sentences are buffered and synthesized after they are complete.
When you call the
streaming_completemethod, the server synthesizes all received but unprocessed text fragments, including incomplete sentences.The interval between sending text fragments cannot exceed 23 seconds. Otherwise, a "request timeout after 23 seconds" exception occurs.
If you have no more text to send, you must call the
streaming_completemethod to end the task.The server enforces a 23 second timeout. The client cannot modify this configuration.
You can instantiate the SpeechSynthesizer class.
Instantiate the SpeechSynthesizer class and attach the request parameters and the ResultCallback callback interface.
Streaming data
Stream data by calling the
streaming_callmethod of the SpeechSynthesizer class multiple times. This sends the text to be synthesized to the server-side in segments.While you send text, the server uses the
on_datamethod of the ResultCallback interface to return the synthesized result to the client in real time.The length of the text segment (the
textparameter) sent in each call to thestreaming_callmethod cannot exceed 2,000 characters. The cumulative length of all text that you send cannot exceed 200,000 characters.Processing is complete.
End the process by calling the
streaming_completemethod of the SpeechSynthesizer class to end the speech synthesis task.This method blocks the current thread until the
on_completeoron_errormethod of the ResultCallback interface is triggered.You must call this method. Otherwise, the end of the text may not be successfully synthesized.
Request parameters
You can set the request parameters in the constructor of the SpeechSynthesizer class.
Parameter | Type | Required | Description |
model | str | Yes | The speech synthesis model. Difference models require corresponding voices:
|
voice | str | Yes | The voice to use for speech synthesis. System voices and cloned voices are supported:
|
format | enum | No | Specifies the audio coding format and sample rate. If Note The default sample rate is the optimal rate for the current voice. By default, the output uses this sample rate. Downsampling and upsampling are also supported. The following audio coding formats and sample rates can be specified:
|
volume | int | No | The volume. Default value: 50. Value range: [0, 100]. A value of 50 is the standard volume. The volume has a linear relationship with this value. 0 is mute and 100 is the maximum volume. Important This field differs in various versions of the DashScope SDK:
|
speech_rate | float | No | The speech rate. Default value: 1.0. Value range: [0.5, 2.0]. A value of 1.0 is the standard rate. Values less than 1.0 slow down the speech, and values greater than 1.0 speed it up. |
pitch_rate | float | No | The pitch. This value is a multiplier for pitch adjustment. The relationship between this value and the perceived pitch is not strictly linear or logarithmic. Test different values to find the best one. Default value: 1.0. Value range: [0.5, 2.0]. A value of 1.0 is the natural pitch of the voice. Values greater than 1.0 raise the pitch, and values less than 1.0 lower it. |
bit_rate | int | No | The audio bitrate in kbps. If the audio format is Opus, you can adjust the bitrate using the Default value: 32. Value range: [6, 510]. Note
|
word_timestamp_enabled | bool | No | Specifies whether to enable word-level timestamps. Default value: False.
This feature applies only to cloned voices of the cosyvoice-v3-flash, cosyvoice-v3-plus, and cosyvoice-v2 models, and system voices in the Voice list that are marked as supported. Timestamp results can only be retrieved through the callback interface. Note
|
seed | int | No | The random number seed used during generation, which varies the synthesis effect. If the model version, text, voice, and other parameters are the same, using the same seed reproduces the same synthesis result. Default value: 0. Value range: [0, 65535]. |
language_hints | list[str] | No | Specifies the target language for speech synthesis to improve the synthesis effect. Use this parameter when the pronunciation of numbers, abbreviations, or symbols, or when the synthesis effect for non-Chinese languages, does not meet expectations. For example:
Valid values:
Note: Although this parameter is an array, the current version processes only the first element. Therefore, you must pass only one value. Important This parameter specifies the target language for speech synthesis. This setting is independent of the language of the sample audio used for voice cloning. To set the source language for a voice cloning task, see CosyVoice voice cloning API. |
instruction | str | No | Set instruction: This feature is available only for cloned voices for the cosyvoice-v3-flash and cosyvoice-v3-plus models, and system voices marked as supported in the Voice List. No default value. This parameter has no effect if it is not set. Speech synthesis has the following effects:
|
enable_aigc_tag | bool | No | Specifies whether to add an invisible AIGC identifier to the generated audio. If set to True, the invisible identifier is embedded into the audio for supported formats (WAV, MP3, and Opus). Default value: False. This feature is supported only by the cosyvoice-v3-flash, cosyvoice-v3-plus, and cosyvoice-v2 models. Note
|
aigc_propagator | str | No | Sets the Default value: Alibaba Cloud UID. This feature is supported only by the cosyvoice-v3-flash, cosyvoice-v3-plus, and cosyvoice-v2 models. Note
|
aigc_propagate_id | str | No | Sets the Default value: The Request ID of the current speech synthesis request. This feature is supported only by the cosyvoice-v3-flash, cosyvoice-v3-plus, and cosyvoice-v2 models. Note
|
callback | ResultCallback | No |
Key interfaces
SpeechSynthesizer class
The SpeechSynthesizer class is the main interface for speech synthesis. You can import this class using from dashscope.audio.tts_v2 import *.
Method | Parameters | Return value | Description |
|
| Returns binary audio data if | Transforms an entire segment of text into speech. The text can be plain text or contain SSML. When you create a
Important Before each call to the |
|
| None | Streams the text to synthesize. Text that contains SSML is not supported. You can call this interface multiple times to send the text to be synthesized to the server in segments. The synthesis result is retrieved through the For more information, see Bidirectional streaming call. |
|
| None | Ends the streaming speech synthesis. This method blocks the current thread for the duration specified by By default, the wait stops if the wait time exceeds 10 minutes. For more information, see Bidirectional streaming call. Important When making bidirectional streaming calls, call this method. Otherwise, the synthesized speech may be incomplete. |
| None | The request ID of the last task. | Gets the request ID of the last task. |
| None | First-packet latency | Gets the first-packet latency. The latency is typically about 500 ms. First-packet latency is the time in milliseconds from when you send the text to when you receive the first audio packet. Check the latency after the task is complete. When you send text for the first time, a WebSocket connection must be established. Therefore, the first-packet latency includes the time required to establish the connection. |
| None | The last message | Gets the last message, which is in JSON format. You can use this to retrieve task-failed errors. |
Callback interface (ResultCallback)
For an unidirectional streaming call or a bidirectional streaming call, the server returns key process information and data to the client through a callback. You must implement the callback methods to process the returned information and data.
You can import it using from dashscope.audio.tts_v2 import *.
Method | Parameters | Return value | Description |
| None | None | This method is called immediately after a connection is established with the server. |
|
| None | This method is called when the service sends a response. The |
| None | None | This method is called after all synthesized data is returned and the speech synthesis is complete. |
|
| None | This method is called when an exception occurs. |
|
| None | This method is called when the server returns synthesized audio. Combine the binary audio data into a complete audio file for playback, or play it in real time with a player that supports streaming playback. Important
|
| None | None | This method is called after the service has closed the connection. |
Response
The server returns binary audio data:
Non-streaming call: You can process the binary audio data returned by the
callmethod of the SpeechSynthesizer class.For unidirectional streaming call or bidirectional streaming call, you can process the parameter (byte data) of the
on_datamethod of the ResultCallback callback interface.
Error codes
For troubleshooting information, see Error messages.
More examples
For more examples, see GitHub.
FAQ
Features, billing, and rate limiting
Q: What can I do to fix inaccurate pronunciation?
You can use SSML to customize the speech synthesis output.
Q: Speech synthesis is billed based on the number of text characters. How can I view or obtain the text length for each synthesis?
This depends on whether logging is enabled:
Logging is disabled.
For a non-streaming call, you can calculate the number of characters according to the character counting rules.
Alternatively, you can retrieve the information from the
messageparameter of theon_eventmethod of the ResultCallback callback interface. Themessageis a JSON string that you can parse to retrieve the number of billable characters for the current request from thecharactersparameter. Use the lastmessagethat you receive.
Logging is enabled.
If logging is enabled, the console prints a log that contains the
charactersparameter. This parameter indicates the number of billable characters for the request. Use the value from the last log entry for the request.2025-08-27 11:02:09,429 - dashscope - speech_synthesizer.py - on_message - 454 - DEBUG - <<<recv {"header":{"task_id":"62ebb7d6cb0a4080868f0edb######","event":"result-generated","attributes":{}},"payload":{"output":{"sentence":{"words":[]}},"usage":{"characters":15}}}
Troubleshooting
If you encounter a code error, refer to Error codes to troubleshoot the issue.
Q: How do I get the Request ID?
You can retrieve it in one of the following two ways:
Parse the
messageJSON string in theon_eventmethod of the ResultCallback callback interface.Call the
get_last_request_idmethod of SpeechSynthesizer.
Q: Why does the SSML feature fail?
Check the following:
Ensure that the scope is correct.
Ensure that you have installed the latest version of the DashScope SDK.
Ensure that you are using the correct interface. SSML is supported only by the
callmethod of the SpeechSynthesizer class.Ensure that the text for synthesis is in plain text and meets the required format. For more information, see Introduction to SSML.
Q: Why can't the audio be played?
Troubleshoot this issue based on the following scenarios:
The audio is saved as a complete file, such as an .mp3 file.
Audio format consistency: Ensure that the audio format specified in the request parameters matches the file extension. For example, playback might fail if the audio format is set to WAV in the request parameters but the file has an .mp3 extension.
Player compatibility: Confirm that your player supports the format and sample rate of the audio file. For example, some players might not support high sample rates or specific audio encodings.
The audio is played in streaming mode.
Save the audio stream as a complete file and try to play it. If the file fails to play, see the troubleshooting steps for the first scenario.
If the file plays correctly, the issue might be with the streaming playback implementation. Confirm that your player supports streaming playback.
Common tools and libraries that support streaming playback include ffmpeg, pyaudio (Python), AudioFormat (Java), and MediaSource (JavaScript).
Q: Why does the audio playback stutter?
Troubleshoot this issue based on the following scenarios:
Check the text sending speed: Ensure that the text sending interval is reasonable. Avoid delays in sending the next text segment after the audio for the previous segment has finished playing.
Check the callback function performance:
Check whether the callback function contains excessive business logic that could cause it to block.
The callback function runs in the WebSocket thread. If this thread is blocked, it can interfere with the WebSocket's ability to receive network packets, resulting in audio stuttering.
To avoid blocking the WebSocket thread, write the audio data to a separate audio buffer and then use another thread to read and process it.
Check network stability: Ensure that your network connection is stable to prevent audio transmission interruptions or delays caused by network fluctuations.
Q: Why is speech synthesis slow (long synthesis time)?
Perform the following troubleshooting steps:
Check the input interval
If you are using streaming speech synthesis, check whether the text sending interval is too long. For example, a delay of several seconds before sending the next segment will increase the total synthesis time.
Analyze performance metrics
First packet delay: This is typically around 500 ms.
Real-Time Factor (RTF): This is calculated as Total Synthesis Time / Audio Duration. The RTF is normally less than 1.0.
Q: How do I handle incorrect pronunciation in the synthesized speech?
Use the <phoneme> tag of SSML to specify the correct pronunciation.
Q: Why is no speech returned? Why is the end of the text not successfully converted to speech? (Missing synthesized speech)
Check whether you called the streaming_complete method of the SpeechSynthesizer class. The server caches text and begins synthesis only after it has received enough text. If you do not call the streaming_complete method, the text remaining in the cache may not be synthesized.
Q: How do I handle an SSL certificate verification failure?
Install the system root certificate.
sudo yum install -y ca-certificates sudo update-ca-trust enableAdd the following content to your code.
import os os.environ["SSL_CERT_FILE"] = "/etc/ssl/certs/ca-bundle.crt"
Q: What causes the "SSL: CERTIFICATE_VERIFY_FAILED" exception on macOS? (websocket closed due to [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1000))
When connecting to a WebSocket, you may encounter an OpenSSL certificate verification failure with a message indicating that the certificate cannot be found. This usually occurs because of an incorrect certificate configuration in the Python environment. Follow these steps to manually locate and fix the certificate issue:
Export the system certificate and set the environment variable. Run the following commands to export all certificates from your macOS system to a file and set this file as the default certificate path for Python and its related libraries:
security find-certificate -a -p > ~/all_mac_certs.pem export SSL_CERT_FILE=~/all_mac_certs.pem export REQUESTS_CA_BUNDLE=~/all_mac_certs.pemCreate a symbolic link to fix Python's OpenSSL configuration. If Python's OpenSSL configuration is missing certificates, run the following command to create a symbolic link. Make sure to replace the path in the command with the actual installation path of your local Python version:
# 3.9 is a sample version number. Adjust the path according to your locally installed Python version. ln -s /etc/ssl/* /Library/Frameworks/Python.framework/Versions/3.9/etc/opensslRestart the terminal and clear the cache. After you complete the preceding steps, close and reopen the terminal to ensure that the environment variables take effect. Clear any cache that might exist and try to connect to the WebSocket again.
These steps should resolve connection issues caused by incorrect certificate configurations. If the problem persists, check whether the certificate configuration on the target server is correct.
Q: What causes the "AttributeError: module 'websocket' has no attribute 'WebSocketApp'. Did you mean: 'WebSocket'?" error when running the code?
This error occurs because the websocket-client is not installed or its version is mismatched. Run the following commands to resolve the issue:
pip uninstall websocket-client
pip uninstall websocket
pip install websocket-clientPermissions and authentication
Q: I want my API key to be used only for the CosyVoice speech synthesis service, not for other Model Studio models (permission isolation). What should I do?
You can create a workspace and authorize only specific models to limit the scope of the API key. For more information, see Manage workspaces.
More questions
For more information, see the Q&A on GitHub.