The parameters and interfaces of the CosyVoice speech synthesis Python SDK.
User guide: For model overviews and selection suggestions, see Real-time speech synthesis - CosyVoice.
Prerequisites
-
You have activated the Model Studio and created an API key. Export it as an environment variable (not hard-coded) to prevent security risks.
NoteFor temporary access or strict control over high-risk operations (accessing/deleting sensitive data), use a temporary authentication token instead.
Compared with long-term API keys, temporary tokens are more secure (60-second lifespan) and reduce API key leakage risk.
To use a temporary token, replace the API key used for authentication in your code with the temporary authentication token.
Models and pricing
Text and format limitations
Text length limits
-
Non-streaming calls or unidirectional streaming calls: The text length per request must not exceed 20,000 characters.
-
Bidirectional streaming calls: The text length per request must not exceed 20,000 characters, and the cumulative text length across all requests must not exceed 200,000 characters.
Character counting rules
-
Chinese characters (simplified/traditional Chinese, Japanese Kanji, Korean Hanja) count as two characters. All other characters (punctuation, letters, numbers, Kana, Hangul) count as one.
-
SSML tags are not included when calculating the text length.
-
Examples:
-
"你好"→ 2 (Chinese character) + 2 (Chinese character) = 4 characters -
"中A文123"→ 2 (Chinese character) + 1 (A) + 2 (Chinese character) + 1 (1) + 1 (2) + 1 (3) = 8 characters -
"中文。"→ 2 (Chinese character) + 2 (Chinese character) + 1 (.) = 5 characters -
"中 文。"→ 2 (Chinese character) + 1 (space) + 2 (Chinese character) + 1 (.) = 6 characters -
"<speak>你好</speak>"→ 2 (Chinese character) + 2 (Chinese character) = 4 characters
-
Encoding format
Use UTF-8 encoding.
Support for mathematical expressions
Mathematical expression parsing (v3.5-flash, v3.5-plus, v3-flash, v3-plus, v2 only): Supports primary and secondary school math—basic operations, algebra, geometry.
This feature only supports Chinese.
See Convert LaTeX formulas to speech (Chinese language only).
SSML support
SSML is available for custom voices (voice design or cloning) with v3.5-flash, v3.5-plus, v3-flash, v3-plus, and v2, and for system voices marked as supported in the voice list. Requirements:
-
Use DashScope SDK version 1.23.4 or later.
-
Non-streaming calls and unidirectional streaming calls (the
callmethod of the SpeechSynthesizer class) are supported, but bidirectional streaming calls (thestreaming_callmethod of the SpeechSynthesizer class) are not supported. -
Pass SSML-formatted text to the
callmethod of the SpeechSynthesizer class, just as for regular speech synthesis.
Getting started
The SpeechSynthesizer class provides core speech synthesis interfaces and supports the following invocation methods:
-
Non-streaming: A blocking call that sends the full text at once and returns the complete audio. Suitable for short text.
-
Unidirectional streaming: A non-blocking call that sends the full text at once and receives audio via callback. Suitable for short text with low latency.
-
Bidirectional streaming: A non-blocking call that sends text fragments incrementally and receives audio via callback in real time. Suitable for long text with low latency.
Non-streaming
Submit a single speech synthesis task and receive the complete audio result in one response (no streaming, no callbacks).
Instantiate the SpeechSynthesizer class, bind request parameters, and call the call method to synthesize and retrieve binary audio data.
The text length must not exceed 20,000 characters.
Before each call to the call method, you must re-initialize the SpeechSynthesizer instance.
Unidirectional streaming
Submit a single speech synthesis task and stream results in real time through the ResultCallback interface.
Instantiate the SpeechSynthesizer class, bind request parameters and the callback interface (ResultCallback), and call the call method to synthesize and retrieve results in real time through the on_data method of the callback interface (ResultCallback).
The text length must not exceed 20,000 characters.
Before each call to the call method, you must re-initialize the SpeechSynthesizer instance.
Bidirectional streaming
Submit text in multiple parts within a single speech synthesis task and receive synthesis results in real time through callbacks.
-
During streaming input, call
streaming_callmultiple times to submit text fragments in order. After receiving a fragment, the server automatically splits it into sentences:-
Complete sentences are synthesized immediately.
-
Incomplete sentences are cached until complete, then synthesized.
When you call
streaming_complete, the server forces synthesis of all received but unprocessed fragments—including incomplete sentences. -
-
The interval between text fragment submissions must not exceed 23 seconds, or the system throws a "request timeout after 23 seconds" error.
If no more text remains to send, call
streaming_completepromptly to end the task.The server enforces a fixed 23-second timeout. Clients cannot modify this setting.
-
Instantiate the SpeechSynthesizer class
Instantiate the SpeechSynthesizer class, and bind request parameters and the callback interface (ResultCallback).
-
Streaming
Call the
streaming_callmethod of the SpeechSynthesizer class multiple times to submit text fragments for synthesis. Send text fragments to the server in parts.While sending text, the server returns synthesis results in real time to the client through the
on_datamethod of the callback interface (ResultCallback).Each text fragment (the
textparameter) sent viastreaming_callmust not exceed 20,000 characters, and the cumulative text length across all fragments must not exceed 200,000 characters. -
End processing
Call the
streaming_completemethod of the SpeechSynthesizer class to end speech synthesis.This method blocks the current thread until the
on_completeoron_errorcallback of the callback interface (ResultCallback) triggers.Always call this method. Otherwise, trailing text may fail to convert to speech.
Request parameters
Set request parameters using the constructor of the SpeechSynthesizer class.
|
Parameter |
Type |
Required |
Description |
|
model |
str |
Yes |
Speech synthesis model. Each model version requires compatible voices:
|
|
voice |
str |
Yes |
The voice used for speech synthesis. Supported voice types:
|
|
format |
enum |
No |
Audio encoding format and sample rate. If you do not specify Note
The default sample rate represents the optimal rate for the selected voice. Output uses this rate by default, but downsampling and upsampling are supported. Supported audio encoding formats and sample rates include the following:
|
|
volume |
int |
No |
The volume. Default: 50. Valid range: [0, 100]. Values scale linearly—0 is silent, 50 is default, 100 is maximum. Important
This field differs across DashScope SDK versions:
|
|
speech_rate |
float |
No |
The speech rate. Default value: 1.0. Valid values: [0.5, 2.0]. A value of 1.0 is the standard speech rate. A value less than 1.0 slows down the speech, and a value greater than 1.0 speeds it up. |
|
pitch_rate |
float |
No |
Pitch multiplier. The relationship to perceived pitch is neither linear nor logarithmic—test to find suitable values. Default value: 1.0. Valid values: [0.5, 2.0]. A value of 1.0 is the natural pitch of the voice. A value greater than 1.0 raises the pitch, and a value less than 1.0 lowers it. |
|
bit_rate |
int |
No |
The audio bitrate in kbps. If the audio format is Opus, adjust the bitrate by using the Default value: 32. Valid values: [6, 510]. Note
Set
|
|
word_timestamp_enabled |
bool |
No |
Enable word-level timestamps. Default: False.
This feature supports only replicated voice styles of the cosyvoice-v3-flash, cosyvoice-v3-plus, and cosyvoice-v2 models, and system voice styles marked as supported in the Voice Styles List. Timestamps are available only through the callback interface. Note
Set
|
|
seed |
int |
No |
The random seed used during generation. Different seeds produce different synthesis results. If the model, text, voice, and other parameters are identical, using the same seed reproduces the same output. Default value: 0. Valid values: [0, 65535]. |
|
language_hints |
list[str] |
No |
Specifies the target language for speech synthesis to improve the synthesis effect. Use when pronunciation or synthesis quality is poor for numbers, abbreviations, symbols, or less common languages:
Valid values:
Note: This parameter is an array, but the current version only processes the first element. Therefore, we recommend passing only one value. Important
This parameter specifies the target language for speech synthesis. This setting is independent of the language of the sample audio used for voice cloning. To set the source language for a cloning task, see CosyVoice Voice Cloning/Design API. |
|
instruction |
str |
No |
Sets an instruction to control synthesis effects such as dialect, emotion, or speaking style. This feature is available only for cloned voices of the cosyvoice-v3.5-flash, cosyvoice-v3.5-plus, and cosyvoice-v3-flash models, and for system voices marked as supporting Instruct in the voice list. Length limit: 100 characters. A Chinese character (including simplified and traditional Chinese, Japanese Kanji, and Korean Hanja) is counted as two characters. All other characters, such as punctuation marks, letters, numbers, and Japanese/Korean Kana/Hangul, are counted as one character. Usage requirements (vary by model):
|
|
enable_aigc_tag |
bool |
No |
Add an invisible AIGC identifier to generated audio. When set to True, the identifier is embedded in supported audio formats (WAV, MP3, OPUS). Default: False. Only cosyvoice-v3-flash, cosyvoice-v3-plus, and cosyvoice-v2 support this feature. Note
Set
|
|
aigc_propagator |
str |
No |
Set the Default: Alibaba Cloud UID. Only cosyvoice-v3-flash, cosyvoice-v3-plus, and cosyvoice-v2 support this feature. Note
Set
|
|
aigc_propagate_id |
str |
No |
Set the Default: Request ID of this speech synthesis request. Only cosyvoice-v3-flash, cosyvoice-v3-plus, and cosyvoice-v2 support this feature. Note
Set
|
|
hot_fix |
dict |
No |
Configuration for text hotpatching. Allows you to customize the pronunciation of specific words or replace text before synthesis. This feature is available only for cloned voices of cosyvoice-v3-flash. Parameters:
Example:
|
|
enable_markdown_filter |
bool |
No |
Specifies whether to enable Markdown filtering. When enabled, the system automatically removes Markdown symbols from the input text before synthesizing speech, preventing them from being read aloud. This feature is available only for cloned voices of cosyvoice-v3-flash. Default: False. Values:
Note
Set
|
|
callback |
ResultCallback |
No |
Key interfaces
SpeechSynthesizer class
Import the SpeechSynthesizer class using from dashscope.audio.tts_v2 import *. It provides core speech synthesis interfaces.
|
Method |
Parameters |
Return value |
Description |
|
|
Returns binary audio data if no |
Convert the entire text (whether plain text or SSML) to speech. Two cases exist when creating a
Important
Before each call to the |
|
|
None |
Stream text fragments for synthesis (SSML is not supported). Call this method multiple times to send text fragments to the server. Retrieve synthesis results through the |
|
|
None |
End streaming speech synthesis. This method blocks the current thread for N milliseconds (determined by By default, waiting stops after 10 minutes. Important
In bidirectional streaming calls, always call this method. Otherwise, synthesized speech may be missing. |
|
None |
Request ID of the previous task |
Get the request ID of the previous task. |
|
None |
First-package delay |
Returns first-packet latency in milliseconds (time from sending text to receiving first audio). Call after task completes. Factors affecting first-packet latency:
Typical latency:
If latency consistently exceeds 2,000 ms:
|
|
None |
Last message |
Get the last message (JSON-formatted data), useful for detecting task-failed errors. |
Callback interface (ResultCallback)
During unidirectional streaming calls or bidirectional streaming calls, the server returns key process information and data to the client via callbacks. Implement callback methods to handle server responses.
Import using from dashscope.audio.tts_v2 import *.
|
Method |
Parameters |
Return value |
Description |
|
None |
None |
Called immediately after the client connects to the server. |
|
|
None |
Called when the server sends a message. |
|
None |
None |
Called after all synthesis data is returned (speech synthesis complete). |
|
|
None |
Called when an error occurs. |
|
|
None |
Called when audio data arrives. Combine segments into a complete file or stream to a compatible player. Important
|
|
None |
None |
Called after the server closes the connection. |
Response
The server returns binary audio data:
-
Non-streaming: Process the binary audio data returned by the
callmethod of the SpeechSynthesizer class. -
Unidirectional streaming or bidirectional streaming call: Process the
on_datamethod parameter (bytes) of the callback interface (ResultCallback).
Error codes
If an error occurs, see Error messages for troubleshooting.
More examples
For more examples, see GitHub.
FAQ
Features, billing, and rate limiting
Q: What can I do if the pronunciation is inaccurate?
Use SSML to fix pronunciation.
Q: Speech synthesis is billed by character count. How do I view or get the character count for each synthesis?
How you retrieve the count depends on whether logging is enabled:
-
Logging disabled
-
Non-streaming: Calculate manually using the character counting rules.
-
Other call types: Retrieve the count from the
on_eventmethod parametermessageof the callback interface (ResultCallback).messageis a JSON string. Parse it to get the billed character count (characters). Use the lastmessagereceived.
-
-
Logging enabled
The console prints logs like this.
charactersis the billed character count for this request. Use the last log printed.2025-08-27 11:02:09,429 - dashscope - speech_synthesizer.py - on_message - 454 - DEBUG - <<<recv {"header":{"task_id":"62ebb7d6cb0a4080868f0edb######","event":"result-generated","attributes":{}},"payload":{"output":{"sentence":{"words":[]}},"usage":{"characters":15}}}
Troubleshooting
If your code throws errors, troubleshoot using the information in Error codes.
Q: How do I get the request ID?
Retrieve it in two ways:
-
Parse the JSON string
messagein theon_eventmethod of the callback interface (ResultCallback). -
Call the
get_last_request_idmethod of SpeechSynthesizer.
Q: Why does SSML fail?
Troubleshoot step by step:
-
Verify correct limitations and constraints.
-
Ensure you use the correct interface: Only the
callmethod of the SpeechSynthesizer class supports SSML. -
Ensure the text to synthesize is plain text and meets formatting requirements. See SSML overview.
Q: Why does the audio duration of TTS speech synthesis differ from the WAV file's displayed duration? For example, a WAV file shows 7 seconds but the actual audio is less than 5 seconds?
TTS uses a streaming synthesis mechanism, which means it synthesizes and returns data progressively. As a result, the WAV file header contains an estimated value, which may have some margin of error. If you require precise duration, you can set the format to PCM and manually add the WAV header information after obtaining the complete synthesis result. This will give you a more accurate duration.
Q: Why can't the audio be played?
Check the following scenarios one by one:
-
The audio is saved as a complete file (such as xx.mp3).
-
Format consistency: Verify request format matches file extension (e.g., WAV with .wav, not .mp3).
-
Player compatibility: Verify that your player supports the format and sample rate of the audio file. Some players may not support high sample rates or specific audio encodings.
-
-
The audio is played in a stream.
-
Save the audio stream as a complete file and try to play it with a player. If the file cannot be played, see the troubleshooting method for scenario 1.
-
If the file plays normally, the problem may be with your streaming playback implementation. Verify that your player supports streaming playback.
Common tools and libraries that support streaming playback include FFmpeg, PyAudio (Python), AudioFormat (Java), and MediaSource (JavaScript).
-
Q: Why does the audio playback stutter?
Check the following scenarios one by one:
-
Check the text sending speed: Make sure the interval between text segments is reasonable. Avoid situations where the next segment is not sent promptly after the previous audio segment finishes playing.
-
Check the callback function performance:
-
Avoid heavy business logic in the callback function—it can cause blocking.
-
Callbacks run in the WebSocket thread. Blocking prevents timely packet reception and causes audio playback to stutter.
-
We recommend writing audio data to a separate buffer and processing it in another thread to avoid blocking the WebSocket thread.
-
-
Check network stability: Ensure your network connection is stable to avoid audio transmission interruptions or delays caused by network fluctuations.
Q: Why does speech synthesis take a long time?
Follow these steps to troubleshoot:
-
Check input interval
Check the input interval. If you are using streaming speech synthesis, verify whether the interval between sending text segments is too long (for example, a delay of several seconds). A long interval increases the total synthesis time.
-
Analyze performance metrics.
-
First-packet latency: Normally around 500 ms.
-
RTF (RTF = Total synthesis time / Audio duration): Normally less than 1.0.
-
Q: How do I handle incorrect pronunciation in the synthesized speech?
Use the <phoneme> tag of SSML to specify the correct pronunciation.
Q: Why is no speech returned? Why is part of the text at the end not converted to speech? (Missing speech)
Confirm you called the streaming_complete method of the SpeechSynthesizer class. During synthesis, the server waits until it has enough cached text before starting synthesis. If you omit streaming_complete, trailing text in the cache may never synthesize.
Q: How do I fix SSL certificate verification failure?
-
Install system root certificates
sudo yum install -y ca-certificates sudo update-ca-trust enable -
Add this to your code
import os os.environ["SSL_CERT_FILE"] = "/etc/ssl/certs/ca-bundle.crt"
Q: Why do I get “SSL: CERTIFICATE_VERIFY_FAILED” on macOS? (websocket closed due to [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1000))
OpenSSL certificate verification may fail during WebSocket connection due to incorrect Python certificate configuration. Fix it manually:
-
Export system certificates and set environment variables: Run this command to export all macOS certificates to a file and set it as the default certificate path for Python and related libraries:
security find-certificate -a -p > ~/all_mac_certs.pem export SSL_CERT_FILE=~/all_mac_certs.pem export REQUESTS_CA_BUNDLE=~/all_mac_certs.pem -
Create a symbolic link to fix Python’s OpenSSL configuration: If Python’s OpenSSL config lacks certificates, create a symbolic link. Replace the path with your local Python version:
# 3.9 is an example version number. Adjust to your installed Python version. ln -s /etc/ssl/* /Library/Frameworks/Python.framework/Versions/3.9/etc/openssl -
Restart your terminal and clear caches: Close and reopen your terminal to apply environment variables. Clear caches and retry the WebSocket connection.
These steps resolve connection issues caused by certificate misconfiguration. If problems persist, check the server’s certificate configuration.
Q: Why do I get “AttributeError: module 'websocket' has no attribute 'WebSocketApp'. Did you mean: 'WebSocket'?”
This happens when websocket-client is not installed or the version is incompatible. Run these commands:
pip uninstall websocket-client
pip uninstall websocket
pip install websocket-client
Permissions and authentication
Q: How can I restrict my API key to the CosyVoice speech synthesis service only (permission isolation)?
Create a workspace and grant authorization only to specific models to limit the API key scope. For more information, see Manage workspaces.
More questions
See the QA on GitHub.