This topic describes the parameters and interfaces of the CosyVoice speech synthesis Python SDK.
This document applies only to the China (Beijing) region. To use the model, you must use an API key from the China (Beijing) region.
User guide: For more information about the models and guidance on model selection, see Speech synthesis - CosyVoice/Sambert.
Prerequisites
You have activated the Model Studio and created an API key. To prevent security risks, export the API key as an environment variable instead of hard-coding it in your code.
NoteTo grant temporary access permissions to third-party applications or users, or if you want to strictly control high-risk operations such as accessing or deleting sensitive data, we recommend that you use a temporary authentication token.
Compared with long-term API keys, temporary authentication tokens are more secure because they are short-lived (60 seconds). They are suitable for temporary call scenarios and can effectively reduce the risk of API key leakage.
To use a temporary token, replace the API key used for authentication in your code with the temporary authentication token.
Models and pricing
Model | Unit price |
cosyvoice-v3-plus | $0.286706/10,000 characters |
cosyvoice-v2 | $0.286706/10,000 characters |
Character billing rule: One Chinese character is counted as two characters. English letters, punctuation, and spaces are each counted as one character.
For more information, see Throttling.
Text and format limitations
Text length limits
For non-streaming calls (synchronous call or asynchronous invocation), the text sent in a single request cannot exceed 2,000 characters in length.
For streaming calls, the text sent in a single request cannot exceed 2,000 characters in length, and the total length of all text sent cannot exceed 200,000 characters.
Character calculation rules
Chinese characters: 2 characters each
English letters, numbers, punctuation, and spaces: 1 character each
The content of SSML tags is included when calculating the text length.
Examples:
"Hello"→ 4 distinct characters"ChineseA123"→ 2 + 1 + 2 + 1 + 1 + 1 = 8 characters"Chinese."→ 2 + 2 + 1 = 5 characters"Chinese."→ 2+1+2+1=6 characters"<speak>你好</speak>"→ 7 + 4 + 8 = 19 characters
Encoding format
Use UTF-8 encoding.
Support for mathematical expressions
The mathematical expression parsing feature is available only for the cosyvoice-v2 model. It supports common mathematical expressions, such as those in primary and secondary school curricula, including basic arithmetic, algebra, and geometry.
For more information, see Convert LaTeX formulas to speech.
SSML support
The Speech Synthesis Markup Language (SSML) feature is available only for some voices of the cosyvoice-v2 model. Check the voice list to confirm whether a voice supports SSML. To use SSML, the following conditions must be met:
You must use DashScope SDK 1.23.4 or later.
Only synchronous calls and asynchronous invocations are supported. This means you must use the
callmethod of the SpeechSynthesizer class. Streaming calls, which use thestreaming_callmethod of the SpeechSynthesizer class, are not supported.The usage is the same as for standard speech synthesis. Pass the text that contains SSML to the
callmethod of the SpeechSynthesizer class.
Getting started
The SpeechSynthesizer class provides the key interfaces for speech synthesis and supports the following call methods:
Synchronous call: After you submit the text, the server immediately processes it and returns the complete synthesized speech. The entire process is blocking. The client must wait for the server to finish processing before it can perform the next operation. This method is suitable for speech synthesis scenarios that involve short text.
Asynchronous invocation: Send the text to the server at one time and receive the synthesized speech in real time. You cannot send the text in segments. This method is suitable for speech synthesis scenarios that involve short text and require high real-time performance.
Streaming call: Send the text to the server in segments and receive the synthesized speech in real time. The server starts processing as soon as it receives a portion of the text. This method is suitable for speech synthesis scenarios that involve long text and require high real-time performance.
Synchronous call
Submit a single speech synthesis task and obtain the complete result at once without using a callback function for streaming intermediate results.
Instantiate the SpeechSynthesizer class, bind the request parameters, and call the call method to synthesize the speech and obtain the binary audio data.
The length of the sent text cannot exceed 2,000 characters. For more information, see the call method of the SpeechSynthesizer class.
You must re-initialize the SpeechSynthesizer instance before each call to the call method.
Asynchronous invocation
Submit a single speech synthesis task and receive streaming intermediate results through a callback. The synthesis results are streamed through the callback functions in ResultCallback.
Instantiate the SpeechSynthesizer class, bind the request parameters and the ResultCallback interface, and call the call method to synthesize the speech. The results are retrieved in real time through the on_data method of the ResultCallback interface.
The length of the sent text cannot exceed 2,000 characters. For more information, see the call method of the SpeechSynthesizer class.
You must re-initialize the SpeechSynthesizer instance before each call to the call method.
Streaming call
Submit text in multiple parts within a single speech synthesis task and receive the synthesis results in real time through a callback.
For streaming input, call the
streaming_callmethod multiple times to submit text segments in order. The server automatically splits the text into sentences when it receives the segments:Complete sentences are synthesized immediately.
Incomplete sentences are cached until they are complete and are then synthesized.
When you call the
streaming_completemethod, the server forcibly synthesizes all received but unprocessed text segments, including incomplete sentences.The interval between sending text segments cannot exceed 23 seconds. Otherwise, a "request timeout after 23 seconds" exception is triggered.
If you have no more text to send, promptly call the
streaming_completemethod to end the task.The server enforces a 23-second timeout. This configuration cannot be changed by the client.
Instantiate the SpeechSynthesizer class
Instantiate the SpeechSynthesizer class and bind the request parameters and the ResultCallback interface.
Stream data
Call the
streaming_callmethod of the SpeechSynthesizer class multiple times to submit the text to be synthesized in segments.While you send the text, the server returns the synthesis results to the client in real time through the
on_datamethod of the ResultCallback interface.The length of the text segment (
text) sent in eachstreaming_callcannot exceed 2,000 characters. The total length of all sent text cannot exceed 200,000 characters.End processing
Call the
streaming_completemethod of the SpeechSynthesizer class to end the speech synthesis.This method blocks the current thread until the
on_completeoron_errorcallback of the ResultCallback interface is triggered. The thread is unblocked only after the callback is triggered.You must call this method. Otherwise, the text at the end may not be converted to speech.
Request parameters
Set the request parameters through the constructor of the SpeechSynthesizer class.
Parameter | Type | Default value | Required | Description |
model | str | - | Yes | Specifies the model. Different model versions share the same codec. However, the |
voice | str | - | Yes | Specify the voice to use for speech synthesis. The following voice types are available:
⚠️ When you use a voice cloning model for speech synthesis, use only the custom voice generated by that model. Do not use a default voice. ⚠️ When you use a custom voice for speech synthesis, the speech synthesis model ( |
format | enum | Varies by voice | No | Specifies the audio coding format and sample rate. If Note The default sample rate is the optimal sample rate for the current voice. By default, the output uses this sample rate. Downsampling and upsampling are also supported. The following audio coding formats and sample rates are available:
|
volume | int | 50 | No | The volume of the synthesized audio. Valid values: 0 to 100. Important This field differs in different versions of the DashScope SDK:
|
speech_rate | float | 1.0 | No | The speech rate of the synthesized audio. Valid values: 0.5 to 2.
|
pitch_rate | float | 1.0 | No | The pitch of the synthesized audio. Valid values: 0.5 to 2. |
bit_rate | int | 32 | No | Specifies the bitrate of the audio. Valid values: 6 to 510 kbps. A higher bitrate results in better audio quality and a larger audio file. Only available when the audio format ( Note Set the |
word_timestamp_enabled | bool | False | No | Specifies whether to enable word-level timestamps. The default value is false. This feature is supported only by cosyvoice-v2. Timestamp results can be obtained only through the callback interface. Note Set the |
seed | int | 0 | No | The random number seed used for generation, which changes the synthesis effect. The default value is 0. Valid values: 0 to 65535. |
callback | ResultCallback | - | No |
Key interfaces
SpeechSynthesizer class
The SpeechSynthesizer class is imported using "from dashscope.audio.tts_v2 import *" and provides the key interfaces for speech synthesis.
Method | Parameters | Return value | Description |
|
| Returns binary audio data if | Converts a whole text segment (either plain text or text with SSML) into speech. When creating a
Important You must re-initialize the |
|
| None | Streams the text to be synthesized (text with SSML is not supported). Call this interface multiple times to send the text to be synthesized to the server in segments. The synthesis result is obtained through the For usage, see Streaming call. |
|
| None | Ends the streaming speech synthesis. This method blocks the current thread for N milliseconds (the duration is determined by By default, if the waiting time exceeds 10 minutes, the waiting stops. For usage, see Streaming call. Important When making a streaming call, make sure to call this method. Otherwise, parts of the synthesized speech may be missing. |
| None | The request_id of the previous task. | Gets the request_id of the previous task. |
| None | First-packet latency | Gets the first-packet latency (usually around 500 ms). First-packet latency is the time between when the text is sent and when the first audio packet is received, in milliseconds. Use this after the task is completed. A WebSocket connection must be established when sending text for the first time. Therefore, the first-packet latency includes the time to establish the connection. |
| None | The last message. | Gets the last message (in JSON format), which can be used to get task-failed errors. |
ResultCallback interface
When you make an asynchronous invocation or a streaming call, the server returns key process information and data to the client through a callback. You need to implement the callback methods to handle the information or data that is returned by the server.
Import it using "from dashscope.audio.tts_v2 import *".
Method | Parameters | Return value | Description |
| None | None | This method is called immediately after a connection is established with the server. |
|
| None | This method is called when there is a response from the service. The |
| None | None | This method is called after all synthesized data has been returned (speech synthesis is complete). |
|
| None | This method is called when an exception occurs. |
|
| None | This method is called when the server returns synthesized audio. Combine the binary audio data into a complete audio file for playback, or play the data in real time using a player that supports streaming playback. Important
|
| None | None | This method is called after the service has closed the connection. |
Response
The server returns binary audio data:
Synchronous call: Process the binary audio data that is returned by the
callmethod of the SpeechSynthesizer class.Asynchronous invocation or streaming call: Process the parameter (bytes type data) of the
on_datamethod of the ResultCallback interface.
Error codes
If an error occurs, see Error messages to troubleshoot the issue.
If the problem persists, you can join the developer group to provide feedback. Include the Request ID for further investigation.
More examples
For more examples, see GitHub.
Voice list
The default voices that are currently supported are listed in the table below. If you need a more personalized voice, customize an exclusive voice for free using the voice cloning feature. For more information, see Use a cloned voice for speech synthesis.
When you perform speech synthesis, the model parameter must match the selected voice. Otherwise, the call fails.
The text to be synthesized (text) must be in the same language as the selected voice. Otherwise, pronunciation errors or unnatural speech may occur.
cosyvoice-v2
Scenario | Voice | Characteristics | Audio sample (Right-click to save) | voice parameter | Language | SSML | Permission requirements |
Telemarketing | Longyingxiao | Sweet-voiced saleswoman | longyingxiao | Chinese, English | ✅ | ✅ Available for direct use | |
Short video voiceover | Longjiqi | Cute robot | longjiqi | Chinese, English | ✅ | ✅ Available for direct use | |
Longhouge | Classic Monkey King | longhouge | Chinese, English | ✅ | ✅ Available for direct use | ||
Longjixin | Sharp-tongued and scheming female | longjixin | Chinese, English | ✅ | ✅ Available for direct use | ||
Longanyue | Lively Cantonese male | longanyue | Chinese, English | ✅ | ✅ Available for direct use | ||
Longgangmei | TVB drama Mandarin female | longgangmei | Chinese, English | ✅ | ✅ Available for direct use | ||
Longshange | Authentic Northern Shaanxi male | longshange | Chinese, English | ✅ | ✅ Available for direct use | ||
Longanmin | Sweet Southern Min female | longanmin | Chinese, English | ✅ | ✅ Available for direct use | ||
Longdaiyu | Delicate and talented female | longdaiyu | Chinese, English | ✅ | ✅ Available for direct use | ||
Longgaoseng | The voice of an enlightened master | longgaoseng | Chinese, English | ✅ | ✅ Available for direct use | ||
Voice assistant | Longanli | Crisp and composed female | longanli | Chinese, English | ✅ | ✅ Available for direct use | |
Longanlang | Fresh and crisp male | longanlang | Chinese, English | ✅ | ✅ Available for direct use | ||
Longanwen | Elegant and intellectual female | longanwen | Chinese, English | ✅ | ✅ Available for direct use | ||
Longanyun | Homely and warm male | longanyun | Chinese, English | ✅ | ✅ Available for direct use | ||
YUMI | Formal young female | longyumi_v2 | Chinese, English | ✅ | ✅ Available for direct use | ||
Longxiaochun | Intellectual and positive female | longxiaochun_v2 | Chinese, English | ✅ | ✅ Available for direct use | ||
Longxiaoxia | Calm and authoritative female | longxiaoxia_v2 | Chinese, English | ✅ | ✅ Available for direct use | ||
Audiobook | Longyichen | Free-spirited and energetic male | longyichen | Chinese, English | ✅ | ✅ Available for direct use | |
Longwanjun | Delicate and gentle female | longwanjun | Chinese, English | ✅ | ✅ Available for direct use | ||
Longlaobo | Weathered old man | longlaobo | Chinese, English | ✅ | ✅ Available for direct use | ||
Longlaoyi | Worldly and composed aunt | longlaoyi | Chinese, English | ✅ | ✅ Available for direct use | ||
Longbaizhi | Wise female narrator | longbaizhi | Chinese, English | ✅ | ✅ Available for direct use | ||
Longsanshu | Calm and textured male | longsanshu | Chinese, English | ✅ | ✅ Available for direct use | ||
Longxiu | Erudite male storyteller | longxiu_v2 | Chinese, English | ✅ | ✅ Available for direct use | ||
Longmiao | Cadenced female | longmiao_v2 | Chinese, English | ✅ | ✅ Available for direct use | ||
Longyue | Warm and magnetic female | longyue_v2 | Chinese, English | ✅ | ✅ Available for direct use | ||
Longnan | Wise young male | longnan_v2 | Chinese, English | ✅ | ✅ Available for direct use | ||
Longyuan | Warm and healing female | longyuan_v2 | Chinese, English | ✅ | ✅ Available for direct use | ||
Social companion | Longanqin | Approachable and lively female | longanqin | Chinese, English | ✅ | ✅ Available for direct use | |
Longanya | Elegant and classy female | longanya | Chinese, English | ✅ | ✅ Available for direct use | ||
Longanshuo | Clean and fresh male | longanshuo | Chinese, English | ✅ | ✅ Available for direct use | ||
Longanling | Agile-minded female | longanling | Chinese, English | ✅ | ✅ Available for direct use | ||
Longanzhi | Wise and mature young male | longanzhi | Chinese, English | ✅ | ✅ Available for direct use | ||
Longanrou | Gentle female best friend | longanrou | Chinese, English | ✅ | ✅ Available for direct use | ||
Longqiang | Romantic and charming female | longqiang_v2 | Chinese, English | ✅ | ✅ Available for direct use | ||
Longhan | Warm and devoted male | longhan_v2 | Chinese, English | ✅ | ✅ Available for direct use | ||
Longxing | Gentle girl-next-door | longxing_v2 | Chinese, English | ✅ | ✅ Available for direct use | ||
Longhua | Energetic and sweet female | longhua_v2 | Chinese, English | ✅ | ✅ Available for direct use | ||
Longwan | Positive and intellectual female | longwan_v2 | Chinese, English | ✅ | ✅ Available for direct use | ||
Longcheng | Intelligent young male | longcheng_v2 | Chinese, English | ✅ | ✅ Available for direct use | ||
Longfeifei | Sweet and delicate female | longfeifei_v2 | Chinese, English | ✅ | ✅ Available for direct use | ||
Longxiaocheng | Magnetic low-pitched male | longxiaocheng_v2 | Chinese, English | ✅ | ✅ Available for direct use | ||
Longzhe | Awkward but warm-hearted male | longzhe_v2 | Chinese, English | ✅ | ✅ Available for direct use | ||
Longyan | Warm and gentle female | longyan_v2 | Chinese, English | ✅ | ✅ Available for direct use | ||
Longtian | Magnetic and rational male | longtian_v2 | Chinese, English | ✅ | ✅ Available for direct use | ||
Longze | Warm and energetic male | longze_v2 | Chinese, English | ✅ | ✅ Available for direct use | ||
Longshao | Positive and ambitious male | longshao_v2 | Chinese, English | ✅ | ✅ Available for direct use | ||
Longhao | Emotional and melancholic male | longhao_v2 | Chinese, English | ✅ | ✅ Available for direct use | ||
Longshen | Talented male singer | kabuleshen_v2 | Chinese, English | ✅ | ✅ Available for direct use | ||
Child's voice (benchmark voice) | Longhuhu | Innocent and lively young girl | longhuhu | Chinese, English | ✅ | ✅ Available for direct use | |
Consumer electronics - Education and training | Longanpei | Young female teacher | longanpei | Chinese, English | ✅ | ✅ Available for direct use | |
Consumer electronics - Child companion | Longwangwang | Taiwanese youth | longwangwang | Chinese, English | ✅ | ✅ Available for direct use | |
Longpaopao | Apsara bubble voice | longpaopao | Chinese, English | ✅ | ✅ Available for direct use | ||
Consumer electronics - Children's audiobooks | Longshanshan | Dramatic child's voice | longshanshan | Chinese, English | ✅ | ✅ Available for direct use | |
Longniuniu | Sunny young boy's voice | longniuniu | Chinese, English | ✅ | ✅ Available for direct use | ||
Customer service | Longyingmu | Elegant and intellectual female | longyingmu | Chinese, English | ✅ | ✅ Available for direct use | |
Longyingxun | Young and inexperienced male | longyingxun | Chinese, English | ✅ | ✅ Available for direct use | ||
Longyingcui | Serious male for collections | longyingcui | Chinese, English | ✅ | ✅ Available for direct use | ||
Longyingda | Cheerful high-pitched female | longyingda | Chinese, English | ✅ | ✅ Available for direct use | ||
Longyingjing | Low-key and calm female | longyingjing | Chinese, English | ✅ | ✅ Available for direct use | ||
Longyingyan | Righteous and stern female | longyingyan | Chinese, English | ✅ | ✅ Available for direct use | ||
Longyingtian | Gentle and sweet female | longyingtian | Chinese, English | ✅ | ✅ Available for direct use | ||
Longyingbing | Sharp and assertive female | longyingbing | Chinese, English | ✅ | ✅ Available for direct use | ||
Longyingtao | Gentle and calm female | longyingtao | Chinese, English | ✅ | ✅ Available for direct use | ||
Longyingling | Gentle and empathetic female | longyingling | Chinese, English | ✅ | ✅ Available for direct use | ||
Livestreaming e-commerce | Longanran | Lively and textured female | longanran | Chinese, English | ✅ | ✅ Available for direct use | |
Longanxuan | Classic female livestreamer | longanxuan | Chinese, English | ✅ | ✅ Available for direct use | ||
Longanchong | Passionate male salesperson | longanchong | Chinese, English | ✅ | ✅ Available for direct use | ||
Longanping | High-pitched female livestreamer | longanping | Chinese, English | ✅ | ✅ Available for direct use | ||
Child's voice | Longjielidou | Sunny and mischievous male | longjielidou_v2 | Chinese, English | ✅ | ✅ Available for direct use | |
Longling | Childish and deadpan female | longling_v2 | Chinese, English | ✅ | ✅ Available for direct use | ||
Longke | Innocent and well-behaved female | longke_v2 | Chinese, English | ✅ | ✅ Available for direct use | ||
Longxian | Bold and cute female | longxian_v2 | Chinese, English | ✅ | ✅ Available for direct use | ||
Dialect | Longlaotie | Forthright Northeastern male | longlaotie_v2 | Chinese (Northeastern), English | ✅ | ✅ Available for direct use | |
Longjiayi | Intellectual Cantonese female | longjiayi_v2 | Chinese (Cantonese), English | ✅ | ✅ Available for direct use | ||
Longtao | Positive Cantonese female | longtao_v2 | Chinese (Cantonese), English | ✅ | ✅ Available for direct use | ||
Poetry recitation | Longfei | Passionate and magnetic male | longfei_v2 | Chinese, English | ✅ | ✅ Available for direct use | |
Libai | Ancient male poet | libai_v2 | Chinese, English | ✅ | ✅ Available for direct use | ||
Longjin | Elegant and gentle male | longjin_v2 | Chinese, English | ✅ | ✅ Available for direct use | ||
News broadcast | Longshu | Calm young male | longshu_v2 | Chinese, English | ✅ | ✅ Available for direct use | |
Bella2.0 | Precise and capable female | loongbella_v2 | Chinese, English | ✅ | ✅ Available for direct use | ||
Longshuo | Erudite and capable male | longshuo_v2 | Chinese, English | ✅ | ✅ Available for direct use | ||
Longxiaobai | Calm female announcer | longxiaobai_v2 | Chinese, English | ✅ | ✅ Available for direct use | ||
Longjing | Typical female announcer | longjing_v2 | Chinese, English | ✅ | ✅ Available for direct use | ||
loongstella | Confident and crisp female | loongstella_v2 | Chinese, English | ✅ | ✅ Available for direct use | ||
Overseas marketing | loongyuuna | Energetic Japanese female | loongyuuna_v2 | Japanese | ✅ | ✅ Available for direct use | |
loongyuuma | Capable Japanese male | loongyuuma_v2 | Japanese | ✅ | ✅ Available for direct use | ||
loongjihun | Sunny Korean male | loongjihun_v2 | Korean | ✅ | ✅ Available for direct use | ||
loongeva | Intellectual British English female | loongeva_v2 | British English | ❌ | ✅ Available for direct use | ||
loongbrian | Calm British English male | loongbrian_v2 | British English | ❌ | ✅ Available for direct use | ||
loongluna | British English female | loongluna_v2 | British English | ❌ | ✅ Available for direct use | ||
loongluca | British English male | loongluca_v2 | British English | ❌ | ✅ Available for direct use | ||
loongemily | British English female | loongemily_v2 | British English | ❌ | ✅ Available for direct use | ||
loongeric | British English male | loongeric_v2 | British English | ❌ | ✅ Available for direct use | ||
loongabby | American English female | loongabby_v2 | American English | ❌ | ✅ Available for direct use | ||
loongannie | American English female | loongannie_v2 | American English | ❌ | ✅ Available for direct use | ||
loongandy | American English male | loongandy_v2 | American English | ❌ | ✅ Available for direct use | ||
loongava | American English female | loongava_v2 | American English | ❌ | ✅ Available for direct use | ||
loongbeth | American English female | loongbeth_v2 | American English | ❌ | ✅ Available for direct use | ||
loongbetty | American English female | loongbetty_v2 | American English | ❌ | ✅ Available for direct use | ||
loongcindy | American English female | loongcindy_v2 | American English | ❌ | ✅ Available for direct use | ||
loongcally | American English female | loongcally_v2 | American English | ❌ | ✅ Available for direct use | ||
loongdavid | American English male | loongdavid_v2 | American English | ❌ | ✅ Available for direct use | ||
loongdonna | American English female | loongdonna_v2 | American English | ❌ | ✅ Available for direct use | ||
loongkyong | Korean female | loongkyong_v2 | Korean | ❌ | ✅ Available for direct use | ||
loongtomoka | Japanese female | loongtomoka_v2 | Japanese | ❌ | ✅ Available for direct use | ||
loongtomoya | Japanese male | loongtomoya_v2 | Japanese | ❌ | ✅ Available for direct use |
FAQ
Features, billing, and rate limiting
Q: Where can I find information about the features, billing, and throttling of CosyVoice?
A: For more information, see CosyVoice.
Q: What can I do if the pronunciation is inaccurate?
A: You can use SSML to customize the speech synthesis results.
Q: The current requests per second (RPS) cannot meet my business requirements. What should I do? How can I increase the limit? Is there a fee?
A: You can submit an Alibaba Cloud ticket or join the developer group to request a scale-out. The scale-out is free of charge.
Q: How do I specify the language of the speech to be synthesized?
A: You cannot specify the language of the speech to be synthesized through request parameters. If you want to synthesize speech in a specific language, see the voice list and select a voice based on its language.
Q: Speech synthesis is billed based on the number of text characters. How do I check or get the text length for each synthesis task?
The method for retrieving the text length depends on whether you have logging enabled:
If logging is disabled
For synchronous calls, calculate the length based on the character counting rules.
For other call methods, retrieve the length from the
on_eventmethod'smessageparameter in the ResultCallback interface. Themessageis a JSON string. Parse the string to retrieve the number of billable characters for the request from thecharactersparameter. Use the value from the lastmessagethat you receive.
If logging is enabled
The following log is printed to the console. The
charactersfield shows the number of billable characters for the request. Use the value from the last log entry that is printed.2025-08-27 11:02:09,429 - dashscope - speech_synthesizer.py - on_message - 454 - DEBUG - <<<recv {"header":{"task_id":"62ebb7d6cb0a4080868f0edb######","event":"result-generated","attributes":{}},"payload":{"output":{"sentence":{"words":[]}},"usage":{"characters":15}}}
Troubleshooting
If you encounter a code error, troubleshoot the issue based on the information in Error codes.
Q: How to get a Request ID?
Get it in the following two ways:
In the
on_eventmethod of the ResultCallback callback interface, parse the JSON stringmessage.Call the
get_last_request_idmethod of SpeechSynthesizer.
Q: Why does the SSML feature fail?
Perform the following troubleshooting operations:
Ensure that the current voice supports the SSML feature. Personalized voices do not support SSML.
Ensure that the
modelparameter iscosyvoice-v2.Ensure that you use the correct interface. Only the
callmethod of the SpeechSynthesizer class supports SSML.Ensure that the text to be synthesized is in plain text format and meets the format requirements. For more information, see Introduction to the SSML markup language.
Q: Why can't the audio be played?
A: Troubleshoot the issue based on the following scenarios:
The audio is saved as a complete file, such as xx.mp3
Audio format consistency: Make sure that the audio format set in the request parameters matches the file extension. For example, if the audio format is set to WAV in the request parameters but the file extension is .mp3, playback may fail.
Player compatibility: Confirm whether your player supports the format and sample rate of the audio file. For example, some players may not support high sample rates or specific audio encodings.
The audio is played in a stream
Save the audio stream as a complete file and try to play it with a player. If the file cannot be played, see the troubleshooting method for the first scenario.
If the file can be played normally, the problem may be with the streaming playback implementation. Confirm whether your player supports streaming playback.
Common tools and libraries that support streaming playback include FFmpeg, PyAudio (Python), AudioFormat (Java), and MediaSource (JavaScript).
Q: Why does the audio playback stutter?
A: Troubleshoot the issue based on the following scenarios:
The audio is saved as a complete file, such as xx.mp3
Join the developer group and provide the Request ID so that we can troubleshoot the issue for you.
The audio is played in a stream
Check the text sending speed: Make sure that the interval for sending text is reasonable. Avoid situations where the next sentence is not sent promptly after the previous audio segment has finished playing.
Check the callback function performance:
Check whether there is too much business logic in the callback function, which may cause blocking.
The callback function runs in the WebSocket thread. If it is blocked, it may affect the WebSocket's ability to receive network packets, which can cause stuttering when receiving the audio stream.
Write the audio data to a separate audio buffer and then read and process it in other threads. This avoids blocking the WebSocket thread.
Check network stability: Make sure that the network connection is stable to avoid audio transmission interruptions or delays due to network fluctuations.
Further troubleshooting: If the preceding methods do not resolve the issue, join the developer group and provide the Request ID so that we can further investigate the issue for you.
Q: Why is speech synthesis slow (long synthesis time)?
A: Troubleshoot the issue as follows:
You can check the input interval.
Check the input interval: If you are using streaming speech synthesis, check whether the interval between sending text segments is too long. For example, a delay of several seconds between segments will increase the total synthesis time.
Analyze performance metrics
Analyze performance metrics: If the first-packet latency does not meet the following requirements, submit the request ID to the technical team for assistance.
First-packet latency: Typically around 500 ms.
Real-Time Factor (RTF): Typically around 0.3. RTF = Total synthesis time / Audio duration.
Q: How to handle pronunciation errors in the synthesized speech?
If you are using the cosyvoice-v1 model, we recommend using cosyvoice-v2, which delivers better results and supports SSML.
If the current model is cosyvoice-v2, use the SSML <phoneme> tag to specify the correct pronunciation.
Q: Why is no audio returned, or why is the synthesized audio incomplete?
Check whether you called the streaming_complete method of the SpeechSynthesizer class. During speech synthesis, the server starts the process only after it caches enough text. If you do not call the streaming_complete method, the last part of the text in the cache might not be synthesized into audio.
Q: What do I do if SSL certificate verification fails?
Install the system root certificates.
sudo yum install -y ca-certificates sudo update-ca-trust enableYou can add the following lines to your code.
import os os.environ["SSL_CERT_FILE"] = "/etc/ssl/certs/ca-bundle.crt"
Q: Why do I get an "SSL: CERTIFICATE_VERIFY_FAILED" error on macOS? (websocket closed due to [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1000))
When you connect to a WebSocket, you might encounter an OpenSSL certificate authentication failure that reports that the certificate cannot be found. This issue usually occurs because the certificate configuration in your Python environment is incorrect. You can follow these steps to manually locate and fix the certificate issue:
Export system certificates and set environment variables. You can run the following commands to export all certificates from your macOS system to a file. This sets the file as the default certification path for Python and its libraries:
security find-certificate -a -p > ~/all_mac_certs.pem export SSL_CERT_FILE=~/all_mac_certs.pem export REQUESTS_CA_BUNDLE=~/all_mac_certs.pemCreate a symbolic link to fix the Python OpenSSL configuration. If your Python OpenSSL configuration is missing a certificate, run the following command to create a symbolic link. Replace the path in the command with the actual installation folder for your Python version:
# 3.9 is an example. Adjust the path to your installed Python version. ln -s /etc/ssl/* /Library/Frameworks/Python.framework/Versions/3.9/etc/opensslRestart the terminal and clear the cache. After you complete the steps, close and reopen your terminal to apply the environment variables. Then, clear any caches and retry the WebSocket connection.
These steps can resolve connection issues that are caused by incorrect certificate configuration. If the issue persists, check the certificate configuration on the target server.
Q: Why do I get the error "AttributeError: module 'websocket' has no attribute 'WebSocketApp'. Did you mean: 'WebSocket'?"
This error occurs because websocket-client is not installed or there is a version mismatch. You can run the following commands in order:
pip uninstall websocket-client
pip uninstall websocket
pip install websocket-clientPermissions and authentication
Q: I want my API key to be used only for the CosyVoice speech synthesis service and not for other Model Studio models (permission isolation). How can I do this?
A: You can limit the scope of an API key by creating a new workspace and authorizing only specific models. For more information, see Workspace Management.
More questions
For more information, see the GitHub QA.