To use a model in the China (Beijing) region, go to the API key page for the China (Beijing) region
This document describes how to access the CosyVoice speech synthesis service using a WebSocket connection.
The DashScope SDK currently supports only Java and Python. If you want to develop CosyVoice speech synthesis applications in other programming languages, you must use a WebSocket connection to communicate with the service.
User guide: For more information about the models and guidance on model selection, see Real-time speech synthesis - CosyVoice.
WebSocket is a network protocol that supports full-duplex communication. The client and server establish a persistent connection with a single handshake, which allows both parties to actively push data to each other. This provides significant advantages in real-time performance and efficiency.
For common programming languages, many ready-to-use WebSocket libraries and examples are available, such as:
Go:
gorilla/websocketPHP:
RatchetNode.js:
ws
Familiarize yourself with the basic principles and technical details of WebSocket before you begin development.
Prerequisites
You have activated the Model Studio and created an API key. To prevent security risks, export the API key as an environment variable instead of hard-coding it in your code.
To grant temporary access permissions to third-party applications or users, or if you want to strictly control high-risk operations such as accessing or deleting sensitive data, we recommend that you use a temporary authentication token.
Compared with long-term API keys, temporary authentication tokens are more secure because they are short-lived (60 seconds). They are suitable for temporary call scenarios and can effectively reduce the risk of API key leakage.
To use a temporary token, replace the API key used for authentication in your code with the temporary authentication token.
Models and pricing
Model | Price |
cosyvoice-v3-plus | $0.286706 per 10,000 characters |
cosyvoice-v3-flash | $0.14335 per 10,000 characters |
cosyvoice-v2 | $0.286706 per 10,000 characters |
Text and format limitations
Text length limits
The text sent in a single continue-task instruction must not exceed 2,000 characters. The total length of text sent across multiple continue-task instructions must not exceed 200,000 characters.
Character counting rules
A Chinese character, including simplified or traditional Chinese, Japanese kanji, and Korean hanja, is counted as 2 characters. All other characters, such as punctuation marks, letters, numbers, and Japanese or Korean kana or hangul, are counted as 1 character.
SSML tags are not included in the text length calculation.
Examples:
"你好"→ 2(你) + 2(好) = 4 characters"中A文123"→ 2(中) + 1(A) + 2(文) + 1(1) + 1(2) + 1(3) = 8 characters"中文。"→ 2(中) + 2(文) + 1(。) = 5 characters"中 文。"→ 2(中) + 1(space) + 2(文) + 1(。) = 6 characters"<speak>你好</speak>"→ 2(你) + 2(好) = 4 characters
Encoding format
Use UTF-8 encoding.
Support for mathematical expressions
The mathematical expression parsing feature is currently available only for the cosyvoice-v2, cosyvoice-v3-flash, and cosyvoice-v3-plus models. This feature supports common mathematical expressions from primary and secondary school, such as basic arithmetic, algebra, and geometry.
For more information, see Convert LaTeX formulas to speech.
SSML support
The Speech Synthesis Markup Language (SSML) feature is currently available only for cloned voices of the cosyvoice-v3-flash, cosyvoice-v3-plus, and cosyvoice-v2 models, and for system voices that are indicated as supported in the voice list. The following conditions must be met:
To use this feature:
When you send the run-task instruction, set the
enable_ssmlparameter totrueto enable SSML support.Then, send the text that contains SSML using the continue-task instruction.
After you enable SSML support by setting the enable_ssml parameter to true, you must submit the complete text for synthesis in a single continue-task instruction. Multiple submissions are not supported.
Interaction flow
Messages sent from the client to the server are called instructions. The server returns two types of messages to the client: JSON-formatted events and binary audio streams.
The following is the interaction flow between the client and the server:
Establish a connection: The client establishes a WebSocket connection with the server.
Start a task:
The client sends a run-task instruction to start the task.
The client receives the task-started event from the server. This event indicates that the task has started successfully and that you can proceed to the next steps.
Send text for synthesis:
The client sends one or more continue-task instructions that contain the text for synthesis to the server in sequence. After the server receives a complete sentence, it returns an audio stream. The text length is constrained. For more information, see the description of the
textfield in the continue-task instruction.NoteYou can send multiple continue-task instructions to submit text fragments in sequence. The server automatically segments the text after it receives the fragments:
Complete sentences are synthesized immediately, and the client receives the corresponding audio from the server.
Incomplete sentences are cached until they are complete, and then synthesized. The server does not return audio for incomplete sentences.
When you send a finish-task instruction, the server synthesizes all cached content.
Notify the server to end the task:
After all the text is sent, the client sends a finish-task instruction to the server to end the task. The client continues to receive the audio stream from the server. You must perform this step. Otherwise, you may not receive the complete audio stream.
Task ends:
The client receives a task-finished event from the server. This event indicates that the task has ended.
Close the connection: The client closes the WebSocket connection.
URL
The WebSocket URL is fixed:
wss://dashscope.aliyuncs.com/api-ws/v1/inferenceHeaders
Include the following parameters in the request headers:
{
"Authorization": "bearer <your_dashscope_api_key>", // Required. Replace <your_dashscope_api_key> with your API key.
"user-agent": "your_platform_info", // Optional.
"X-DashScope-WorkSpace": workspace, // Optional. The ID of your Model Studio workspace.
"X-DashScope-DataInspection": "enable"
}Instructions (client to server)
Instructions are messages that the client sends to the server in JSON format as text frames. These instructions control the start and end of a task and identify task boundaries.
To avoid task failure, you must send instructions in the following sequence:
Send the run-task instruction
Starts the speech synthesis task.
The returned
task_idmust be used for the subsequent continue-task instruction and finish-task instruction.
Send the continue-task instruction
Sends the text for synthesis.
Send this instruction only after you receive the task-started event from the server.
Send the finish-task instruction
Ends the speech synthesis task.
Send this instruction after all continue-task instructions have been sent.
1. run-task instruction: Start a task
This instruction starts a speech synthesis task and lets you set request parameters, such as the voice and sample rate.
When to send: After the WebSocket connection is established.
Do not send text for synthesis: Do not include the text for synthesis in this instruction. Including text can make troubleshooting difficult.
Example:
{
"header": {
"action": "run-task",
"task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx", // Random UUID
"streaming": "duplex"
},
"payload": {
"task_group": "audio",
"task": "tts",
"function": "SpeechSynthesizer",
"model": "cosyvoice-v3-flash",
"parameters": {
"text_type": "PlainText",
"voice": "longanyang", // Voice
"format": "mp3", // Audio format
"sample_rate": 22050, // Sample rate
"volume": 50, // Volume
"rate": 1, // Speech rate
"pitch": 1 // Pitch
},
"input": {// The input field cannot be omitted. Otherwise, an error is reported.
}
}
}header parameters
Parameter | Type | Required | Description |
header.action | string | Yes | The instruction type. For this instruction, the value is fixed as "run-task". |
header.task_id | string | Yes | The ID of the current task. A UUID composed of 32 randomly generated letters and numbers. It can be formatted with hyphens (for example, The task_id used for subsequent continue-task instruction and finish-task instruction must be the same as the one used in the run-task instruction. |
header.streaming | string | Yes | Fixed string: "duplex" |
payload parameters
Parameter | Type | Required | Description |
payload.task_group | string | Yes | Fixed string: "audio". |
payload.task | string | Yes | Fixed string: "tts". |
payload.function | string | Yes | Fixed string: "SpeechSynthesizer". |
payload.model | string | Yes | The speech synthesis model. Difference models require corresponding voices:
|
payload.input | object | Yes |
|
payload.parameters | |||
text_type | string | Yes | Fixed string: "PlainText". |
voice | string | Yes | The voice to use for speech synthesis. System voices and cloned voices are supported:
|
format | string | No | The audio coding format. Supported formats are pcm, wav, mp3 (default), and opus. When the audio format is opus, you can adjust the bitrate using the |
sample_rate | integer | No | The audio sample rate in Hz. Default value: 22050. Valid values: 8000, 16000, 22050, 24000, 44100, 48000. Note The default sample rate is the optimal rate for the current voice. By default, the output uses this sample rate. Downsampling and upsampling are also supported. |
volume | integer | No | The volume. Default value: 50. Value range: [0, 100]. A value of 50 is the standard volume. The volume has a linear relationship with this value. 0 is mute and 100 is the maximum volume. |
rate | float | No | The speech rate. Default value: 1.0. Value range: [0.5, 2.0]. A value of 1.0 is the standard rate. Values less than 1.0 slow down the speech, and values greater than 1.0 speed it up. |
pitch | float | No | The pitch. This value is a multiplier for pitch adjustment. The relationship between this value and the perceived pitch is not strictly linear or logarithmic. Test different values to find the best one. Default value: 1.0. Value range: [0.5, 2.0]. A value of 1.0 is the natural pitch of the voice. Values greater than 1.0 raise the pitch, and values less than 1.0 lower it. |
enable_ssml | boolean | No | Specifies whether to enable the SSML feature. If this parameter is set to |
bit_rate | int | No | The audio bitrate in kbps. If the audio format is Opus, you can adjust the bitrate using the Default value: 32. Value range: [6, 510]. |
word_timestamp_enabled | boolean | No | Specifies whether to enable character-level timestamps. Default value: false.
This feature is available only for cloned voices of the cosyvoice-v3-flash, cosyvoice-v3-plus, and cosyvoice-v2 models, and for system voices marked as supported in the voice list. |
seed | int | No | The random number seed used during generation, which varies the synthesis effect. If the model version, text, voice, and other parameters are the same, using the same seed reproduces the same synthesis result. Default value: 0. Value range: [0, 65535]. |
language_hints | array[string] | No | Provides language hints. Only cosyvoice-v3-flash and cosyvoice-v3-plus support this feature. No default value. This parameter has no effect if it is not set. This parameter has the following effects in speech synthesis:
If the specified language hint clearly does not match the text content, for example, setting Note: This parameter is an array, but the current version processes only the first element. Therefore, pass only one value. |
instruction | string | No | Sets an instruction. This feature is available only for cloned voices of the cosyvoice-v3-flash and cosyvoice-v3-plus models, and for system voices marked as supported in the voice list. The prompt must be in Chinese and follow specifc patterns. No default value. This parameter has no effect if it is not set. The instruction has the following effects in speech synthesis:
|
enable_aigc_tag | boolean | No | Specifies whether to add an invisible AIGC identifier to the generated audio. When set to true, an invisible identifier is embedded into the audio in supported formats (WAV, MP3, and Opus). Default value: false. This feature is available only for the cosyvoice-v3-flash, cosyvoice-v3-plus, and cosyvoice-v2 models. |
aigc_propagator | string | No | Sets the Default value: Alibaba Cloud UID. This feature is available only for the cosyvoice-v3-flash, cosyvoice-v3-plus, and cosyvoice-v2 models. |
aigc_propagate_id | string | No | Sets the Default value: The request ID of the current speech synthesis request. This feature is available only for the cosyvoice-v3-flash, cosyvoice-v3-plus, and cosyvoice-v2 models. |
2. continue-task instruction
This instruction is used to send the text for synthesis.
You can send the text for synthesis all at once in a single continue-task instruction, or you can segment the text and send it sequentially in multiple continue-task instructions.
When to send: Send this instruction after you receive the task-started event.
The interval between sending text fragments must not exceed 23 seconds. If this interval is exceeded, a "request timeout after 23 seconds" exception is triggered.
If you have no text to send, promptly send a finish-task instruction to end the task.
The server enforces a 23 second timeout. The client cannot change this setting.
Example:
{
"header": {
"action": "continue-task",
"task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx", // Random UUID
"streaming": "duplex"
},
"payload": {
"input": {
"text": "A bright moonbeam shines before my bed, I wonder if it's frost upon the ground."
}
}
}header parameter description
Parameter | Type | Required | Description |
header.action | string | Yes | The instruction type. For this instruction, the value is fixed as "continue-task". |
header.task_id | string | Yes | The ID of the current task. This must be the same as the task_id used when sending the run-task instruction. |
header.streaming | string | Yes | Fixed string: "duplex" |
payload parameter description
Parameter | Type | Required | Description |
input.text | string | Yes | The text for synthesis. |
3. finish-task instruction: End a task
This instruction ends the speech synthesis task.
You must send this instruction to ensure that no part of the synthesized speech is missing.
After the client sends this instruction, the server converts any remaining text into speech. When the synthesis is complete, the server returns a task-finished event to the client.
When to send: Send this instruction after all continue-task instructions have been sent.
Example:
{
"header": {
"action": "finish-task",
"task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx",
"streaming": "duplex"
},
"payload": {
"input": {}// The input field cannot be omitted. Otherwise, an error is reported.
}
}header parameters
Parameter | Type | Required | Description |
header.action | string | Yes | The instruction type. For this instruction, the value is fixed as "finish-task". |
header.task_id | string | Yes | The ID of the current task. This must be the same as the task_id used when you send the run-task instruction. |
header.streaming | string | Yes | Fixed string: "duplex" |
payload parameters
Parameter | Type | Required | Description |
payload.input | object | Yes | Fixed format: {}. |
Events (server to client)
Events are messages that the server sends to the client. These events are in JSON format and represent different processing stages.
The binary audio that the server returns to the client is not included in any event and must be received separately.
1. task-started event: Task has started
The task-started event from the server indicates that the task has started successfully. You must receive this event before you send a continue-task instruction or a finish-task instruction to the server. Otherwise, the task fails.
The payload of the task-started event is empty.
Example:
{
"header": {
"task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx",
"event": "task-started",
"attributes": {}
},
"payload": {}
}header parameters
Parameter | Type | Description |
header.event | string | The event type. For this event, the value is fixed as "task-started". |
header.task_id | string | The task_id generated by the client. |
2. result-generated event
When the client sends continue-task and finish-task instructions, the server continuously returns result-generated events.
In the CosyVoice service, the result-generated event is reserved for the protocol. It encapsulates information, such as the Request ID, and can be ignored.
Example:
{
"header": {
"task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx",
"event": "result-generated",
"attributes": {
"request_uuid": "0a9dba9e-d3a6-45a4-be6d-xxxxxxxxxxxx"
}
},
"payload": {}
}header parameters
Parameter | Type | Description |
header.event | string | The event type. For this event, the value is fixed as "result-generated". |
header.task_id | string | The task_id generated by the client. |
header.attributes.request_uuid | string | The Request ID. |
payload parameter description
Parameter | Type | Description |
payload.usage.characters | integer | The number of billable characters in the current request so far.
In a single task, |
3. task-finished event: Task has ended
When you receive the task-finished event from the server, this indicates that the task has finished.
After the task finishes, you can close the WebSocket connection to terminate the program. You can also reuse the WebSocket connection and send another run-task instruction to start the next task. For more information, see Connection overhead and reuse.
Example:
{
"header": {
"task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx",
"event": "task-finished",
"attributes": {
"request_uuid": "0a9dba9e-d3a6-45a4-be6d-xxxxxxxxxxxx"
}
},
"payload": {
"output": {
"sentence": {
"words": []
}
},
"usage": {
"characters": 13
}
}
}header parameters
Parameter | Type | Description |
header.event | string | The event type. For this event, the value is fixed as "task-finished". |
header.task_id | string | The task_id generated by the client. |
header.attributes.request_uuid | string | The Request ID. You can provide this to CosyVoice developers to locate issues. |
payload parameters
Parameter | Type | Description |
payload.usage.characters | integer | The number of billable characters in the current request so far.
In a single task, |
payload.output.sentence.index | integer | The sentence number, starting from 0. These fields are returned only when character-level timestamps are enabled by setting word_timestamp_enabled to true. |
payload.output.sentence.words[k] | ||
text | string | The text of the character. |
begin_index | integer | The start position index of the character in the sentence, starting from 0. |
end_index | integer | The end position index of the character in the sentence, starting from 1. |
begin_time | integer | The start timestamp of the audio corresponding to the character, in milliseconds. |
end_time | integer | The end timestamp of the audio corresponding to the character, in milliseconds. |
If you enable character-level timestamps using word_timestamp_enabled, the response includes timestamp information. The following is an example:
4. task-failed event: Task has failed
If you receive a task-failed event, this indicates that the task has failed. You must then close the WebSocket connection and handle the error. Analyze the error message to determine the cause. If the failure is due to a programming issue, you can modify your code to fix it.
Example:
{
"header": {
"task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx",
"event": "task-failed",
"error_code": "InvalidParameter",
"error_message": "[tts:]Engine return error code: 418",
"attributes": {}
},
"payload": {}
}header parameters
Parameter | Type | Description |
header.event | string | The event type. For this event, the value is fixed as "task-failed". |
header.task_id | string | The task_id generated by the client. |
header.error_code | string | The error type description. |
header.error_message | string | The specific cause of the error. |
Connection overhead and reuse
The WebSocket service supports connection reuse. This improves resource utilization and reduces the overhead from establishing new connections.
The server starts a new task after it receives a run-task instruction from the client. When the client sends a finish-task instruction, the server completes the task and returns a task-finished event. After the task ends, the WebSocket connection can be reused. This allows the client to send another run-task instruction to start the next task.
Each task in a reused connection requires a unique task_id.
If a task fails during execution, the service returns a task-failed event and closes the connection. In this case, the connection cannot be reused.
If a new task is not started within 60 seconds after the previous task ends, the connection automatically times out and closes.
Sample code
The sample code provides a basic implementation to run the service. You must develop the code for your specific business scenarios.
When you write WebSocket client code, asynchronous programming is typically used to send and receive messages simultaneously. You can write your program by following these steps:
Error codes
For troubleshooting information, see Error messages.
FAQ
Features, billing, and rate limiting
Q: What can I do to fix inaccurate pronunciation?
You can use SSML to customize the speech synthesis output.
Q: Why use the WebSocket protocol instead of HTTP/HTTPS? Why isn't a RESTful API provided?
The Voice Service uses WebSocket instead of HTTP, HTTPS, or RESTful APIs because it requires full-duplex communication. WebSocket allows both the server and client to send data. This is necessary for features such as pushing the progress of real-time speech synthesis or recognition. In contrast, a RESTful API is based on HTTP and supports only a one-way, client-initiated request-response model, which cannot meet the requirements for real-time interaction.
Q: Speech synthesis is billed based on the number of text characters. How can I view or obtain the text length for each synthesis?
You can obtain the character count from the payload.usage.characters parameter in the result-generated event that is returned by the server. The value from the last result-generated event that you receive is the final count.
Troubleshooting
If a code error occurs, check that the instructions sent to the server are correct. You can print the instruction content to check for incorrect formats or missing required parameters. If the instructions are correct, troubleshoot the issue using the information in Error codes.
Q: How do I get the Request ID?
You can obtain it in one of the following two ways:
Parse the information that the server returns in the result-generated event.
Parse the information that the server returns in the task-finished event.
Q: Why does the SSML feature fail?
Troubleshoot the issue by following these steps:
Make sure that the scope of application is correct.
Make sure that you are calling the feature correctly. For more information, see SSML support.
Make sure that the text for synthesis is in plain text and meets the formatting requirements. For more information, see Introduction to SSML.
Q: Why can't the audio be played?
Troubleshoot this issue based on the following scenarios:
The audio is saved as a complete file, such as an .mp3 file.
Audio format consistency: Ensure that the audio format specified in the request parameters matches the file extension. For example, playback might fail if the audio format is set to WAV in the request parameters but the file has an .mp3 extension.
Player compatibility: Confirm that your player supports the format and sample rate of the audio file. For example, some players might not support high sample rates or specific audio encodings.
The audio is played in streaming mode.
Save the audio stream as a complete file and try to play it. If the file fails to play, see the troubleshooting steps for the first scenario.
If the file plays correctly, the issue might be with the streaming playback implementation. Confirm that your player supports streaming playback.
Common tools and libraries that support streaming playback include FFmpeg, pyaudio (Python), AudioFormat (Java), and MediaSource (JavaScript).
Q: Why does the audio playback stutter?
Troubleshoot this issue based on the following scenarios:
Check the text sending speed: Ensure that the text sending interval is reasonable. Avoid delays in sending the next text segment after the audio for the previous segment has finished playing.
Check the callback function performance:
Check whether the callback function contains excessive business logic that could cause it to block.
The callback function runs in the WebSocket thread. If this thread is blocked, it can interfere with the WebSocket's ability to receive network packets, resulting in audio stuttering.
To avoid blocking the WebSocket thread, write the audio data to a separate audio buffer and then use another thread to read and process it.
Check network stability: Ensure that your network connection is stable to prevent audio transmission interruptions or delays caused by network fluctuations.
Q: Why is speech synthesis slow (long synthesis time)?
Perform the following troubleshooting steps:
Check the input interval
If you are using streaming speech synthesis, check whether the text sending interval is too long. For example, a delay of several seconds before sending the next segment will increase the total synthesis time.
Analyze performance metrics
First packet delay: This is typically around 500 ms.
Real-Time Factor (RTF): This is calculated as Total Synthesis Time / Audio Duration. The RTF is normally less than 1.0.
Q: How do I handle incorrect pronunciation in the synthesized speech?
Use the <phoneme> tag of SSML to specify the correct pronunciation.
Q: Why is no speech returned? Why is the end of the text not converted into speech? (Missing synthesized speech)
Check if you forgot to send the finish-task instruction. During speech synthesis, the server starts synthesizing only after it has cached enough text. If you forget to send the finish-task instruction, the text at the end of the cache may not be synthesized.
Q: Why is the order of the returned audio stream incorrect, causing chaotic playback?
Troubleshoot this issue by checking the following:
Make sure that the run-task instruction, continue-task instruction, and finish-task instruction for the same synthesis task use the same
task_id.Check if an asynchronous operation is causing the audio file to be written in a different order from which the binary data was received.
Permissions and authentication
Q: I want my API key to be used only for the CosyVoice speech synthesis service, not for other Model Studio models (permission isolation). What should I do?
You can create a workspace and authorize only specific models to limit the scope of the API key. For more information, see Manage workspaces.
More questions
For more information, see the Q&A on GitHub.