To use a model in the China (Beijing) region, go to the API key page for the China (Beijing) region
This topic describes how to access the CosyVoice speech synthesis service using a WebSocket connection.
The DashScope SDK currently supports only Java and Python. For other programming languages, you can develop CosyVoice speech synthesis applications by communicating with the service directly through a WebSocket connection.
User guide: For more information about the models and guidance on model selection, see Real-time speech synthesis - CosyVoice.
WebSocket is a network protocol that supports full-duplex communication. The client and server establish a persistent connection with a single handshake, which allows both parties to actively push data to each other. This provides significant advantages in real-time performance and efficiency.
For common programming languages, many ready-to-use WebSocket libraries and examples are available, such as:
Go:
gorilla/websocketPHP:
RatchetNode.js:
ws
Familiarize yourself with the basic principles and technical details of WebSocket before you begin development.
Prerequisites
You have activated the Model Studio and created an API key. To prevent security risks, export the API key as an environment variable instead of hard-coding it in your code.
To grant temporary access permissions to third-party applications or users, or if you want to strictly control high-risk operations such as accessing or deleting sensitive data, we recommend that you use a temporary authentication token.
Compared with long-term API keys, temporary authentication tokens are more secure because they are short-lived (60 seconds). They are suitable for temporary call scenarios and can effectively reduce the risk of API key leakage.
To use a temporary token, replace the API key used for authentication in your code with the temporary authentication token.
Models and pricing
Model | Unit price |
cosyvoice-v3-plus | $0.286706 per 10,000 characters |
cosyvoice-v3-flash | $0.14335 per 10,000 characters |
cosyvoice-v2 | $0.286706 per 10,000 characters |
Text limits and format specifications for speech synthesis
Text length limits
The length of the text sent for synthesis in a single call to the continue-task instruction cannot exceed 2,000 characters, and the total length of the text sent in multiple calls to the continue-task instruction cannot exceed 200,000 characters.
Character counting rules
A Chinese character, including simplified or traditional Chinese, Japanese kanji, and Korean hanja, is counted as 2 characters. All other characters, such as punctuation marks, letters, numbers, and Japanese or Korean kana or hangul, are counted as 1 character.
SSML tags are not included in the text length calculation.
Examples:
"你好"→ 2(你) + 2(好) = 4 characters"中A文123"→ 2(中) + 1(A) + 2(文) + 1(1) + 1(2) + 1(3) = 8 characters"中文。"→ 2(中) + 2(文) + 1(。) = 5 characters"中 文。"→ 2(中) + 1(space) + 2(文) + 1(。) = 6 characters"<speak>你好</speak>"→ 2(你) + 2(好) = 4 characters
Encoding format
Use UTF-8 encoding.
Support for mathematical expressions
The mathematical expression parsing feature is currently available only for the cosyvoice-v2, cosyvoice-v3-flash, and cosyvoice-v3-plus models. This feature supports common mathematical expressions from primary and secondary school, such as basic arithmetic, algebra, and geometry.
For more information, see Convert LaTeX formulas to speech.
SSML support
The Speech Synthesis Markup Language (SSML) feature is currently available only for cloned voices of the cosyvoice-v3-flash, cosyvoice-v3-plus, and cosyvoice-v2 models, and for system voices that are indicated as supported in the voice list. The following conditions must be met:
To use this feature:
When you send the run-task instruction, set the
enable_ssmlparameter totrueto enable SSML support.Then, send the text that contains SSML using the continue-task instruction.
After you enable SSML support by setting the enable_ssml parameter to true, you can submit the complete text for synthesis in only one continue-task instruction. Multiple submissions are not supported.
Interaction flow
Messages sent from the client to the server are called instructions. The server returns two types of messages to the client: events in JSON format and binary audio streams.
The interaction flow between the client and the server is as follows:
Establish a connection: The client establishes a WebSocket connection with the server.
Start a task:
The client sends a run-task instruction to start a task.
The client receives a task-started event from the server. This indicates that the task has successfully started and you can proceed to the next steps.
Send text for synthesis:
The client sends one or more continue-task instructions to the server in sequence. These instructions contain the text to be synthesized. Once the server receives a complete sentence, it returns an audio stream. The length of the text is limited. For more information, see the description of the
textfield in the continue-task instruction.NoteYou can send multiple continue-task instructions to submit text fragments sequentially. The server automatically segments sentences after receiving the text fragments:
Complete sentences are synthesized immediately. The client can then receive the audio returned by the server.
Incomplete sentences are cached until they are complete, and then synthesized. The server does not return audio for incomplete sentences.
When you send a finish-task instruction, the server forces the synthesis of all cached content.
Notify the server to end the task:
After all text is sent, the client sends a finish-task instruction to notify the server to end the task and continues to receive the audio stream from the server. This step is crucial. Otherwise, you may not receive the complete audio.
End the task:
The client receives a task-finished event from the server, which indicates that the task is finished.
Close the connection: The client closes the WebSocket connection.
URL
The WebSocket URL is fixed as follows:
wss://dashscope.aliyuncs.com/api-ws/v1/inferenceHeaders
Add the following information to the request header:
{
"Authorization": "bearer <your_dashscope_api_key>", // Required. Replace <your_dashscope_api_key> with your API key.
"user-agent": "your_platform_info", // Optional.
"X-DashScope-WorkSpace": workspace, // Optional. The ID of your workspace in Alibaba Cloud Model Studio.
"X-DashScope-DataInspection": "enable"
}Instructions (client to server)
Instructions are messages sent from the client to the server. They are in JSON format, sent as Text Frames, and are used to control the start and end of a task and identify task boundaries.
Send instructions in the following strict sequence. Otherwise, the task may fail.
Send a run-task instruction
Starts the speech synthesis task.
The returned
task_idmust be used and must be the same for all subsequent continue-task instructions and the final finish-task instruction.
Send a continue-task instruction
Sends the text to be synthesized.
You can send this instruction only after you receive the task-started event from the server.
Send a finish-task instruction
Ends the speech synthesis task.
Send this instruction after all continue-task instructions have been sent.
1. run-task instruction: Start a task
This instruction starts a speech synthesis task. You can set request parameters such as the voice and sample rate in this instruction.
When to send: After the WebSocket connection is established.
Do not send text for synthesis: Sending text in this instruction makes troubleshooting difficult. Avoid sending text in this instruction.
Example:
{
"header": {
"action": "run-task",
"task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx", // Random UUID.
"streaming": "duplex"
},
"payload": {
"task_group": "audio",
"task": "tts",
"function": "SpeechSynthesizer",
"model": "cosyvoice-v2",
"parameters": {
"text_type": "PlainText",
"voice": "longxiaochun_v2", // Voice
"format": "mp3", // Audio format
"sample_rate": 22050, // Sample rate
"volume": 50, // Volume
"rate": 1, // Speech rate
"pitch": 1 // Pitch
},
"input": {// The input field cannot be omitted. Otherwise, an error is reported.
}
}
}header parameters:
Parameter | Type | Required | Description |
header.action | string | Yes | The instruction type. For this instruction, the value is fixed as "run-task". |
header.task_id | string | Yes | The ID of the current task. It is a 32-bit universally unique identifier (UUID) composed of 32 randomly generated letters and numbers. It can include hyphens (for example, When you later send continue-task instructions and finish-task instructions, the task_id used must be the same as the one used when sending the run-task instruction. |
header.streaming | string | Yes | Fixed string: "duplex" |
payload parameters:
Parameter | Type | Required | Description |
payload.task_group | string | Yes | Fixed string: "audio". |
payload.task | string | Yes | Fixed string: "tts". |
payload.function | string | Yes | Fixed string: "SpeechSynthesizer". |
payload.model | string | Yes | The speech synthesis model. Difference models require corresponding voices:
|
payload.input | object | Yes |
|
payload.parameters | |||
text_type | string | Yes | Fixed string: "PlainText". |
voice | string | Yes | The voice to use for speech synthesis. System voices and cloned voices are supported:
|
format | string | No | The audio coding format. Supported formats are pcm, wav, mp3 (default), and opus. When the audio format is opus, you can adjust the bitrate using the |
sample_rate | integer | No | The audio sample rate (in Hz). Default value: 22050. Valid values: 8000, 16000, 22050, 24000, 44100, 48000. Note The default sample rate is the optimal rate for the current voice. By default, the output uses this sample rate. Downsampling and upsampling are also supported. |
volume | integer | No | The volume. Default value: 50. Value range: [0, 100]. A value of 50 is the standard volume. The volume has a linear relationship with this value. 0 is mute and 100 is the maximum volume. |
rate | float | No | The speech rate. Default value: 1.0. Value range: [0.5, 2.0]. A value of 1.0 is the standard rate. Values less than 1.0 slow down the speech, and values greater than 1.0 speed it up. |
pitch | float | No | The pitch. This value is a multiplier for pitch adjustment. The relationship between this value and the perceived pitch is not strictly linear or logarithmic. Test different values to find the best one. Default value: 1.0. Value range: [0.5, 2.0]. A value of 1.0 is the natural pitch of the voice. Values greater than 1.0 raise the pitch, and values less than 1.0 lower it. |
enable_ssml | boolean | No | Specifies whether to enable the SSML feature. If this parameter is set to |
bit_rate | int | No | The audio bitrate in kbps. If the audio format is Opus, you can adjust the bitrate using the Default value: 32. Value range: [6, 510]. |
word_timestamp_enabled | boolean | No | Specifies whether to enable character-level timestamps. Default value: false.
This feature is available only for cloned voices of the cosyvoice-v3-flash, cosyvoice-v3-plus, and cosyvoice-v2 models, and for system voices marked as supported in the voice list. |
seed | int | No | The random number seed used during generation, which varies the synthesis effect. If the model version, text, voice, and other parameters are the same, using the same seed reproduces the same synthesis result. Default value: 0. Value range: [0, 65535]. |
language_hints | array[string] | No | Provides language hints. Only cosyvoice-v3-flash and cosyvoice-v3-plus support this feature. No default value. This parameter has no effect if it is not set. This parameter has the following effects in speech synthesis:
If the specified language hint clearly does not match the text content, for example, setting Note: This parameter is an array, but the current version processes only the first element. Therefore, pass only one value. |
instruction | string | No | Sets an instruction. This feature is available only for cloned voices of the cosyvoice-v3-flash and cosyvoice-v3-plus models, and for system voices marked as supported in the voice list. No default value. This parameter has no effect if it is not set. The instruction has the following effects in speech synthesis:
|
enable_aigc_tag | boolean | No | Specifies whether to add an invisible AIGC identifier to the generated audio. When set to true, an invisible identifier is embedded into the audio in supported formats (WAV, MP3, and Opus). Default value: false. This feature is available only for the cosyvoice-v3-flash, cosyvoice-v3-plus, and cosyvoice-v2 models. |
aigc_propagator | string | No | Sets the Default value: Alibaba Cloud UID. This feature is available only for the cosyvoice-v3-flash, cosyvoice-v3-plus, and cosyvoice-v2 models. |
aigc_propagate_id | string | No | Sets the Default value: The request ID of the current speech synthesis request. This feature is available only for the cosyvoice-v3-flash, cosyvoice-v3-plus, and cosyvoice-v2 models. |
2. continue-task instruction
This instruction is used exclusively to send the text to be synthesized.
You can send the text to be synthesized all at once in a single continue-task instruction, or you can segment the text and send it sequentially in multiple continue-task instructions.
When to send: After you receive the task-started event.
The interval between sending text fragments cannot exceed 23 seconds. Otherwise, a "request timeout after 23 seconds" exception is triggered.
If there is no more text to send, promptly send a finish-task instruction to end the task.
The server enforces a 23-second timeout. The client cannot modify this configuration.
Example:
{
"header": {
"action": "continue-task",
"task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx", // Random UUID.
"streaming": "duplex"
},
"payload": {
"input": {
"text": "A quiet night thought, I see the moonlight before my bed"
}
}
}header parameters:
Parameter | Type | Required | Description |
header.action | string | Yes | The instruction type. For this instruction, the value is fixed as "continue-task". |
header.task_id | string | Yes | The ID of the current task. This must be the same as the task_id used when sending the run-task instruction. |
header.streaming | string | Yes | Fixed string: "duplex" |
payload parameters:
Parameter | Type | Required | Description |
input.text | string | Yes | The text to be synthesized. |
3. finish-task instruction: End a task
This instruction ends a speech synthesis task.
You must send this instruction. Otherwise, the synthesized speech may be incomplete.
After this instruction is sent, the server converts the remaining text into speech. After the speech synthesis is complete, the server returns a task-finished event to the client.
When to send: After all continue-task instructions have been sent.
Example:
{
"header": {
"action": "finish-task",
"task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx",
"streaming": "duplex"
},
"payload": {
"input": {}// The input field cannot be omitted. Otherwise, an error is reported.
}
}header parameters:
Parameter | Type | Required | Description |
header.action | string | Yes | The instruction type. For this instruction, the value is fixed as "finish-task". |
header.task_id | string | Yes | The ID of the current task. This must be the same as the task_id used when sending the run-task instruction. |
header.streaming | string | Yes | Fixed string: "duplex" |
payload parameters:
Parameter | Type | Required | Description |
payload.input | object | Yes | Fixed format: {}. |
Events (server to client)
Events are messages returned from the server to the client. They are in JSON format and represent different processing stages.
The binary audio returned from the server to the client is not included in any event and must be received separately.
1. task-started event: Task has started
When you receive the task-started event from the server, it indicates that the task has started successfully. You can send a continue-task instruction or a finish-task instruction to the server only after receiving this event. Otherwise, the task will fail.
The task-started event's payload is empty.
Example:
{
"header": {
"task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx",
"event": "task-started",
"attributes": {}
},
"payload": {}
}header parameters:
Parameter | Type | Description |
header.event | string | The event type. For this event, the value is fixed as "task-started". |
header.task_id | string | The task_id generated by the client. |
2. result-generated event
While the client sends continue-task instructions and a finish-task instruction, the server continuously returns result-generated events.
In the CosyVoice service, the result-generated event is a reserved interface in the protocol. It encapsulates information such as the Request ID and can be ignored.
Example:
{
"header": {
"task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx",
"event": "result-generated",
"attributes": {
"request_uuid": "0a9dba9e-d3a6-45a4-be6d-xxxxxxxxxxxx"
}
},
"payload": {}
}header parameters:
Parameter | Type | Description |
header.event | string | The event type. For this event, the value is fixed as "result-generated". |
header.task_id | string | The task_id generated by the client. |
header.attributes.request_uuid | string | The Request ID. |
payload parameters:
Parameter | Type | Description |
payload.usage.characters | integer | The number of billable characters in the current request so far.
In a single task, |
3. task-finished event: Task has finished
When you receive the task-finished event from the server, it indicates that the task is finished.
After the task is finished, you can close the WebSocket connection to end the program, or you can reuse the WebSocket connection and resend the run-task instruction to start the next task. For more information, see Connection establishment overhead and connection reuse.
Example:
{
"header": {
"task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx",
"event": "task-finished",
"attributes": {
"request_uuid": "0a9dba9e-d3a6-45a4-be6d-xxxxxxxxxxxx"
}
},
"payload": {
"output": {
"sentence": {
"words": []
}
},
"usage": {
"characters": 13
}
}
}header parameters:
Parameter | Type | Description |
header.event | string | The event type. For this event, the value is fixed as "task-finished". |
header.task_id | string | The task_id generated by the client. |
header.attributes.request_uuid | string | The Request ID. You can provide this to CosyVoice developers to locate issues. |
payload parameters:
Parameter | Type | Description |
payload.usage.characters | integer | The number of billable characters used in the current request. In a task, the |
payload.output.sentence.index | integer | The number of the sentence, starting from 0. This field and the following fields require enabling word-level timestamps using word_timestamp_enabled. |
payload.output.sentence.words[k] | ||
text | string | The text of the word. |
begin_index | integer | The start position index of the word in the sentence, starting from 0. |
end_index | integer | The end position index of the word in the sentence, starting from 1. |
begin_time | integer | The start timestamp of the audio corresponding to the word, in milliseconds. |
end_time | integer | The end timestamp of the audio corresponding to the word, in milliseconds. |
After you enable word-level timestamps using word_timestamp_enabled, timestamp information is returned. The following is an example:
4. task-failed event: Task has failed
If you receive a task-failed event, it indicates that the task has failed. At this point, close the WebSocket connection and handle the error. You can analyze the error message to identify and fix programming issues in your code.
Example:
{
"header": {
"task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx",
"event": "task-failed",
"error_code": "InvalidParameter",
"error_message": "[tts:]Engine return error code: 418",
"attributes": {}
},
"payload": {}
}header parameters:
Parameter | Type | Description |
header.event | string | The event type. For this event, the value is fixed as "task-failed". |
header.task_id | string | The task_id generated by the client. |
header.error_code | string | A description of the error type. |
header.error_message | string | The specific reason for the error. |
Connection overhead and reuse
The WebSocket service supports connection reuse to improve resource utilization and avoid connection establishment overhead.
After the server receives a run-task instruction from the client, it starts a new task. After the client sends a finish-task instruction, the server returns a task-finished event to end the task when it is complete. After a task is finished, the WebSocket connection can be reused. The client can send another run-task instruction to start the next task.
Different tasks in a reused connection must use different task_ids.
If a task fails during execution, the service will still return a task-failed event and close the connection. This connection cannot be reused.
If there are no new tasks for 60 seconds after a task is finished, the connection will time out and be automatically disconnected.
Sample code
The sample code provides a basic implementation to demonstrate how the service works. You must adapt the code for your specific business scenarios.
When you write a WebSocket client, asynchronous programming is typically used to send and receive messages simultaneously. You can write your program by following these steps:
Error codes
For troubleshooting information, see Error messages.
FAQ
Features, billing, and limits
Q: What can I do to fix inaccurate pronunciation?
You can use SSML to customize the speech synthesis output.
Q: Why use the WebSocket protocol instead of the HTTP/HTTPS protocol? Why not provide a RESTful API?
Voice Service uses WebSocket instead of HTTP, HTTPS, or RESTful because it requires full-duplex communication. WebSocket allows the server and client to actively exchange data in both directions, such as pushing real-time speech synthesis or recognition progress. In contrast, HTTP-based RESTful APIs only support a one-way, client-initiated request-response model, which is unsuitable for real-time interaction.
Q: Speech synthesis is billed based on the number of characters. How can I find the character count for each synthesis task?
You can obtain the character count from the payload.usage.characters parameter of the result-generated event returned by the server. Use the value from the last result-generated event that you receive.
Troubleshooting
If an error occurs in your code, check whether the instruction sent to the server is correct. You can print the content of the instruction to check for formatting errors or missing required parameters. If the instruction is correct, troubleshoot the issue based on the information in the error code.
Q: How do I get the Request ID?
You can obtain the Request ID in two ways:
Parse the information that is returned by the server in the result-generated event.
Parse the task-finished event for the information returned by the server.
Q: Why is the SSML feature failing?
Follow these steps to troubleshoot:
Ensure that the scope of application is correct.
Ensure that you call the service correctly. For more information, see SSML support.
Ensure that the text to be synthesized is in plain text format and meets the formatting requirements. For more information, see Speech Synthesis Markup Language.
Q: Why can't the audio be played?
Troubleshoot this issue based on the following scenarios:
The audio is saved as a complete file, such as an .mp3 file.
Audio format consistency: Ensure that the audio format specified in the request parameters matches the file extension. For example, playback might fail if the audio format is set to WAV in the request parameters but the file has an .mp3 extension.
Player compatibility: Confirm that your player supports the format and sample rate of the audio file. For example, some players might not support high sample rates or specific audio encodings.
The audio is played in streaming mode.
Save the audio stream as a complete file and try to play it. If the file fails to play, see the troubleshooting steps for the first scenario.
If the file plays correctly, the issue might be with the streaming playback implementation. Confirm that your player supports streaming playback.
Common tools and libraries that support streaming playback include FFmpeg, pyaudio (Python), AudioFormat (Java), and MediaSource (JavaScript).
Q: Why does the audio playback stutter?
Troubleshoot this issue based on the following scenarios:
Check the text sending speed: Ensure that the text sending interval is reasonable. Avoid delays in sending the next text segment after the audio for the previous segment has finished playing.
Check the callback function performance:
Check whether the callback function contains excessive business logic that could cause it to block.
The callback function runs in the WebSocket thread. If this thread is blocked, it can interfere with the WebSocket's ability to receive network packets, resulting in audio stuttering.
To avoid blocking the WebSocket thread, write the audio data to a separate audio buffer and then use another thread to read and process it.
Check network stability: Ensure that your network connection is stable to prevent audio transmission interruptions or delays caused by network fluctuations.
Q: Why is speech synthesis slow (long synthesis time)?
Perform the following troubleshooting steps:
Check the input interval
If you are using streaming speech synthesis, check whether the text sending interval is too long. For example, a delay of several seconds before sending the next segment will increase the total synthesis time.
Analyze performance metrics
First packet delay: This is typically around 500 ms.
Real-Time Factor (RTF): This is calculated as Total Synthesis Time / Audio Duration. The RTF is normally less than 1.0.
Q: How do I handle incorrect pronunciation in the synthesized speech?
Use the <phoneme> tag of SSML to specify the correct pronunciation.
Q: Why is some audio missing? Why is the end of my text not synthesized into speech?
Ensure that you send the finish-task instruction. During the speech synthesis process, the server starts synthesis only after it caches a sufficient amount of text. If you forget to send the finish-task instruction, the last part of the text in the cache may not be synthesized into speech.
Q: Why are the returned audio stream segments out of order, causing jumbled playback?
Check the following two points:
Ensure that the run-task instruction, continue-task instruction, and finish-task instruction for the same synthesis task use the same
task_id.Check if your asynchronous operations are writing audio data in a different order than it is received.
Permissions and authentication
Q: I want my API key to be used only for the CosyVoice speech synthesis service, not for other Model Studio models (permission isolation). What should I do?
You can create a workspace and authorize only specific models to limit the scope of the API key. For more information, see Manage workspaces.
More questions
For more information, see the Q&A on GitHub.