CosyVoice speech synthesis WebSocket API - Alibaba Cloud Model Studio

Important

To use a model in the China (Beijing) region, go to the API key page for the China (Beijing) region

This document describes how to access the CosyVoice speech synthesis service using a WebSocket connection.

The DashScope SDK currently supports only Java and Python. If you want to develop CosyVoice speech synthesis applications in other programming languages, you must use a WebSocket connection to communicate with the service.

User guide: For more information about the models and guidance on model selection, see Real-time speech synthesis - CosyVoice.

WebSocket is a network protocol that supports full-duplex communication. The client and server establish a persistent connection with a single handshake, which allows both parties to actively push data to each other. This provides significant advantages in real-time performance and efficiency.

For common programming languages, many ready-to-use WebSocket libraries and examples are available, such as:

Go: gorilla/websocket
PHP: Ratchet
Node.js: ws

Familiarize yourself with the basic principles and technical details of WebSocket before you begin development.

Prerequisites

You have activated the Model Studio and created an API key. To prevent security risks, export the API key as an environment variable instead of hard-coding it in your code.

Note

To grant temporary access permissions to third-party applications or users, or if you want to strictly control high-risk operations such as accessing or deleting sensitive data, we recommend that you use a temporary authentication token.

Compared with long-term API keys, temporary authentication tokens are more secure because they are short-lived (60 seconds). They are suitable for temporary call scenarios and can effectively reduce the risk of API key leakage.

To use a temporary token, replace the API key used for authentication in your code with the temporary authentication token.

Models and pricing

Model	Price
cosyvoice-v3-plus	$0.286706 per 10,000 characters
cosyvoice-v3-flash	$0.14335 per 10,000 characters
cosyvoice-v2	$0.286706 per 10,000 characters

Text and format limitations

Text length limits

The text sent in a single continue-task instruction must not exceed 2,000 characters. The total length of text sent across multiple continue-task instructions must not exceed 200,000 characters.

Character counting rules

A Chinese character, including simplified or traditional Chinese, Japanese kanji, and Korean hanja, is counted as 2 characters. All other characters, such as punctuation marks, letters, numbers, and Japanese or Korean kana or hangul, are counted as 1 character.
SSML tags are not included in the text length calculation.
Examples:
- "你好" → 2(你) + 2(好) = 4 characters
- "中A文123" → 2(中) + 1(A) + 2(文) + 1(1) + 1(2) + 1(3) = 8 characters
- "中文。" → 2(中) + 2(文) + 1(。) = 5 characters
- "中文。" → 2(中) + 1(space) + 2(文) + 1(。) = 6 characters
- "<speak>你好</speak>" → 2(你) + 2(好) = 4 characters

Encoding format

Use UTF-8 encoding.

Support for mathematical expressions

The mathematical expression parsing feature is currently available only for the cosyvoice-v2, cosyvoice-v3-flash, and cosyvoice-v3-plus models. This feature supports common mathematical expressions from primary and secondary school, such as basic arithmetic, algebra, and geometry.

For more information, see Convert LaTeX formulas to speech.

SSML support

The Speech Synthesis Markup Language (SSML) feature is currently available only for cloned voices of the cosyvoice-v3-flash, cosyvoice-v3-plus, and cosyvoice-v2 models, and for system voices that are indicated as supported in the voice list. The following conditions must be met:

To use this feature:

When you send the run-task instruction, set the enable_ssml parameter to true to enable SSML support.
Then, send the text that contains SSML using the continue-task instruction.

Important

After you enable SSML support by setting the enable_ssml parameter to true, you must submit the complete text for synthesis in a single continue-task instruction. Multiple submissions are not supported.

Interaction flow

Messages sent from the client to the server are called instructions. The server returns two types of messages to the client: JSON-formatted events and binary audio streams.

The following is the interaction flow between the client and the server:

Establish a connection: The client establishes a WebSocket connection with the server.
Start a task:
- The client sends a run-task instruction to start the task.
- The client receives the task-started event from the server. This event indicates that the task has started successfully and that you can proceed to the next steps.
Send text for synthesis:
The client sends one or more continue-task instructions that contain the text for synthesis to the server in sequence. After the server receives a complete sentence, it returns an audio stream. The text length is constrained. For more information, see the description of the text field in the continue-task instruction.
Note
You can send multiple continue-task instructions to submit text fragments in sequence. The server automatically segments the text after it receives the fragments:
- Complete sentences are synthesized immediately, and the client receives the corresponding audio from the server.
- Incomplete sentences are cached until they are complete, and then synthesized. The server does not return audio for incomplete sentences.
When you send a finish-task instruction, the server synthesizes all cached content.
Notify the server to end the task:
After all the text is sent, the client sends a finish-task instruction to the server to end the task. The client continues to receive the audio stream from the server. You must perform this step. Otherwise, you may not receive the complete audio stream.
Task ends:
The client receives a task-finished event from the server. This event indicates that the task has ended.
Close the connection: The client closes the WebSocket connection.

URL

The WebSocket URL is fixed:

wss://dashscope.aliyuncs.com/api-ws/v1/inference

Headers

Include the following parameters in the request headers:

{
    "Authorization": "bearer <your_dashscope_api_key>", // Required. Replace <your_dashscope_api_key> with your API key.
    "user-agent": "your_platform_info", // Optional.
    "X-DashScope-WorkSpace": workspace, // Optional. The ID of your Model Studio workspace.
    "X-DashScope-DataInspection": "enable"
}

Instructions (client to server)

Instructions are messages that the client sends to the server in JSON format as text frames. These instructions control the start and end of a task and identify task boundaries.

To avoid task failure, you must send instructions in the following sequence:

Send the run-task instruction
- Starts the speech synthesis task.
- The returned task_id must be used for the subsequent continue-task instruction and finish-task instruction.
Send the continue-task instruction
- Sends the text for synthesis.
- Send this instruction only after you receive the task-started event from the server.
Send the finish-task instruction
- Ends the speech synthesis task.
- Send this instruction after all continue-task instructions have been sent.

1. run-task instruction: Start a task

This instruction starts a speech synthesis task and lets you set request parameters, such as the voice and sample rate.

Important

When to send: After the WebSocket connection is established.
Do not send text for synthesis: Do not include the text for synthesis in this instruction. Including text can make troubleshooting difficult.

Example:

{
    "header": {
        "action": "run-task",
        "task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx", // Random UUID
        "streaming": "duplex"
    },
    "payload": {
        "task_group": "audio",
        "task": "tts",
        "function": "SpeechSynthesizer",
        "model": "cosyvoice-v3-flash",
        "parameters": {
            "text_type": "PlainText",
            "voice": "longanyang",            // Voice
            "format": "mp3",		        // Audio format
            "sample_rate": 22050,	        // Sample rate
            "volume": 50,			// Volume
            "rate": 1,				// Speech rate
            "pitch": 1				// Pitch
        },
        "input": {// The input field cannot be omitted. Otherwise, an error is reported.
        }
    }
}

header parameters

Parameter	Type	Required	Description
header.action	string	Yes	The instruction type. For this instruction, the value is fixed as "run-task".
header.task_id	string	Yes	The ID of the current task. A UUID composed of 32 randomly generated letters and numbers. It can be formatted with hyphens (for example, `"2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx"`) or without them (for example, `"2bf83b9abaeb4fda8d9axxxxxxxxxxxx"`). Most programming languages have built-in APIs for generating UUIDs. For example, in Python: `import uuid def generateTaskId(self): # Generate a random UUID return uuid.uuid4().hex` The task_id used for subsequent continue-task instruction and finish-task instruction must be the same as the one used in the run-task instruction.
header.streaming	string	Yes	Fixed string: "duplex"

payload parameters

Parameter	Type	Required	Description
payload.task_group	string	Yes	Fixed string: "audio".
payload.task	string	Yes	Fixed string: "tts".
payload.function	string	Yes	Fixed string: "SpeechSynthesizer".
payload.model	string	Yes	The speech synthesis model. Difference models require corresponding voices: cosyvoice-v3-flash/cosyvoice-v3-plus: Use voices such as longanyang. cosyvoice-v2: Use voices such as longxiaochun_v2. For a complete list, see Voice list.
payload.input	object	Yes	If you do not send the text for synthesis at this time, the `input` format is: `"input": {}` If you send the text for synthesis at this time, the `input` format is: `"input": { "text": "What is the weather like today?" // Text for synthesis }`
payload.parameters
text_type	string	Yes	Fixed string: "PlainText".
voice	string	Yes	The voice to use for speech synthesis. System voices and cloned voices are supported: System voices: See Voice list. Cloned voices: Customize voices using the voice cloning feature. When you use a cloned voice, make sure that the same account is used for both voice cloning and speech synthesis. For detailed steps, see CosyVoice Voice Cloning API. When you use a cloned voice, the value of the `model` parameter in the request must be the same as the model version used to create the voice (the `target_model` parameter).
format	string	No	The audio coding format. Supported formats are pcm, wav, mp3 (default), and opus. When the audio format is opus, you can adjust the bitrate using the `bit_rate` parameter.
sample_rate	integer	No	The audio sample rate in Hz. Default value: 22050. Valid values: 8000, 16000, 22050, 24000, 44100, 48000. Note The default sample rate is the optimal rate for the current voice. By default, the output uses this sample rate. Downsampling and upsampling are also supported.
volume	integer	No	The volume. Default value: 50. Value range: [0, 100]. A value of 50 is the standard volume. The volume has a linear relationship with this value. 0 is mute and 100 is the maximum volume.
rate	float	No	The speech rate. Default value: 1.0. Value range: [0.5, 2.0]. A value of 1.0 is the standard rate. Values less than 1.0 slow down the speech, and values greater than 1.0 speed it up.
pitch	float	No	The pitch. This value is a multiplier for pitch adjustment. The relationship between this value and the perceived pitch is not strictly linear or logarithmic. Test different values to find the best one. Default value: 1.0. Value range: [0.5, 2.0]. A value of 1.0 is the natural pitch of the voice. Values greater than 1.0 raise the pitch, and values less than 1.0 lower it.
enable_ssml	boolean	No	Specifies whether to enable the SSML feature. If this parameter is set to `true`, you can send the text only once (you can send the continue-task instruction only once).
bit_rate	int	No	The audio bitrate in kbps. If the audio format is Opus, you can adjust the bitrate using the `bit_rate` parameter. Default value: 32. Value range: [6, 510].
word_timestamp_enabled	boolean	No	Specifies whether to enable character-level timestamps. Default value: false. true false This feature is available only for cloned voices of the cosyvoice-v3-flash, cosyvoice-v3-plus, and cosyvoice-v2 models, and for system voices marked as supported in the voice list.
seed	int	No	The random number seed used during generation, which varies the synthesis effect. If the model version, text, voice, and other parameters are the same, using the same seed reproduces the same synthesis result. Default value: 0. Value range: [0, 65535].
language_hints	array[string]	No	Provides language hints. Only cosyvoice-v3-flash and cosyvoice-v3-plus support this feature. No default value. This parameter has no effect if it is not set. This parameter has the following effects in speech synthesis: Specifies the language for Text Normalization (TN) processing, which affects how numbers, abbreviations, and symbols are read. This is effective only for Chinese and English. Value range: zh: Chinese en: English Specifies the target language for speech synthesis (for cloned voices only) to improve synthesis accuracy. This is effective for English, French, German, Japanese, Korean, and Russian. You do not need to specify Chinese. The value must be consistent with the languageHints/language_hints used during voice cloning. Value range: en: English fr: French de: German ja: Japanese ko: Korean ru: Russian If the specified language hint clearly does not match the text content, for example, setting `en` for Chinese text, the hint is ignored, and the language is automatically detected based on the text content. Note: This parameter is an array, but the current version processes only the first element. Therefore, pass only one value.
instruction	string	No	Sets an instruction. This feature is available only for cloned voices of the cosyvoice-v3-flash and cosyvoice-v3-plus models, and for system voices marked as supported in the voice list. The prompt must be in Chinese and follow specifc patterns. No default value. This parameter has no effect if it is not set. The instruction has the following effects in speech synthesis: Specifies a non-Chinese language (for cloned voices only) Format: "`你会用<小语种>说出来。`" (Note: You must use Chinese and do not omit the "`。`" at the end. Replace "`<小语种>`" with a specific language, for example, `德语`.) Example: "`你会用德语说出来。`" Supported languages: 法语 (French), 德语 (German), 日语 (Japanese), 韩语 (Korean), and 俄语 (Russian). Specifies a dialect (for cloned voices only) Format: "`请用<方言>表达。`" (Note: You must use Chinese and do not omit the "`。`" at the end. Replace "`<方言>`" with a specific `dialect`, for example, `广东话`.) Example: "`请用广东话表达。`" Supported dialects: 广东话 (Cantonese), 东北话 (Dongbei), 甘肃话 (Gansu), 贵州话 (Guizhou), 河南话 (Henan), 湖北话 (Hubei), 江西话 (Jiangxi), 闽南话 (Minnan), 宁夏话 (Ningxia), 山西话 (Shanxi), 陕西话 (Shaanxi), 山东话 (Shandong), 上海话 (Shanghainese), 四川话 (Sichuan), 天津话 (Tianjin), and 云南话 (Yunnan). Specifies emotion, scenario, role, or identity. Only some system voices support this feature, and it varies by voice. For more information, see Voice list.
enable_aigc_tag	boolean	No	Specifies whether to add an invisible AIGC identifier to the generated audio. When set to true, an invisible identifier is embedded into the audio in supported formats (WAV, MP3, and Opus). Default value: false. This feature is available only for the cosyvoice-v3-flash, cosyvoice-v3-plus, and cosyvoice-v2 models.
aigc_propagator	string	No	Sets the `ContentPropagator` field in the AIGC invisible identifier to specify the content propagator. This setting takes effect only when `enable_aigc_tag` is `true` . Default value: Alibaba Cloud UID. This feature is available only for the cosyvoice-v3-flash, cosyvoice-v3-plus, and cosyvoice-v2 models.
aigc_propagate_id	string	No	Sets the `PropagateID` field in the AIGC invisible identifier to uniquely identify a specific propagation behavior. This field takes effect only when `enable_aigc_tag` is set to `true` . Default value: The request ID of the current speech synthesis request. This feature is available only for the cosyvoice-v3-flash, cosyvoice-v3-plus, and cosyvoice-v2 models.

2. continue-task instruction

This instruction is used to send the text for synthesis.

You can send the text for synthesis all at once in a single continue-task instruction, or you can segment the text and send it sequentially in multiple continue-task instructions.

Important

When to send: Send this instruction after you receive the task-started event.

Note

The interval between sending text fragments must not exceed 23 seconds. If this interval is exceeded, a "request timeout after 23 seconds" exception is triggered.

If you have no text to send, promptly send a finish-task instruction to end the task.

The server enforces a 23 second timeout. The client cannot change this setting.

Example:

{
    "header": {
        "action": "continue-task",
        "task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx", // Random UUID
        "streaming": "duplex"
    },
    "payload": {
        "input": {
            "text": "A bright moonbeam shines before my bed, I wonder if it's frost upon the ground."
        }
    }
}

header parameter description

Parameter	Type	Required	Description
header.action	string	Yes	The instruction type. For this instruction, the value is fixed as "continue-task".
header.task_id	string	Yes	The ID of the current task. This must be the same as the task_id used when sending the run-task instruction.
header.streaming	string	Yes	Fixed string: "duplex"

payload parameter description

Parameter	Type	Required	Description
input.text	string	Yes	The text for synthesis.

3. finish-task instruction: End a task

This instruction ends the speech synthesis task.

You must send this instruction to ensure that no part of the synthesized speech is missing.

After the client sends this instruction, the server converts any remaining text into speech. When the synthesis is complete, the server returns a task-finished event to the client.

Important

When to send: Send this instruction after all continue-task instructions have been sent.

Example:

{
    "header": {
        "action": "finish-task",
        "task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx",
        "streaming": "duplex"
    },
    "payload": {
        "input": {}// The input field cannot be omitted. Otherwise, an error is reported.
    }
}

header parameters

Parameter	Type	Required	Description
header.action	string	Yes	The instruction type. For this instruction, the value is fixed as "finish-task".
header.task_id	string	Yes	The ID of the current task. This must be the same as the task_id used when you send the run-task instruction.
header.streaming	string	Yes	Fixed string: "duplex"

payload parameters

Parameter	Type	Required	Description
payload.input	object	Yes	Fixed format: {}.

Events (server to client)

Events are messages that the server sends to the client. These events are in JSON format and represent different processing stages.

Note

The binary audio that the server returns to the client is not included in any event and must be received separately.

1. task-started event: Task has started

The task-started event from the server indicates that the task has started successfully. You must receive this event before you send a continue-task instruction or a finish-task instruction to the server. Otherwise, the task fails.

The payload of the task-started event is empty.

Example:

{
    "header": {
        "task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx",
        "event": "task-started",
        "attributes": {}
    },
    "payload": {}
}

header parameters

Parameter

Type

Description

header.event

string

The event type.

For this event, the value is fixed as "task-started".

header.task_id

string

The task_id generated by the client.

2. result-generated event

When the client sends continue-task and finish-task instructions, the server continuously returns result-generated events.

In the CosyVoice service, the result-generated event is reserved for the protocol. It encapsulates information, such as the Request ID, and can be ignored.

Example:

{
    "header": {
        "task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx",
        "event": "result-generated",
        "attributes": {
            "request_uuid": "0a9dba9e-d3a6-45a4-be6d-xxxxxxxxxxxx"
        }
    },
    "payload": {}
}

header parameters

Parameter	Type	Description
header.event	string	The event type. For this event, the value is fixed as "result-generated".
header.task_id	string	The task_id generated by the client.
header.attributes.request_uuid	string	The Request ID.

payload parameter description

Parameter	Type	Description
payload.usage.characters	integer	The number of billable characters in the current request so far. In a single task, `usage` may appear in the result-generated event or the task-finished event. The returned `usage` field is a cumulative value. Use the final value as the standard.

3. task-finished event: Task has ended

When you receive the task-finished event from the server, this indicates that the task has finished.

After the task finishes, you can close the WebSocket connection to terminate the program. You can also reuse the WebSocket connection and send another run-task instruction to start the next task. For more information, see Connection overhead and reuse.

Example:

{
    "header": {
        "task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx",
        "event": "task-finished",
        "attributes": {
            "request_uuid": "0a9dba9e-d3a6-45a4-be6d-xxxxxxxxxxxx"
        }
    },
    "payload": {
        "output": {
            "sentence": {
                "words": []
            }
        },
        "usage": {
            "characters": 13
        }
    }
}

header parameters

Parameter	Type	Description
header.event	string	The event type. For this event, the value is fixed as "task-finished".
header.task_id	string	The task_id generated by the client.
header.attributes.request_uuid	string	The Request ID. You can provide this to CosyVoice developers to locate issues.

payload parameters

Parameter	Type	Description
payload.usage.characters	integer	The number of billable characters in the current request so far. In a single task, `usage` may appear in the result-generated event or the task-finished event. The returned `usage` field is a cumulative value. Use the final value as the standard.
payload.output.sentence.index	integer	The sentence number, starting from 0. These fields are returned only when character-level timestamps are enabled by setting word_timestamp_enabled to true.
payload.output.sentence.words[k]
text	string	The text of the character.
begin_index	integer	The start position index of the character in the sentence, starting from 0.
end_index	integer	The end position index of the character in the sentence, starting from 1.
begin_time	integer	The start timestamp of the audio corresponding to the character, in milliseconds.
end_time	integer	The end timestamp of the audio corresponding to the character, in milliseconds.

If you enable character-level timestamps using word_timestamp_enabled, the response includes timestamp information. The following is an example:

Click to view the response after enabling character-level timestamps

{
    "header": {
        "task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx",
        "event": "task-finished",
        "attributes": {"request_uuid": "0a9dba9e-d3a6-45a4-be6d-xxxxxxxxxxxx"}
    },
    "payload": {
        "output": {
            "sentence": {
                "index": 0,
                "words": [
                    {
                        "text": "What",
                        "begin_index": 0,
                        "end_index": 1,
                        "begin_time": 80,
                        "end_time": 200
                    },
                    {
                        "text": "is",
                        "begin_index": 1,
                        "end_index": 2,
                        "begin_time": 240,
                        "end_time": 360
                    },
                    {
                        "text": "the",
                        "begin_index": 2,
                        "end_index": 3,
                        "begin_time": 360,
                        "end_time": 480
                    },
                    {
                        "text": "weather",
                        "begin_index": 3,
                        "end_index": 4,
                        "begin_time": 480,
                        "end_time": 680
                    },
                    {
                        "text": "like",
                        "begin_index": 4,
                        "end_index": 5,
                        "begin_time": 680,
                        "end_time": 800
                    },
                    {
                        "text": "to",
                        "begin_index": 5,
                        "end_index": 6,
                        "begin_time": 800,
                        "end_time": 920
                    },
                    {
                        "text": "day",
                        "begin_index": 6,
                        "end_index": 7,
                        "begin_time": 920,
                        "end_time": 1000
                    },
                    {
                        "text": "?",
                        "begin_index": 7,
                        "end_index": 8,
                        "begin_time": 1000,
                        "end_time": 1320
                    }
                ]
            }
        },
        "usage": {"characters": 15}
    }
}

4. task-failed event: Task has failed

If you receive a task-failed event, this indicates that the task has failed. You must then close the WebSocket connection and handle the error. Analyze the error message to determine the cause. If the failure is due to a programming issue, you can modify your code to fix it.

Example:

{
    "header": {
        "task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx",
        "event": "task-failed",
        "error_code": "InvalidParameter",
        "error_message": "[tts:]Engine return error code: 418",
        "attributes": {}
    },
    "payload": {}
}

header parameters

Parameter	Type	Description
header.event	string	The event type. For this event, the value is fixed as "task-failed".
header.task_id	string	The task_id generated by the client.
header.error_code	string	The error type description.
header.error_message	string	The specific cause of the error.

Connection overhead and reuse

The WebSocket service supports connection reuse. This improves resource utilization and reduces the overhead from establishing new connections.

The server starts a new task after it receives a run-task instruction from the client. When the client sends a finish-task instruction, the server completes the task and returns a task-finished event. After the task ends, the WebSocket connection can be reused. This allows the client to send another run-task instruction to start the next task.

Important

Each task in a reused connection requires a unique task_id.
If a task fails during execution, the service returns a task-failed event and closes the connection. In this case, the connection cannot be reused.
If a new task is not started within 60 seconds after the previous task ends, the connection automatically times out and closes.

Sample code

The sample code provides a basic implementation to run the service. You must develop the code for your specific business scenarios.

When you write WebSocket client code, asynchronous programming is typically used to send and receive messages simultaneously. You can write your program by following these steps:

Establish a WebSocket connection
You can call a WebSocket library function and pass the Headers and URL to establish a WebSocket connection. The specific implementation varies based on the programming language or library.
Listen for server messages
You can listen for messages returned by the server using a callback function (observer pattern) provided by the WebSocket library. The specific implementation varies based on the programming language.
The server returns two types of messages: binary audio streams and events.
Listen for events
- task-started: When you receive a task-started event, this indicates that the task has started successfully. You can send a continue-task instruction or a finish-task instruction to the server only after this event is triggered. Otherwise, the task fails.
- result-generated: The server may repeatedly return result-generated events when the client sends a continue-task instruction or a finish-task instruction. This event is a reserved part of the CosyVoice protocol and can be safely ignored.
- task-finished: When you receive a task-finished event, this indicates that the task is complete. You can then close the WebSocket connection and terminate the program.
- task-failed: If you receive a task-failed event, the task has failed. Close the WebSocket connection and modify your code based on the error message to resolve the issue.
Process the binary audio stream: The server sends the audio stream in frames through the binary channel. The complete audio data is transmitted in multiple data packets.
- In streaming speech synthesis, for compressed formats such as MP3 and Opus, use a streaming player to play the audio segments. Do not play them frame by frame to avoid decoding failures.
  Players that support streaming playback include ffmpeg, pyaudio (Python), AudioFormat (Java), and MediaSource (JavaScript).
- When combining audio data into a complete audio file, append the data to the same file.
- For WAV and MP3 audio formats in streaming speech synthesis, only the first frame contains header information. Subsequent frames contain only audio data.
Send messages to the server (pay close attention to the timing)
You can send instructions to the server in a different thread from the one that is listening for server messages, such as the main thread. The specific implementation varies based on the programming language.
To avoid task failure, you must send instructions in the following sequence:
1. Send the run-task instruction
  Starts the speech synthesis task.
  The returned task_id must be used for the subsequent continue-task instruction and finish-task instruction.
2. Send the continue-task instruction
  Sends the text for synthesis.
  Send this instruction only after you receive the task-started event from the server.
3. Send the finish-task instruction
  Ends the speech synthesis task.
  Send this instruction after all continue-task instructions have been sent.
Close the WebSocket connection
Close the WebSocket connection when the program ends normally, when an exception occurs during runtime, or when you receive a task-finished event or a task-failed event. This is usually done by calling the close function in the utility library.

Click to view the full example

Go

package main

import (
	"encoding/json"
	"fmt"
	"net/http"
	"os"
	"strings"
	"time"

	"github.com/google/uuid"
	"github.com/gorilla/websocket"
)

const (
	wsURL      = "wss://dashscope.aliyuncs.com/api-ws/v1/inference/"
	outputFile = "output.mp3"
)

func main() {
	// If you have not configured the API Key as an environment variable, you can replace the following line with: apiKey := "your_api_key". We do not recommend hard-coding the API Key into your code in a production environment to reduce the risk of leaks.
	apiKey := os.Getenv("DASHSCOPE_API_KEY")

	// Clear the output file
	os.Remove(outputFile)
	os.Create(outputFile)

	// Connect to WebSocket
	header := make(http.Header)
	header.Add("X-DashScope-DataInspection", "enable")
	header.Add("Authorization", fmt.Sprintf("bearer %s", apiKey))

	conn, resp, err := websocket.DefaultDialer.Dial(wsURL, header)
	if err != nil {
		if resp != nil {
			fmt.Printf("Connection failed. HTTP status code: %d\n", resp.StatusCode)
		}
		fmt.Println("Connection failed:", err)
		return
	}
	defer conn.Close()

	// Generate a task ID
	taskID := uuid.New().String()
	fmt.Printf("Generated task ID: %s\n", taskID)

	// Send the run-task instruction
	runTaskCmd := map[string]interface{}{
		"header": map[string]interface{}{
			"action":    "run-task",
			"task_id":   taskID,
			"streaming": "duplex",
		},
		"payload": map[string]interface{}{
			"task_group": "audio",
			"task":       "tts",
			"function":   "SpeechSynthesizer",
			"model":      "cosyvoice-v3-flash",
			"parameters": map[string]interface{}{
				"text_type":   "PlainText",
				"voice":       "longanyang",
				"format":      "mp3",
				"sample_rate": 22050,
				"volume":      50,
				"rate":        1,
				"pitch":       1,
				// If enable_ssml is set to true, you can send the continue-task instruction only once. Otherwise, the error "Text request limit violated, expected 1." is reported.
				"enable_ssml": false,
			},
			"input": map[string]interface{}{},
		},
	}

	runTaskJSON, _ := json.Marshal(runTaskCmd)
	fmt.Printf("Sending run-task instruction: %s\n", string(runTaskJSON))

	err = conn.WriteMessage(websocket.TextMessage, runTaskJSON)
	if err != nil {
		fmt.Println("Failed to send run-task:", err)
		return
	}

	textSent := false

	// Process messages
	for {
		messageType, message, err := conn.ReadMessage()
		if err != nil {
			fmt.Println("Failed to read message:", err)
			break
		}

		// Process binary messages
		if messageType == websocket.BinaryMessage {
			fmt.Printf("Received binary message, length: %d\n", len(message))
			file, _ := os.OpenFile(outputFile, os.O_APPEND|os.O_WRONLY|os.O_CREATE, 0644)
			file.Write(message)
			file.Close()
			continue
		}

		// Process text messages
		messageStr := string(message)
		fmt.Printf("Received text message: %s\n", strings.ReplaceAll(messageStr, "\n", ""))

		// Simple JSON parsing to get the event type
		var msgMap map[string]interface{}
		if json.Unmarshal(message, &msgMap) == nil {
			if header, ok := msgMap["header"].(map[string]interface{}); ok {
				if event, ok := header["event"].(string); ok {
					fmt.Printf("Event type: %s\n", event)

					switch event {
					case "task-started":
						fmt.Println("=== Received task-started event ===")

						if !textSent {
							// Send the continue-task instruction

							texts := []string{"A bright moonbeam shines before my bed, I wonder if it's frost upon the ground.", "I raise my head to gaze at the bright moon, then lower it, thinking of my hometown."}

							for _, text := range texts {
								continueTaskCmd := map[string]interface{}{
									"header": map[string]interface{}{
										"action":    "continue-task",
										"task_id":   taskID,
										"streaming": "duplex",
									},
									"payload": map[string]interface{}{
										"input": map[string]interface{}{
											"text": text,
										},
									},
								}

								continueTaskJSON, _ := json.Marshal(continueTaskCmd)
								fmt.Printf("Sending continue-task instruction: %s\n", string(continueTaskJSON))

								err = conn.WriteMessage(websocket.TextMessage, continueTaskJSON)
								if err != nil {
									fmt.Println("Failed to send continue-task:", err)
									return
								}
							}

							textSent = true

							// Delay sending finish-task
							time.Sleep(500 * time.Millisecond)

							// Send the finish-task instruction
							finishTaskCmd := map[string]interface{}{
								"header": map[string]interface{}{
									"action":    "finish-task",
									"task_id":   taskID,
									"streaming": "duplex",
								},
								"payload": map[string]interface{}{
									"input": map[string]interface{}{},
								},
							}

							finishTaskJSON, _ := json.Marshal(finishTaskCmd)
							fmt.Printf("Sending finish-task instruction: %s\n", string(finishTaskJSON))

							err = conn.WriteMessage(websocket.TextMessage, finishTaskJSON)
							if err != nil {
								fmt.Println("Failed to send finish-task:", err)
								return
							}
						}

					case "task-finished":
						fmt.Println("=== Task finished ===")
						return

					case "task-failed":
						fmt.Println("=== Task failed ===")
						if header["error_message"] != nil {
							fmt.Printf("Error message: %s\n", header["error_message"])
						}
						return

					case "result-generated":
						fmt.Println("Received result-generated event")
					}
				}
			}
		}
	}
}

C#

using System.Net.WebSockets;
using System.Text;
using System.Text.Json;

class Program {
    // If you have not configured the API Key as an environment variable, you can replace the following line with: private const string ApiKey="your_api_key". We do not recommend hard-coding the API Key into your code in a production environment to reduce the risk of leaks.
    private static readonly string ApiKey = Environment.GetEnvironmentVariable("DASHSCOPE_API_KEY") ?? throw new InvalidOperationException("DASHSCOPE_API_KEY environment variable is not set.");

    // WebSocket server endpoint
    private const string WebSocketUrl = "wss://dashscope.aliyuncs.com/api-ws/v1/inference/";
    // Output file path
    private const string OutputFilePath = "output.mp3";

    // WebSocket client
    private static ClientWebSocket _webSocket = new ClientWebSocket();
    // Cancellation token source
    private static CancellationTokenSource _cancellationTokenSource = new CancellationTokenSource();
    // Task ID
    private static string? _taskId;
    // Whether the task has started
    private static TaskCompletionSource<bool> _taskStartedTcs = new TaskCompletionSource<bool>();

    static async Task Main(string[] args) {
        try {
            // Clear the output file
            ClearOutputFile(OutputFilePath);

            // Connect to the WebSocket service
            await ConnectToWebSocketAsync(WebSocketUrl);

            // Start the task to receive messages
            Task receiveTask = ReceiveMessagesAsync();

            // Send the run-task instruction
            _taskId = GenerateTaskId();
            await SendRunTaskCommandAsync(_taskId);

            // Wait for the task-started event
            await _taskStartedTcs.Task;

            // Continuously send continue-task instructions
            string[] texts = {
                "A bright moonbeam shines before my bed",
                "I wonder if it's frost upon the ground",
                "I raise my head to gaze at the bright moon",
                "Then lower it, thinking of my hometown"
            };
            foreach (string text in texts) {
                await SendContinueTaskCommandAsync(text);
            }

            // Send the finish-task instruction
            await SendFinishTaskCommandAsync(_taskId);

            // Wait for the receiving task to complete
            await receiveTask;

            Console.WriteLine("Task finished, connection closed.");
        } catch (OperationCanceledException) {
            Console.WriteLine("Task was canceled.");
        } catch (Exception ex) {
            Console.WriteLine($"An error occurred: {ex.Message}");
        } finally {
            _cancellationTokenSource.Cancel();
            _webSocket.Dispose();
        }
    }

    private static void ClearOutputFile(string filePath) {
        if (File.Exists(filePath)) {
            File.WriteAllText(filePath, string.Empty);
            Console.WriteLine("Output file has been cleared.");
        } else {
            Console.WriteLine("Output file does not exist, no need to clear.");
        }
    }

    private static async Task ConnectToWebSocketAsync(string url) {
        var uri = new Uri(url);
        if (_webSocket.State == WebSocketState.Connecting || _webSocket.State == WebSocketState.Open) {
            return;
        }

        // Set the headers for the WebSocket connection
        _webSocket.Options.SetRequestHeader("Authorization", $"bearer {ApiKey}");
        _webSocket.Options.SetRequestHeader("X-DashScope-DataInspection", "enable");

        try {
            await _webSocket.ConnectAsync(uri, _cancellationTokenSource.Token);
            Console.WriteLine("Successfully connected to the WebSocket service.");
        } catch (OperationCanceledException) {
            Console.WriteLine("WebSocket connection was canceled.");
        } catch (Exception ex) {
            Console.WriteLine($"WebSocket connection failed: {ex.Message}");
            throw;
        }
    }

    private static async Task SendRunTaskCommandAsync(string taskId) {
        var command = CreateCommand("run-task", taskId, "duplex", new {
            task_group = "audio",
            task = "tts",
            function = "SpeechSynthesizer",
            model = "cosyvoice-v3-flash",
            parameters = new
            {
                text_type = "PlainText",
                voice = "longanyang",
                format = "mp3",
                sample_rate = 22050,
                volume = 50,
                rate = 1,
                pitch = 1,
                // If enable_ssml is set to true, you can send the continue-task instruction only once. Otherwise, the error "Text request limit violated, expected 1." is reported.
                enable_ssml = false
            },
            input = new { }
        });

        await SendJsonMessageAsync(command);
        Console.WriteLine("run-task instruction sent.");
    }

    private static async Task SendContinueTaskCommandAsync(string text) {
        if (_taskId == null) {
            throw new InvalidOperationException("Task ID is not initialized.");
        }

        var command = CreateCommand("continue-task", _taskId, "duplex", new {
            input = new {
                text
            }
        });

        await SendJsonMessageAsync(command);
        Console.WriteLine("continue-task instruction sent.");
    }

    private static async Task SendFinishTaskCommandAsync(string taskId) {
        var command = CreateCommand("finish-task", taskId, "duplex", new {
            input = new { }
        });

        await SendJsonMessageAsync(command);
        Console.WriteLine("finish-task instruction sent.");
    }

    private static async Task SendJsonMessageAsync(string message) {
        var buffer = Encoding.UTF8.GetBytes(message);
        try {
            await _webSocket.SendAsync(new ArraySegment<byte>(buffer), WebSocketMessageType.Text, true, _cancellationTokenSource.Token);
        } catch (OperationCanceledException) {
            Console.WriteLine("Message sending was canceled.");
        }
    }

    private static async Task ReceiveMessagesAsync() {
        while (_webSocket.State == WebSocketState.Open) {
            var response = await ReceiveMessageAsync();
            if (response != null) {
                var eventStr = response.RootElement.GetProperty("header").GetProperty("event").GetString();
                switch (eventStr) {
                    case "task-started":
                        Console.WriteLine("Task has started.");
                        _taskStartedTcs.TrySetResult(true);
                        break;
                    case "task-finished":
                        Console.WriteLine("Task has finished.");
                        _cancellationTokenSource.Cancel();
                        break;
                    case "task-failed":
                        Console.WriteLine("Task failed: " + response.RootElement.GetProperty("header").GetProperty("error_message").GetString());
                        _cancellationTokenSource.Cancel();
                        break;
                    default:
                        // result-generated can be handled here
                        break;
                }
            }
        }
    }

    private static async Task<JsonDocument?> ReceiveMessageAsync() {
        var buffer = new byte[1024 * 4];
        var segment = new ArraySegment<byte>(buffer);

        try {
            WebSocketReceiveResult result = await _webSocket.ReceiveAsync(segment, _cancellationTokenSource.Token);

            if (result.MessageType == WebSocketMessageType.Close) {
                await _webSocket.CloseAsync(WebSocketCloseStatus.NormalClosure, "Closing", _cancellationTokenSource.Token);
                return null;
            }

            if (result.MessageType == WebSocketMessageType.Binary) {
                // Process binary data
                Console.WriteLine("Receiving binary data...");

                // Save binary data to a file
                using (var fileStream = new FileStream(OutputFilePath, FileMode.Append)) {
                    fileStream.Write(buffer, 0, result.Count);
                }

                return null;
            }

            string message = Encoding.UTF8.GetString(buffer, 0, result.Count);
            return JsonDocument.Parse(message);
        } catch (OperationCanceledException) {
            Console.WriteLine("Message reception was canceled.");
            return null;
        }
    }

    private static string GenerateTaskId() {
        return Guid.NewGuid().ToString("N").Substring(0, 32);
    }

    private static string CreateCommand(string action, string taskId, string streaming, object payload) {
        var command = new {
            header = new {
                action,
                task_id = taskId,
                streaming
            },
            payload
        };

        return JsonSerializer.Serialize(command);
    }
}

PHP

The directory structure of the sample code is as follows:

my-php-project/

├── composer.json

├── vendor/

└── index.php

The following is the content of composer.json. You can specify the version numbers of the dependencies as needed:

{
    "require": {
        "react/event-loop": "^1.3",
        "react/socket": "^1.11",
        "react/stream": "^1.2",
        "react/http": "^1.1",
        "ratchet/pawl": "^0.4"
    },
    "autoload": {
        "psr-4": {
            "App\\": "src/"
        }
    }
}

The content of index.php is as follows:

<?php

require __DIR__ . '/vendor/autoload.php';

use Ratchet\Client\Connector;
use React\EventLoop\Loop;
use React\Socket\Connector as SocketConnector;

// If you have not configured the API Key as an environment variable, you can replace the following line with: $api_key="your_api_key". We do not recommend hard-coding the API Key into your code in a production environment to reduce the risk of leaks.
$api_key = getenv("DASHSCOPE_API_KEY");
$websocket_url = 'wss://dashscope.aliyuncs.com/api-ws/v1/inference/'; // WebSocket server endpoint
$output_file = 'output.mp3'; // Output file path

$loop = Loop::get();

if (file_exists($output_file)) {
    // Clear the file content
    file_put_contents($output_file, '');
}

// Create a custom connector
$socketConnector = new SocketConnector($loop, [
    'tcp' => [
        'bindto' => '0.0.0.0:0',
    ],
    'tls' => [
        'verify_peer' => false,
        'verify_peer_name' => false,
    ],
]);

$connector = new Connector($loop, $socketConnector);

$headers = [
    'Authorization' => 'bearer ' . $api_key,
    'X-DashScope-DataInspection' => 'enable'
];

$connector($websocket_url, [], $headers)->then(function ($conn) use ($loop, $output_file) {
    echo "Connected to WebSocket server\n";

    // Generate a task ID
    $taskId = generateTaskId();

    // Send the run-task instruction
    sendRunTaskMessage($conn, $taskId);

    // Define the function to send the continue-task instruction
    $sendContinueTask = function() use ($conn, $loop, $taskId) {
        // Text to be sent
        $texts = ["A bright moonbeam shines before my bed", "I wonder if it's frost upon the ground", "I raise my head to gaze at the bright moon", "Then lower it, thinking of my hometown"];
        $continueTaskCount = 0;
        foreach ($texts as $text) {
            $continueTaskMessage = json_encode([
                "header" => [
                    "action" => "continue-task",
                    "task_id" => $taskId,
                    "streaming" => "duplex"
                ],
                "payload" => [
                    "input" => [
                        "text" => $text
                    ]
                ]
            ]);
            echo "Preparing to send continue-task instruction: " . $continueTaskMessage . "\n";
            $conn->send($continueTaskMessage);
            $continueTaskCount++;
        }
        echo "Number of continue-task instructions sent: " . $continueTaskCount . "\n";

        // Send the finish-task instruction
        sendFinishTaskMessage($conn, $taskId);
    };

    // Flag to check if the task-started event has been received
    $taskStarted = false;

    // Listen for messages
    $conn->on('message', function($msg) use ($conn, $sendContinueTask, $loop, &$taskStarted, $taskId, $output_file) {
        if ($msg->isBinary()) {
            // Write binary data to a local file
            file_put_contents($output_file, $msg->getPayload(), FILE_APPEND);
        } else {
            // Process non-binary messages
            $response = json_decode($msg, true);

            if (isset($response['header']['event'])) {
                handleEvent($conn, $response, $sendContinueTask, $loop, $taskId, $taskStarted);
            } else {
                echo "Unknown message format\n";
            }
        }
    });

    // Listen for connection close
    $conn->on('close', function($code = null, $reason = null) {
        echo "Connection closed\n";
        if ($code !== null) {
            echo "Close code: " . $code . "\n";
        }
        if ($reason !== null) {
            echo "Close reason: " . $reason . "\n";
        }
    });
}, function ($e) {
    echo "Could not connect: {$e->getMessage()}\n";
});

$loop->run();

/**
 * Generate a task ID
 * @return string
 */
function generateTaskId(): string {
    return bin2hex(random_bytes(16));
}

/**
 * Send the run-task instruction
 * @param $conn
 * @param $taskId
 */
function sendRunTaskMessage($conn, $taskId) {
    $runTaskMessage = json_encode([
        "header" => [
            "action" => "run-task",
            "task_id" => $taskId,
            "streaming" => "duplex"
        ],
        "payload" => [
            "task_group" => "audio",
            "task" => "tts",
            "function" => "SpeechSynthesizer",
            "model" => "cosyvoice-v3-flash",
            "parameters" => [
                "text_type" => "PlainText",
                "voice" => "longanyang",
                "format" => "mp3",
                "sample_rate" => 22050,
                "volume" => 50,
                "rate" => 1,
                "pitch" => 1,
                // If enable_ssml is set to true, you can send the continue-task instruction only once. Otherwise, the error "Text request limit violated, expected 1." is reported.
                "enable_ssml" => false
            ],
            "input" => (object) []
        ]
    ]);
    echo "Preparing to send run-task instruction: " . $runTaskMessage . "\n";
    $conn->send($runTaskMessage);
    echo "run-task instruction sent\n";
}

/**
 * Read the audio file
 * @param string $filePath
 * @return bool|string
 */
function readAudioFile(string $filePath) {
    $voiceData = file_get_contents($filePath);
    if ($voiceData === false) {
        echo "Could not read audio file\n";
    }
    return $voiceData;
}

/**
 * Split the audio data
 * @param string $data
 * @param int $chunkSize
 * @return array
 */
function splitAudioData(string $data, int $chunkSize): array {
    return str_split($data, $chunkSize);
}

/**
 * Send the finish-task instruction
 * @param $conn
 * @param $taskId
 */
function sendFinishTaskMessage($conn, $taskId) {
    $finishTaskMessage = json_encode([
        "header" => [
            "action" => "finish-task",
            "task_id" => $taskId,
            "streaming" => "duplex"
        ],
        "payload" => [
            "input" => (object) []
        ]
    ]);
    echo "Preparing to send finish-task instruction: " . $finishTaskMessage . "\n";
    $conn->send($finishTaskMessage);
    echo "finish-task instruction sent\n";
}

/**
 * Handle events
 * @param $conn
 * @param $response
 * @param $sendContinueTask
 * @param $loop
 * @param $taskId
 * @param $taskStarted
 */
function handleEvent($conn, $response, $sendContinueTask, $loop, $taskId, &$taskStarted) {
    switch ($response['header']['event']) {
        case 'task-started':
            echo "Task started, sending continue-task instruction...\n";
            $taskStarted = true;
            // Send the continue-task instruction
            $sendContinueTask();
            break;
        case 'result-generated':
            // Ignore the result-generated event
            break;
        case 'task-finished':
            echo "Task finished\n";
            $conn->close();
            break;
        case 'task-failed':
            echo "Task failed\n";
            echo "Error code: " . $response['header']['error_code'] . "\n";
            echo "Error message: " . $response['header']['error_message'] . "\n";
            $conn->close();
            break;
        case 'error':
            echo "Error: " . $response['payload']['message'] . "\n";
            break;
        default:
            echo "Unknown event: " . $response['header']['event'] . "\n";
            break;
    }

    // If the task is finished, close the connection
    if ($response['header']['event'] == 'task-finished') {
        // Wait 1 second to ensure all data has been transferred
        $loop->addTimer(1, function() use ($conn) {
            $conn->close();
            echo "Client closed connection\n";
        });
    }

    // If the task-started event was not received, close the connection
    if (!$taskStarted && in_array($response['header']['event'], ['task-failed', 'error'])) {
        $conn->close();
    }
}

Node.js

Install the required dependencies:

npm install ws
npm install uuid

The sample code is as follows:

const WebSocket = require('ws');
const fs = require('fs');
const uuid = require('uuid').v4;

// If you have not configured the API Key as an environment variable, you can replace the following line with: apiKey = 'your_api_key'. We do not recommend hard-coding the API Key into your code in a production environment to reduce the risk of leaks.
const apiKey = process.env.DASHSCOPE_API_KEY;
// WebSocket server endpoint
const url = 'wss://dashscope.aliyuncs.com/api-ws/v1/inference/';
// Output file path
const outputFilePath = 'output.mp3';

// Clear the output file
fs.writeFileSync(outputFilePath, '');

// Create a WebSocket client
const ws = new WebSocket(url, {
  headers: {
    Authorization: `bearer ${apiKey}`,
    'X-DashScope-DataInspection': 'enable'
  }
});

let taskStarted = false;
let taskId = uuid();

ws.on('open', () => {
  console.log('Connected to WebSocket server');

  // Send the run-task instruction
  const runTaskMessage = JSON.stringify({
    header: {
      action: 'run-task',
      task_id: taskId,
      streaming: 'duplex'
    },
    payload: {
      task_group: 'audio',
      task: 'tts',
      function: 'SpeechSynthesizer',
      model: 'cosyvoice-v3-flash',
      parameters: {
        text_type: 'PlainText',
        voice: 'longanyang', // Voice
        format: 'mp3', // Audio format
        sample_rate: 22050, // Sample rate
        volume: 50, // Volume
        rate: 1, // Speech rate
        pitch: 1, // Pitch
        enable_ssml: false // Specifies whether to enable the SSML feature. If enable_ssml is set to true, you can send the continue-task instruction only once. Otherwise, the error "Text request limit violated, expected 1." is reported.
      },
      input: {}
    }
  });
  ws.send(runTaskMessage);
  console.log('run-task message sent');
});

const fileStream = fs.createWriteStream(outputFilePath, { flags: 'a' });
ws.on('message', (data, isBinary) => {
  if (isBinary) {
    // Write binary data to the file
    fileStream.write(data);
  } else {
    const message = JSON.parse(data);

    switch (message.header.event) {
      case 'task-started':
        taskStarted = true;
        console.log('Task has started');
        // Send continue-task instructions
        sendContinueTasks(ws);
        break;
      case 'task-finished':
        console.log('Task has finished');
        ws.close();
        fileStream.end(() => {
          console.log('File stream closed');
        });
        break;
      case 'task-failed':
        console.error('Task failed: ', message.header.error_message);
        ws.close();
        fileStream.end(() => {
          console.log('File stream closed');
        });
        break;
      default:
        // You can handle result-generated here
        break;
    }
  }
});

function sendContinueTasks(ws) {
  const texts = [
    'A bright moonbeam shines before my bed,',
    'I wonder if it\'s frost upon the ground.',
    'I raise my head to gaze at the bright moon,',
    'Then lower it, thinking of my hometown.'
  ];

  texts.forEach((text, index) => {
    setTimeout(() => {
      if (taskStarted) {
        const continueTaskMessage = JSON.stringify({
          header: {
            action: 'continue-task',
            task_id: taskId,
            streaming: 'duplex'
          },
          payload: {
            input: {
              text: text
            }
          }
        });
        ws.send(continueTaskMessage);
        console.log(`continue-task sent, text: ${text}`);
      }
    }, index * 1000); // Send every 1 second
  });

  // Send the finish-task instruction
  setTimeout(() => {
    if (taskStarted) {
      const finishTaskMessage = JSON.stringify({
        header: {
          action: 'finish-task',
          task_id: taskId,
          streaming: 'duplex'
        },
        payload: {
          input: {}
        }
      });
      ws.send(finishTaskMessage);
      console.log('finish-task sent');
    }
  }, texts.length * 1000 + 1000); // Send 1 second after all continue-task instructions are sent
}

ws.on('close', () => {
  console.log('Disconnected from WebSocket server');
});

Java

If you use the Java programming language, we recommend developing with the Java DashScope SDK. For more information, see Java SDK.

The following is a Java WebSocket call example. Before you run the example, ensure that you import the following dependencies:

Java-WebSocket
jackson-databind

We recommend that you use Maven or Gradle to manage dependency packages. The following are the configurations:

pom.xml

<dependencies>
    <!-- WebSocket Client -->
    <dependency>
        <groupId>org.java-websocket</groupId>
        <artifactId>Java-WebSocket</artifactId>
        <version>1.5.3</version>
    </dependency>

    <!-- JSON Processing -->
    <dependency>
        <groupId>com.fasterxml.jackson.core</groupId>
        <artifactId>jackson-databind</artifactId>
        <version>2.13.0</version>
    </dependency>
</dependencies>

build.gradle

// Omit other code
dependencies {
  // WebSocket Client
  implementation 'org.java-websocket:Java-WebSocket:1.5.3'
  // JSON Processing
  implementation 'com.fasterxml.jackson.core:jackson-databind:2.13.0'
}
// Omit other code

The Java code is as follows:

import com.fasterxml.jackson.databind.ObjectMapper;

import org.java_websocket.client.WebSocketClient;
import org.java_websocket.handshake.ServerHandshake;

import java.io.FileOutputStream;
import java.io.IOException;
import java.net.URI;
import java.nio.ByteBuffer;
import java.util.*;

public class TTSWebSocketClient extends WebSocketClient {
    private final String taskId = UUID.randomUUID().toString();
    private final String outputFile = "output_" + System.currentTimeMillis() + ".mp3";
    private boolean taskFinished = false;

    public TTSWebSocketClient(URI serverUri, Map<String, String> headers) {
        super(serverUri, headers);
    }

    @Override
    public void onOpen(ServerHandshake serverHandshake) {
        System.out.println("Connection successful");

        // Send the run-task instruction
        // If enable_ssml is set to true, you can send the continue-task instruction only once. Otherwise, the error "Text request limit violated, expected 1." is reported.
        String runTaskCommand = "{ \"header\": { \"action\": \"run-task\", \"task_id\": \"" + taskId + "\", \"streaming\": \"duplex\" }, \"payload\": { \"task_group\": \"audio\", \"task\": \"tts\", \"function\": \"SpeechSynthesizer\", \"model\": \"cosyvoice-v3-flash\", \"parameters\": { \"text_type\": \"PlainText\", \"voice\": \"longanyang\", \"format\": \"mp3\", \"sample_rate\": 22050, \"volume\": 50, \"rate\": 1, \"pitch\": 1, \"enable_ssml\": false }, \"input\": {} }}";
        send(runTaskCommand);
    }

    @Override
    public void onMessage(String message) {
        System.out.println("Received message from server: " + message);
        try {
            // Parse JSON message
            Map<String, Object> messageMap = new ObjectMapper().readValue(message, Map.class);

            if (messageMap.containsKey("header")) {
                Map<String, Object> header = (Map<String, Object>) messageMap.get("header");

                if (header.containsKey("event")) {
                    String event = (String) header.get("event");

                    if ("task-started".equals(event)) {
                        System.out.println("Received task-started event from server");

                        List<String> texts = Arrays.asList(
                                "A bright moonbeam shines before my bed, I wonder if it's frost upon the ground.",
                                "I raise my head to gaze at the bright moon, then lower it, thinking of my hometown."
                        );

                        for (String text : texts) {
                            // Send the continue-task instruction
                            sendContinueTask(text);
                        }

                        // Send the finish-task instruction
                        sendFinishTask();
                    } else if ("task-finished".equals(event)) {
                        System.out.println("Received task-finished event from server");
                        taskFinished = true;
                        closeConnection();
                    } else if ("task-failed".equals(event)) {
                        System.out.println("Task failed: " + message);
                        closeConnection();
                    }
                }
            }
        } catch (Exception e) {
            System.err.println("An exception occurred: " + e.getMessage());
        }
    }

    @Override
    public void onMessage(ByteBuffer message) {
        System.out.println("Size of binary audio data received: " + message.remaining());

        try (FileOutputStream fos = new FileOutputStream(outputFile, true)) {
            byte[] buffer = new byte[message.remaining()];
            message.get(buffer);
            fos.write(buffer);
            System.out.println("Audio data has been written to the local file " + outputFile);
        } catch (IOException e) {
            System.err.println("Failed to write audio data to local file: " + e.getMessage());
        }
    }

    @Override
    public void onClose(int code, String reason, boolean remote) {
        System.out.println("Connection closed: " + reason + " (" + code + ")");
    }

    @Override
    public void onError(Exception ex) {
        System.err.println("An error occurred: " + ex.getMessage());
        ex.printStackTrace();
    }

    private void sendContinueTask(String text) {
        String command = "{ \"header\": { \"action\": \"continue-task\", \"task_id\": \"" + taskId + "\", \"streaming\": \"duplex\" }, \"payload\": { \"input\": { \"text\": \"" + text + "\" } }}";
        send(command);
    }

    private void sendFinishTask() {
        String command = "{ \"header\": { \"action\": \"finish-task\", \"task_id\": \"" + taskId + "\", \"streaming\": \"duplex\" }, \"payload\": { \"input\": {} }}";
        send(command);
    }

    private void closeConnection() {
        if (!isClosed()) {
            close();
        }
    }

    public static void main(String[] args) {
        try {
            String apiKey = System.getenv("DASHSCOPE_API_KEY");
            if (apiKey == null || apiKey.isEmpty()) {
                System.err.println("Please set the DASHSCOPE_API_KEY environment variable");
                return;
            }

            Map<String, String> headers = new HashMap<>();
            headers.put("Authorization", "bearer " + apiKey);
            TTSWebSocketClient client = new TTSWebSocketClient(new URI("wss://dashscope.aliyuncs.com/api-ws/v1/inference/"), headers);

            client.connect();

            while (!client.isClosed() && !client.taskFinished) {
                Thread.sleep(1000);
            }
        } catch (Exception e) {
            System.err.println("Failed to connect to WebSocket service: " + e.getMessage());
            e.printStackTrace();
        }
    }
}

Python

If you use the Python programming language, we recommend developing with the Python DashScope SDK. For more information, see Python SDK.

The following is a Python WebSocket call example. Before you run the example, ensure that you import the following dependencies:

pip uninstall websocket-client
pip uninstall websocket
pip install websocket-client

Important

Do not name the Python file that runs the sample code "websocket.py". Otherwise, an error is reported: `AttributeError: module 'websocket' has no attribute 'WebSocketApp'. Did you mean: 'WebSocket'?`

import websocket
import json
import uuid
import os
import time


class TTSClient:
    def __init__(self, api_key, uri):
        """
    Initializes a TTSClient instance.

    Parameters:
        api_key (str): The API Key for authentication.
        uri (str): The WebSocket service endpoint.
    """
        self.api_key = api_key  # Replace with your API Key
        self.uri = uri  # Replace with your WebSocket endpoint
        self.task_id = str(uuid.uuid4())  # Generate a unique task ID
        self.output_file = f"output_{int(time.time())}.mp3"  # Output audio file path
        self.ws = None  # WebSocketApp instance
        self.task_started = False  # Whether task-started has been received
        self.task_finished = False  # Whether task-finished / task-failed has been received

    def on_open(self, ws):
        """
    Callback function for when the WebSocket connection is established.
    Sends a run-task instruction to start the speech synthesis task.
    """
        print("WebSocket connected")

        # Construct the run-task instruction
        run_task_cmd = {
            "header": {
                "action": "run-task",
                "task_id": self.task_id,
                "streaming": "duplex"
            },
            "payload": {
                "task_group": "audio",
                "task": "tts",
                "function": "SpeechSynthesizer",
                "model": "cosyvoice-v3-flash",
                "parameters": {
                    "text_type": "PlainText",
                    "voice": "longanyang",
                    "format": "mp3",
                    "sample_rate": 22050,
                    "volume": 50,
                    "rate": 1,
                    "pitch": 1,
                    # If enable_ssml is set to True, you can send the continue-task instruction only once. Otherwise, the error "Text request limit violated, expected 1." is reported.
                    "enable_ssml": False
                },
                "input": {}
            }
        }

        # Send the run-task instruction
        ws.send(json.dumps(run_task_cmd))
        print("run-task instruction sent")

    def on_message(self, ws, message):
        """
    Callback function for receiving messages.
    Handles text and binary messages differently.
    """
        if isinstance(message, str):
            # Process JSON text messages
            try:
                msg_json = json.loads(message)
                print(f"Received JSON message: {msg_json}")

                if "header" in msg_json:
                    header = msg_json["header"]

                    if "event" in header:
                        event = header["event"]

                        if event == "task-started":
                            print("Task has started")
                            self.task_started = True

                            # Send continue-task instructions
                            texts = [
                                "A bright moonbeam shines before my bed, I wonder if it's frost upon the ground.",
                                "I raise my head to gaze at the bright moon, then lower it, thinking of my hometown."
                            ]

                            for text in texts:
                                self.send_continue_task(text)

                            # Send finish-task after all continue-task instructions are sent
                            self.send_finish_task()

                        elif event == "task-finished":
                            print("Task has finished")
                            self.task_finished = True
                            self.close(ws)

                        elif event == "task-failed":
                            error_msg = msg_json.get("error_message", "Unknown error")
                            print(f"Task failed: {error_msg}")
                            self.task_finished = True
                            self.close(ws)

            except json.JSONDecodeError as e:
                print(f"JSON parsing failed: {e}")
        else:
            # Process binary messages (audio data)
            print(f"Received binary message, size: {len(message)} bytes")
            with open(self.output_file, "ab") as f:
                f.write(message)
            print(f"Audio data has been written to the local file {self.output_file}")

    def on_error(self, ws, error):
        """Callback for when an error occurs"""
        print(f"WebSocket error: {error}")

    def on_close(self, ws, close_status_code, close_msg):
        """Callback for when the connection is closed"""
        print(f"WebSocket closed: {close_msg} ({close_status_code})")

    def send_continue_task(self, text):
        """Sends a continue-task instruction with the text to be synthesized"""
        cmd = {
            "header": {
                "action": "continue-task",
                "task_id": self.task_id,
                "streaming": "duplex"
            },
            "payload": {
                "input": {
                    "text": text
                }
            }
        }

        self.ws.send(json.dumps(cmd))
        print(f"continue-task instruction sent, text content: {text}")

    def send_finish_task(self):
        """Sends a finish-task instruction to end the speech synthesis task"""
        cmd = {
            "header": {
                "action": "finish-task",
                "task_id": self.task_id,
                "streaming": "duplex"
            },
            "payload": {
                "input": {}
            }
        }

        self.ws.send(json.dumps(cmd))
        print("finish-task instruction sent")

    def close(self, ws):
        """Actively closes the connection"""
        if ws and ws.sock and ws.sock.connected:
            ws.close()
            print("Connection actively closed")

    def run(self):
        """Starts the WebSocket client"""
        # Set request headers (for authentication)
        header = {
            "Authorization": f"bearer {self.api_key}",
            "X-DashScope-DataInspection": "enable"
        }

        # Create a WebSocketApp instance
        self.ws = websocket.WebSocketApp(
            self.uri,
            header=header,
            on_open=self.on_open,
            on_message=self.on_message,
            on_error=self.on_error,
            on_close=self.on_close
        )

        print("Listening for WebSocket messages...")
        self.ws.run_forever()  # Start the persistent connection listener


# Example usage
if __name__ == "__main__":
    API_KEY = os.environ.get("DASHSCOPE_API_KEY")  # If you have not configured the API Key as an environment variable, set API_KEY to your API Key.
    SERVER_URI = "wss://dashscope.aliyuncs.com/api-ws/v1/inference/"  # Replace with your WebSocket endpoint

    client = TTSClient(API_KEY, SERVER_URI)
    client.run()

Error codes

For troubleshooting information, see Error messages.

FAQ

Features, billing, and rate limiting

Q: What can I do to fix inaccurate pronunciation?

You can use SSML to customize the speech synthesis output.

Q: Why use the WebSocket protocol instead of HTTP/HTTPS? Why isn't a RESTful API provided?

The Voice Service uses WebSocket instead of HTTP, HTTPS, or RESTful APIs because it requires full-duplex communication. WebSocket allows both the server and client to send data. This is necessary for features such as pushing the progress of real-time speech synthesis or recognition. In contrast, a RESTful API is based on HTTP and supports only a one-way, client-initiated request-response model, which cannot meet the requirements for real-time interaction.

Q: Speech synthesis is billed based on the number of text characters. How can I view or obtain the text length for each synthesis?

You can obtain the character count from the payload.usage.characters parameter in the result-generated event that is returned by the server. The value from the last result-generated event that you receive is the final count.

Troubleshooting

Important

If a code error occurs, check that the instructions sent to the server are correct. You can print the instruction content to check for incorrect formats or missing required parameters. If the instructions are correct, troubleshoot the issue using the information in Error codes.

Q: How do I get the Request ID?

You can obtain it in one of the following two ways:

Parse the information that the server returns in the result-generated event.
Parse the information that the server returns in the task-finished event.

Q: Why does the SSML feature fail?

Troubleshoot the issue by following these steps:

Make sure that the scope of application is correct.
Make sure that you are calling the feature correctly. For more information, see SSML support.
Make sure that the text for synthesis is in plain text and meets the formatting requirements. For more information, see Introduction to SSML.

Q: Why can't the audio be played?

Troubleshoot this issue based on the following scenarios:

The audio is saved as a complete file, such as an .mp3 file.
1. Audio format consistency: Ensure that the audio format specified in the request parameters matches the file extension. For example, playback might fail if the audio format is set to WAV in the request parameters but the file has an .mp3 extension.
2. Player compatibility: Confirm that your player supports the format and sample rate of the audio file. For example, some players might not support high sample rates or specific audio encodings.
The audio is played in streaming mode.
1. Save the audio stream as a complete file and try to play it. If the file fails to play, see the troubleshooting steps for the first scenario.
2. If the file plays correctly, the issue might be with the streaming playback implementation. Confirm that your player supports streaming playback.
  Common tools and libraries that support streaming playback include FFmpeg, pyaudio (Python), AudioFormat (Java), and MediaSource (JavaScript).

Q: Why does the audio playback stutter?

Troubleshoot this issue based on the following scenarios:

Check the text sending speed: Ensure that the text sending interval is reasonable. Avoid delays in sending the next text segment after the audio for the previous segment has finished playing.
Check the callback function performance:
- Check whether the callback function contains excessive business logic that could cause it to block.
- The callback function runs in the WebSocket thread. If this thread is blocked, it can interfere with the WebSocket's ability to receive network packets, resulting in audio stuttering.
- To avoid blocking the WebSocket thread, write the audio data to a separate audio buffer and then use another thread to read and process it.
Check network stability: Ensure that your network connection is stable to prevent audio transmission interruptions or delays caused by network fluctuations.

Q: Why is speech synthesis slow (long synthesis time)?

Perform the following troubleshooting steps:

Check the input interval
If you are using streaming speech synthesis, check whether the text sending interval is too long. For example, a delay of several seconds before sending the next segment will increase the total synthesis time.
Analyze performance metrics
- First packet delay: This is typically around 500 ms.
- Real-Time Factor (RTF): This is calculated as Total Synthesis Time / Audio Duration. The RTF is normally less than 1.0.

Q: How do I handle incorrect pronunciation in the synthesized speech?

Use the <phoneme> tag of SSML to specify the correct pronunciation.

Q: Why is no speech returned? Why is the end of the text not converted into speech? (Missing synthesized speech)

Check if you forgot to send the finish-task instruction. During speech synthesis, the server starts synthesizing only after it has cached enough text. If you forget to send the finish-task instruction, the text at the end of the cache may not be synthesized.

Q: Why is the order of the returned audio stream incorrect, causing chaotic playback?

Troubleshoot this issue by checking the following:

Make sure that the run-task instruction, continue-task instruction, and finish-task instruction for the same synthesis task use the same task_id.
Check if an asynchronous operation is causing the audio file to be written in a different order from which the binary data was received.

Permissions and authentication

Q: I want my API key to be used only for the CosyVoice speech synthesis service, not for other Model Studio models (permission isolation). What should I do?

You can create a workspace and authorize only specific models to limit the scope of the API key. For more information, see Manage workspaces.

Prerequisites

Models and pricing

Text and format limitations

Text length limits

Character counting rules

Encoding format

Support for mathematical expressions

SSML support

Interaction flow

URL

Headers

Instructions (client to server)

1. run-task instruction: Start a task

2. continue-task instruction

3. finish-task instruction: End a task

Events (server to client)

1. task-started event: Task has started

2. result-generated event

3. task-finished event: Task has ended

4. task-failed event: Task has failed

Connection overhead and reuse

Sample code

Go

C#

PHP

Node.js

Java

pom.xml

build.gradle

Python

Error codes

FAQ

Features, billing, and rate limiting

Q: What can I do to fix inaccurate pronunciation?

Q: Why use the WebSocket protocol instead of HTTP/HTTPS? Why isn't a RESTful API provided?

Q: Speech synthesis is billed based on the number of text characters. How can I view or obtain the text length for each synthesis?

Troubleshooting

Q: How do I get the Request ID?

Q: Why does the SSML feature fail?

Q: Why can't the audio be played?

Q: Why does the audio playback stutter?

Q: Why is speech synthesis slow (long synthesis time)?

Q: How do I handle incorrect pronunciation in the synthesized speech?

Q: Why is no speech returned? Why is the end of the text not converted into speech? (Missing synthesized speech)

Q: Why is the order of the returned audio stream incorrect, causing chaotic playback?

Permissions and authentication

Q: I want my API key to be used only for the CosyVoice speech synthesis service, not for other Model Studio models (permission isolation). What should I do?

More questions