This topic explains how to access the CosyVoice speech synthesis service using a WebSocket connection.
The DashScope SDK supports only Java and Python. To build a CosyVoice speech synthesis application in another programming language, use a WebSocket connection to communicate with the service.
User guide: For model overviews and selection recommendations, see Real-time Speech Synthesis—CosyVoice/Sambert.
WebSocket is a network protocol that supports full-duplex communication. The client and server establish a persistent connection with a single handshake, which allows both parties to actively push data to each other. This provides significant advantages in real-time performance and efficiency.
For common programming languages, many ready-to-use WebSocket libraries and examples are available, such as:
Go:
gorilla/websocketPHP:
RatchetNode.js:
ws
Familiarize yourself with the basic principles and technical details of WebSocket before you begin development.
CosyVoice models support only WebSocket connections and do not support HTTP REST APIs. If you call the service using an HTTP request (such as POST), it returns an InvalidParameter or URL error.
Prerequisites
You have obtained an API key.
Models and pricing
For more information, see Real-time Speech Synthesis—CosyVoice/Sambert.
Text limits and format requirements for speech synthesis
Text length limits
When you send text to synthesize using the continue-task instruction, the text must be no longer than 20,000 characters. The total length of text sent across multiple calls to the continue-task instruction must be no longer than 200,000 characters.
Character counting rules
Each Chinese character (including simplified, traditional, Japanese kanji, and Korean hanja) counts as two characters. All other characters—including punctuation, letters, digits, Japanese kana, and Korean Hangul—count as one character each.
SSML tag content is excluded from character counts.
Examples:
"你好"→ 你 (2) + 好 (2) = 4 characters"中A文123"→ 2 (Chinese characters) + 1 (A) + 2 (Chinese characters) + 1 (1) + 1 (2) + 1 (3) = 8 characters"中文。"→ 2 (中) + 2 (文) + 1 (。) = 5 characters"中 文。"→ 2 (for "中") + 1 (for the space) + 2 (for "文") + 1 (for "。") = 6 characters"<speak>你好</speak>"→ 4 characters (SSML tags are not counted, 2 Chinese characters × 2)
Encoding format
Use UTF-8 encoding.
Mathematical expression support
The math expression parsing feature works only with the cosyvoice-v2, cosyvoice-v3-flash, and cosyvoice-v3-plus models. It supports common math expressions used in primary and secondary education, including basic arithmetic, algebra, and geometry.
For more information, see Convert LaTeX formulas to speech.
SSML markup language support
To use SSML, meet all the following conditions:
Model support: Only the cosyvoice-v3-flash, cosyvoice-v3-plus, and cosyvoice-v2 models support SSML.
Voice support: You must use a voice that supports SSML. Voices that support SSML include the following:
All cloned voices (custom voices created using the Voice Cloning API)
System voices marked as supporting SSML in the voice list
NoteIf you use a system voice that does not support SSML (such as some basic voices), you get the error "SSML text is not supported at the moment!" even if you set the
enable_ssmlparameter to true.Parameter settings: In the run-task instruction, set the
enable_ssmlparameter totrue.
After meeting these conditions, send text that includes SSML using the continue-task instruction to use SSML. For a complete example, see QuickStart.
Interaction flow
A message sent from the client to the server is called an instruction. Messages returned from the server to the client fall into two categories: JSON-formatted events and binary audio streams.
The interaction flow between the client and server, in chronological order, is as follows:
Establish a connection: The client establishes a WebSocket connection with the server.
Start a task: The client sends the run-task instruction to start a task.
Wait for confirmation: The client receives the task-started event from the server. This event signals that the task started successfully and that you can proceed to the next step.
Send text to synthesize:
The client sends one or more continue-task instructions, each containing text to synthesize, to the server in sequence. After receiving a complete sentence, the server returns a result-generated event and an audio stream. (Text length constraints apply. For details, see the description of the
textfield in the continue-task instruction.)NoteYou can send the continue-task instruction multiple times to submit text fragments in sequence. After receiving text fragments, the server automatically splits them into sentences:
Complete sentences are synthesized immediately. At this point, the client receives audio from the server.
Incomplete sentences are cached until they become complete. The server does not return audio for incomplete sentences.
When you send the finish-task instruction, the server synthesizes all cached content.
Receive audio: Receive the audio stream over the
binarychannel.Notify the server to end the task:
After sending all text, the client sends the finish-task instruction to notify the server to end the task. Continue receiving audio from the server. (Do not skip this step. Otherwise, you might not receive audio or might miss the final part of the audio.)
End the task:
The client receives the task-finished event from the server. This event signals that the task ended.
Close the connection: The client closes the WebSocket connection.
To improve resource utilization, reuse a WebSocket connection to handle multiple tasks instead of creating a new connection for each task. See Connection overhead and connection reuse.
The task_id must remain consistent throughout: For a single speech synthesis task, the run-task, all continue-task, and finish-task instructions must use the same task_id.
Consequences of errors: Using different task_ids causes:
The server cannot associate requests, causing audio stream order to become disordered.
Text content is incorrectly assigned to different tasks, resulting in misaligned speech content.
Task status becomes abnormal, possibly preventing receipt of the task-finished event.
Billing fails, leading to inaccurate usage statistics.
Correct approach:
Generate a unique task_id (for example, using UUID) when sending the run-task instruction.
Store the task_id in a variable.
Use this task_id for all subsequent continue-task and finish-task instructions.
After the task ends (after receiving task-finished), generate a new task_id for a new task.
Client implementation considerations
When implementing a WebSocket client—especially on Flutter, web, or mobile platforms—you must clearly define responsibilities between the server and client to ensure the integrity and stability of speech synthesis tasks.
Server and client responsibilities
Server responsibilities
The server guarantees that it returns a complete audio stream in order. You do not need to worry about the order or completeness of audio data. The server generates and pushes all audio chunks in the order of the input text.
Client responsibilities
The client must handle the following key tasks:
Read and concatenate all audio chunks
The server pushes audio as multiple binary frames. The client must receive all frames completely and concatenate them in the order received to form the final audio file. Example code follows:
# Python example: Concatenate audio chunks with open("output.mp3", "ab") as f: # Append mode f.write(audio_chunk) # audio_chunk is each received binary audio chunk// JavaScript example: Concatenate audio chunks const audioChunks = []; ws.onmessage = (event) => { if (event.data instanceof Blob) { audioChunks.push(event.data); // Collect all audio chunks } }; // Merge audio after task completes const audioBlob = new Blob(audioChunks, { type: 'audio/mp3' });Maintain a complete WebSocket lifecycle
During the entire speech synthesis task—from sending the run-task instruction to receiving the task-finished event—do not disconnect the WebSocket connection prematurely. Common mistakes include the following:
Closing the connection before all audio chunks are received, resulting in incomplete audio.
Forgetting to send the finish-task instruction, leaving text in the server cache unprocessed.
Failing to handle WebSocket keepalive properly when the page navigates away or the app moves to the background.
ImportantMobile apps (such as Flutter, iOS, and Android) require special attention to network connection management when entering the background. We recommend maintaining the WebSocket connection in a background task or service, or checking the task status and reestablishing the connection when returning to the foreground.
Text integrity in ASR→LLM→TTS workflows
In ASR→LLM→TTS workflows, ensure the text passed to TTS is complete and not truncated mid-process. For example:
Wait for the LLM to generate a complete sentence or paragraph before sending the continue-task instruction, rather than streaming character by character.
If you need streaming synthesis (generate and play simultaneously), send text in batches based on natural sentence boundaries (such as periods or question marks).
After the LLM finishes generating output, send the finish-task instruction to avoid missing trailing content.
Platform-specific tips
Flutter: When using the
web_socket_channelpackage, close the connection correctly in thedisposemethod to prevent memory leaks. Also, handle app lifecycle events (such asAppLifecycleState.paused) to manage background transitions.Web (browser): Some browsers limit the number of WebSocket connections. Reuse a single connection for multiple tasks. Use the
beforeunloadevent to close the connection explicitly before the page closes, avoiding lingering connections.Mobile (iOS/Android native): When the app enters the background, the OS may pause or terminate network connections. Use a background task or foreground service to keep the WebSocket active, or reinitialize the task when returning to the foreground.
URL
The WebSocket URL is fixed as follows:
International
In international deployment mode, both the endpoint and data storage are located in the Singapore region. Model inference computing resources are dynamically scheduled globally (excluding the China mainland).
WebSocket URL: wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference
China mainland
In China mainland deployment mode, both the endpoint and data storage are located in the Beijing region. Model inference computing resources are available only in the China mainland.
WebSocket URL: wss://dashscope.aliyuncs.com/api-ws/v1/inference
Common URL configuration errors:
Error: Using a URL that starts with http:// or https:// → Correct: You must use the wss:// protocol.
Error: Placing the Authorization parameter in the URL query string (for example,
?Authorization=bearer <your_api_key>) → Correct: Set Authorization in the HTTP handshake headers (see Headers).Error: Adding the model name or other path parameters to the end of the URL → Correct: The URL remains fixed. Specify the model using the
payload.modelparameter in the run-task instruction.
Headers
Add the following information to the request header:
Parameter | Type | Required | Description |
Authorization | string | Yes | Authentication token in the format |
user-agent | string | No | Client identifier to help the server track the source. |
X-DashScope-WorkSpace | string | No | Alibaba Cloud Model Studio workspace ID. |
X-DashScope-DataInspection | string | No | Whether to enable data compliance inspection. Default is not to pass this parameter or set it to |
Timing and common errors for authentication validation
Authentication validation occurs during the WebSocket handshake, not when you send the run-task instruction. If the Authorization header is missing or the API key is invalid, the server rejects the handshake and returns HTTP 401 or 403. Most client libraries parse this as a WebSocketBadStatus exception.
Troubleshooting authentication failures
If the WebSocket connection fails, troubleshoot using the following steps:
Check the API key format: Confirm the Authorization header is formatted as
bearer <your_api_key>, with a space between bearer and the API key.Verify the API key validity: In the Model Studio console, confirm the API key is not deleted or disabled and has permission to call CosyVoice models.
Check header settings: Confirm the Authorization header is set correctly during the WebSocket handshake. Different programming languages set headers differently:
Python (websockets library):
extra_headers={"Authorization": f"bearer {api_key}"}JavaScript: The standard WebSocket API does not support custom headers. Use a server-side proxy or another library (such as ws).
Go (gorilla/websocket):
header.Add("Authorization", fmt.Sprintf("bearer %s", apiKey))
Test network connectivity: Use curl or Postman to test whether the API key is valid (using another DashScope API that supports HTTP).
Using WebSocket in browser environments
When using WebSocket in browser environments (such as Vue3 or React), note the following limitation: The browser WebSocket API does not support custom headers. The native browser new WebSocket(url) API does not support setting custom request headers (such as Authorization) during the handshake. This is a security restriction imposed by browsers. Therefore, you cannot directly authenticate using an API key in frontend code.
Solution: Use a backend proxy
Set up a WebSocket connection from your backend service (Node.js, Java, Python, etc.) to the CosyVoice service. Your backend can set the Authorization header correctly.
Have the frontend connect via WebSocket to your backend service. Your backend acts as a proxy, forwarding messages to CosyVoice.
Benefits: Your API key stays hidden from the frontend, improving security. You can add extra business logic (such as authentication, logging, or rate limiting) in your backend.
Do not hardcode your API key in frontend code or send it directly from the browser. Leaking your API key could lead to account compromise, unexpected charges, or data breaches.
Example code:
If you need implementations in other programming languages, adapt the logic shown in the examples. Or use AI tools to convert the examples to your target language.
Frontend (native web) + Backend (Node.js Express): cosyvoiceNodeJs_en.zip
Frontend (native web) + Backend (Python Flask): cosyvoiceFlask_en.zip
Instructions (client → server)
Instructions are JSON-formatted messages sent from the client to the server. They use Text Frame format and control task start, stop, and boundaries.
Send instructions in strict chronological order. Otherwise, the task may fail:
Send the run-task instruction
Starts a speech synthesis task.
The
task_idreturned must be used in subsequent continue-task and finish-task instructions. Keep it consistent.
Send thecontinue-task instruction
Sends text to synthesize.
You can send this instruction only after receiving the task-started event from the server.
Send thefinish-task instruction
Ends the speech synthesis task.
Send this instruction after all continue-task instructions are sent.
1. run-task instruction: Start a task
This instruction starts a speech synthesis task. You can configure request parameters such as voice and sample rate in this instruction.
Timing: Send after establishing the WebSocket connection.
Do not send text to synthesize: Sending text in the run-task instruction makes troubleshooting difficult. Avoid sending text here. Send text using the continue-task instruction.
The input field is required: The payload must contain the input field (formatted as
{}). Omitting it triggers the error "task can not be null".
Example:
{
"header": {
"action": "run-task",
"task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx", // Random UUID
"streaming": "duplex"
},
"payload": {
"task_group": "audio",
"task": "tts",
"function": "SpeechSynthesizer",
"model": "cosyvoice-v3-flash",
"parameters": {
"text_type": "PlainText",
"voice": "longanyang", // Voice
"format": "mp3", // Audio format
"sample_rate": 22050, // Sample rate
"volume": 50, // Volume
"rate": 1, // Speech rate
"pitch": 1 // Pitch
},
"input": {// input cannot be omitted, or else an error occurs
}
}
}headerparameter description:
Parameter | Type | Required | Description |
header.action | string | Yes | Instruction type. For this instruction, it is always "run-task". |
header.task_id | string | Yes | ID for this task. A 32-character universally unique identifier (UUID), composed of 32 randomly generated letters and digits. It can include hyphens (for example, When sending subsequent continue-task and finish-task instructions, use the same task_id as in the run-task instruction. |
header.streaming | string | Yes | Fixed string: "duplex" |
payloadparameter description:
Parameter | Type | Required | Description |
payload.task_group | string | Yes | Fixed string: "audio". |
payload.task | string | Yes | Fixed string: "tts". |
payload.function | string | Yes | Fixed string: "SpeechSynthesizer". |
payload.model | string | Yes | Speech synthesis model. Different model versions require corresponding voice versions:
|
payload.input | object | Yes | The input field is required in the run-task instruction (cannot be omitted), but do not send text to synthesize here (so use an empty object The Important Common error: Omitting the input field or including unexpected fields (such as mode or content) causes the server to reject the request and return "InvalidParameter: task can not be null" or close the connection (WebSocket code 1007). |
payload.parameters | |||
text_type | string | Yes | Fixed string: "PlainText". |
voice | string | Yes | Voice used for speech synthesis. Supported voices include system voices and cloned voices:
|
format | string | No | Audio coding format. Supported formats: pcm, wav, mp3 (default), and opus. When the audio format is opus, use the |
sample_rate | integer | No | Audio sampling rate (unit: Hz). Default: 22050. Valid values: 8000, 16000, 22050, 24000, 44100, 48000. Note The default sample rate represents the optimal sample rate for the selected voice. Output uses this rate by default, but downsampling and upsampling are supported. |
volume | integer | No | Volume. Default: 50. Range: [0, 100]. 50 is normal volume. Volume scales linearly with this value: 0 is mute, 100 is maximum. |
rate | float | No | Speech rate. Default: 1.0. Range: [0.5, 2.0]. 1.0 is normal speed. Values less than 1.0 slow speech; values greater than 1.0 speed it up. |
pitch | float | No | Pitch. This value multiplies pitch, but perceived pitch change isn't strictly linear or logarithmic. Test to find suitable values. Default: 1.0. Range: [0.5, 2.0]. 1.0 is natural pitch. Values above 1.0 raise pitch; values below 1.0 lower it. |
enable_ssml | boolean | No | Whether to enable SSML. When set to |
bit_rate | int | No | Audio bitrate (kbps). For OPUS format, adjust bitrate using the Default: 32. Range: [6, 510]. |
word_timestamp_enabled | boolean | No | Enable word-level timestamps. Default: false.
This feature works only with cloned voices for cosyvoice-v3-flash, cosyvoice-v3-plus, and cosyvoice-v2, and with system voices marked as timestamp-supported in the voice list. For more information, see Best practices for timestamp data extraction. |
seed | int | No | Random number seed used during synthesis to vary output. With identical model version, text, voice, and other parameters, the same seed produces identical results. Default: 0. Range: [0, 65535]. |
language_hints | array[string] | No | Specify the target language for speech synthesis to improve synthesis quality. Use this parameter when the pronunciation of numbers, abbreviations, symbols, or the synthesis quality of minor languages does not meet expectations, such as:
Value range:
Note: This parameter is an array, but the current version only processes the first element. Therefore, pass only one value. Important This parameter specifies the target language for speech synthesis. This setting is unrelated to the language of the sample audio during voice cloning. To set the source language for a cloning task, see the CosyVoice Voice Cloning API. |
instruction | string | No | Set instructions to control synthesis effects such as dialect, emotion, or role. This feature applies only to cloned voices of the cosyvoice-v3-flash models, and to system voices marked as supporting Instruct in the Voice List. Requirements:
Supported features:
|
enable_aigc_tag | boolean | No | Adds an invisible AIGC identifier to the generated audio. If set to true, the invisible identifier is embedded in audio files of supported formats (WAV/MP3/Opus). Default value: false. Only cosyvoice-v3-flash, cosyvoice-v3-plus, and cosyvoice-v2 support this feature. |
aigc_propagator | string | No | Set the Default value: Your Alibaba Cloud UID. Only cosyvoice-v3-flash, cosyvoice-v3-plus, and cosyvoice-v2 support this feature. |
aigc_propagate_id | string | No | Set the Default value: The Request ID of the current speech synthesis request. This feature is supported only by cosyvoice-v3-flash, cosyvoice-v3-plus, and cosyvoice-v2. |
2. continue-task instruction
This instruction sends text to synthesize.
You can send all text in one continue-task instruction, or split the text and send it in multiple continue-task instructions in sequence.
Timing: Send after receiving the task-started event.
Do not wait longer than 23 seconds between sending text fragments. Otherwise, the "request timeout after 23 seconds" error occurs.
If no more text remains to send, send the finish-task instruction to end the task.
The server enforces a 23-second timeout. Clients cannot modify this setting.
Example:
{
"header": {
"action": "continue-task",
"task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx", // Random UUID
"streaming": "duplex"
},
"payload": {
"input": {
"text": "Before my bed, moonlight gleams, like frost upon the ground."
}
}
}headerparameter description:
Parameter | Type | Required | Description |
header.action | string | Yes | Instruction type. For this instruction, it is always "continue-task". |
header.task_id | string | Yes | ID for this task. Must match the task_id used in the run-task instruction. |
header.streaming | string | Yes | Fixed string: "duplex" |
payloadparameter description:
Parameter | Type | Required | Description |
input.text | string | Yes | Text to synthesize. |
3. finish-task instruction: End a task
This instruction ends a speech synthesis task.
Make sure to send this instruction. Otherwise, you may encounter the following issues:
Incomplete audio: The server will not force-synthesize incomplete sentences held in its cache, causing missing audio at the end.
Connection timeout: If you do not send finish-task within 23 seconds after the last continue-task instruction, the connection times out and closes.
Billing anomalies: Tasks that do not end normally may not return accurate usage information.
Timing: Send immediately after sending all continue-task instructions. Do not wait for audio to finish returning or delay sending. Otherwise, the timeout may trigger.
Example:
{
"header": {
"action": "finish-task",
"task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx",
"streaming": "duplex"
},
"payload": {
"input": {}// input cannot be omitted, or else an error occurs
}
}headerparameter description:
Parameter | Type | Required | Description |
header.action | string | Yes | Instruction type. For this instruction, it is always "finish-task". |
header.task_id | string | Yes | ID for this task. Must match the task_id used in the run-task instruction. |
header.streaming | string | Yes | Fixed string: "duplex" |
payloadparameter description:
Parameter | Type | Required | Description |
payload.input | object | Yes | Fixed format: {}. |
Events (server → client)
Events are JSON-formatted messages returned from the server to the client. Each event represents a different processing stage.
The server returns binary audio separately. It is not included in any event.
1. task-started event: Task started
When you receive the task-started event from the server, the task has started successfully. You can send continue-task or finish-task instructions to the server only after receiving this event. Otherwise, the task fails.
The task-started event's payload is empty.
Example:
{
"header": {
"task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx",
"event": "task-started",
"attributes": {}
},
"payload": {}
}headerparameter description:
Parameter | Type | Description |
header.event | string | Event type. For this event, it is always "task-started". |
header.task_id | string | task_id generated by the client |
2. result-generated event
While the client sends continue-task and finish-task instructions, the server returns result-generated events continuously.
To link audio data to its corresponding text, the server returns sentence metadata along with audio data in the result-generated event. The server automatically splits input text into sentences. Each sentence's synthesis process includes three sub-events:
sentence-begin: Marks the start of a sentence and returns the text to synthesize.sentence-synthesis: Marks an audio data chunk. Each event is followed immediately by an audio data frame over the WebSocket binary channel.Multiple
sentence-synthesisevents occur per sentence, each corresponding to one audio data chunk.The client must receive these audio data chunks in order and append them to the same file.
Each
sentence-synthesisevent corresponds one-to-one with the audio data frame that follows it. No misalignment occurs.
sentence-end: Marks the end of a sentence and returns the sentence text and cumulative billed character count.
Use the payload.output.type field to distinguish sub-event types.
Example:
sentence-begin
{
"header": {
"task_id": "3f2d5c86-0550-45c0-801f-xxxxxxxxxx",
"event": "result-generated",
"attributes": {}
},
"payload": {
"output": {
"sentence": {
"index": 0,
"words": []
},
"type": "sentence-begin",
"original_text": "Before my bed, moonlight gleams,"
}
}
}sentence-synthesis
{
"header": {
"task_id": "3f2d5c86-0550-45c0-801f-xxxxxxxxxx",
"event": "result-generated",
"attributes": {}
},
"payload": {
"output": {
"sentence": {
"index": 0,
"words": []
},
"type": "sentence-synthesis"
}
}
}sentence-end
{
"header": {
"task_id": "3f2d5c86-0550-45c0-801f-xxxxxxxxxx",
"event": "result-generated",
"attributes": {}
},
"payload": {
"output": {
"sentence": {
"index": 0,
"words": []
},
"type": "sentence-end",
"original_text": "Before my bed, moonlight gleams,"
},
"usage": {
"characters": 11
}
}
}headerparameter description:
Parameter | Type | Description |
header.event | string | Event type. For this event, it is always "result-generated". |
header.task_id | string | task_id generated by the client. |
header.attributes | object | Additional attributes, usually an empty object. |
payloadparameter description:
Parameter | Type | Description |
payload.output.type | string | Sub-event type. Valid values:
Full event flow For each sentence to synthesize, the server returns events in the following order:
|
payload.output.sentence.index | integer | Sentence number, starting from 0. |
payload.output.sentence.words | array | A character-level information array is typically empty. |
payload.output.original_text | string | Sentence content after splitting the user's input text. The last sentence may not include this field. |
payload.usage.characters | integer | This value represents the number of billable characters for the current request. In a task, |
3. task-finished event: Task finished
When you receive the task-finished event from the server, the task has ended.
After ending the task, you can close the WebSocket connection and exit the program. Or you can reuse the WebSocket connection to start a new task by sending another run-task instruction (see Connection overhead and connection reuse).
Example:
{
"header": {
"task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx",
"event": "task-finished",
"attributes": {
"request_uuid": "0a9dba9e-d3a6-45a4-be6d-xxxxxxxxxxxx"
}
},
"payload": {
"output": {
"sentence": {
"words": []
}
},
"usage": {
"characters": 13
}
}
}headerparameter description:
Parameter | Type | Description |
header.event | string | Event type. For this event, it is always "task-finished". |
header.task_id | string | task_id generated by the client. |
header.attributes.request_uuid | string | Request ID. Provide this to CosyVoice developers to help diagnose issues. |
payloadparameter description:
Parameter | Type | Description |
payload.usage.characters | integer | The number of billable characters in the current request so far.
In a single task, |
payload.output.sentence.index | integer | Sentence number, starting from 0. This field and the following fields require enabling word-level timestamps using word_timestamp_enabled. |
payload.output.sentence.words[k] | ||
text | string | Text of the word. |
begin_index | integer | Starting position index of the word in the sentence, starting from 0. |
end_index | integer | Ending position index of the word in the sentence, starting from 1. |
begin_time | integer | Start timestamp of the audio for this word, in milliseconds. |
end_time | integer | End timestamp of the audio for this word, in milliseconds. |
Best practices for timestamp data extraction
After enabling word_timestamp_enabled, timestamp information is returned in the task-finished event. Example:
{
"header": {
"task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx",
"event": "task-finished",
"attributes": {"request_uuid": "0a9dba9e-d3a6-45a4-be6d-xxxxxxxxxxxx"}
},
"payload": {
"output": {
"sentence": {
"index": 0,
"words": [
{
"text": "How",
"begin_index": 0,
"end_index": 1,
"begin_time": 80,
"end_time": 280
},
{
"text": "is",
"begin_index": 1,
"end_index": 2,
"begin_time": 300,
"end_time": 400
},
{
"text": "the",
"begin_index": 2,
"end_index": 3,
"begin_time": 420,
"end_time": 520
},
{
"text": "weather",
"begin_index": 3,
"end_index": 4,
"begin_time": 540,
"end_time": 840
},
{
"text": "today",
"begin_index": 4,
"end_index": 5,
"begin_time": 860,
"end_time": 1160
},
{
"text": "?",
"begin_index": 5,
"end_index": 6,
"begin_time": 1180,
"end_time": 1320
}
]
}
},
"usage": {"characters": 25}
}
}
Correct extraction method:
Extract complete timestamps only from the task-finished event: Complete sentence timestamp data is returned only at task completion (in the task-finished event), in the payload.output.sentence.words array.
The result-generated event does not contain timestamps: The result-generated event indicates audio stream progress but does not include word-level timestamp information.
Example event filtering (Python):
def on_event(message): event_type = message["header"]["event"] # Extract timestamps only from the task-finished event if event_type == "task-finished": words = message["payload"]["output"]["sentence"]["words"] for word in words: print(f"Text: {word['text']}, Start: {word['begin_time']}ms, End: {word['end_time']}ms") # Process audio streams in result-generated events elif event_type == "result-generated": # Handle audio stream, do not extract timestamps pass
If you extract timestamp data from multiple events, duplicates occur. Ensure you extract timestamps only from the task-finished event.
4. task-failed event: Task failed
If you receive the task-failed event, the task failed. Close the WebSocket connection and handle the error. If the failure was due to a coding issue, adjust your code accordingly.
Example:
{
"header": {
"task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx",
"event": "task-failed",
"error_code": "InvalidParameter",
"error_message": "[tts:]Engine return error code: 418",
"attributes": {}
},
"payload": {}
}headerparameter description:
Parameter | Type | Description |
header.event | string | Event type. For this event, it is always "task-failed". |
header.task_id | string | task_id generated by the client. |
header.error_code | string | Error type description. |
header.error_message | string | Detailed error cause. |
Task interruption methods
During streaming synthesis, to terminate the current task early (for example, user cancels playback or interrupts a real-time conversation), use one of the following methods:
Interruption method | Server behavior | Use case |
Close connection directly |
| Immediate interruption: User cancels playback, switches content, or exits the app. |
Send finish-task |
| Elegant termination: Stop sending new text but still receive audio for cached content. |
Initiate a new run-task |
| Task switching: In real-time conversations, users interrupt and switch immediately to new content. |
Connection overhead and connection reuse
The WebSocket service supports connection reuse to improve resource efficiency and avoid connection overhead.
After the server receives the client's run-task instruction, it starts a new task. After the client sends the finish-task instruction, the server returns the task-finished event when the task completes. After the task ends, the WebSocket connection can be reused. To start a new task, simply send another run-task instruction.
Each task using a reused connection must use a different task_id.
If a task fails during execution, the server returns the task-failed event and closes the connection. At that point, the connection cannot be reused.
If no new task starts within 60 seconds after the previous task ends, the connection times out and closes automatically.
Performance metrics and concurrency limits
Concurrency limits
See Rate limiting for details.
To increase your concurrency quota (such as supporting more concurrent connections), contact customer support. Quota adjustments may require review and usually take 1–3 business days to complete.
Best practice: To improve resource utilization, reuse a WebSocket connection for multiple tasks instead of creating a new connection for each task. See Connection overhead and connection reuse.
Connection performance and latency
Normal connection time:
Clients in the China mainland: WebSocket connection establishment (from newWebSocket to onOpen) typically takes 200–1000 milliseconds.
Cross-border connections (such as Hong Kong or international regions): Connection latency may reach 1–3 seconds, and occasionally up to 10–30 seconds.
Troubleshooting long connection times:
If WebSocket connection establishment takes longer than 30 seconds, possible causes include the following:
Network issues: High network latency between the client and server (such as cross-border connections or ISP quality problems).
Slow DNS resolution: DNS resolution for dashscope.aliyuncs.com takes too long. Try using a public DNS (such as 8.8.8.8) or configuring your local hosts file.
Slow TLS handshake: The client uses an outdated TLS version or certificate validation takes too long. Use TLS 1.2 or later.
Proxy or firewall: Corporate networks may restrict WebSocket connections or require proxy usage.
Troubleshooting tools:
Use Wireshark or tcpdump to analyze TCP handshake, TLS handshake, and WebSocket Upgrade timing.
Test HTTP connection latency with curl:
curl -w "@curl-format.txt" -o /dev/null -s https://dashscope.aliyuncs.com
The CosyVoice WebSocket API is deployed in the Beijing region of the China mainland. If your client is in another region (such as Hong Kong or overseas), consider using a nearby relay server or CDN to accelerate the connection.
Audio generation performance
Synthesis speed:
Real-time factor (RTF): CosyVoice models typically synthesize audio at 0.1–0.5× real-time (i.e., generating 1 second of audio takes 0.1–0.5 seconds). Actual speed depends on model version, text length, and server load.
First packet latency: From sending the continue-task instruction to receiving the first audio chunk, latency is typically 200–800 milliseconds.
Sample code
Sample code provides only basic functionality to verify service connectivity. You must develop additional code for real-world business scenarios.
When writing WebSocket client code, use asynchronous programming to send and receive messages simultaneously. Follow these steps:
Error codes
To troubleshoot an error, see Error messages.
FAQ
Features, billing, and rate limiting
Q: How can I fix inaccurate pronunciation?
You can customize the speech synthesis output using Speech Synthesis Markup Language (SSML).
Q: Why use WebSocket instead of HTTP/HTTPS? Why not provide a RESTful API?
The speech service chooses WebSocket over HTTP/HTTPS/RESTful because it relies on full-duplex communication. WebSocket allows both the server and client to actively transmit data bidirectionally (such as pushing real-time speech synthesis or recognition progress). In contrast, RESTful APIs based on HTTP support only unidirectional client-initiated request-response patterns, which cannot meet real-time interaction requirements.
Q: Speech synthesis is billed per character. How do I view or retrieve the text length for each synthesis?
Get the character count from the payload.usage.characters parameter in the result-generated event returned by the server. Use the value from the last result-generated event.
Troubleshooting
When your code throws an error, check whether the instruction sent to the server is correct. Print the instruction content and check for formatting errors or missing required parameters. If the instruction is correct, troubleshoot using the information in the error codes.
Q: How do I get the Request ID?
You can get it in two ways:
Parse the information returned by the server in the result-generated event.
Parse the information returned by the server in the task-finished event.
Q: Why does SSML fail?
Troubleshoot using the following steps:
Ensure the limitations and constraints are correct.
Ensure you call it correctly. For details, see SSML markup language support.
Ensure the text to synthesize is plain text and meets formatting requirements. For details, see SSML markup language introduction.
Q: Why won't the audio play?
You can troubleshoot based on the following scenarios:
If you save the audio as a complete file (for example, audio.mp3):
Audio format consistency: Make sure the audio format set in the request parameters matches the file extension. For example, if you set the audio format to `wav` but save the file with an `.mp3` extension, playback might fail.
Player compatibility: Check if your player supports the audio file's format and sample rate. For example, some players may not support audio files with high sample rates or specific encodings.
Streaming audio playback scenarios
Save the audio stream as a complete file and try to play it. If the file does not play, follow the troubleshooting steps for Scenario 1.
If the file plays correctly, the issue might be with your streaming implementation. Check if your player supports streaming playback.
Common tools and libraries that support streaming playback include FFmpeg, PyAudio (Python), AudioFormat (Java), and MediaSource (JavaScript).
Q: Why is the audio playback stuttering?
You can troubleshoot based on the following scenarios:
Check the text submission speed: Make sure the interval between text submissions is reasonable. This prevents delays where the next text segment is not sent promptly after the previous audio segment finishes playing.
Check callback function performance:
Check for excessive business logic in the callback function that could cause blocking.
The callback function runs on the WebSocket thread. If this thread is blocked, it can interfere with the reception of network packets over WebSocket, which causes audio stuttering.
Write the audio data to a separate audio buffer. Then, read and process it in a different thread to avoid blocking the WebSocket thread.
Check network stability: Make sure your network connectivity is stable to prevent audio transmission interruptions or delays caused by network fluctuations.
Q: Why is speech synthesis slow?
You can troubleshoot the issue by following these steps:
Check the input interval
For streaming speech synthesis, check if the interval between sending text segments is too long (for example, waiting several seconds after one segment is sent before sending the next). Long intervals increase the total synthesis time.
Analyze performance metrics
Time to First Byte (TTFB): This is typically around 500 ms.
Real-Time Factor (RTF): The ratio of the total synthesis time to the audio duration. This value should typically be less than 1.0.
Q: How do I fix pronunciation errors in the synthesized speech?
You can use the SSML <phoneme> tag to specify the correct pronunciation.
Q: Why is no audio returned? Why is part of the text at the end not converted to speech? (Missing speech)
Confirm you did not forget to send the finish-task instruction. During speech synthesis, the server waits until it has enough text in its cache before starting synthesis. If you forget to send the finish-task instruction, the text cached at the end may not be synthesized into speech.
Q: Why is the audio stream order scrambled, causing garbled playback?
Troubleshoot from two angles:
Ensure the run-task instruction, continue-task instruction, and finish-task instruction for the same synthesis task use the same
task_id.Check whether asynchronous operations cause audio files to be written in a different order than binary data is received.
Q: How do I handle WebSocket connection errors?
How do I handle WebSocket connection closure (code 1007)?
The WebSocket connection closes immediately after sending the run-task instruction, with close code 1007.
Root cause: The server detects protocol or data format errors and closes the connection. Common causes include the following:
Invalid fields in the run-task instruction's payload (e.g., adding fields other than
"input": {}).JSON format errors (e.g., missing commas or mismatched brackets).
Missing required fields (e.g., task_id, action).
Solution:
Check JSON format: Validate the request body format.
Check required fields: Confirm header.action, header.task_id, header.streaming, payload.task_group, payload.task, payload.function, payload.model, and payload.input are all set correctly.
Remove invalid fields: In the run-task payload.input, allow only an empty object
{}or a text field. Do not add other fields.
How do I handle WebSocketBadStatus, 401 Unauthorized, or 403 Forbidden errors?
A WebSocketBadStatus, 401 Unauthorized, or 403 Forbidden error occurs during connection setup.
Root cause: Authentication failure. The server validates the Authorization header during the WebSocket handshake. If the API key is invalid or missing, the connection is rejected.
Solution: See Troubleshooting authentication failures.
Permissions and authentication
Q: How can I restrict my API key to be used only for the CosyVoice speech synthesis service and not for other Model Studio models (permission isolation)?
You can create a new workspace and grant authorization to specific models only to limit the scope of the API key. For more information, see Workspace management.
More questions
For answers to more questions, see the QA on GitHub.