Parameters and the interaction protocol for WebSocket connections to CosyVoice speech synthesis. Use WebSocket for languages other than Java and Python (which have SDK support).
User guide: For model overviews and selection suggestions, see Real-time speech synthesis - CosyVoice.
WebSocket provides full-duplex communication: the client and server establish a persistent connection with a single handshake, allowing both parties to push data to each other, providing better real-time performance.
WebSocket libraries are available for most languages (Go: gorilla/websocket, PHP: Ratchet, Node.js: ws). Familiarize yourself with WebSocket basics before starting.
CosyVoice models support only WebSocket connections—not HTTP REST APIs. HTTP requests return InvalidParameter or URL errors.
Prerequisites
Get an API key.
Models and pricing
Text and format limitations
Text length limits
Send up to 20,000 characters per continue-task instruction, with a total limit of 200,000 characters across all instructions.
Character counting rules
-
Chinese characters (simplified/traditional Chinese, Japanese Kanji, Korean Hanja) count as two characters. All other characters (punctuation, letters, numbers, Kana, Hangul) count as one.
-
SSML tags are not included when calculating the text length.
-
Examples:
-
"你好"→ 2 (Chinese character) + 2 (Chinese character) = 4 characters -
"中A文123"→ 2 (Chinese character) + 1 (A) + 2 (Chinese character) + 1 (1) + 1 (2) + 1 (3) = 8 characters -
"中文。"→ 2 (Chinese character) + 2 (Chinese character) + 1 (.) = 5 characters -
"中 文。"→ 2 (Chinese character) + 1 (space) + 2 (Chinese character) + 1 (.) = 6 characters -
"<speak>你好</speak>"→ 2 (Chinese character) + 2 (Chinese character) = 4 characters
-
Encoding format
Use UTF-8 encoding.
Support for mathematical expressions
Mathematical expression parsing (v3.5-flash, v3.5-plus, v3-flash, v3-plus, v2 only): Supports primary and secondary school math—basic operations, algebra, geometry.
This feature only supports Chinese.
See Convert LaTeX formulas to speech (Chinese language only).
SSML support
SSML requires all of the following conditions to be met:
-
Model support: Only cosyvoice-v3.5-flash, cosyvoice-v3.5-plus, cosyvoice-v3-flash, cosyvoice-v3-plus, and cosyvoice-v2 support SSML.
-
Voice support: You must use an SSML-enabled voice. Supported voices include the following:
-
All cloned voices (created through the Voice Cloning API).
-
System voices marked as SSML-enabled in the voice list.
NoteSystem voices that do not support SSML (such as some basic voices) return the error “SSML text is not supported at the moment!” even with
enable_ssmlenabled. -
-
Parameter setting: In the run-task instruction, set the
enable_ssmlparameter totrue.
After meeting these conditions, send SSML-formatted text through the continue-task instruction to use SSML. For a complete example, see Getting started.
Interaction flow
Client-to-server messages are instructions. Server-to-client messages are either JSON-formatted events or binary audio streams.
The client-server interaction follows this sequence:
-
Establish connection: Client opens a WebSocket connection to the server.
-
Start task: Client sends the run-task instruction.
-
Wait for confirmation: Client receives the task-started event confirming the task has started.
-
Send text to synthesize:
Client sends one or more continue-task instructions in order. The server returns a result-generated event and audio stream after receiving a complete sentence. For text length constraints, see the
textfield in the continue-task instruction.NoteSend multiple continue-task instructions to submit text fragments in order. The server automatically segments text into sentences:
-
Complete sentences are synthesized immediately, and the client receives the audio.
-
Incomplete sentences are buffered until complete. No audio is returned for incomplete sentences.
After receiving the finish-task instruction, the server force-synthesizes all buffered content.
-
-
Receive audio: Receive the audio stream through the
binarychannel. -
Notify the server to end the task:
After sending all text, client sends the finish-task instruction and continues receiving audio. Do not skip this step, or the ending audio may be lost.
-
Task ends:
Client receives the task-finished event, marking the task end.
-
Close connection: Client closes the WebSocket connection.
To improve resource utilization, reuse a WebSocket connection to handle multiple tasks instead of creating a new connection for each task. See Connection overhead and reuse.
The task_id must remain consistent throughout: Within a single synthesis task, run-task, all continue-task, and finish-task instructions must use the same task_id.
Consequences of mismatched task_ids:
-
Disordered audio stream delivery due to unassociated requests.
-
Misaligned speech content due to text being assigned to different tasks.
-
Abnormal task state, possibly preventing receipt of the task-finished event.
-
Billing failures or inaccurate usage statistics.
Correct approach:
-
Generate a unique task_id (for example, using UUID) when sending the run-task instruction.
-
Store the task_id in a variable.
-
Use this task_id for all subsequent continue-task and finish-task instructions.
-
After the task ends (after receiving task-finished), generate a new task_id for the next task.
Client implementation considerations
When implementing a WebSocket client, especially on Flutter, web, or mobile platforms, clearly define server and client responsibilities to ensure task integrity and stability.
Server and client responsibilities
Server responsibilities
The server delivers the complete audio stream in order. You do not need to handle audio ordering or completeness.
Client responsibilities
The client must handle the following:
-
Read and concatenate all audio chunks
The server delivers audio as multiple binary frames. The client must receive all frames and concatenate them to form the final audio file:
# Python example: Concatenate audio chunks with open("output.mp3", "ab") as f: # Append mode f.write(audio_chunk) # audio_chunk is each received binary audio chunk// JavaScript example: Concatenate audio chunks const audioChunks = []; ws.onmessage = (event) => { if (event.data instanceof Blob) { audioChunks.push(event.data); // Collect all audio chunks } }; // Merge audio after task completes const audioBlob = new Blob(audioChunks, { type: 'audio/mp3' }); -
Maintain a complete WebSocket lifecycle
Do not disconnect the WebSocket connection prematurely during the entire task, from sending the run-task instruction to receiving the task-finished event. Common mistakes:
-
Closing the connection before all audio chunks are returned, resulting in incomplete audio.
-
Forgetting to send the finish-task instruction, leaving text buffered on the server and unprocessed.
-
Failing to handle WebSocket keepalive properly during page navigation or app backgrounding.
ImportantMobile apps (such as Flutter, iOS, and Android) require special attention to network management when entering the background. Maintain the WebSocket connection in a background task or service, or reinitialize the connection when returning to the foreground.
-
-
Text integrity in ASR→LLM→TTS workflows
In ASR (speech recognition) → LLM (large language model) → TTS (speech synthesis) workflows, ensure the text passed to TTS is complete. For example:
-
Wait for the LLM to generate a full sentence or paragraph before sending the continue-task instruction, rather than streaming character-by-character.
-
For streaming synthesis, send text in batches at natural sentence boundaries (such as periods or question marks).
-
After the LLM finishes generating, always send the finish-task instruction to avoid missing trailing content.
-
Platform-specific tips
-
Flutter: When using the
web_socket_channelpackage, close the connection correctly in thedisposemethod to prevent memory leaks. Also handle app lifecycle events (such asAppLifecycleState.paused) for background transitions. -
Web (browser): Some browsers limit the number of WebSocket connections. Reuse a single connection for multiple tasks. Use the
beforeunloadevent to close the connection explicitly before the page closes. -
Mobile (iOS/Android native): The operating system may pause or terminate network connections when the app enters the background. Use a background task or foreground service to keep the WebSocket active, or reinitialize the task when returning to the foreground.
URL
The WebSocket URL is fixed:
International
In international deployment mode, the access point and data storage are both located in the Singapore region. Model inference computing resources are dynamically scheduled globally, excluding the Chinese mainland.
WebSocket URL: wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference
Chinese mainland
In Chinese mainland deployment mode, the access point and data storage are both located in the Beijing region. Model inference computing resources are restricted to the Chinese mainland.
WebSocket URL: wss://dashscope.aliyuncs.com/api-ws/v1/inference
Common URL configuration errors:
-
Error: Using URLs that start with http:// or https:// → Correct: You must use the wss:// protocol.
-
Error: Placing the Authorization parameter in the URL query string (e.g.,
?Authorization=bearer <your_api_key>) → Correct: Set the Authorization parameter in the HTTP handshake headers. See Headers. -
Error: Adding model names or other path parameters to the end of the URL → Correct: The URL is fixed. Specify the model using the
payload.modelparameter in the run-task instruction.
Headers
Set the following request headers:
|
Parameter |
Type |
Required |
Description |
|
Authorization |
string |
Yes |
Authentication token. Format: |
|
user-agent |
string |
No |
Client identifier for tracking the source. |
|
X-DashScope-WorkSpace |
string |
No |
Your Alibaba Cloud Model Studio workspace ID. |
|
X-DashScope-DataInspection |
string |
No |
Whether to enable data compliance inspection. Default: |
Authentication timing and common errors
Authentication occurs during the WebSocket handshake, not when sending the run-task instruction. If the Authorization header is missing or invalid, the server rejects the handshake and returns an HTTP 401 or 403 error. Client libraries typically report this as a WebSocketBadStatus exception.
Troubleshoot authentication failures
If the WebSocket connection fails, follow these steps:
-
Check API key format: Confirm the Authorization header follows
bearer <your_api_key>with a space separating `bearer` and the key. -
Verify API key validity: In the Model Studio console, confirm the key is active and has CosyVoice model permissions.
-
Check header settings: Confirm the Authorization header is set during the WebSocket handshake. Configuration differs by language:
-
Python (websockets library):
extra_headers={"Authorization": f"bearer {api_key}"} -
JavaScript: The standard browser WebSocket API does not support custom headers. Use a server-side proxy or another library, such as ws.
-
Go (gorilla/websocket):
header.Add("Authorization", fmt.Sprintf("bearer %s", apiKey))
-
-
Test network connectivity: Use curl or Postman to test whether the API key is valid by calling other HTTP-supported DashScope APIs.
Using WebSocket in browser environments
In browser environments such as Vue3 and React, the native new WebSocket(url) API does not support custom request headers (including Authorization) during the handshake. This is a browser security restriction, so you cannot authenticate directly from frontend code.
Solution: Use a backend proxy
-
Set up a WebSocket connection from your backend (Node.js, Java, or Python) to the CosyVoice service. The backend can set the Authorization header.
-
Have the frontend connect via WebSocket to your backend, which acts as a proxy to forward messages to CosyVoice.
-
Benefits: The API key stays hidden from the frontend. You can also add authentication, logging, or rate limiting on the backend.
Never hardcode your API key in frontend code. Leaking your API key could lead to account compromise, unexpected charges, or data breaches.
Example code:
For other programming languages, implement the same logic or use AI tools to convert these examples.
-
Frontend (native web) + Backend (Node.js Express): cosyvoiceNodeJs_en.zip
-
Frontend (native web) + Backend (Python Flask): cosyvoiceFlask_en.zip
Instructions (client → server)
Instructions are JSON messages sent from the client to the server as WebSocket text frames. They control the task lifecycle.
Send instructions in the following order to prevent task failure:
-
Send the run-task instruction
-
Starts the speech synthesis task.
-
The
task_idreturned must be used consistently in subsequent continue-task and finish-task instructions.
-
-
Send the continue-task instruction
-
Sends text to synthesize.
-
Send only after receiving the task-started event from the server.
-
-
Send the finish-task instruction
-
Ends the speech synthesis task.
-
Send after all continue-task instructions are sent.
-
1. run-task instruction: Start a task
Starts a speech synthesis task. Configure the voice, sample rate, and other parameters in this instruction.
-
Timing: Send this instruction after the WebSocket connection is established.
-
Do not send text to synthesize: Sending text in the run-task instruction complicates troubleshooting. Avoid doing so. Instead, send text using the continue-task instruction.
-
The input field is required: The payload must contain the input field, which is formatted as
{}. If you omit this field, the "task can not be null" error occurs.
Example:
{
"header": {
"action": "run-task",
"task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx", // Random UUID
"streaming": "duplex"
},
"payload": {
"task_group": "audio",
"task": "tts",
"function": "SpeechSynthesizer",
"model": "cosyvoice-v3-flash",
"parameters": {
"text_type": "PlainText",
"voice": "longanyang", // Voice
"format": "mp3", // Audio format
"sample_rate": 22050, // Sample rate
"volume": 50, // Volume
"rate": 1, // Speech rate
"pitch": 1 // Pitch
},
"input": {// input must exist, or it will return an error
}
}
}
header parameter descriptions:
|
Parameter |
Type |
Required |
Description |
|
header.action |
string |
Yes |
Instruction type. Fixed value: "run-task". |
|
header.task_id |
string |
Yes |
Task ID for this operation. A 32-character UUID composed of randomly generated letters and digits. Hyphens are optional (for example,
When sending subsequent continue-task instructions and finish-task instructions, use the same task_id as in the run-task instruction. |
|
header.streaming |
string |
Yes |
Fixed string: "duplex" |
payload parameter descriptions:
|
Parameter |
Type |
Required |
Description |
|
payload.task_group |
string |
Yes |
Fixed string: "audio". |
|
payload.task |
string |
Yes |
Fixed string: "tts". |
|
payload.function |
string |
Yes |
Fixed string: "SpeechSynthesizer". |
|
payload.model |
string |
Yes |
Speech synthesis model. Each model version requires compatible voices:
|
|
payload.input |
object |
Yes |
The input field is required but must be empty in the run-task instruction. Use an empty object The
Important
Common error: Omitting the input field or adding unexpected fields (like mode or content) causes the server to reject the request with “InvalidParameter: task can not be null” or close the connection (WebSocket code 1007). |
|
payload.parameters |
|||
|
text_type |
string |
Yes |
Fixed string: "PlainText". |
|
voice |
string |
Yes |
The voice used for speech synthesis. Supported voice types:
|
|
format |
string |
No |
Audio coding format. Supports pcm, wav, mp3 (default), and opus. When format is opus, adjust bitrate using the |
|
sample_rate |
integer |
No |
Audio sampling rate (in Hz). Default: 22050. Valid values: 8000, 16000, 22050, 24000, 44100, 48000. Note
The default sample rate represents the optimal rate for the selected voice. Output uses this rate by default, but downsampling and upsampling are supported. |
|
volume |
integer |
No |
The volume. Default: 50. Valid range: [0, 100]. Values scale linearly—0 is silent, 50 is default, 100 is maximum. |
|
rate |
float |
No |
The speech rate. Default value: 1.0. Valid values: [0.5, 2.0]. A value of 1.0 is the standard speech rate. A value less than 1.0 slows down the speech, and a value greater than 1.0 speeds it up. |
|
pitch |
float |
No |
Pitch multiplier. The relationship to perceived pitch is neither linear nor logarithmic—test to find suitable values. Default value: 1.0. Valid values: [0.5, 2.0]. A value of 1.0 is the natural pitch of the voice. A value greater than 1.0 raises the pitch, and a value less than 1.0 lowers it. |
|
enable_ssml |
boolean |
No |
Enable SSML. When set to |
|
bit_rate |
int |
No |
The audio bitrate in kbps. If the audio format is Opus, adjust the bitrate by using the Default value: 32. Valid values: [6, 510]. |
|
word_timestamp_enabled |
boolean |
No |
Specifies whether to enable word-level timestamps. Default value: false.
This feature is available only for cloned voices of the cosyvoice-v3-flash, cosyvoice-v3-plus, and cosyvoice-v2 models, and for system voices that are marked as supported in the voice list. When word_timestamp_enabled is enabled, timestamp information appears in the result-generated event. Example:
|
|
seed |
int |
No |
The random seed used during generation. Different seeds produce different synthesis results. If the model, text, voice, and other parameters are identical, using the same seed reproduces the same output. Default value: 0. Valid values: [0, 65535]. |
|
language_hints |
array[string] |
No |
Specifies the target language for speech synthesis to improve the synthesis effect. Use when pronunciation or synthesis quality is poor for numbers, abbreviations, symbols, or less common languages:
Valid values:
Note: This parameter is an array, but the current version only processes the first element. Therefore, we recommend passing only one value. Important
This parameter specifies the target language for speech synthesis. This setting is independent of the language of the sample audio used for voice cloning. To set the source language for a cloning task, see CosyVoice Voice Cloning/Design API. |
|
instruction |
string |
No |
Sets an instruction to control synthesis effects such as dialect, emotion, or speaking style. This feature is available only for cloned voices of the cosyvoice-v3.5-flash, cosyvoice-v3.5-plus, and cosyvoice-v3-flash models, and for system voices marked as supporting Instruct in the voice list. Length limit: 100 characters. A Chinese character (including simplified and traditional Chinese, Japanese Kanji, and Korean Hanja) is counted as two characters. All other characters, such as punctuation marks, letters, numbers, and Japanese/Korean Kana/Hangul, are counted as one character. Usage requirements (vary by model):
|
|
enable_aigc_tag |
boolean |
No |
Specifies whether to add an invisible AIGC identifier to the generated audio. When set to true, an invisible identifier is embedded into the audio in supported formats (WAV, MP3, and Opus). Default value: false. Only cosyvoice-v3-flash, cosyvoice-v3-plus, and cosyvoice-v2 support this feature. |
|
aigc_propagator |
string |
No |
Sets the Default value: Alibaba Cloud UID. Only cosyvoice-v3-flash, cosyvoice-v3-plus, and cosyvoice-v2 support this feature. |
|
aigc_propagate_id |
string |
No |
Sets the Default value: The request ID of the current speech synthesis request. Only cosyvoice-v3-flash, cosyvoice-v3-plus, and cosyvoice-v2 support this feature. |
|
hot_fix |
object |
No |
Configuration for text hotpatching. Allows you to customize the pronunciation of specific words or replace text before synthesis. This feature is available only for cloned voices of cosyvoice-v3-flash. Parameters:
Example:
|
|
enable_markdown_filter |
boolean |
No |
Specifies whether to enable Markdown filtering. When enabled, the system automatically removes Markdown symbols from the input text before synthesizing speech, preventing them from being read aloud. This feature is available only for cloned voices of cosyvoice-v3-flash. Default value: false. Valid values:
|
2. continue-task instruction
Sends text to synthesize.
Send all text in one continue-task instruction, or split it across multiple continue-task instructions, in order.
When to send: After receiving the task-started event.
Do not wait longer than 23 seconds between sending text fragments, otherwise, the 'request timeout after 23 seconds' error occurs.
If no more text remains, send the finish-task instruction to end the task.
The server enforces a 23-second timeout. Clients cannot modify this.
Example:
{
"header": {
"action": "continue-task",
"task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx", // Random UUID
"streaming": "duplex"
},
"payload": {
"input": {
"text": "Before my bed, moonlight shines bright, I suspect it's frost upon the ground."
}
}
}
header parameter descriptions:
|
Parameter |
Type |
Required |
Description |
|
header.action |
string |
Yes |
Instruction type. Fixed value: "continue-task". |
|
header.task_id |
string |
Yes |
Task ID for this request. The task_id must match the one used in the run-task instruction. |
|
header.streaming |
string |
Yes |
Fixed string: "duplex" |
payload parameter descriptions:
|
Parameter |
Type |
Required |
Description |
|
input.text |
string |
Yes |
Text to synthesize. |
3. finish-task instruction: End task
Ends the speech synthesis task.
Always send this instruction. Otherwise, you may face:
-
Incomplete audio: The server will not force-synthesize incomplete cached sentences, which results in missing endings.
-
Connection timeout: If you wait more than 23 seconds after the last continue-task instruction before sending finish-task, the connection times out and closes.
-
Billing issues: Tasks that do not end normally may return inaccurate usage information.
When to send: Send immediately after all continue-task instructions have been sent. Do not wait for audio to finish or delay sending—this may trigger timeouts.
Example:
{
"header": {
"action": "finish-task",
"task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx",
"streaming": "duplex"
},
"payload": {
"input": {}//input must exist, or it will return an error
}
}
header parameter descriptions:
|
Parameter |
Type |
Required |
Description |
|
header.action |
string |
Yes |
Instruction type. Fixed value: "finish-task". |
|
header.task_id |
string |
Yes |
Task ID for this request. The task_id must match the one used in the run-task instruction. |
|
header.streaming |
string |
Yes |
Fixed string: "duplex" |
payload parameter descriptions:
|
Parameter |
Type |
Required |
Description |
|
payload.input |
object |
Yes |
Fixed format: {}. |
Events (server → client)
Events are JSON messages sent from the server to the client. Each event marks a stage in the task lifecycle.
The server sends binary audio separately—it is not included in any event.
1. task-started event: Task started
The task-started event confirms that the task has started. Send continue-task or finish-task instructions only after receiving this event. Otherwise, the task fails.
The task-started event’s payload contains no content.
Example:
{
"header": {
"task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx",
"event": "task-started",
"attributes": {}
},
"payload": {}
}
header parameter descriptions:
|
Parameter |
Type |
Description |
|
header.event |
string |
Event type. Fixed value: "task-started". |
|
header.task_id |
string |
Task ID generated by the client. |
2. result-generated event
While you send continue-task and finish-task instructions, the server continuously returns result-generated events.
To link audio data to its corresponding text, the server includes sentence metadata in the result-generated event alongside the audio. The server automatically splits the input text into sentences. The synthesis of each sentence consists of three sub-events:
-
sentence-begin: Marks sentence start and returns the text to synthesize. -
sentence-synthesis: Marks an audio data chunk. Each event is followed immediately by an audio data frame over the WebSocket binary channel.-
One sentence produces multiple
sentence-synthesisevents—one per audio chunk. -
The client must receive these audio chunks in order and append them to the same file.
-
Each
sentence-synthesisevent maps one-to-one with its following audio frame—no misalignment occurs.
-
-
sentence-end: Marks sentence end and returns the sentence text and cumulative billed character count.
Use the payload.output.type field to distinguish between sub-event types.
Example:
sentence-begin
{
"header": {
"task_id": "3f2d5c86-0550-45c0-801f-xxxxxxxxxx",
"event": "result-generated",
"attributes": {}
},
"payload": {
"output": {
"sentence": {
"index": 0,
"words": []
},
"type": "sentence-begin",
"original_text": "Before my bed, moonlight shines bright,"
}
}
}
sentence-synthesis
{
"header": {
"task_id": "3f2d5c86-0550-45c0-801f-xxxxxxxxxx",
"event": "result-generated",
"attributes": {}
},
"payload": {
"output": {
"sentence": {
"index": 0,
"words": []
},
"type": "sentence-synthesis"
}
}
}
sentence-end
{
"header": {
"task_id": "3f2d5c86-0550-45c0-801f-xxxxxxxxxx",
"event": "result-generated",
"attributes": {}
},
"payload": {
"output": {
"sentence": {
"index": 0,
"words": []
},
"type": "sentence-end",
"original_text": "Before my bed, moonlight shines bright,"
},
"usage": {
"characters": 11
}
}
}
header parameter descriptions:
|
Parameter |
Type |
Description |
|
header.event |
string |
Event type. Fixed value: "result-generated". |
|
header.task_id |
string |
Task ID generated by the client. |
|
header.attributes |
object |
Additional attributes—usually an empty object. |
payload parameter descriptions:
|
Parameter |
Type |
Description |
|
payload.output.type |
string |
Sub-event type. Values:
Full event flow For each sentence to synthesize, the server returns events in this order:
|
|
payload.output.sentence.index |
integer |
Sentence number, starting from 0. |
|
payload.output.sentence.words |
array |
An array of character information. |
|
payload.output.sentence.words.text |
string |
Word text. |
|
payload.output.sentence.words.begin_index |
integer |
Starting position index of the word in the sentence, counting from 0. |
|
payload.output.sentence.words.end_index |
integer |
Ending position index of the word in the sentence, counting from 1. |
|
payload.output.sentence.words.begin_time |
integer |
Start timestamp of the word’s audio, in milliseconds. |
|
payload.output.sentence.words.end_time |
integer |
End timestamp of the word’s audio, in milliseconds. |
|
payload.output.original_text |
string |
Sentence content after splitting the user’s input text. The last sentence may omit this field. |
|
payload.usage.characters |
integer |
Total billed characters in this request so far. Within one task, the |
3. task-finished event: Task finished
The task-finished event marks the end of the task.
After the task ends, close the WebSocket connection and exit, or reuse the connection to send a new run-task instruction (see Connection overhead and reuse).
Example:
{
"header": {
"task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx",
"event": "task-finished",
"attributes": {
"request_uuid": "0a9dba9e-d3a6-45a4-be6d-xxxxxxxxxxxx"
}
},
"payload": {
"output": {
"sentence": {
"words": []
}
},
"usage": {
"characters": 13
}
}
}
header parameter descriptions:
|
Parameter |
Type |
Description |
|
header.event |
string |
Event type. Fixed value: "task-finished". |
|
header.task_id |
string |
Task ID generated by the client. |
|
header.attributes.request_uuid |
string |
Request ID. Provide this to CosyVoice developers for issue diagnosis. |
payload parameter descriptions:
|
Parameter |
Type |
Description |
|
payload.usage.characters |
integer |
Total billed characters in this request so far. Within one task, the |
4. task-failed event: Task failed
The task-failed event indicates that the task has failed. Close the WebSocket connection and review the error message to identify the cause.
Example:
{
"header": {
"task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx",
"event": "task-failed",
"error_code": "InvalidParameter",
"error_message": "[tts:]Engine return error code: 418",
"attributes": {}
},
"payload": {}
}
header parameter descriptions:
|
Parameter |
Type |
Description |
|
header.event |
string |
Event type. Fixed value: "task-failed". |
|
header.task_id |
string |
Task ID generated by the client. |
|
header.error_code |
string |
Error type description. |
|
header.error_message |
string |
Detailed error reason. |
Task interruption methods
During streaming synthesis, you can interrupt the current task early—for example, if the user cancels playback or interrupts a live conversation—using one of these methods:
|
Interrupt Mode |
Server behavior |
Use case |
|
Directly close the connection |
|
Immediate interruption: The user cancels playback, switches content, or exits the app. |
|
Send a finish-task command |
|
Elegant end: You stop sending new text but still receive audio for cached content. |
Connection overhead and reuse
The WebSocket service supports connection reuse to reduce overhead.
Send a run-task instruction to start a task, then a finish-task instruction to end it. After the task-finished event, reuse the same connection by sending a new run-task instruction.
-
A new run-task instruction can only be sent after the server returns a task-finished event.
-
Different tasks on a reused connection must use different task_ids.
-
If a task fails during execution, the server returns a task-failed event and closes the connection. This connection cannot be reused.
-
If no new task starts within 60 seconds after the task ends, the connection times out and closes automatically.
Performance metrics and concurrency limits
Concurrency limits
See Rate limiting.
To increase your concurrency quota, contact customer support. Quota adjustments require a review and are usually completed within 1 to 3 business days.
Best practice: Reuse a WebSocket connection for multiple tasks instead of creating a new connection for each task. See Connection overhead and reuse.
Connection performance and latency
Typical connection time:
-
Clients in the Chinese mainland: WebSocket connection establishment (from newWebSocket to onOpen) typically takes 200 to 1000 ms.
-
Cross-border connections (such as Hong Kong or international regions): Connection latency may reach 1 to 3 seconds. In rare cases, it may reach 10 to 30 seconds.
Troubleshooting long connection times:
If a WebSocket connection takes longer than 30 seconds to establish, check for the following issues:
-
Network issues: High network latency between the client and the server, such as latency caused by cross-border connections or poor ISP quality.
-
Slow DNS resolution: DNS resolution for dashscope.aliyuncs.com may be slow. Try using a public DNS such as 8.8.8.8, or configure your local hosts file.
-
Slow TLS handshake: An outdated TLS version on the client or slow certificate validation. Use TLS 1.2 or later.
-
Proxy or firewall: Corporate networks may block WebSocket connections or require the use of a proxy.
Troubleshooting tools:
-
Use Wireshark or tcpdump to analyze the timing of the TCP handshake, TLS handshake, and WebSocket Upgrade phases.
-
Test HTTP connection latency with curl:
curl -w "@curl-format.txt" -o /dev/null -s https://dashscope.aliyuncs.com
The CosyVoice WebSocket API is deployed in the Beijing region of the Chinese mainland. If your client is in another region, such as Hong Kong or an overseas region, you can use a nearby relay server or CDN to accelerate the connection.
Audio generation performance
Synthesis speed:
-
Real-time factor (RTF): CosyVoice models typically synthesize audio at 0.1 to 0.5 times the real-time speed. This means that generating 1 second of audio takes 0.1 to 0.5 seconds. The actual speed depends on the model version, text length, and server load.
-
First packet latency: The latency from sending the continue-task instruction to receiving the first audio chunk is typically 200 to 800 ms.
Example code
This example demonstrates basic service connectivity only. Implement production-ready logic for your specific use case.
When writing WebSocket client code, use asynchronous programming to send and receive messages simultaneously:
Error codes
If an error occurs, see Error messages for troubleshooting.
FAQ
Features, billing, and rate limiting
Q: What can I do if the pronunciation is inaccurate?
Use SSML to fix pronunciation.
Q: Why use WebSocket instead of HTTP/HTTPS? Why not provide a RESTful API?
The Speech Service uses WebSocket instead of HTTP/HTTPS or RESTful APIs because it requires full-duplex communication. WebSocket allows both the server and client to proactively push data, such as real-time progress updates for synthesis or recognition. RESTful APIs over HTTP only support client-initiated request-response cycles and cannot meet real-time interaction requirements.
Q: Speech synthesis is billed per character. How do I check or retrieve the character count for each synthesis?
The character count is available in the payload.usage.characters field of the server’s result-generated event. Use the value from the last received result-generated event.
Troubleshooting
If your code throws an error, check whether the instruction sent to the server is correct. Print the instruction and verify its format and required fields. If the instruction is correct, refer to the error codes for further diagnosis.
Q: How do I get the Request ID?
Two methods are available:
-
Parse the result-generated event returned by the server.
-
Parse the task-finished event returned by the server.
Q: Why does SSML fail?
Troubleshoot this issue step by step:
-
Ensure you correctly follow the limitations and constraints.
-
Ensure SSML is called correctly. For details, see SSML support.
-
Ensure the text is plain text and meets formatting requirements. For details, see SSML markup language overview.
Q: Why does the audio duration of TTS speech synthesis differ from the WAV file's displayed duration? For example, a WAV file shows 7 seconds but the actual audio is less than 5 seconds?
TTS uses a streaming synthesis mechanism, which means it synthesizes and returns data progressively. As a result, the WAV file header contains an estimated value, which may have some margin of error. If you require precise duration, you can set the format to PCM and manually add the WAV header information after obtaining the complete synthesis result. This will give you a more accurate duration.
Q: Why can't the audio be played?
Check the following scenarios one by one:
-
The audio is saved as a complete file (such as xx.mp3).
-
Format consistency: Verify request format matches file extension (e.g., WAV with .wav, not .mp3).
-
Player compatibility: Verify that your player supports the format and sample rate of the audio file. Some players may not support high sample rates or specific audio encodings.
-
-
The audio is played in a stream.
-
Save the audio stream as a complete file and try to play it with a player. If the file cannot be played, see the troubleshooting method for scenario 1.
-
If the file plays normally, the problem may be with your streaming playback implementation. Verify that your player supports streaming playback.
Common tools and libraries that support streaming playback include FFmpeg, PyAudio (Python), AudioFormat (Java), and MediaSource (JavaScript).
-
Q: Why does the audio playback stutter?
Check the following scenarios one by one:
-
Check the text sending speed: Make sure the interval between text segments is reasonable. Avoid situations where the next segment is not sent promptly after the previous audio segment finishes playing.
-
Check the callback function performance:
-
Avoid heavy business logic in the callback function—it can cause blocking.
-
Callbacks run in the WebSocket thread. Blocking prevents timely packet reception and causes audio playback to stutter.
-
We recommend writing audio data to a separate buffer and processing it in another thread to avoid blocking the WebSocket thread.
-
-
Check network stability: Ensure your network connection is stable to avoid audio transmission interruptions or delays caused by network fluctuations.
Q: Why does speech synthesis take a long time?
Follow these steps to troubleshoot:
-
Check input interval
Check the input interval. If you are using streaming speech synthesis, verify whether the interval between sending text segments is too long (for example, a delay of several seconds). A long interval increases the total synthesis time.
-
Analyze performance metrics.
-
First-packet latency: Normally around 500 ms.
-
RTF (RTF = Total synthesis time / Audio duration): Normally less than 1.0.
-
Q: How do I handle incorrect pronunciation in the synthesized speech?
Use the <phoneme> tag of SSML to specify the correct pronunciation.
Q: Why is no audio returned? Why is part of the ending text missing from the audio? (Missing audio)
Verify that the finish-task instruction was sent. During synthesis, the server waits until enough text is buffered before processing. Without finish-task, buffered text at the end may never be converted to audio.
Q: Why is the audio stream order scrambled, causing garbled playback?
Troubleshoot in two areas:
-
Ensure that the run-task, continue-task, and finish-task instructions for one synthesis task all use the same
task_id. -
Check whether asynchronous operations cause audio files to be written in a different order than binary data is received.
Q: How do I handle WebSocket connection errors?
-
How do you handle WebSocket connection closure (code 1007)?
A WebSocket connection closes immediately after sending the run-task instruction, with close code 1007.
-
Root cause: The server detects protocol or data format errors and disconnects. Common reasons include the following:
-
Invalid fields in the run-task payload, such as adding fields besides
"input": {}. -
JSON format errors, such as missing commas or mismatched brackets.
-
Missing required fields, such as task_id or action.
-
-
Solution:
-
Validate JSON format: Check the request body syntax.
-
Verify required fields: Confirm that header.action, header.task_id, header.streaming, payload.task_group, payload.task, payload.function, payload.model, and payload.input are all set.
-
Remove invalid fields: In the run-task payload.input, allow only an empty object
{}or a text field. Do not add other fields.
-
-
-
How do you handle WebSocketBadStatus, 401 Unauthorized, or 403 Forbidden errors?
A WebSocket connection fails with WebSocketBadStatus, 401 Unauthorized, or 403 Forbidden.
-
Root cause: Authentication failure. The server validates the Authorization header during the WebSocket handshake. An invalid or missing API key triggers rejection.
-
Solution: See Authentication failure troubleshooting.
-
Permissions and authentication
Q: How can I restrict my API key to the CosyVoice speech synthesis service only (permission isolation)?
Create a workspace and grant authorization only to specific models to limit the API key scope. For more information, see Manage workspaces.
More questions
See the QA on GitHub.