All Products
Search
Document Center

Alibaba Cloud Model Studio:CosyVoice Speech Synthesis WebSocket API

Last Updated:Feb 11, 2026

This topic explains how to access the CosyVoice speech synthesis service using a WebSocket connection.

The DashScope SDK supports only Java and Python. To build a CosyVoice speech synthesis application in another programming language, use a WebSocket connection to communicate with the service.

User guide: For model overviews and selection recommendations, see Real-time Speech Synthesis—CosyVoice/Sambert.

WebSocket is a network protocol that supports full-duplex communication. The client and server establish a persistent connection with a single handshake, which allows both parties to actively push data to each other. This provides significant advantages in real-time performance and efficiency.

For common programming languages, many ready-to-use WebSocket libraries and examples are available, such as:

  • Go: gorilla/websocket

  • PHP: Ratchet

  • Node.js: ws

Familiarize yourself with the basic principles and technical details of WebSocket before you begin development.

Important

CosyVoice models support only WebSocket connections and do not support HTTP REST APIs. If you call the service using an HTTP request (such as POST), it returns an InvalidParameter or URL error.

Prerequisites

You have obtained an API key.

Models and pricing

For more information, see Real-time Speech Synthesis—CosyVoice/Sambert.

Text limits and format requirements for speech synthesis

Text length limits

When you send text to synthesize using the continue-task instruction, the text must be no longer than 20,000 characters. The total length of text sent across multiple calls to the continue-task instruction must be no longer than 200,000 characters.

Character counting rules

  • Each Chinese character (including simplified, traditional, Japanese kanji, and Korean hanja) counts as two characters. All other characters—including punctuation, letters, digits, Japanese kana, and Korean Hangul—count as one character each.

  • SSML tag content is excluded from character counts.

  • Examples:

    • "你好" → 你 (2) + 好 (2) = 4 characters

    • "中A文123" → 2 (Chinese characters) + 1 (A) + 2 (Chinese characters) + 1 (1) + 1 (2) + 1 (3) = 8 characters

    • "中文。" → 2 (中) + 2 (文) + 1 (。) = 5 characters

    • "中 文。" → 2 (for "中") + 1 (for the space) + 2 (for "文") + 1 (for "。") = 6 characters

    • "<speak>你好</speak>" → 4 characters (SSML tags are not counted, 2 Chinese characters × 2)

Encoding format

Use UTF-8 encoding.

Mathematical expression support

The math expression parsing feature works only with the cosyvoice-v2, cosyvoice-v3-flash, and cosyvoice-v3-plus models. It supports common math expressions used in primary and secondary education, including basic arithmetic, algebra, and geometry.

For more information, see Convert LaTeX formulas to speech.

SSML markup language support

To use SSML, meet all the following conditions:

  1. Model support: Only the cosyvoice-v3-flash, cosyvoice-v3-plus, and cosyvoice-v2 models support SSML.

  2. Voice support: You must use a voice that supports SSML. Voices that support SSML include the following:

    • All cloned voices (custom voices created using the Voice Cloning API)

    • System voices marked as supporting SSML in the voice list

    Note

    If you use a system voice that does not support SSML (such as some basic voices), you get the error "SSML text is not supported at the moment!" even if you set the enable_ssml parameter to true.

  3. Parameter settings: In the run-task instruction, set the enable_ssml parameter to true.

After meeting these conditions, send text that includes SSML using the continue-task instruction to use SSML. For a complete example, see QuickStart.

Interaction flow

image

A message sent from the client to the server is called an instruction. Messages returned from the server to the client fall into two categories: JSON-formatted events and binary audio streams.

The interaction flow between the client and server, in chronological order, is as follows:

  1. Establish a connection: The client establishes a WebSocket connection with the server.

  2. Start a task: The client sends the run-task instruction to start a task.

  3. Wait for confirmation: The client receives the task-started event from the server. This event signals that the task started successfully and that you can proceed to the next step.

  4. Send text to synthesize:

    The client sends one or more continue-task instructions, each containing text to synthesize, to the server in sequence. After receiving a complete sentence, the server returns a result-generated event and an audio stream. (Text length constraints apply. For details, see the description of the text field in the continue-task instruction.)

    Note

    You can send the continue-task instruction multiple times to submit text fragments in sequence. After receiving text fragments, the server automatically splits them into sentences:

    • Complete sentences are synthesized immediately. At this point, the client receives audio from the server.

    • Incomplete sentences are cached until they become complete. The server does not return audio for incomplete sentences.

    When you send the finish-task instruction, the server synthesizes all cached content.

  5. Receive audio: Receive the audio stream over the binary channel.

  6. Notify the server to end the task:

    After sending all text, the client sends the finish-task instruction to notify the server to end the task. Continue receiving audio from the server. (Do not skip this step. Otherwise, you might not receive audio or might miss the final part of the audio.)

  7. End the task:

    The client receives the task-finished event from the server. This event signals that the task ended.

  8. Close the connection: The client closes the WebSocket connection.

To improve resource utilization, reuse a WebSocket connection to handle multiple tasks instead of creating a new connection for each task. See Connection overhead and connection reuse.

Important

The task_id must remain consistent throughout: For a single speech synthesis task, the run-task, all continue-task, and finish-task instructions must use the same task_id.

Consequences of errors: Using different task_ids causes:

  • The server cannot associate requests, causing audio stream order to become disordered.

  • Text content is incorrectly assigned to different tasks, resulting in misaligned speech content.

  • Task status becomes abnormal, possibly preventing receipt of the task-finished event.

  • Billing fails, leading to inaccurate usage statistics.

Correct approach:

  • Generate a unique task_id (for example, using UUID) when sending the run-task instruction.

  • Store the task_id in a variable.

  • Use this task_id for all subsequent continue-task and finish-task instructions.

  • After the task ends (after receiving task-finished), generate a new task_id for a new task.

Client implementation considerations

When implementing a WebSocket client—especially on Flutter, web, or mobile platforms—you must clearly define responsibilities between the server and client to ensure the integrity and stability of speech synthesis tasks.

Server and client responsibilities

Server responsibilities

The server guarantees that it returns a complete audio stream in order. You do not need to worry about the order or completeness of audio data. The server generates and pushes all audio chunks in the order of the input text.

Client responsibilities

The client must handle the following key tasks:

  1. Read and concatenate all audio chunks

    The server pushes audio as multiple binary frames. The client must receive all frames completely and concatenate them in the order received to form the final audio file. Example code follows:

    # Python example: Concatenate audio chunks
    with open("output.mp3", "ab") as f:  # Append mode
        f.write(audio_chunk)  # audio_chunk is each received binary audio chunk
    // JavaScript example: Concatenate audio chunks
    const audioChunks = [];
    ws.onmessage = (event) => {
      if (event.data instanceof Blob) {
        audioChunks.push(event.data);  // Collect all audio chunks
      }
    };
    // Merge audio after task completes
    const audioBlob = new Blob(audioChunks, { type: 'audio/mp3' });
  2. Maintain a complete WebSocket lifecycle

    During the entire speech synthesis task—from sending the run-task instruction to receiving the task-finished eventdo not disconnect the WebSocket connection prematurely. Common mistakes include the following:

    • Closing the connection before all audio chunks are received, resulting in incomplete audio.

    • Forgetting to send the finish-task instruction, leaving text in the server cache unprocessed.

    • Failing to handle WebSocket keepalive properly when the page navigates away or the app moves to the background.

    Important

    Mobile apps (such as Flutter, iOS, and Android) require special attention to network connection management when entering the background. We recommend maintaining the WebSocket connection in a background task or service, or checking the task status and reestablishing the connection when returning to the foreground.

  3. Text integrity in ASR→LLM→TTS workflows

    In ASR→LLM→TTS workflows, ensure the text passed to TTS is complete and not truncated mid-process. For example:

    • Wait for the LLM to generate a complete sentence or paragraph before sending the continue-task instruction, rather than streaming character by character.

    • If you need streaming synthesis (generate and play simultaneously), send text in batches based on natural sentence boundaries (such as periods or question marks).

    • After the LLM finishes generating output, send the finish-task instruction to avoid missing trailing content.

Platform-specific tips

  • Flutter: When using the web_socket_channel package, close the connection correctly in the dispose method to prevent memory leaks. Also, handle app lifecycle events (such as AppLifecycleState.paused) to manage background transitions.

  • Web (browser): Some browsers limit the number of WebSocket connections. Reuse a single connection for multiple tasks. Use the beforeunload event to close the connection explicitly before the page closes, avoiding lingering connections.

  • Mobile (iOS/Android native): When the app enters the background, the OS may pause or terminate network connections. Use a background task or foreground service to keep the WebSocket active, or reinitialize the task when returning to the foreground.

URL

The WebSocket URL is fixed as follows:

International

In international deployment mode, both the endpoint and data storage are located in the Singapore region. Model inference computing resources are dynamically scheduled globally (excluding the China mainland).

WebSocket URL: wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference

China mainland

In China mainland deployment mode, both the endpoint and data storage are located in the Beijing region. Model inference computing resources are available only in the China mainland.

WebSocket URL: wss://dashscope.aliyuncs.com/api-ws/v1/inference

Important

Common URL configuration errors:

  • Error: Using a URL that starts with http:// or https:// → Correct: You must use the wss:// protocol.

  • Error: Placing the Authorization parameter in the URL query string (for example, ?Authorization=bearer <your_api_key>) → Correct: Set Authorization in the HTTP handshake headers (see Headers).

  • Error: Adding the model name or other path parameters to the end of the URL → Correct: The URL remains fixed. Specify the model using the payload.model parameter in the run-task instruction.

Headers

Add the following information to the request header:

Parameter

Type

Required

Description

Authorization

string

Yes

Authentication token in the format Bearer <your_api_key>. Replace "<your_api_key>" with your actual API key.

user-agent

string

No

Client identifier to help the server track the source.

X-DashScope-WorkSpace

string

No

Alibaba Cloud Model Studio workspace ID.

X-DashScope-DataInspection

string

No

Whether to enable data compliance inspection. Default is not to pass this parameter or set it to enable. Do not enable this parameter unless necessary.

Important

Timing and common errors for authentication validation

Authentication validation occurs during the WebSocket handshake, not when you send the run-task instruction. If the Authorization header is missing or the API key is invalid, the server rejects the handshake and returns HTTP 401 or 403. Most client libraries parse this as a WebSocketBadStatus exception.

Troubleshooting authentication failures

If the WebSocket connection fails, troubleshoot using the following steps:

  1. Check the API key format: Confirm the Authorization header is formatted as bearer <your_api_key>, with a space between bearer and the API key.

  2. Verify the API key validity: In the Model Studio console, confirm the API key is not deleted or disabled and has permission to call CosyVoice models.

  3. Check header settings: Confirm the Authorization header is set correctly during the WebSocket handshake. Different programming languages set headers differently:

    • Python (websockets library): extra_headers={"Authorization": f"bearer {api_key}"}

    • JavaScript: The standard WebSocket API does not support custom headers. Use a server-side proxy or another library (such as ws).

    • Go (gorilla/websocket): header.Add("Authorization", fmt.Sprintf("bearer %s", apiKey))

  4. Test network connectivity: Use curl or Postman to test whether the API key is valid (using another DashScope API that supports HTTP).

Using WebSocket in browser environments

When using WebSocket in browser environments (such as Vue3 or React), note the following limitation: The browser WebSocket API does not support custom headers. The native browser new WebSocket(url) API does not support setting custom request headers (such as Authorization) during the handshake. This is a security restriction imposed by browsers. Therefore, you cannot directly authenticate using an API key in frontend code.

Solution: Use a backend proxy

  1. Set up a WebSocket connection from your backend service (Node.js, Java, Python, etc.) to the CosyVoice service. Your backend can set the Authorization header correctly.

  2. Have the frontend connect via WebSocket to your backend service. Your backend acts as a proxy, forwarding messages to CosyVoice.

  3. Benefits: Your API key stays hidden from the frontend, improving security. You can add extra business logic (such as authentication, logging, or rate limiting) in your backend.

Important

Do not hardcode your API key in frontend code or send it directly from the browser. Leaking your API key could lead to account compromise, unexpected charges, or data breaches.

Example code:

If you need implementations in other programming languages, adapt the logic shown in the examples. Or use AI tools to convert the examples to your target language.

Instructions (client → server)

Instructions are JSON-formatted messages sent from the client to the server. They use Text Frame format and control task start, stop, and boundaries.

Send instructions in strict chronological order. Otherwise, the task may fail:

  1. Send the run-task instruction

  2. Send thecontinue-task instruction

    • Sends text to synthesize.

    • You can send this instruction only after receiving the task-started event from the server.

  3. Send thefinish-task instruction

1. run-task instruction: Start a task

This instruction starts a speech synthesis task. You can configure request parameters such as voice and sample rate in this instruction.

Important
  • Timing: Send after establishing the WebSocket connection.

  • Do not send text to synthesize: Sending text in the run-task instruction makes troubleshooting difficult. Avoid sending text here. Send text using the continue-task instruction.

  • The input field is required: The payload must contain the input field (formatted as {}). Omitting it triggers the error "task can not be null".

Example:

{
    "header": {
        "action": "run-task",
        "task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx", // Random UUID
        "streaming": "duplex"
    },
    "payload": {
        "task_group": "audio",
        "task": "tts",
        "function": "SpeechSynthesizer",
        "model": "cosyvoice-v3-flash",
        "parameters": {
            "text_type": "PlainText",
            "voice": "longanyang",            // Voice
            "format": "mp3",		        // Audio format
            "sample_rate": 22050,	        // Sample rate
            "volume": 50,			// Volume
            "rate": 1,				// Speech rate
            "pitch": 1				// Pitch
        },
        "input": {// input cannot be omitted, or else an error occurs
        }
    }
}

headerparameter description:

Parameter

Type

Required

Description

header.action

string

Yes

Instruction type.

For this instruction, it is always "run-task".

header.task_id

string

Yes

ID for this task.

A 32-character universally unique identifier (UUID), composed of 32 randomly generated letters and digits. It can include hyphens (for example, "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx") or omit them (for example, "2bf83b9abaeb4fda8d9axxxxxxxxxxxx"). Most programming languages provide built-in APIs to generate UUIDs. For example, in Python:

import uuid

def generateTaskId(self):
    # Generate a random UUID
    return uuid.uuid4().hex

When sending subsequent continue-task and finish-task instructions, use the same task_id as in the run-task instruction.

header.streaming

string

Yes

Fixed string: "duplex"

payloadparameter description:

Parameter

Type

Required

Description

payload.task_group

string

Yes

Fixed string: "audio".

payload.task

string

Yes

Fixed string: "tts".

payload.function

string

Yes

Fixed string: "SpeechSynthesizer".

payload.model

string

Yes

Speech synthesis model.

Different model versions require corresponding voice versions:

  • cosyvoice-v3-flash/cosyvoice-v3-plus: Use voices like longanyang.

  • cosyvoice-v2: Use voices like longxiaochun_v2.

  • See the full voice list in Voice list.

payload.input

object

Yes

The input field is required in the run-task instruction (cannot be omitted), but do not send text to synthesize here (so use an empty object {}). Send text to synthesize using subsequent continue-task instructions to simplify troubleshooting and support streaming synthesis.

The input format is:

"input": {}
Important

Common error: Omitting the input field or including unexpected fields (such as mode or content) causes the server to reject the request and return "InvalidParameter: task can not be null" or close the connection (WebSocket code 1007).

payload.parameters

text_type

string

Yes

Fixed string: "PlainText".

voice

string

Yes

Voice used for speech synthesis.

Supported voices include system voices and cloned voices:

  • System voices: See the voice list.

  • Cloned voices: Customized using the voice cloning (CosyVoice) feature. When using cloned voices, ensure voice cloning and speech synthesis use the same account. For detailed steps, see CosyVoice voice cloning API.

    When using a cloned voice, the model parameter value must exactly match the model version (target_model) used to create that voice.

format

string

No

Audio coding format.

Supported formats: pcm, wav, mp3 (default), and opus.

When the audio format is opus, use the bit_rate parameter to adjust the bitrate.

sample_rate

integer

No

Audio sampling rate (unit: Hz).

Default: 22050.

Valid values: 8000, 16000, 22050, 24000, 44100, 48000.

Note

The default sample rate represents the optimal sample rate for the selected voice. Output uses this rate by default, but downsampling and upsampling are supported.

volume

integer

No

Volume.

Default: 50.

Range: [0, 100]. 50 is normal volume. Volume scales linearly with this value: 0 is mute, 100 is maximum.

rate

float

No

Speech rate.

Default: 1.0.

Range: [0.5, 2.0]. 1.0 is normal speed. Values less than 1.0 slow speech; values greater than 1.0 speed it up.

pitch

float

No

Pitch. This value multiplies pitch, but perceived pitch change isn't strictly linear or logarithmic. Test to find suitable values.

Default: 1.0.

Range: [0.5, 2.0]. 1.0 is natural pitch. Values above 1.0 raise pitch; values below 1.0 lower it.

enable_ssml

boolean

No

Whether to enable SSML.

When set to true, you can send text only once (only one continue-task instruction is allowed).

bit_rate

int

No

Audio bitrate (kbps). For OPUS format, adjust bitrate using the bit_rate parameter.

Default: 32.

Range: [6, 510].

word_timestamp_enabled

boolean

No

Enable word-level timestamps.

Default: false.

  • true: Enable.

  • false: Disable.

This feature works only with cloned voices for cosyvoice-v3-flash, cosyvoice-v3-plus, and cosyvoice-v2, and with system voices marked as timestamp-supported in the voice list.

For more information, see Best practices for timestamp data extraction.

seed

int

No

Random number seed used during synthesis to vary output. With identical model version, text, voice, and other parameters, the same seed produces identical results.

Default: 0.

Range: [0, 65535].

language_hints

array[string]

No

Specify the target language for speech synthesis to improve synthesis quality.

Use this parameter when the pronunciation of numbers, abbreviations, symbols, or the synthesis quality of minor languages does not meet expectations, such as:

  • The pronunciation of numbers does not meet expectations. For example, "hello, this is 110" is read as "hello, this is one one zero" instead of "hello, this is yao yao ling".

  • Symbol pronunciation is inaccurate. For example, "@" is read as "ait" instead of "at".

  • Poor synthesis quality for minor languages, resulting in unnatural synthesis.

Value range:

  • zh: Chinese

  • en: English

  • fr: French

  • de: German

  • ja: Japanese

  • ko: Korean

  • ru: Russian

Note: This parameter is an array, but the current version only processes the first element. Therefore, pass only one value.

Important

This parameter specifies the target language for speech synthesis. This setting is unrelated to the language of the sample audio during voice cloning. To set the source language for a cloning task, see the CosyVoice Voice Cloning API.

instruction

string

No

Set instructions to control synthesis effects such as dialect, emotion, or role. This feature applies only to cloned voices of the cosyvoice-v3-flash models, and to system voices marked as supporting Instruct in the Voice List.

Requirements:

  • Use only the fixed instruction format and content (see below).

  • The instruction has no effect if not set (no default value).

Supported features:

  • Specify a dialect

    • Supported voices: cloned voices only

    • Format: Speak in <dialect>. (Include the period at the end. Replace <dialect> with a specific dialect, such as Cantonese.)

    • Example: Speak in Cantonese.

    • Supported dialects: Cantonese, Northeastern Mandarin, Gansu Mandarin, Guizhou Mandarin, Henan Mandarin, Hubei Mandarin, Jiangxi Mandarin, Minnan, Ningxia Mandarin, Shanxi Mandarin, Shaanxi Mandarin, Shandong Mandarin, Shanghai Mandarin, Sichuan Mandarin, Tianjin Mandarin, and Yunnan Mandarin.

  • Specify an emotion

    • Supported voices

      • Timbre Replication

      • System voices marked as Instruct-supported in the Voice List

    • Format:

      • Cloned voices:

        View instruction format for cloned voices

        • Speak as loudly as possible.

        • Speak as slowly as possible.

        • Speak as quickly as possible.

        • Speak very softly.

        • Can you speak more slowly?

        • Can you speak much faster?

        • Can you speak much slower?

        • Can you speak faster?

        • Speak angrily.

        • Speak happily.

        • Speak fearfully.

        • Speak sadly.

        • Speak in surprise.

        • Sound as confident as possible.

        • Sound as angry as possible.

        • Try a friendly tone.

        • Speak in a cold tone.

        • Speak in an authoritative tone.

        • I want to hear a natural tone.

        • Show me how you express a threat.

        • Show me how you express wisdom.

        • Show me how you express seduction.

        • Speak in a lively way.

        • Speak with passion.

        • I would like to hear a sample of a calm tone.

        • Speak confidently.

        • Can you talk to me with excitement?

        • Can you show arrogance?

        • Can you show elegance?

        • Can you answer questions happily?

        • Can you demonstrate a gentle emotion?

        • Can you talk to me calmly?

        • Can you answer me in a deep voice?

        • Talk to me with a rugged attitude.

        • Tell me the answer in a sinister voice.

        • Tell me the answer in a resilient voice.

        • Narrate in a natural, casual chat style.

        • Speak like a radio drama podcaster.

      • System voices: The instruction format for emotions differs between system voices and cloned voices. For details, see the Voice List.

  • Specify a scenario, role, or identity

    • Supported voices: System voices in the Voice List that are marked to support Instruct

    • Format: For more information, see the Voice List

enable_aigc_tag

boolean

No

Adds an invisible AIGC identifier to the generated audio. If set to true, the invisible identifier is embedded in audio files of supported formats (WAV/MP3/Opus).

Default value: false.

Only cosyvoice-v3-flash, cosyvoice-v3-plus, and cosyvoice-v2 support this feature.

aigc_propagator

string

No

Set the ContentPropagator field in the invisible AIGC identifier to identify the content propagator. This setting takes effect only when enable_aigc_tag is set to true.

Default value: Your Alibaba Cloud UID.

Only cosyvoice-v3-flash, cosyvoice-v3-plus, and cosyvoice-v2 support this feature.

aigc_propagate_id

string

No

Set the PropagateID field in the invisible AIGC identifier. This uniquely identifies a specific propagation behavior. This takes effect only when enable_aigc_tag is true.

Default value: The Request ID of the current speech synthesis request.

This feature is supported only by cosyvoice-v3-flash, cosyvoice-v3-plus, and cosyvoice-v2.

2. continue-task instruction

This instruction sends text to synthesize.

You can send all text in one continue-task instruction, or split the text and send it in multiple continue-task instructions in sequence.

Important

Timing: Send after receiving the task-started event.

Note

Do not wait longer than 23 seconds between sending text fragments. Otherwise, the "request timeout after 23 seconds" error occurs.

If no more text remains to send, send the finish-task instruction to end the task.

The server enforces a 23-second timeout. Clients cannot modify this setting.

Example:

{
    "header": {
        "action": "continue-task",
        "task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx", // Random UUID
        "streaming": "duplex"
    },
    "payload": {
        "input": {
            "text": "Before my bed, moonlight gleams, like frost upon the ground."
        }
    }
}

headerparameter description:

Parameter

Type

Required

Description

header.action

string

Yes

Instruction type.

For this instruction, it is always "continue-task".

header.task_id

string

Yes

ID for this task.

Must match the task_id used in the run-task instruction.

header.streaming

string

Yes

Fixed string: "duplex"

payloadparameter description:

Parameter

Type

Required

Description

input.text

string

Yes

Text to synthesize.

3. finish-task instruction: End a task

This instruction ends a speech synthesis task.

Make sure to send this instruction. Otherwise, you may encounter the following issues:

  • Incomplete audio: The server will not force-synthesize incomplete sentences held in its cache, causing missing audio at the end.

  • Connection timeout: If you do not send finish-task within 23 seconds after the last continue-task instruction, the connection times out and closes.

  • Billing anomalies: Tasks that do not end normally may not return accurate usage information.

Important

Timing: Send immediately after sending all continue-task instructions. Do not wait for audio to finish returning or delay sending. Otherwise, the timeout may trigger.

Example:

{
    "header": {
        "action": "finish-task",
        "task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx",
        "streaming": "duplex"
    },
    "payload": {
        "input": {}// input cannot be omitted, or else an error occurs
    }
}

headerparameter description:

Parameter

Type

Required

Description

header.action

string

Yes

Instruction type.

For this instruction, it is always "finish-task".

header.task_id

string

Yes

ID for this task.

Must match the task_id used in the run-task instruction.

header.streaming

string

Yes

Fixed string: "duplex"

payloadparameter description:

Parameter

Type

Required

Description

payload.input

object

Yes

Fixed format: {}.

Events (server → client)

Events are JSON-formatted messages returned from the server to the client. Each event represents a different processing stage.

Note

The server returns binary audio separately. It is not included in any event.

1. task-started event: Task started

When you receive the task-started event from the server, the task has started successfully. You can send continue-task or finish-task instructions to the server only after receiving this event. Otherwise, the task fails.

The task-started event's payload is empty.

Example:

{
    "header": {
        "task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx",
        "event": "task-started",
        "attributes": {}
    },
    "payload": {}
}

headerparameter description:

Parameter

Type

Description

header.event

string

Event type.

For this event, it is always "task-started".

header.task_id

string

task_id generated by the client

2. result-generated event

While the client sends continue-task and finish-task instructions, the server returns result-generated events continuously.

To link audio data to its corresponding text, the server returns sentence metadata along with audio data in the result-generated event. The server automatically splits input text into sentences. Each sentence's synthesis process includes three sub-events:

  • sentence-begin: Marks the start of a sentence and returns the text to synthesize.

  • sentence-synthesis: Marks an audio data chunk. Each event is followed immediately by an audio data frame over the WebSocket binary channel.

    • Multiple sentence-synthesis events occur per sentence, each corresponding to one audio data chunk.

    • The client must receive these audio data chunks in order and append them to the same file.

    • Each sentence-synthesis event corresponds one-to-one with the audio data frame that follows it. No misalignment occurs.

  • sentence-end: Marks the end of a sentence and returns the sentence text and cumulative billed character count.

Use the payload.output.type field to distinguish sub-event types.

Example:

sentence-begin

{
    "header": {
        "task_id": "3f2d5c86-0550-45c0-801f-xxxxxxxxxx",
        "event": "result-generated",
        "attributes": {}
    },
    "payload": {
        "output": {
            "sentence": {
                "index": 0,
                "words": []
            },
            "type": "sentence-begin",
            "original_text": "Before my bed, moonlight gleams,"
        }
    }
}

sentence-synthesis

{
    "header": {
        "task_id": "3f2d5c86-0550-45c0-801f-xxxxxxxxxx",
        "event": "result-generated",
        "attributes": {}
    },
    "payload": {
        "output": {
            "sentence": {
                "index": 0,
                "words": []
            },
            "type": "sentence-synthesis"
        }
    }
}

sentence-end

{
    "header": {
        "task_id": "3f2d5c86-0550-45c0-801f-xxxxxxxxxx",
        "event": "result-generated",
        "attributes": {}
    },
    "payload": {
        "output": {
            "sentence": {
                "index": 0,
                "words": []
            },
            "type": "sentence-end",
            "original_text": "Before my bed, moonlight gleams,"
        },
        "usage": {
            "characters": 11
        }
    }
}

headerparameter description:

Parameter

Type

Description

header.event

string

Event type.

For this event, it is always "result-generated".

header.task_id

string

task_id generated by the client.

header.attributes

object

Additional attributes, usually an empty object.

payloadparameter description:

Parameter

Type

Description

payload.output.type

string

Sub-event type.

Valid values:

  • sentence-begin: Marks the start of a sentence and returns the text to synthesize.

  • sentence-synthesis: Marks an audio data chunk. Each event is followed immediately by an audio data frame over the WebSocket binary channel.

    • Multiple sentence-synthesis events occur per sentence, each corresponding to one audio data chunk.

    • The client must receive these audio data chunks in order and append them to the same file.

    • Each sentence-synthesis event corresponds one-to-one with the audio data frame that follows it. No misalignment occurs.

  • sentence-end: Marks the end of a sentence and returns the sentence text and cumulative billed character count.

Full event flow

For each sentence to synthesize, the server returns events in the following order:

  1. sentence-begin: Marks the start of a sentence and includes the sentence text (original_text).

  2. sentence-synthesis (multiple times): Each event is followed immediately by a binary audio data frame.

  3. sentence-end: Marks the end of a sentence and includes the sentence text and cumulative billed character count.

payload.output.sentence.index

integer

Sentence number, starting from 0.

payload.output.sentence.words

array

A character-level information array is typically empty.

payload.output.original_text

string

Sentence content after splitting the user's input text. The last sentence may not include this field.

payload.usage.characters

integer

This value represents the number of billable characters for the current request. In a task, usage may appear in a result-generated event or a task-finished event. The returned usage field is a cumulative result. Use the last occurrence as the authoritative value.

3. task-finished event: Task finished

When you receive the task-finished event from the server, the task has ended.

After ending the task, you can close the WebSocket connection and exit the program. Or you can reuse the WebSocket connection to start a new task by sending another run-task instruction (see Connection overhead and connection reuse).

Example:

{
    "header": {
        "task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx",
        "event": "task-finished",
        "attributes": {
            "request_uuid": "0a9dba9e-d3a6-45a4-be6d-xxxxxxxxxxxx"
        }
    },
    "payload": {
        "output": {
            "sentence": {
                "words": []
            }
        },
        "usage": {
            "characters": 13
        }
    }
}

headerparameter description:

Parameter

Type

Description

header.event

string

Event type.

For this event, it is always "task-finished".

header.task_id

string

task_id generated by the client.

header.attributes.request_uuid

string

Request ID. Provide this to CosyVoice developers to help diagnose issues.

payloadparameter description:

Parameter

Type

Description

payload.usage.characters

integer

The number of billable characters in the current request so far. In a single task, usage appears in either a result-generated event or a task-finished event. The returned usage field contains a cumulative value. The final value is definitive.

payload.output.sentence.index

integer

Sentence number, starting from 0.

This field and the following fields require enabling word-level timestamps using word_timestamp_enabled.

payload.output.sentence.words[k]

text

string

Text of the word.

begin_index

integer

Starting position index of the word in the sentence, starting from 0.

end_index

integer

Ending position index of the word in the sentence, starting from 1.

begin_time

integer

Start timestamp of the audio for this word, in milliseconds.

end_time

integer

End timestamp of the audio for this word, in milliseconds.

Best practices for timestamp data extraction

After enabling word_timestamp_enabled, timestamp information is returned in the task-finished event. Example:

{
    "header": {
        "task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx",
        "event": "task-finished",
        "attributes": {"request_uuid": "0a9dba9e-d3a6-45a4-be6d-xxxxxxxxxxxx"}
    },
    "payload": {
        "output": {
            "sentence": {
                "index": 0,
                "words": [
                    {
                        "text": "How",
                        "begin_index": 0,
                        "end_index": 1,
                        "begin_time": 80,
                        "end_time": 280
                    },
                    {
                        "text": "is",
                        "begin_index": 1,
                        "end_index": 2,
                        "begin_time": 300,
                        "end_time": 400
                    },
                    {
                        "text": "the",
                        "begin_index": 2,
                        "end_index": 3,
                        "begin_time": 420,
                        "end_time": 520
                    },
                    {
                        "text": "weather",
                        "begin_index": 3,
                        "end_index": 4,
                        "begin_time": 540,
                        "end_time": 840
                    },
                    {
                        "text": "today",
                        "begin_index": 4,
                        "end_index": 5,
                        "begin_time": 860,
                        "end_time": 1160
                    },
                    {
                        "text": "?",
                        "begin_index": 5,
                        "end_index": 6,
                        "begin_time": 1180,
                        "end_time": 1320
                    }
                ]
            }
        },
        "usage": {"characters": 25}
    }
}

Correct extraction method:

  1. Extract complete timestamps only from the task-finished event: Complete sentence timestamp data is returned only at task completion (in the task-finished event), in the payload.output.sentence.words array.

  2. The result-generated event does not contain timestamps: The result-generated event indicates audio stream progress but does not include word-level timestamp information.

  3. Example event filtering (Python):

    def on_event(message):
        event_type = message["header"]["event"]
        # Extract timestamps only from the task-finished event
        if event_type == "task-finished":
            words = message["payload"]["output"]["sentence"]["words"]
            for word in words:
                print(f"Text: {word['text']}, Start: {word['begin_time']}ms, End: {word['end_time']}ms")
    
        # Process audio streams in result-generated events
        elif event_type == "result-generated":
            # Handle audio stream, do not extract timestamps
            pass
Important

If you extract timestamp data from multiple events, duplicates occur. Ensure you extract timestamps only from the task-finished event.

4. task-failed event: Task failed

If you receive the task-failed event, the task failed. Close the WebSocket connection and handle the error. If the failure was due to a coding issue, adjust your code accordingly.

Example:

{
    "header": {
        "task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx",
        "event": "task-failed",
        "error_code": "InvalidParameter",
        "error_message": "[tts:]Engine return error code: 418",
        "attributes": {}
    },
    "payload": {}
}

headerparameter description:

Parameter

Type

Description

header.event

string

Event type.

For this event, it is always "task-failed".

header.task_id

string

task_id generated by the client.

header.error_code

string

Error type description.

header.error_message

string

Detailed error cause.

Task interruption methods

During streaming synthesis, to terminate the current task early (for example, user cancels playback or interrupts a real-time conversation), use one of the following methods:

Interruption method

Server behavior

Use case

Close connection directly

  • Server stops synthesis immediately.

  • Audio already generated but not yet sent is discarded.

  • Client does not receive the task-finished event.

  • Connection cannot be reused after closing.

Immediate interruption: User cancels playback, switches content, or exits the app.

Send finish-task

  • Server forces synthesis of all cached text.

  • Returns remaining audio chunks.

  • Returns the task-finished event.

  • Connection remains reusable (you can start a new task).

Elegant termination: Stop sending new text but still receive audio for cached content.

Initiate a new run-task

  • Server automatically terminates the current task.

  • Unfinished audio from the current task is discarded.

  • Synthesis for the new task starts immediately.

  • Connection remains open; no need to rebuild it.

Task switching: In real-time conversations, users interrupt and switch immediately to new content.

Connection overhead and connection reuse

The WebSocket service supports connection reuse to improve resource efficiency and avoid connection overhead.

After the server receives the client's run-task instruction, it starts a new task. After the client sends the finish-task instruction, the server returns the task-finished event when the task completes. After the task ends, the WebSocket connection can be reused. To start a new task, simply send another run-task instruction.

Important
  1. Each task using a reused connection must use a different task_id.

  2. If a task fails during execution, the server returns the task-failed event and closes the connection. At that point, the connection cannot be reused.

  3. If no new task starts within 60 seconds after the previous task ends, the connection times out and closes automatically.

Performance metrics and concurrency limits

Concurrency limits

See Rate limiting for details.

To increase your concurrency quota (such as supporting more concurrent connections), contact customer support. Quota adjustments may require review and usually take 1–3 business days to complete.

Note

Best practice: To improve resource utilization, reuse a WebSocket connection for multiple tasks instead of creating a new connection for each task. See Connection overhead and connection reuse.

Connection performance and latency

Normal connection time:

  • Clients in the China mainland: WebSocket connection establishment (from newWebSocket to onOpen) typically takes 200–1000 milliseconds.

  • Cross-border connections (such as Hong Kong or international regions): Connection latency may reach 1–3 seconds, and occasionally up to 10–30 seconds.

Troubleshooting long connection times:

If WebSocket connection establishment takes longer than 30 seconds, possible causes include the following:

  1. Network issues: High network latency between the client and server (such as cross-border connections or ISP quality problems).

  2. Slow DNS resolution: DNS resolution for dashscope.aliyuncs.com takes too long. Try using a public DNS (such as 8.8.8.8) or configuring your local hosts file.

  3. Slow TLS handshake: The client uses an outdated TLS version or certificate validation takes too long. Use TLS 1.2 or later.

  4. Proxy or firewall: Corporate networks may restrict WebSocket connections or require proxy usage.

Troubleshooting tools:

  • Use Wireshark or tcpdump to analyze TCP handshake, TLS handshake, and WebSocket Upgrade timing.

  • Test HTTP connection latency with curl: curl -w "@curl-format.txt" -o /dev/null -s https://dashscope.aliyuncs.com

Note

The CosyVoice WebSocket API is deployed in the Beijing region of the China mainland. If your client is in another region (such as Hong Kong or overseas), consider using a nearby relay server or CDN to accelerate the connection.

Audio generation performance

Synthesis speed:

  • Real-time factor (RTF): CosyVoice models typically synthesize audio at 0.1–0.5× real-time (i.e., generating 1 second of audio takes 0.1–0.5 seconds). Actual speed depends on model version, text length, and server load.

  • First packet latency: From sending the continue-task instruction to receiving the first audio chunk, latency is typically 200–800 milliseconds.

Sample code

Sample code provides only basic functionality to verify service connectivity. You must develop additional code for real-world business scenarios.

When writing WebSocket client code, use asynchronous programming to send and receive messages simultaneously. Follow these steps:

  1. Establish a WebSocket connection

    Call the WebSocket library function (implementation varies by language or library) and pass the Headers and URL to establish the WebSocket connection.

  2. Listen for server messages

    Use the callback function (observer pattern) provided by the WebSocket library to listen for messages from the server. Implementation varies by programming language.

    Server messages fall into two categories: binary audio streams and events.

    Listen forevents

    Process binary audio streams: The server sends audio streams over the binary channel in frames. The full audio data is split across multiple packets.

    • In streaming speech synthesis, compressed formats such as MP3 or Opus must be played using a streaming-capable player. Do not attempt frame-by-frame playback, as this may cause decoding failures.

      Streaming-capable players include the following: ffmpeg, pyaudio (Python), AudioFormat (Java), MediaSource (JavaScript), and others.
    • When assembling audio data into a complete file, write to the same file in append mode.

    • For WAV or MP3 format audio in streaming speech synthesis, only the first frame contains header information. Subsequent frames contain audio-only data.

  3. Send messages to the server (pay strict attention to timing)

    In a thread separate from listening for server messages (such as the main thread, implementation varies by language), send instructions to the server.

    Send instructions in strict chronological order. Otherwise, the task may fail:

    1. Send the run-task instruction

    2. Send thecontinue-task instruction

      • Sends text to synthesize.

      • You can send this instruction only after receiving the task-started event from the server.

    3. Send thefinish-task instruction

  4. Close the WebSocket connection

    Close the WebSocket connection when the program ends normally, encounters an exception during execution, or receives the task-finished event or task-failed event. Usually call the close function provided by your tool library.

View full examples

Go

package main

import (
	"encoding/json"
	"fmt"
	"net/http"
	"os"
	"strings"
	"time"

	"github.com/google/uuid"
	"github.com/gorilla/websocket"
)

const (
	// This is the URL for the Singapore region. For Beijing region models, replace with: wss://dashscope.aliyuncs.com/api-ws/v1/inference/
	wsURL      = "wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference/"
	outputFile = "output.mp3"
)

func main() {
	// API keys differ between Singapore and Beijing regions. Get your API key: https://www.alibabacloud.com/help/zh/model-studio/get-api-key
	// If you have not set an environment variable, replace the next line with: apiKey := "sk-xxx"
	apiKey := os.Getenv("DASHSCOPE_API_KEY")

	// Clear output file
	os.Remove(outputFile)
	os.Create(outputFile)

	// Connect to WebSocket
	header := make(http.Header)
	header.Add("X-DashScope-DataInspection", "enable")
	header.Add("Authorization", fmt.Sprintf("bearer %s", apiKey))

	conn, resp, err := websocket.DefaultDialer.Dial(wsURL, header)
	if err != nil {
		if resp != nil {
			fmt.Printf("Connection failed HTTP status code: %d\n", resp.StatusCode)
		}
		fmt.Println("Connection failed:", err)
		return
	}
	defer conn.Close()

	// Generate task ID
	taskID := uuid.New().String()
	fmt.Printf("Generated task ID: %s\n", taskID)

	// Send run-task instruction
	runTaskCmd := map[string]interface{}{
		"header": map[string]interface{}{
			"action":    "run-task",
			"task_id":   taskID,
			"streaming": "duplex",
		},
		"payload": map[string]interface{}{
			"task_group": "audio",
			"task":       "tts",
			"function":   "SpeechSynthesizer",
			"model":      "cosyvoice-v3-flash",
			"parameters": map[string]interface{}{
				"text_type":   "PlainText",
				"voice":       "longanyang",
				"format":      "mp3",
				"sample_rate": 22050,
				"volume":      50,
				"rate":        1,
				"pitch":       1,
				// If enable_ssml is true, only one continue-task instruction is allowed. Otherwise, you get "Text request limit violated, expected 1."
				"enable_ssml": false,
			},
			"input": map[string]interface{}{},
		},
	}

	runTaskJSON, _ := json.Marshal(runTaskCmd)
	fmt.Printf("Sent run-task instruction: %s\n", string(runTaskJSON))

	err = conn.WriteMessage(websocket.TextMessage, runTaskJSON)
	if err != nil {
		fmt.Println("Failed to send run-task:", err)
		return
	}

	textSent := false

	// Process messages
	for {
		messageType, message, err := conn.ReadMessage()
		if err != nil {
			fmt.Println("Failed to read message:", err)
			break
		}

		// Process binary messages
		if messageType == websocket.BinaryMessage {
			fmt.Printf("Received binary message, length: %d\n", len(message))
			file, _ := os.OpenFile(outputFile, os.O_APPEND|os.O_WRONLY|os.O_CREATE, 0644)
			file.Write(message)
			file.Close()
			continue
		}

		// Process text messages
		messageStr := string(message)
		fmt.Printf("Received text message: %s\n", strings.ReplaceAll(messageStr, "\n", ""))

		// Simple JSON parsing to get event type
		var msgMap map[string]interface{}
		if json.Unmarshal(message, &msgMap) == nil {
			if header, ok := msgMap["header"].(map[string]interface{}); ok {
				if event, ok := header["event"].(string); ok {
					fmt.Printf("Event type: %s\n", event)

					switch event {
					case "task-started":
						fmt.Println("=== Received task-started event ===")

						if !textSent {
							// Send continue-task instruction

							texts := []string{"Before my bed, moonlight gleams, like frost upon the ground.", "I lift my eyes to gaze at the bright moon, then bow my head, thinking of home."}

							for _, text := range texts {
								continueTaskCmd := map[string]interface{}{
									"header": map[string]interface{}{
										"action":    "continue-task",
										"task_id":   taskID,
										"streaming": "duplex",
									},
									"payload": map[string]interface{}{
										"input": map[string]interface{}{
											"text": text,
										},
									},
								}

								continueTaskJSON, _ := json.Marshal(continueTaskCmd)
								fmt.Printf("Sent continue-task instruction: %s\n", string(continueTaskJSON))

								err = conn.WriteMessage(websocket.TextMessage, continueTaskJSON)
								if err != nil {
									fmt.Println("Failed to send continue-task:", err)
									return
								}
							}

							textSent = true

							// Delay sending finish-task
							time.Sleep(500 * time.Millisecond)

							// Send finish-task instruction
							finishTaskCmd := map[string]interface{}{
								"header": map[string]interface{}{
									"action":    "finish-task",
									"task_id":   taskID,
									"streaming": "duplex",
								},
								"payload": map[string]interface{}{
									"input": map[string]interface{}{},
								},
							}

							finishTaskJSON, _ := json.Marshal(finishTaskCmd)
							fmt.Printf("Sent finish-task instruction: %s\n", string(finishTaskJSON))

							err = conn.WriteMessage(websocket.TextMessage, finishTaskJSON)
							if err != nil {
								fmt.Println("Failed to send finish-task:", err)
								return
							}
						}

					case "task-finished":
						fmt.Println("=== Task completed ===")
						return

					case "task-failed":
						fmt.Println("=== Task failed ===")
						if header["error_message"] != nil {
							fmt.Printf("Error message: %s\n", header["error_message"])
						}
						return

					case "result-generated":
						fmt.Println("Received result-generated event")
					}
				}
			}
		}
	}
}

C#

using System.Net.WebSockets;
using System.Text;
using System.Text.Json;

class Program {
    // API keys differ between Singapore and Beijing regions. Get your API key: https://www.alibabacloud.com/help/zh/model-studio/get-api-key
    // If you have not set an environment variable, replace the next line with: private static readonly string ApiKey = "sk-xxx"
    private static readonly string ApiKey = Environment.GetEnvironmentVariable("DASHSCOPE_API_KEY") ?? throw new InvalidOperationException("DASHSCOPE_API_KEY environment variable is not set.");

    // This is the URL for the Singapore region. For Beijing region models, replace with: wss://dashscope.aliyuncs.com/api-ws/v1/inference/
    private const string WebSocketUrl = "wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference/";
    // Output file path
    private const string OutputFilePath = "output.mp3";

    // WebSocket client
    private static ClientWebSocket _webSocket = new ClientWebSocket();
    // Cancellation token source
    private static CancellationTokenSource _cancellationTokenSource = new CancellationTokenSource();
    // Task ID
    private static string? _taskId;
    // Whether task has started
    private static TaskCompletionSource<bool> _taskStartedTcs = new TaskCompletionSource<bool>();

    static async Task Main(string[] args) {
        try {
            // Clear output file
            ClearOutputFile(OutputFilePath);

            // Connect to WebSocket service
            await ConnectToWebSocketAsync(WebSocketUrl);

            // Start receiving messages task
            Task receiveTask = ReceiveMessagesAsync();

            // Send run-task instruction
            _taskId = GenerateTaskId();
            await SendRunTaskCommandAsync(_taskId);

            // Wait for task-started event
            await _taskStartedTcs.Task;

            // Continuously send continue-task instructions
            string[] texts = {
                "Before my bed, moonlight gleams",
                "like frost upon the ground",
                "I lift my eyes to gaze at the bright moon",
                "then bow my head, thinking of home"
            };
            foreach (string text in texts) {
                await SendContinueTaskCommandAsync(text);
            }

            // Send finish-task instruction
            await SendFinishTaskCommandAsync(_taskId);

            // Wait for receive task to complete
            await receiveTask;

            Console.WriteLine("Task completed, connection closed.");
        } catch (OperationCanceledException) {
            Console.WriteLine("Task canceled.");
        } catch (Exception ex) {
            Console.WriteLine($"Error occurred: {ex.Message}");
        } finally {
            _cancellationTokenSource.Cancel();
            _webSocket.Dispose();
        }
    }

    private static void ClearOutputFile(string filePath) {
        if (File.Exists(filePath)) {
            File.WriteAllText(filePath, string.Empty);
            Console.WriteLine("Output file cleared.");
        } else {
            Console.WriteLine("Output file does not exist, no need to clear.");
        }
    }

    private static async Task ConnectToWebSocketAsync(string url) {
        var uri = new Uri(url);
        if (_webSocket.State == WebSocketState.Connecting || _webSocket.State == WebSocketState.Open) {
            return;
        }

        // Set WebSocket connection headers
        _webSocket.Options.SetRequestHeader("Authorization", $"bearer {ApiKey}");
        _webSocket.Options.SetRequestHeader("X-DashScope-DataInspection", "enable");

        try {
            await _webSocket.ConnectAsync(uri, _cancellationTokenSource.Token);
            Console.WriteLine("Successfully connected to WebSocket service.");
        } catch (OperationCanceledException) {
            Console.WriteLine("WebSocket connection canceled.");
        } catch (Exception ex) {
            Console.WriteLine($"WebSocket connection failed: {ex.Message}");
            throw;
        }
    }

    private static async Task SendRunTaskCommandAsync(string taskId) {
        var command = CreateCommand("run-task", taskId, "duplex", new {
            task_group = "audio",
            task = "tts",
            function = "SpeechSynthesizer",
            model = "cosyvoice-v3-flash",
            parameters = new
            {
                text_type = "PlainText",
                voice = "longanyang",
                format = "mp3",
                sample_rate = 22050,
                volume = 50,
                rate = 1,
                pitch = 1,
                // If enable_ssml is true, only one continue-task instruction is allowed. Otherwise, you get "Text request limit violated, expected 1."
                enable_ssml = false
            },
            input = new { }
        });

        await SendJsonMessageAsync(command);
        Console.WriteLine("Sent run-task instruction.");
    }

    private static async Task SendContinueTaskCommandAsync(string text) {
        if (_taskId == null) {
            throw new InvalidOperationException("Task ID not initialized.");
        }

        var command = CreateCommand("continue-task", _taskId, "duplex", new {
            input = new {
                text
            }
        });

        await SendJsonMessageAsync(command);
        Console.WriteLine("Sent continue-task instruction.");
    }

    private static async Task SendFinishTaskCommandAsync(string taskId) {
        var command = CreateCommand("finish-task", taskId, "duplex", new {
            input = new { }
        });

        await SendJsonMessageAsync(command);
        Console.WriteLine("Sent finish-task instruction.");
    }

    private static async Task SendJsonMessageAsync(string message) {
        var buffer = Encoding.UTF8.GetBytes(message);
        try {
            await _webSocket.SendAsync(new ArraySegment<byte>(buffer), WebSocketMessageType.Text, true, _cancellationTokenSource.Token);
        } catch (OperationCanceledException) {
            Console.WriteLine("Message sending canceled.");
        }
    }

    private static async Task ReceiveMessagesAsync() {
        while (_webSocket.State == WebSocketState.Open) {
            var response = await ReceiveMessageAsync();
            if (response != null) {
                var eventStr = response.RootElement.GetProperty("header").GetProperty("event").GetString();
                switch (eventStr) {
                    case "task-started":
                        Console.WriteLine("Task started.");
                        _taskStartedTcs.TrySetResult(true);
                        break;
                    case "task-finished":
                        Console.WriteLine("Task completed.");
                        _cancellationTokenSource.Cancel();
                        break;
                    case "task-failed":
                        Console.WriteLine("Task failed: " + response.RootElement.GetProperty("header").GetProperty("error_message").GetString());
                        _cancellationTokenSource.Cancel();
                        break;
                    default:
                        // Process result-generated here
                        break;
                }
            }
        }
    }

    private static async Task<JsonDocument?> ReceiveMessageAsync() {
        var buffer = new byte[1024 * 4];
        var segment = new ArraySegment<byte>(buffer);

        try {
            WebSocketReceiveResult result = await _webSocket.ReceiveAsync(segment, _cancellationTokenSource.Token);

            if (result.MessageType == WebSocketMessageType.Close) {
                await _webSocket.CloseAsync(WebSocketCloseStatus.NormalClosure, "Closing", _cancellationTokenSource.Token);
                return null;
            }

            if (result.MessageType == WebSocketMessageType.Binary) {
                // Process binary data
                Console.WriteLine("Received binary data...");

                // Save binary data to file
                using (var fileStream = new FileStream(OutputFilePath, FileMode.Append)) {
                    fileStream.Write(buffer, 0, result.Count);
                }

                return null;
            }

            string message = Encoding.UTF8.GetString(buffer, 0, result.Count);
            return JsonDocument.Parse(message);
        } catch (OperationCanceledException) {
            Console.WriteLine("Message receiving canceled.");
            return null;
        }
    }

    private static string GenerateTaskId() {
        return Guid.NewGuid().ToString("N").Substring(0, 32);
    }

    private static string CreateCommand(string action, string taskId, string streaming, object payload) {
        var command = new {
            header = new {
                action,
                task_id = taskId,
                streaming
            },
            payload
        };

        return JsonSerializer.Serialize(command);
    }
}

PHP

Example code directory structure:

my-php-project/

├── composer.json

├── vendor/

└── index.php

Contents of composer.json (use appropriate versions for your environment):

{
    "require": {
        "react/event-loop": "^1.3",
        "react/socket": "^1.11",
        "react/stream": "^1.2",
        "react/http": "^1.1",
        "ratchet/pawl": "^0.4"
    },
    "autoload": {
        "psr-4": {
            "App\\": "src/"
        }
    }
}

Contents of index.php:

<?php

require __DIR__ . '/vendor/autoload.php';

use Ratchet\Client\Connector;
use React\EventLoop\Loop;
use React\Socket\Connector as SocketConnector;

// API keys differ between Singapore and Beijing regions. Get your API key: https://www.alibabacloud.com/help/zh/model-studio/get-api-key
// If you have not set an environment variable, replace the next line with: $api_key = "sk-xxx"
$api_key = getenv("DASHSCOPE_API_KEY");
// This is the URL for the Singapore region. For Beijing region models, replace with: wss://dashscope.aliyuncs.com/api-ws/v1/inference/
$websocket_url = 'wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference/'; // WebSocket server address
$output_file = 'output.mp3'; // Output file path

$loop = Loop::get();

if (file_exists($output_file)) {
    // Clear file contents
    file_put_contents($output_file, '');
}

// Create custom connector
$socketConnector = new SocketConnector($loop, [
    'tcp' => [
        'bindto' => '0.0.0.0:0',
    ],
    'tls' => [
        'verify_peer' => false,
        'verify_peer_name' => false,
    ],
]);

$connector = new Connector($loop, $socketConnector);

$headers = [
    'Authorization' => 'bearer ' . $api_key,
    'X-DashScope-DataInspection' => 'enable'
];

$connector($websocket_url, [], $headers)->then(function ($conn) use ($loop, $output_file) {
    echo "Connected to WebSocket server\n";

    // Generate task ID
    $taskId = generateTaskId();

    // Send run-task instruction
    sendRunTaskMessage($conn, $taskId);

    // Define function to send continue-task instruction
    $sendContinueTask = function() use ($conn, $loop, $taskId) {
        // Text to send
        $texts = ["Before my bed, moonlight gleams", "like frost upon the ground", "I lift my eyes to gaze at the bright moon", "then bow my head, thinking of home"];
        $continueTaskCount = 0;
        foreach ($texts as $text) {
            $continueTaskMessage = json_encode([
                "header" => [
                    "action" => "continue-task",
                    "task_id" => $taskId,
                    "streaming" => "duplex"
                ],
                "payload" => [
                    "input" => [
                        "text" => $text
                    ]
                ]
            ]);
            echo "Preparing to send continue-task instruction: " . $continueTaskMessage . "\n";
            $conn->send($continueTaskMessage);
            $continueTaskCount++;
        }
        echo "Number of continue-task instructions sent: " . $continueTaskCount . "\n";

        // Send finish-task instruction
        sendFinishTaskMessage($conn, $taskId);
    };

    // Flag to track whether task-started event was received
    $taskStarted = false;

    // Listen for messages
    $conn->on('message', function($msg) use ($conn, $sendContinueTask, $loop, &$taskStarted, $taskId, $output_file) {
        if ($msg->isBinary()) {
            // Write binary data to local file
            file_put_contents($output_file, $msg->getPayload(), FILE_APPEND);
        } else {
            // Process non-binary messages
            $response = json_decode($msg, true);

            if (isset($response['header']['event'])) {
                handleEvent($conn, $response, $sendContinueTask, $loop, $taskId, $taskStarted);
            } else {
                echo "Unknown message format\n";
            }
        }
    });

    // Listen for connection close
    $conn->on('close', function($code = null, $reason = null) {
        echo "Connection closed\n";
        if ($code !== null) {
            echo "Close code: " . $code . "\n";
        }
        if ($reason !== null) {
            echo "Close reason: " . $reason . "\n";
        }
    });
}, function ($e) {
    echo "Cannot connect: {$e->getMessage()}\n";
});

$loop->run();

/**
 * Generate task ID
 * @return string
 */
function generateTaskId(): string {
    return bin2hex(random_bytes(16));
}

/**
 * Send run-task instruction
 * @param $conn
 * @param $taskId
 */
function sendRunTaskMessage($conn, $taskId) {
    $runTaskMessage = json_encode([
        "header" => [
            "action" => "run-task",
            "task_id" => $taskId,
            "streaming" => "duplex"
        ],
        "payload" => [
            "task_group" => "audio",
            "task" => "tts",
            "function" => "SpeechSynthesizer",
            "model" => "cosyvoice-v3-flash",
            "parameters" => [
                "text_type" => "PlainText",
                "voice" => "longanyang",
                "format" => "mp3",
                "sample_rate" => 22050,
                "volume" => 50,
                "rate" => 1,
                "pitch" => 1,
                // If enable_ssml is true, only one continue-task instruction is allowed. Otherwise, you get "Text request limit violated, expected 1."
                "enable_ssml" => false
            ],
            "input" => (object) []
        ]
    ]);
    echo "Preparing to send run-task instruction: " . $runTaskMessage . "\n";
    $conn->send($runTaskMessage);
    echo "run-task instruction sent\n";
}

/**
 * Read audio file
 * @param string $filePath
 * @return bool|string
 */
function readAudioFile(string $filePath) {
    $voiceData = file_get_contents($filePath);
    if ($voiceData === false) {
        echo "Cannot read audio file\n";
    }
    return $voiceData;
}

/**
 * Split audio data
 * @param string $data
 * @param int $chunkSize
 * @return array
 */
function splitAudioData(string $data, int $chunkSize): array {
    return str_split($data, $chunkSize);
}

/**
 * Send finish-task instruction
 * @param $conn
 * @param $taskId
 */
function sendFinishTaskMessage($conn, $taskId) {
    $finishTaskMessage = json_encode([
        "header" => [
            "action" => "finish-task",
            "task_id" => $taskId,
            "streaming" => "duplex"
        ],
        "payload" => [
            "input" => (object) []
        ]
    ]);
    echo "Preparing to send finish-task instruction: " . $finishTaskMessage . "\n";
    $conn->send($finishTaskMessage);
    echo "finish-task instruction sent\n";
}

/**
 * Handle event
 * @param $conn
 * @param $response
 * @param $sendContinueTask
 * @param $loop
 * @param $taskId
 * @param $taskStarted
 */
function handleEvent($conn, $response, $sendContinueTask, $loop, $taskId, &$taskStarted) {
    switch ($response['header']['event']) {
        case 'task-started':
            echo "Task started, sending continue-task instructions...\n";
            $taskStarted = true;
            // Send continue-task instructions
            $sendContinueTask();
            break;
        case 'result-generated':
            // Received result-generated event
            break;
        case 'task-finished':
            echo "Task completed\n";
            $conn->close();
            break;
        case 'task-failed':
            echo "Task failed\n";
            echo "Error code: " . $response['header']['error_code'] . "\n";
            echo "Error message: " . $response['header']['error_message'] . "\n";
            $conn->close();
            break;
        case 'error':
            echo "Error: " . $response['payload']['message'] . "\n";
            break;
        default:
            echo "Unknown event: " . $response['header']['event'] . "\n";
            break;
    }

    // Close connection if task completed
    if ($response['header']['event'] == 'task-finished') {
        // Wait 1 second to ensure all data is transmitted
        $loop->addTimer(1, function() use ($conn) {
            $conn->close();
            echo "Client closed connection\n";
        });
    }

    // Close connection if task-started not received
    if (!$taskStarted && in_array($response['header']['event'], ['task-failed', 'error'])) {
        $conn->close();
    }
}

Node.js

Install dependencies:

npm install ws
npm install uuid

Example code:

const WebSocket = require('ws');
const fs = require('fs');
const uuid = require('uuid').v4;

// API keys differ between Singapore and Beijing regions. Get your API key: https://www.alibabacloud.com/help/zh/model-studio/get-api-key
// If you have not set an environment variable, replace the next line with: const apiKey = "sk-xxx"
const apiKey = process.env.DASHSCOPE_API_KEY;
// This is the URL for the Singapore region. For Beijing region models, replace with: wss://dashscope.aliyuncs.com/api-ws/v1/inference/
const url = 'wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference/';
// Output file path
const outputFilePath = 'output.mp3';

// Clear output file
fs.writeFileSync(outputFilePath, '');

// Create WebSocket client
const ws = new WebSocket(url, {
  headers: {
    Authorization: `bearer ${apiKey}`,
    'X-DashScope-DataInspection': 'enable'
  }
});

let taskStarted = false;
let taskId = uuid();

ws.on('open', () => {
  console.log('Connected to WebSocket server');

  // Send run-task instruction
  const runTaskMessage = JSON.stringify({
    header: {
      action: 'run-task',
      task_id: taskId,
      streaming: 'duplex'
    },
    payload: {
      task_group: 'audio',
      task: 'tts',
      function: 'SpeechSynthesizer',
      model: 'cosyvoice-v3-flash',
      parameters: {
        text_type: 'PlainText',
        voice: 'longanyang', // Voice
        format: 'mp3', // Audio format
        sample_rate: 22050, // Sample rate
        volume: 50, // Volume
        rate: 1, // Speech rate
        pitch: 1, // Pitch
        enable_ssml: false // Enable SSML. If true, only one continue-task instruction is allowed. Otherwise, you get "Text request limit violated, expected 1."
      },
      input: {}
    }
  });
  ws.send(runTaskMessage);
  console.log('Sent run-task message');
});

const fileStream = fs.createWriteStream(outputFilePath, { flags: 'a' });
ws.on('message', (data, isBinary) => {
  if (isBinary) {
    // Write binary data to file
    fileStream.write(data);
  } else {
    const message = JSON.parse(data);

    switch (message.header.event) {
      case 'task-started':
        taskStarted = true;
        console.log('Task started');
        // Send continue-task instructions
        sendContinueTasks(ws);
        break;
      case 'task-finished':
        console.log('Task completed');
        ws.close();
        fileStream.end(() => {
          console.log('File stream closed');
        });
        break;
      case 'task-failed':
        console.error('Task failed: ', message.header.error_message);
        ws.close();
        fileStream.end(() => {
          console.log('File stream closed');
        });
        break;
      default:
        // Process result-generated here
        break;
    }
  }
});

function sendContinueTasks(ws) {
  const texts = [
    'Before my bed, moonlight gleams,',
    'like frost upon the ground.',
    'I lift my eyes to gaze at the bright moon,',
    'then bow my head, thinking of home.'
  ];

  texts.forEach((text, index) => {
    setTimeout(() => {
      if (taskStarted) {
        const continueTaskMessage = JSON.stringify({
          header: {
            action: 'continue-task',
            task_id: taskId,
            streaming: 'duplex'
          },
          payload: {
            input: {
              text: text
            }
          }
        });
        ws.send(continueTaskMessage);
        console.log(`Sent continue-task, text: ${text}`);
      }
    }, index * 1000); // Send every second
  });

  // Send finish-task instruction
  setTimeout(() => {
    if (taskStarted) {
      const finishTaskMessage = JSON.stringify({
        header: {
          action: 'finish-task',
          task_id: taskId,
          streaming: 'duplex'
        },
        payload: {
          input: {}
        }
      });
      ws.send(finishTaskMessage);
      console.log('Sent finish-task');
    }
  }, texts.length * 1000 + 1000); // Send 1 second after last continue-task
}

ws.on('close', () => {
  console.log('Disconnected from WebSocket server');
});

Java

If you use Java, we recommend using the Java DashScope SDK. For details, see Java SDK.

Below is a Java WebSocket usage example. Before running the example, import the following dependencies:

  • Java-WebSocket

  • jackson-databind

We recommend managing dependencies with Maven or Gradle. Configuration examples follow:

pom.xml

<dependencies>
    <!-- WebSocket Client -->
    <dependency>
        <groupId>org.java-websocket</groupId>
        <artifactId>Java-WebSocket</artifactId>
        <version>1.5.3</version>
    </dependency>

    <!-- JSON Processing -->
    <dependency>
        <groupId>com.fasterxml.jackson.core</groupId>
        <artifactId>jackson-databind</artifactId>
        <version>2.13.0</version>
    </dependency>
</dependencies>

build.gradle

// Omit other code
dependencies {
  // WebSocket Client
  implementation 'org.java-websocket:Java-WebSocket:1.5.3'
  // JSON Processing
  implementation 'com.fasterxml.jackson.core:jackson-databind:2.13.0'
}
// Omit other code

Java code:

import com.fasterxml.jackson.databind.ObjectMapper;

import org.java_websocket.client.WebSocketClient;
import org.java_websocket.handshake.ServerHandshake;

import java.io.FileOutputStream;
import java.io.IOException;
import java.net.URI;
import java.nio.ByteBuffer;
import java.util.*;

public class TTSWebSocketClient extends WebSocketClient {
    private final String taskId = UUID.randomUUID().toString();
    private final String outputFile = "output_" + System.currentTimeMillis() + ".mp3";
    private boolean taskFinished = false;

    public TTSWebSocketClient(URI serverUri, Map<String, String> headers) {
        super(serverUri, headers);
    }

    @Override
    public void onOpen(ServerHandshake serverHandshake) {
        System.out.println("Connection successful");

        // Send run-task instruction
        // If enable_ssml is true, only one continue-task instruction is allowed. Otherwise, you get "Text request limit violated, expected 1."
        String runTaskCommand = "{ \"header\": { \"action\": \"run-task\", \"task_id\": \"" + taskId + "\", \"streaming\": \"duplex\" }, \"payload\": { \"task_group\": \"audio\", \"task\": \"tts\", \"function\": \"SpeechSynthesizer\", \"model\": \"cosyvoice-v3-flash\", \"parameters\": { \"text_type\": \"PlainText\", \"voice\": \"longanyang\", \"format\": \"mp3\", \"sample_rate\": 22050, \"volume\": 50, \"rate\": 1, \"pitch\": 1, \"enable_ssml\": false }, \"input\": {} }}";
        send(runTaskCommand);
    }

    @Override
    public void onMessage(String message) {
        System.out.println("Received server message: " + message);
        try {
            // Parse JSON message
            Map<String, Object> messageMap = new ObjectMapper().readValue(message, Map.class);

            if (messageMap.containsKey("header")) {
                Map<String, Object> header = (Map<String, Object>) messageMap.get("header");

                if (header.containsKey("event")) {
                    String event = (String) header.get("event");

                    if ("task-started".equals(event)) {
                        System.out.println("Received task-started event from server");

                        List<String> texts = Arrays.asList(
                                "Before my bed, moonlight gleams, like frost upon the ground",
                                "I lift my eyes to gaze at the bright moon, then bow my head, thinking of home"
                        );

                        for (String text : texts) {
                            // Send continue-task instruction
                            sendContinueTask(text);
                        }

                        // Send finish-task instruction
                        sendFinishTask();
                    } else if ("task-finished".equals(event)) {
                        System.out.println("Received task-finished event from server");
                        taskFinished = true;
                        closeConnection();
                    } else if ("task-failed".equals(event)) {
                        System.out.println("Task failed: " + message);
                        closeConnection();
                    }
                }
            }
        } catch (Exception e) {
            System.err.println("Exception occurred: " + e.getMessage());
        }
    }

    @Override
    public void onMessage(ByteBuffer message) {
        System.out.println("Received binary audio data size: " + message.remaining());

        try (FileOutputStream fos = new FileOutputStream(outputFile, true)) {
            byte[] buffer = new byte[message.remaining()];
            message.get(buffer);
            fos.write(buffer);
            System.out.println("Audio data written to local file " + outputFile);
        } catch (IOException e) {
            System.err.println("Failed to write audio data to local file: " + e.getMessage());
        }
    }

    @Override
    public void onClose(int code, String reason, boolean remote) {
        System.out.println("Connection closed: " + reason + " (" + code + ")");
    }

    @Override
    public void onError(Exception ex) {
        System.err.println("Error: " + ex.getMessage());
        ex.printStackTrace();
    }

    private void sendContinueTask(String text) {
        String command = "{ \"header\": { \"action\": \"continue-task\", \"task_id\": \"" + taskId + "\", \"streaming\": \"duplex\" }, \"payload\": { \"input\": { \"text\": \"" + text + "\" } }}";
        send(command);
    }

    private void sendFinishTask() {
        String command = "{ \"header\": { \"action\": \"finish-task\", \"task_id\": \"" + taskId + "\", \"streaming\": \"duplex\" }, \"payload\": { \"input\": {} }}";
        send(command);
    }

    private void closeConnection() {
        if (!isClosed()) {
            close();
        }
    }

    public static void main(String[] args) {
        try {
            // API keys differ between Singapore and Beijing regions. Get your API key: https://www.alibabacloud.com/help/zh/model-studio/get-api-key
            // If you have not set an environment variable, replace the next line with: String apiKey = "sk-xxx"
            String apiKey = System.getenv("DASHSCOPE_API_KEY");
            if (apiKey == null || apiKey.isEmpty()) {
                System.err.println("Set DASHSCOPE_API_KEY environment variable");
                return;
            }

            Map<String, String> headers = new HashMap<>();
            headers.put("Authorization", "bearer " + apiKey);
            // This is the URL for the Singapore region. For Beijing region models, replace with: wss://dashscope.aliyuncs.com/api-ws/v1/inference/
            TTSWebSocketClient client = new TTSWebSocketClient(new URI("wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference/"), headers);

            client.connect();

            while (!client.isClosed() && !client.taskFinished) {
                Thread.sleep(1000);
            }
        } catch (Exception e) {
            System.err.println("Failed to connect to WebSocket service: " + e.getMessage());
            e.printStackTrace();
        }
    }
}

Python

If you use Python, we recommend using the Python DashScope SDK. For details, see Python SDK.

Below is a Python WebSocket usage example. Before running the example, install dependencies as follows:

pip uninstall websocket-client
pip uninstall websocket
pip install websocket-client
Important

Do not name your Python file "websocket.py". Doing so causes an AttributeError: module 'websocket' has no attribute 'WebSocketApp'. Did you mean: 'WebSocket'?

import websocket
import json
import uuid
import os
import time


class TTSClient:
    def __init__(self, api_key, uri):
        """
    Initialize a TTSClient instance.

    Parameters:
        api_key (str): API key for authentication.
        uri (str): WebSocket service endpoint.
    """
        self.api_key = api_key  # Replace with your API key
        self.uri = uri  # Replace with your WebSocket endpoint
        self.task_id = str(uuid.uuid4())  # Generate unique task ID
        self.output_file = f"output_{int(time.time())}.mp3"  # Output audio file path
        self.ws = None  # WebSocketApp instance
        self.task_started = False  # Whether task-started was received
        self.task_finished = False  # Whether task-finished or task-failed was received

    def on_open(self, ws):
        """
    Callback triggered when the WebSocket connection opens.
    Sends the run-task instruction to start speech synthesis.
    """
        print("WebSocket connected")

        # Construct run-task instruction
        run_task_cmd = {
            "header": {
                "action": "run-task",
                "task_id": self.task_id,
                "streaming": "duplex"
            },
            "payload": {
                "task_group": "audio",
                "task": "tts",
                "function": "SpeechSynthesizer",
                "model": "cosyvoice-v3-flash",
                "parameters": {
                    "text_type": "PlainText",
                    "voice": "longanyang",
                    "format": "mp3",
                    "sample_rate": 22050,
                    "volume": 50,
                    "rate": 1,
                    "pitch": 1,
                    # If enable_ssml is True, only one continue-task instruction is allowed. Otherwise, you get "Text request limit violated, expected 1."
                    "enable_ssml": False
                },
                "input": {}
            }
        }

        # Send run-task instruction
        ws.send(json.dumps(run_task_cmd))
        print("Sent run-task instruction")

    def on_message(self, ws, message):
        """
    Callback triggered when a message is received.
    Handles both text and binary messages.
    """
        if isinstance(message, str):
            # Process JSON text messages
            try:
                msg_json = json.loads(message)
                print(f"Received JSON message: {msg_json}")

                if "header" in msg_json:
                    header = msg_json["header"]

                    if "event" in header:
                        event = header["event"]

                        if event == "task-started":
                            print("Task started")
                            self.task_started = True

                            # Send continue-task instructions
                            texts = [
                                "Before my bed, moonlight gleams, like frost upon the ground",
                                "I lift my eyes to gaze at the bright moon, then bow my head, thinking of home"
                            ]

                            for text in texts:
                                self.send_continue_task(text)

                            # Send finish-task after all continue-task instructions
                            self.send_finish_task()

                        elif event == "task-finished":
                            print("Task completed")
                            self.task_finished = True
                            self.close(ws)

                        elif event == "task-failed":
                            error_msg = msg_json.get("error_message", "Unknown error")
                            print(f"Task failed: {error_msg}")
                            self.task_finished = True
                            self.close(ws)

            except json.JSONDecodeError as e:
                print(f"JSON parsing failed: {e}")
        else:
            # Process binary messages (audio data)
            print(f"Received binary message, size: {len(message)} bytes")
            with open(self.output_file, "ab") as f:
                f.write(message)
            print(f"Audio data written to local file {self.output_file}")

    def on_error(self, ws, error):
        """Callback triggered on error."""
        print(f"WebSocket error: {error}")

    def on_close(self, ws, close_status_code, close_msg):
        """Callback triggered on connection close."""
        print(f"WebSocket closed: {close_msg} ({close_status_code})")

    def send_continue_task(self, text):
        """Send a continue-task instruction with text to synthesize."""
        cmd = {
            "header": {
                "action": "continue-task",
                "task_id": self.task_id,
                "streaming": "duplex"
            },
            "payload": {
                "input": {
                    "text": text
                }
            }
        }

        self.ws.send(json.dumps(cmd))
        print(f"Sent continue-task instruction, text: {text}")

    def send_finish_task(self):
        """Send a finish-task instruction to end speech synthesis."""
        cmd = {
            "header": {
                "action": "finish-task",
                "task_id": self.task_id,
                "streaming": "duplex"
            },
            "payload": {
                "input": {}
            }
        }

        self.ws.send(json.dumps(cmd))
        print("Sent finish-task instruction")

    def close(self, ws):
        """Close the connection explicitly."""
        if ws and ws.sock and ws.sock.connected:
            ws.close()
            print("Explicitly closed connection")

    def run(self):
        """Start the WebSocket client."""
        # Set request headers (authentication)
        header = {
            "Authorization": f"bearer {self.api_key}",
            "X-DashScope-DataInspection": "enable"
        }

        # Create WebSocketApp instance
        self.ws = websocket.WebSocketApp(
            self.uri,
            header=header,
            on_open=self.on_open,
            on_message=self.on_message,
            on_error=self.on_error,
            on_close=self.on_close
        )

        print("Listening for WebSocket messages...")
        self.ws.run_forever()  # Start long-running connection listener


# Example usage
if __name__ == "__main__":
    # API keys differ between Singapore and Beijing regions. Get your API key: https://www.alibabacloud.com/help/zh/model-studio/get-api-key
    # If you have not set an environment variable, replace the next line with: API_KEY = "sk-xxx"
    API_KEY = os.environ.get("DASHSCOPE_API_KEY")
    # This is the URL for the Singapore region. For Beijing region models, replace with: wss://dashscope.aliyuncs.com/api-ws/v1/inference/
    SERVER_URI = "wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference/"  # Replace with your WebSocket endpoint

    client = TTSClient(API_KEY, SERVER_URI)
    client.run()

Error codes

To troubleshoot an error, see Error messages.

FAQ

Features, billing, and rate limiting

Q: How can I fix inaccurate pronunciation?

You can customize the speech synthesis output using Speech Synthesis Markup Language (SSML).

Q: Why use WebSocket instead of HTTP/HTTPS? Why not provide a RESTful API?

The speech service chooses WebSocket over HTTP/HTTPS/RESTful because it relies on full-duplex communication. WebSocket allows both the server and client to actively transmit data bidirectionally (such as pushing real-time speech synthesis or recognition progress). In contrast, RESTful APIs based on HTTP support only unidirectional client-initiated request-response patterns, which cannot meet real-time interaction requirements.

Q: Speech synthesis is billed per character. How do I view or retrieve the text length for each synthesis?

Get the character count from the payload.usage.characters parameter in the result-generated event returned by the server. Use the value from the last result-generated event.

Troubleshooting

Important

When your code throws an error, check whether the instruction sent to the server is correct. Print the instruction content and check for formatting errors or missing required parameters. If the instruction is correct, troubleshoot using the information in the error codes.

Q: How do I get the Request ID?

You can get it in two ways:

Q: Why does SSML fail?

Troubleshoot using the following steps:

  1. Ensure the limitations and constraints are correct.

  2. Ensure you call it correctly. For details, see SSML markup language support.

  3. Ensure the text to synthesize is plain text and meets formatting requirements. For details, see SSML markup language introduction.

Q: Why won't the audio play?

You can troubleshoot based on the following scenarios:

  1. If you save the audio as a complete file (for example, audio.mp3):

    1. Audio format consistency: Make sure the audio format set in the request parameters matches the file extension. For example, if you set the audio format to `wav` but save the file with an `.mp3` extension, playback might fail.

    2. Player compatibility: Check if your player supports the audio file's format and sample rate. For example, some players may not support audio files with high sample rates or specific encodings.

  2. Streaming audio playback scenarios

    1. Save the audio stream as a complete file and try to play it. If the file does not play, follow the troubleshooting steps for Scenario 1.

    2. If the file plays correctly, the issue might be with your streaming implementation. Check if your player supports streaming playback.

      Common tools and libraries that support streaming playback include FFmpeg, PyAudio (Python), AudioFormat (Java), and MediaSource (JavaScript).

Q: Why is the audio playback stuttering?

You can troubleshoot based on the following scenarios:

  1. Check the text submission speed: Make sure the interval between text submissions is reasonable. This prevents delays where the next text segment is not sent promptly after the previous audio segment finishes playing.

  2. Check callback function performance:

    • Check for excessive business logic in the callback function that could cause blocking.

    • The callback function runs on the WebSocket thread. If this thread is blocked, it can interfere with the reception of network packets over WebSocket, which causes audio stuttering.

    • Write the audio data to a separate audio buffer. Then, read and process it in a different thread to avoid blocking the WebSocket thread.

  3. Check network stability: Make sure your network connectivity is stable to prevent audio transmission interruptions or delays caused by network fluctuations.

Q: Why is speech synthesis slow?

You can troubleshoot the issue by following these steps:

  1. Check the input interval

    For streaming speech synthesis, check if the interval between sending text segments is too long (for example, waiting several seconds after one segment is sent before sending the next). Long intervals increase the total synthesis time.

  2. Analyze performance metrics

    • Time to First Byte (TTFB): This is typically around 500 ms.

    • Real-Time Factor (RTF): The ratio of the total synthesis time to the audio duration. This value should typically be less than 1.0.

Q: How do I fix pronunciation errors in the synthesized speech?

You can use the SSML <phoneme> tag to specify the correct pronunciation.

Q: Why is no audio returned? Why is part of the text at the end not converted to speech? (Missing speech)

Confirm you did not forget to send the finish-task instruction. During speech synthesis, the server waits until it has enough text in its cache before starting synthesis. If you forget to send the finish-task instruction, the text cached at the end may not be synthesized into speech.

Q: Why is the audio stream order scrambled, causing garbled playback?

Troubleshoot from two angles:

Q: How do I handle WebSocket connection errors?

  • How do I handle WebSocket connection closure (code 1007)?

    The WebSocket connection closes immediately after sending the run-task instruction, with close code 1007.

    • Root cause: The server detects protocol or data format errors and closes the connection. Common causes include the following:

      • Invalid fields in the run-task instruction's payload (e.g., adding fields other than "input": {}).

      • JSON format errors (e.g., missing commas or mismatched brackets).

      • Missing required fields (e.g., task_id, action).

    • Solution:

      1. Check JSON format: Validate the request body format.

      2. Check required fields: Confirm header.action, header.task_id, header.streaming, payload.task_group, payload.task, payload.function, payload.model, and payload.input are all set correctly.

      3. Remove invalid fields: In the run-task payload.input, allow only an empty object {} or a text field. Do not add other fields.

  • How do I handle WebSocketBadStatus, 401 Unauthorized, or 403 Forbidden errors?

    A WebSocketBadStatus, 401 Unauthorized, or 403 Forbidden error occurs during connection setup.

    • Root cause: Authentication failure. The server validates the Authorization header during the WebSocket handshake. If the API key is invalid or missing, the connection is rejected.

    • Solution: See Troubleshooting authentication failures.

Permissions and authentication

Q: How can I restrict my API key to be used only for the CosyVoice speech synthesis service and not for other Model Studio models (permission isolation)?

You can create a new workspace and grant authorization to specific models only to limit the scope of the API key. For more information, see Workspace management.

More questions

For answers to more questions, see the QA on GitHub.