All Products
Search
Document Center

Alibaba Cloud Model Studio:CosyVoice WebSocket API for speech synthesis

Last Updated:Dec 15, 2025
Important

To use a model in the China (Beijing) region, go to the API key page for the China (Beijing) region

This topic describes how to access the CosyVoice speech synthesis service using a WebSocket connection.

The DashScope SDK currently supports only Java and Python. For other programming languages, you can develop CosyVoice speech synthesis applications by communicating with the service directly through a WebSocket connection.

User guide: For more information about the models and guidance on model selection, see Real-time speech synthesis - CosyVoice.

WebSocket is a network protocol that supports full-duplex communication. The client and server establish a persistent connection with a single handshake, which allows both parties to actively push data to each other. This provides significant advantages in real-time performance and efficiency.

For common programming languages, many ready-to-use WebSocket libraries and examples are available, such as:

  • Go: gorilla/websocket

  • PHP: Ratchet

  • Node.js: ws

Familiarize yourself with the basic principles and technical details of WebSocket before you begin development.

Prerequisites

You have activated the Model Studio and created an API key. To prevent security risks, export the API key as an environment variable instead of hard-coding it in your code.

Note

To grant temporary access permissions to third-party applications or users, or if you want to strictly control high-risk operations such as accessing or deleting sensitive data, we recommend that you use a temporary authentication token.

Compared with long-term API keys, temporary authentication tokens are more secure because they are short-lived (60 seconds). They are suitable for temporary call scenarios and can effectively reduce the risk of API key leakage.

To use a temporary token, replace the API key used for authentication in your code with the temporary authentication token.

Models and pricing

Model

Unit price

cosyvoice-v3-plus

$0.286706 per 10,000 characters

cosyvoice-v3-flash

$0.14335 per 10,000 characters

cosyvoice-v2

$0.286706 per 10,000 characters

Text limits and format specifications for speech synthesis

Text length limits

The length of the text sent for synthesis in a single call to the continue-task instruction cannot exceed 2,000 characters, and the total length of the text sent in multiple calls to the continue-task instruction cannot exceed 200,000 characters.

Character counting rules

  • A Chinese character, including simplified or traditional Chinese, Japanese kanji, and Korean hanja, is counted as 2 characters. All other characters, such as punctuation marks, letters, numbers, and Japanese or Korean kana or hangul, are counted as 1 character.

  • SSML tags are not included in the text length calculation.

  • Examples:

    • "你好" → 2(你) + 2(好) = 4 characters

    • "中A文123" → 2(中) + 1(A) + 2(文) + 1(1) + 1(2) + 1(3) = 8 characters

    • "中文。" → 2(中) + 2(文) + 1(。) = 5 characters

    • "中 文。" → 2(中) + 1(space) + 2(文) + 1(。) = 6 characters

    • "<speak>你好</speak>" → 2(你) + 2(好) = 4 characters

Encoding format

Use UTF-8 encoding.

Support for mathematical expressions

The mathematical expression parsing feature is currently available only for the cosyvoice-v2, cosyvoice-v3-flash, and cosyvoice-v3-plus models. This feature supports common mathematical expressions from primary and secondary school, such as basic arithmetic, algebra, and geometry.

For more information, see Convert LaTeX formulas to speech.

SSML support

The Speech Synthesis Markup Language (SSML) feature is currently available only for cloned voices of the cosyvoice-v3-flash, cosyvoice-v3-plus, and cosyvoice-v2 models, and for system voices that are indicated as supported in the voice list. The following conditions must be met:

To use this feature:

  1. When you send the run-task instruction, set the enable_ssml parameter to true to enable SSML support.

  2. Then, send the text that contains SSML using the continue-task instruction.

Important

After you enable SSML support by setting the enable_ssml parameter to true, you can submit the complete text for synthesis in only one continue-task instruction. Multiple submissions are not supported.

Interaction flow

image

Messages sent from the client to the server are called instructions. The server returns two types of messages to the client: events in JSON format and binary audio streams.

The interaction flow between the client and the server is as follows:

  1. Establish a connection: The client establishes a WebSocket connection with the server.

  2. Start a task:

    • The client sends a run-task instruction to start a task.

    • The client receives a task-started event from the server. This indicates that the task has successfully started and you can proceed to the next steps.

  3. Send text for synthesis:

    The client sends one or more continue-task instructions to the server in sequence. These instructions contain the text to be synthesized. Once the server receives a complete sentence, it returns an audio stream. The length of the text is limited. For more information, see the description of the text field in the continue-task instruction.

    Note

    You can send multiple continue-task instructions to submit text fragments sequentially. The server automatically segments sentences after receiving the text fragments:

    • Complete sentences are synthesized immediately. The client can then receive the audio returned by the server.

    • Incomplete sentences are cached until they are complete, and then synthesized. The server does not return audio for incomplete sentences.

    When you send a finish-task instruction, the server forces the synthesis of all cached content.

  4. Notify the server to end the task:

    After all text is sent, the client sends a finish-task instruction to notify the server to end the task and continues to receive the audio stream from the server. This step is crucial. Otherwise, you may not receive the complete audio.

  5. End the task:

    The client receives a task-finished event from the server, which indicates that the task is finished.

  6. Close the connection: The client closes the WebSocket connection.

URL

The WebSocket URL is fixed as follows:

wss://dashscope.aliyuncs.com/api-ws/v1/inference

Headers

Add the following information to the request header:

{
    "Authorization": "bearer <your_dashscope_api_key>", // Required. Replace <your_dashscope_api_key> with your API key.
    "user-agent": "your_platform_info", // Optional.
    "X-DashScope-WorkSpace": workspace, // Optional. The ID of your workspace in Alibaba Cloud Model Studio.
    "X-DashScope-DataInspection": "enable"
}

Instructions (client to server)

Instructions are messages sent from the client to the server. They are in JSON format, sent as Text Frames, and are used to control the start and end of a task and identify task boundaries.

Send instructions in the following strict sequence. Otherwise, the task may fail.

  1. Send a run-task instruction

  2. Send a continue-task instruction

    • Sends the text to be synthesized.

    • You can send this instruction only after you receive the task-started event from the server.

  3. Send a finish-task instruction

1. run-task instruction: Start a task

This instruction starts a speech synthesis task. You can set request parameters such as the voice and sample rate in this instruction.

Important
  • When to send: After the WebSocket connection is established.

  • Do not send text for synthesis: Sending text in this instruction makes troubleshooting difficult. Avoid sending text in this instruction.

Example:

{
    "header": {
        "action": "run-task",
        "task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx", // Random UUID.
        "streaming": "duplex"
    },
    "payload": {
        "task_group": "audio",
        "task": "tts",
        "function": "SpeechSynthesizer",
        "model": "cosyvoice-v2",
        "parameters": {
            "text_type": "PlainText",
            "voice": "longxiaochun_v2",            // Voice
            "format": "mp3",		        // Audio format
            "sample_rate": 22050,	        // Sample rate
            "volume": 50,			// Volume
            "rate": 1,				// Speech rate
            "pitch": 1				// Pitch
        },
        "input": {// The input field cannot be omitted. Otherwise, an error is reported.
        }
    }
}

header parameters:

Parameter

Type

Required

Description

header.action

string

Yes

The instruction type.

For this instruction, the value is fixed as "run-task".

header.task_id

string

Yes

The ID of the current task.

It is a 32-bit universally unique identifier (UUID) composed of 32 randomly generated letters and numbers. It can include hyphens (for example, "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx") or not (for example, "2bf83b9abaeb4fda8d9axxxxxxxxxxxx"). Most programming languages have built-in APIs to generate UUIDs. For example, in Python:

import uuid

def generateTaskId(self):
    # Generate a random UUID.
    return uuid.uuid4().hex

When you later send continue-task instructions and finish-task instructions, the task_id used must be the same as the one used when sending the run-task instruction.

header.streaming

string

Yes

Fixed string: "duplex"

payload parameters:

Parameter

Type

Required

Description

payload.task_group

string

Yes

Fixed string: "audio".

payload.task

string

Yes

Fixed string: "tts".

payload.function

string

Yes

Fixed string: "SpeechSynthesizer".

payload.model

string

Yes

The speech synthesis model.

Difference models require corresponding voices:

  • cosyvoice-v3-flash/cosyvoice-v3-plus: Use voices such as longanyang.

  • cosyvoice-v2: Use voices such as longxiaochun_v2.

  • For a complete list, see Voice list.

payload.input

object

Yes

  • If you do not send the text to be synthesized at this time, the format of input is:

    "input": {}
  • If you send the text to be synthesized at this time, the format of input is:

    "input": {
      "text": "What is the weather like today?" // Text to be synthesized.
    }

payload.parameters

text_type

string

Yes

Fixed string: "PlainText".

voice

string

Yes

The voice to use for speech synthesis.

System voices and cloned voices are supported:

  • System voices: See Voice list.

  • Cloned voices: Customize voices using the voice cloning feature. When you use a cloned voice, make sure that the same account is used for both voice cloning and speech synthesis. For detailed steps, see CosyVoice Voice Cloning API.

    When you use a cloned voice, the value of the model parameter in the request must be the same as the model version used to create the voice (the target_model parameter).

format

string

No

The audio coding format.

Supported formats are pcm, wav, mp3 (default), and opus.

When the audio format is opus, you can adjust the bitrate using the bit_rate parameter.

sample_rate

integer

No

The audio sample rate (in Hz).

Default value: 22050.

Valid values: 8000, 16000, 22050, 24000, 44100, 48000.

Note

The default sample rate is the optimal rate for the current voice. By default, the output uses this sample rate. Downsampling and upsampling are also supported.

volume

integer

No

The volume.

Default value: 50.

Value range: [0, 100]. A value of 50 is the standard volume. The volume has a linear relationship with this value. 0 is mute and 100 is the maximum volume.

rate

float

No

The speech rate.

Default value: 1.0.

Value range: [0.5, 2.0]. A value of 1.0 is the standard rate. Values less than 1.0 slow down the speech, and values greater than 1.0 speed it up.

pitch

float

No

The pitch. This value is a multiplier for pitch adjustment. The relationship between this value and the perceived pitch is not strictly linear or logarithmic. Test different values to find the best one.

Default value: 1.0.

Value range: [0.5, 2.0]. A value of 1.0 is the natural pitch of the voice. Values greater than 1.0 raise the pitch, and values less than 1.0 lower it.

enable_ssml

boolean

No

Specifies whether to enable the SSML feature.

If this parameter is set to true, you can send the text only once. Plain text or text that contains SSML is supported.

bit_rate

int

No

The audio bitrate in kbps. If the audio format is Opus, you can adjust the bitrate using the bit_rate parameter.

Default value: 32.

Value range: [6, 510].

word_timestamp_enabled

boolean

No

Specifies whether to enable character-level timestamps.

Default value: false.

  • true

  • false

This feature is available only for cloned voices of the cosyvoice-v3-flash, cosyvoice-v3-plus, and cosyvoice-v2 models, and for system voices marked as supported in the voice list.

seed

int

No

The random number seed used during generation, which varies the synthesis effect. If the model version, text, voice, and other parameters are the same, using the same seed reproduces the same synthesis result.

Default value: 0.

Value range: [0, 65535].

language_hints

array[string]

No

Provides language hints. Only cosyvoice-v3-flash and cosyvoice-v3-plus support this feature.

No default value. This parameter has no effect if it is not set.

This parameter has the following effects in speech synthesis:

  1. Specifies the language for Text Normalization (TN) processing, which affects how numbers, abbreviations, and symbols are read. This is effective only for Chinese and English.

    Value range:

    • zh: Chinese

    • en: English

  2. Specifies the target language for speech synthesis (for cloned voices only) to improve synthesis accuracy. This is effective for English, French, German, Japanese, Korean, and Russian. You do not need to specify Chinese. The value must be consistent with the languageHints/language_hints used during voice cloning.

    Value range:

    • en: English

    • fr: French

    • de: German

    • ja: Japanese

    • ko: Korean

    • ru: Russian

If the specified language hint clearly does not match the text content, for example, setting en for Chinese text, the hint is ignored, and the language is automatically detected based on the text content.

Note: This parameter is an array, but the current version processes only the first element. Therefore, pass only one value.

instruction

string

No

Sets an instruction. This feature is available only for cloned voices of the cosyvoice-v3-flash and cosyvoice-v3-plus models, and for system voices marked as supported in the voice list.

No default value. This parameter has no effect if it is not set.

The instruction has the following effects in speech synthesis:

  1. Specifies a non-Chinese language (for cloned voices only)

    • Format: "You will say it in <language>." (Note: Do not omit the period at the end. Replace "<language>" with a specific language, for example, German.)

    • Example: "You will say it in German."

    • Supported languages: French, German, Japanese, Korean, and Russian.

  2. Specifies a dialect (for cloned voices only)

    • Format: "Say it in <dialect>." (Note: Do not omit the period at the end. Replace "<dialect>" with a specific dialect, for example, Cantonese.)

    • Example: "Say it in Cantonese."

    • Supported dialects: Cantonese, Dongbei, Gansu, Guizhou, Henan, Hubei, Jiangxi, Minnan, Ningxia, Shanxi, Shaanxi, Shandong, Shanghainese, Sichuan, Tianjin, and Yunnan.

  3. Specifies emotion, scenario, role, or identity. Only some system voices support this feature, and it varies by voice. For more information, see Voice list.

enable_aigc_tag

boolean

No

Specifies whether to add an invisible AIGC identifier to the generated audio. When set to true, an invisible identifier is embedded into the audio in supported formats (WAV, MP3, and Opus).

Default value: false.

This feature is available only for the cosyvoice-v3-flash, cosyvoice-v3-plus, and cosyvoice-v2 models.

aigc_propagator

string

No

Sets the  ContentPropagator  field in the AIGC invisible identifier to specify the content propagator. This setting takes effect only when  enable_aigc_tag  is  true .

Default value: Alibaba Cloud UID.

This feature is available only for the cosyvoice-v3-flash, cosyvoice-v3-plus, and cosyvoice-v2 models.

aigc_propagate_id

string

No

Sets the  PropagateID  field in the AIGC invisible identifier to uniquely identify a specific propagation behavior. This field takes effect only when  enable_aigc_tag  is set to  true .

Default value: The request ID of the current speech synthesis request.

This feature is available only for the cosyvoice-v3-flash, cosyvoice-v3-plus, and cosyvoice-v2 models.

2. continue-task instruction

This instruction is used exclusively to send the text to be synthesized.

You can send the text to be synthesized all at once in a single continue-task instruction, or you can segment the text and send it sequentially in multiple continue-task instructions.

Important

When to send: After you receive the task-started event.

Note

The interval between sending text fragments cannot exceed 23 seconds. Otherwise, a "request timeout after 23 seconds" exception is triggered.

If there is no more text to send, promptly send a finish-task instruction to end the task.

The server enforces a 23-second timeout. The client cannot modify this configuration.

Example:

{
    "header": {
        "action": "continue-task",
        "task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx", // Random UUID.
        "streaming": "duplex"
    },
    "payload": {
        "input": {
            "text": "A quiet night thought, I see the moonlight before my bed"
        }
    }
}

header parameters:

Parameter

Type

Required

Description

header.action

string

Yes

The instruction type.

For this instruction, the value is fixed as "continue-task".

header.task_id

string

Yes

The ID of the current task.

This must be the same as the task_id used when sending the run-task instruction.

header.streaming

string

Yes

Fixed string: "duplex"

payload parameters:

Parameter

Type

Required

Description

input.text

string

Yes

The text to be synthesized.

3. finish-task instruction: End a task

This instruction ends a speech synthesis task.

You must send this instruction. Otherwise, the synthesized speech may be incomplete.

After this instruction is sent, the server converts the remaining text into speech. After the speech synthesis is complete, the server returns a task-finished event to the client.

Important

When to send: After all continue-task instructions have been sent.

Example:

{
    "header": {
        "action": "finish-task",
        "task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx",
        "streaming": "duplex"
    },
    "payload": {
        "input": {}// The input field cannot be omitted. Otherwise, an error is reported.
    }
}

header parameters:

Parameter

Type

Required

Description

header.action

string

Yes

The instruction type.

For this instruction, the value is fixed as "finish-task".

header.task_id

string

Yes

The ID of the current task.

This must be the same as the task_id used when sending the run-task instruction.

header.streaming

string

Yes

Fixed string: "duplex"

payload parameters:

Parameter

Type

Required

Description

payload.input

object

Yes

Fixed format: {}.

Events (server to client)

Events are messages returned from the server to the client. They are in JSON format and represent different processing stages.

Note

The binary audio returned from the server to the client is not included in any event and must be received separately.

1. task-started event: Task has started

When you receive the task-started event from the server, it indicates that the task has started successfully. You can send a continue-task instruction or a finish-task instruction to the server only after receiving this event. Otherwise, the task will fail.

The task-started event's payload is empty.

Example:

{
    "header": {
        "task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx",
        "event": "task-started",
        "attributes": {}
    },
    "payload": {}
}

header parameters:

Parameter

Type

Description

header.event

string

The event type.

For this event, the value is fixed as "task-started".

header.task_id

string

The task_id generated by the client.

2. result-generated event

While the client sends continue-task instructions and a finish-task instruction, the server continuously returns result-generated events.

In the CosyVoice service, the result-generated event is a reserved interface in the protocol. It encapsulates information such as the Request ID and can be ignored.

Example:

{
    "header": {
        "task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx",
        "event": "result-generated",
        "attributes": {
            "request_uuid": "0a9dba9e-d3a6-45a4-be6d-xxxxxxxxxxxx"
        }
    },
    "payload": {}
}

header parameters:

Parameter

Type

Description

header.event

string

The event type.

For this event, the value is fixed as "result-generated".

header.task_id

string

The task_id generated by the client.

header.attributes.request_uuid

string

The Request ID.

payload parameters:

Parameter

Type

Description

payload.usage.characters

integer

The number of billable characters in the current request so far. In a single task, usage may appear in a result-generated event or a task-finished event. The returned usage field is a cumulative result. Use the value from the last event.

3. task-finished event: Task has finished

When you receive the task-finished event from the server, it indicates that the task is finished.

After the task is finished, you can close the WebSocket connection to end the program, or you can reuse the WebSocket connection and resend the run-task instruction to start the next task. For more information, see Connection establishment overhead and connection reuse.

Example:

{
    "header": {
        "task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx",
        "event": "task-finished",
        "attributes": {
            "request_uuid": "0a9dba9e-d3a6-45a4-be6d-xxxxxxxxxxxx"
        }
    },
    "payload": {
        "output": {
            "sentence": {
                "words": []
            }
        },
        "usage": {
            "characters": 13
        }
    }
}

header parameters:

Parameter

Type

Description

header.event

string

The event type.

For this event, the value is fixed as "task-finished".

header.task_id

string

The task_id generated by the client.

header.attributes.request_uuid

string

The Request ID. You can provide this to CosyVoice developers to locate issues.

payload parameters:

Parameter

Type

Description

payload.usage.characters

integer

The number of billable characters used in the current request. In a task, the usage field may appear in the result-generated event or the task-finished event. The returned usage value is cumulative. Use the final value that is returned for the task.

payload.output.sentence.index

integer

The number of the sentence, starting from 0.

This field and the following fields require enabling word-level timestamps using word_timestamp_enabled.

payload.output.sentence.words[k]

text

string

The text of the word.

begin_index

integer

The start position index of the word in the sentence, starting from 0.

end_index

integer

The end position index of the word in the sentence, starting from 1.

begin_time

integer

The start timestamp of the audio corresponding to the word, in milliseconds.

end_time

integer

The end timestamp of the audio corresponding to the word, in milliseconds.

After you enable word-level timestamps using word_timestamp_enabled, timestamp information is returned. The following is an example:

Click to view the response after enabling word-level timestamps

{
    "header": {
        "task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx",
        "event": "task-finished",
        "attributes": {"request_uuid": "0a9dba9e-d3a6-45a4-be6d-xxxxxxxxxxxx"}
    },
    "payload": {
        "output": {
            "sentence": {
                "index": 0,
                "words": [
                    {
                        "text": "What",
                        "begin_index": 0,
                        "end_index": 4,
                        "begin_time": 80,
                        "end_time": 200
                    },
                    {
                        "text": "is",
                        "begin_index": 5,
                        "end_index": 7,
                        "begin_time": 240,
                        "end_time": 360
                    },
                    {
                        "text": "the",
                        "begin_index": 8,
                        "end_index": 11,
                        "begin_time": 360,
                        "end_time": 480
                    },
                    {
                        "text": "weather",
                        "begin_index": 12,
                        "end_index": 19,
                        "begin_time": 480,
                        "end_time": 680
                    },
                    {
                        "text": "like",
                        "begin_index": 20,
                        "end_index": 24,
                        "begin_time": 680,
                        "end_time": 800
                    },
                    {
                        "text": "today",
                        "begin_index": 25,
                        "end_index": 30,
                        "begin_time": 800,
                        "end_time": 920
                    },
                    {
                        "text": "?",
                        "begin_index": 30,
                        "end_index": 31,
                        "begin_time": 920,
                        "end_time": 1320
                    }
                ]
            }
        },
        "usage": {"characters": 31}
    }
}

4. task-failed event: Task has failed

If you receive a task-failed event, it indicates that the task has failed. At this point, close the WebSocket connection and handle the error. You can analyze the error message to identify and fix programming issues in your code.

Example:

{
    "header": {
        "task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx",
        "event": "task-failed",
        "error_code": "InvalidParameter",
        "error_message": "[tts:]Engine return error code: 418",
        "attributes": {}
    },
    "payload": {}
}

header parameters:

Parameter

Type

Description

header.event

string

The event type.

For this event, the value is fixed as "task-failed".

header.task_id

string

The task_id generated by the client.

header.error_code

string

A description of the error type.

header.error_message

string

The specific reason for the error.

Connection overhead and reuse

The WebSocket service supports connection reuse to improve resource utilization and avoid connection establishment overhead.

After the server receives a run-task instruction from the client, it starts a new task. After the client sends a finish-task instruction, the server returns a task-finished event to end the task when it is complete. After a task is finished, the WebSocket connection can be reused. The client can send another run-task instruction to start the next task.

Important
  1. Different tasks in a reused connection must use different task_ids.

  2. If a task fails during execution, the service will still return a task-failed event and close the connection. This connection cannot be reused.

  3. If there are no new tasks for 60 seconds after a task is finished, the connection will time out and be automatically disconnected.

Sample code

The sample code provides a basic implementation to demonstrate how the service works. You must adapt the code for your specific business scenarios.

When you write a WebSocket client, asynchronous programming is typically used to send and receive messages simultaneously. You can write your program by following these steps:

  1. Establish a WebSocket connection

    Call a WebSocket library function (the specific implementation varies by programming language or library) and pass the Headers and URL to establish a WebSocket connection.

  2. Listen for server messages

    You can listen for messages returned by the server using callback functions (observer pattern) provided by the WebSocket library. The specific implementation varies by programming language.

    The server returns two types of messages: binary audio streams and events.

    Listen for events

    • task-started: When you receive a task-started event, it indicates that the task has started successfully. You can send a continue-task instruction or a finish-task instruction to the server only after this event is triggered. Otherwise, the task will fail.

    • result-generated (can be ignored): When the client sends a continue-task instruction or a finish-task instruction, the server may continuously return result-generated events. In the current CosyVoice service, this event is a reserved interface in the protocol and can be ignored.

    • task-finished: When you receive a task-finished event, it indicates that the task is complete. At this point, you can close the WebSocket connection and end the program.

    • task-failed: If you receive a task-failed event, it indicates that the task has failed. Close the WebSocket connection and adjust your code based on the error message to fix the issue.

    Process binary audio streams: The server sends the audio stream in frames through the binary channel. The complete audio data is transmitted in multiple packets.

    • In streaming speech synthesis, for compressed formats such as MP3 and Opus, use a streaming player to play the audio segments. Do not play them frame by frame to avoid decoding failures.

      Players that support streaming playback include ffmpeg, pyaudio (Python), AudioFormat (Java), and MediaSource (JavaScript).
    • When combining audio data into a complete audio file, append the data to the same file.

    • For WAV and MP3 audio formats in streaming speech synthesis, only the first frame contains header information. Subsequent frames contain only audio data.

  3. Send messages to the server (pay close attention to the sequence)

    In a thread different from the one listening for server messages (such as the main thread, the specific implementation varies by programming language), send instructions to the server.

    Send instructions in the following strict sequence. Otherwise, the task may fail.

    1. Send a run-task instruction

    2. Send a continue-task instruction

      • Sends the text to be synthesized.

      • You can send this instruction only after you receive the task-started event from the server.

    3. Send a finish-task instruction

  4. Close the WebSocket connection

    When the program ends normally, an exception occurs during runtime, or you receive a task-finished event or a task-failed event, close the WebSocket connection. This is usually done by calling the close function in the utility library.

Click to view the complete examples

Important: Different models require corresponding voices:

  • cosyvoice-v3-flash/cosyvoice-v3-plus: Use voices such as longanyang

  • cosyvoice-v2: Use voices such as longxiaochun_v2

  • For a complete list, see Voice list.

Go

package main

import (
	"encoding/json"
	"fmt"
	"net/http"
	"os"
	"time"

	"github.com/google/uuid"
	"github.com/gorilla/websocket"
)

const (
	wsURL      = "wss://dashscope.aliyuncs.com/api-ws/v1/inference/" // WebSocket server endpoint.
	outputFile = "output.mp3"                                        // Output file path.
)

func main() {
	// If you have not configured the API key as an environment variable, you can replace the next line with: apiKey := "your_api_key". We do not recommend hard coding the API key in your code in a production environment to reduce the risk of API key leaks.
	apiKey := os.Getenv("DASHSCOPE_API_KEY")
	// Check and clear the output file.
	if err := clearOutputFile(outputFile); err != nil {
		fmt.Println("Failed to clear output file: ", err)
		return
	}

	// Connect to the WebSocket service.
	conn, err := connectWebSocket(apiKey)
	if err != nil {
		fmt.Println("Failed to connect to WebSocket: ", err)
		return
	}
	defer closeConnection(conn)

	// Start a goroutine to receive results.
	done, taskStarted := startResultReceiver(conn)

	// Send the run-task instruction.
	taskID, err := sendRunTaskCmd(conn)
	if err != nil {
		fmt.Println("Failed to send run-task instruction: ", err)
		return
	}

	// Wait for the task-started event.
	for !*taskStarted {
		time.Sleep(100 * time.Millisecond)
	}

	// Send the text to be synthesized.
	if err := sendContinueTaskCmd(conn, taskID); err != nil {
		fmt.Println("Failed to send text for synthesis: ", err)
		return
	}

	// Send the finish-task instruction.
	if err := sendFinishTaskCmd(conn, taskID); err != nil {
		fmt.Println("Failed to send finish-task instruction: ", err)
		return
	}

	// Wait for the goroutine that receives results to complete.
	<-done
}

var dialer = websocket.DefaultDialer

// Define structs to represent JSON data.
type Header struct {
	Action       string                 `json:"action"`
	TaskID       string                 `json:"task_id"`
	Streaming    string                 `json:"streaming"`
	Event        string                 `json:"event"`
	ErrorCode    string                 `json:"error_code,omitempty"`
	ErrorMessage string                 `json:"error_message,omitempty"`
	Attributes   map[string]interface{} `json:"attributes"`
}

type Payload struct {
	TaskGroup  string     `json:"task_group"`
	Task       string     `json:"task"`
	Function   string     `json:"function"`
	Model      string     `json:"model"`
	Parameters Params     `json:"parameters"`
	Resources  []Resource `json:"resources"`
	Input      Input      `json:"input"`
}

type Params struct {
	TextType   string `json:"text_type"`
	Voice      string `json:"voice"`
	Format     string `json:"format"`
	SampleRate int    `json:"sample_rate"`
	Volume     int    `json:"volume"`
	Rate       int    `json:"rate"`
	Pitch      int    `json:"pitch"`
}

type Resource struct {
	ResourceID   string `json:"resource_id"`
	ResourceType string `json:"resource_type"`
}

type Input struct {
	Text string `json:"text"`
}

type Event struct {
	Header  Header  `json:"header"`
	Payload Payload `json:"payload"`
}

// Connect to the WebSocket service.
func connectWebSocket(apiKey string) (*websocket.Conn, error) {
	header := make(http.Header)
	header.Add("X-DashScope-DataInspection", "enable")
	header.Add("Authorization", fmt.Sprintf("bearer %s", apiKey))
	conn, _, err := dialer.Dial(wsURL, header)
	if err != nil {
		fmt.Println("Failed to connect to WebSocket: ", err)
		return nil, err
	}
	return conn, nil
}

// Send the run-task instruction.
func sendRunTaskCmd(conn *websocket.Conn) (string, error) {
	runTaskCmd, taskID, err := generateRunTaskCmd()
	if err != nil {
		return "", err
	}
	err = conn.WriteMessage(websocket.TextMessage, []byte(runTaskCmd))
	return taskID, err
}

// Generate the run-task instruction.
func generateRunTaskCmd() (string, string, error) {
	taskID := uuid.New().String()
	runTaskCmd := Event{
		Header: Header{
			Action:    "run-task",
			TaskID:    taskID,
			Streaming: "duplex",
		},
		Payload: Payload{
			TaskGroup: "audio",
			Task:      "tts",
			Function:  "SpeechSynthesizer",
			Model:     "cosyvoice-v3-flash",
			Parameters: Params{
				TextType:   "PlainText",
				Voice:      "longanyang",
				Format:     "mp3",
				SampleRate: 22050,
				Volume:     50,
				Rate:       1,
				Pitch:      1,
			},
			Input: Input{},
		},
	}
	runTaskCmdJSON, err := json.Marshal(runTaskCmd)
	return string(runTaskCmdJSON), taskID, err
}

// Send the text to be synthesized.
func sendContinueTaskCmd(conn *websocket.Conn, taskID string) error {
	texts := []string{"A quiet night thought,", "I see the moonlight before my bed.", "I lift my head and watch the moon.", "I lower my head and think of home."}

	for _, text := range texts {
		runTaskCmd, err := generateContinueTaskCmd(text, taskID)
		if err != nil {
			return err
		}

		err = conn.WriteMessage(websocket.TextMessage, []byte(runTaskCmd))
		if err != nil {
			return err
		}
	}

	return nil
}

// Generate the continue-task instruction.
func generateContinueTaskCmd(text string, taskID string) (string, error) {
	runTaskCmd := Event{
		Header: Header{
			Action:    "continue-task",
			TaskID:    taskID,
			Streaming: "duplex",
		},
		Payload: Payload{
			Input: Input{
				Text: text,
			},
		},
	}
	runTaskCmdJSON, err := json.Marshal(runTaskCmd)
	return string(runTaskCmdJSON), err
}

// Start a goroutine to receive results.
func startResultReceiver(conn *websocket.Conn) (chan struct{}, *bool) {
	done := make(chan struct{})
	taskStarted := new(bool)
	*taskStarted = false

	go func() {
		defer close(done)
		for {
			msgType, message, err := conn.ReadMessage()
			if err != nil {
				fmt.Println("Failed to parse server message: ", err)
				return
			}

			if msgType == websocket.BinaryMessage {
				// Process the binary audio stream.
				if err := writeBinaryDataToFile(message, outputFile); err != nil {
					fmt.Println("Failed to write binary data: ", err)
					return
				}
			} else {
				// Process the text message.
				var event Event
				err = json.Unmarshal(message, &event)
				if err != nil {
					fmt.Println("Failed to parse event: ", err)
					continue
				}
				if handleEvent(conn, event, taskStarted) {
					return
				}
			}
		}
	}()

	return done, taskStarted
}

// Handle events.
func handleEvent(conn *websocket.Conn, event Event, taskStarted *bool) bool {
	switch event.Header.Event {
	case "task-started":
		fmt.Println("Received task-started event")
		*taskStarted = true
	case "result-generated":
		// Ignore the result-generated event.
		return false
	case "task-finished":
		fmt.Println("Task finished")
		return true
	case "task-failed":
		handleTaskFailed(event, conn)
		return true
	default:
		fmt.Printf("Unexpected event: %v\n", event)
	}
	return false
}

// Handle the task-failed event.
func handleTaskFailed(event Event, conn *websocket.Conn) {
	if event.Header.ErrorMessage != "" {
		fmt.Printf("Task failed: %s\n", event.Header.ErrorMessage)
	} else {
		fmt.Println("Task failed for an unknown reason")
	}
}

// Close the connection.
func closeConnection(conn *websocket.Conn) {
	if conn != nil {
		conn.Close()
	}
}

// Write binary data to a file.
func writeBinaryDataToFile(data []byte, filePath string) error {
	file, err := os.OpenFile(filePath, os.O_APPEND|os.O_CREATE|os.O_WRONLY, 0644)
	if err != nil {
		return err
	}
	defer file.Close()

	_, err = file.Write(data)
	if err != nil {
		return err
	}

	return nil
}

// Send the finish-task instruction.
func sendFinishTaskCmd(conn *websocket.Conn, taskID string) error {
	finishTaskCmd, err := generateFinishTaskCmd(taskID)
	if err != nil {
		return err
	}
	err = conn.WriteMessage(websocket.TextMessage, []byte(finishTaskCmd))
	return err
}

// Generate the finish-task instruction.
func generateFinishTaskCmd(taskID string) (string, error) {
	finishTaskCmd := Event{
		Header: Header{
			Action:    "finish-task",
			TaskID:    taskID,
			Streaming: "duplex",
		},
		Payload: Payload{
			Input: Input{},
		},
	}
	finishTaskCmdJSON, err := json.Marshal(finishTaskCmd)
	return string(finishTaskCmdJSON), err
}

// Clear the output file.
func clearOutputFile(filePath string) error {
	file, err := os.OpenFile(filePath, os.O_TRUNC|os.O_CREATE|os.O_WRONLY, 0644)
	if err != nil {
		return err
	}
	file.Close()
	return nil
}

C#

using System.Net.WebSockets;
using System.Text;
using System.Text.Json;

class Program {
    // If you have not configured the API key as an environment variable, you can replace the next line with: private const string ApiKey="your_api_key";. We do not recommend hard coding the API key in your code in a production environment to reduce the risk of API key leaks.
    private static readonly string ApiKey = Environment.GetEnvironmentVariable("DASHSCOPE_API_KEY") ?? throw new InvalidOperationException("DASHSCOPE_API_KEY environment variable is not set.");

    // WebSocket server endpoint.
    private const string WebSocketUrl = "wss://dashscope.aliyuncs.com/api-ws/v1/inference/";
    // Output file path.
    private const string OutputFilePath = "output.mp3";

    // WebSocket client.
    private static ClientWebSocket _webSocket = new ClientWebSocket();
    // Cancellation token source.
    private static CancellationTokenSource _cancellationTokenSource = new CancellationTokenSource();
    // Task ID.
    private static string? _taskId;
    // Whether the task has started.
    private static TaskCompletionSource<bool> _taskStartedTcs = new TaskCompletionSource<bool>();

    static async Task Main(string[] args) {
        try {
            // Clear the output file.
            ClearOutputFile(OutputFilePath);

            // Connect to the WebSocket service.
            await ConnectToWebSocketAsync(WebSocketUrl);

            // Start the task to receive messages.
            Task receiveTask = ReceiveMessagesAsync();

            // Send the run-task instruction.
            _taskId = GenerateTaskId();
            await SendRunTaskCommandAsync(_taskId);

            // Wait for the task-started event.
            await _taskStartedTcs.Task;

            // Continuously send continue-task instructions.
            string[] texts = {
                "A quiet night thought,",
                "I see the moonlight before my bed,",
                "I lift my head and watch the moon,",
                "I lower my head and think of home."
            };
            foreach (string text in texts) {
                await SendContinueTaskCommandAsync(text);
            }

            // Send the finish-task instruction.
            await SendFinishTaskCommandAsync(_taskId);

            // Wait for the receiving task to complete.
            await receiveTask;

            Console.WriteLine("Task finished, connection closed.");
        } catch (OperationCanceledException) {
            Console.WriteLine("Task was canceled.");
        } catch (Exception ex) {
            Console.WriteLine($"An error occurred: {ex.Message}");
        } finally {
            _cancellationTokenSource.Cancel();
            _webSocket.Dispose();
        }
    }

    private static void ClearOutputFile(string filePath) {
        if (File.Exists(filePath)) {
            File.WriteAllText(filePath, string.Empty);
            Console.WriteLine("Output file cleared.");
        } else {
            Console.WriteLine("Output file does not exist, no need to clear.");
        }
    }

    private static async Task ConnectToWebSocketAsync(string url) {
        var uri = new Uri(url);
        if (_webSocket.State == WebSocketState.Connecting || _webSocket.State == WebSocketState.Open) {
            return;
        }

        // Set the header information for the WebSocket connection.
        _webSocket.Options.SetRequestHeader("Authorization", $"bearer {ApiKey}");
        _webSocket.Options.SetRequestHeader("X-DashScope-DataInspection", "enable");

        try {
            await _webSocket.ConnectAsync(uri, _cancellationTokenSource.Token);
            Console.WriteLine("Successfully connected to the WebSocket service.");
        } catch (OperationCanceledException) {
            Console.WriteLine("WebSocket connection was canceled.");
        } catch (Exception ex) {
            Console.WriteLine($"WebSocket connection failed: {ex.Message}");
            throw;
        }
    }

    private static async Task SendRunTaskCommandAsync(string taskId) {
        var command = CreateCommand("run-task", taskId, "duplex", new {
            task_group = "audio",
            task = "tts",
            function = "SpeechSynthesizer",
            model = "cosyvoice-v3-flash",
            parameters = new
            {
                text_type = "PlainText",
                voice = "longanyang",
                format = "mp3",
                sample_rate = 22050,
                volume = 50,
                rate = 1,
                pitch = 1
            },
            input = new { }
        });

        await SendJsonMessageAsync(command);
        Console.WriteLine("run-task instruction sent.");
    }

    private static async Task SendContinueTaskCommandAsync(string text) {
        if (_taskId == null) {
            throw new InvalidOperationException("Task ID is not initialized.");
        }

        var command = CreateCommand("continue-task", _taskId, "duplex", new {
            input = new {
                text
            }
        });

        await SendJsonMessageAsync(command);
        Console.WriteLine("continue-task instruction sent.");
    }

    private static async Task SendFinishTaskCommandAsync(string taskId) {
        var command = CreateCommand("finish-task", taskId, "duplex", new {
            input = new { }
        });

        await SendJsonMessageAsync(command);
        Console.WriteLine("finish-task instruction sent.");
    }

    private static async Task SendJsonMessageAsync(string message) {
        var buffer = Encoding.UTF8.GetBytes(message);
        try {
            await _webSocket.SendAsync(new ArraySegment<byte>(buffer), WebSocketMessageType.Text, true, _cancellationTokenSource.Token);
        } catch (OperationCanceledException) {
            Console.WriteLine("Message sending was canceled.");
        }
    }

    private static async Task ReceiveMessagesAsync() {
        while (_webSocket.State == WebSocketState.Open) {
            var response = await ReceiveMessageAsync();
            if (response != null) {
                var eventStr = response.RootElement.GetProperty("header").GetProperty("event").GetString();
                switch (eventStr) {
                    case "task-started":
                        Console.WriteLine("Task has started.");
                        _taskStartedTcs.TrySetResult(true);
                        break;
                    case "task-finished":
                        Console.WriteLine("Task has finished.");
                        _cancellationTokenSource.Cancel();
                        break;
                    case "task-failed":
                        Console.WriteLine("Task failed.");
                        _cancellationTokenSource.Cancel();
                        break;
                    default:
                        // result-generated can be handled here.
                        break;
                }
            }
        }
    }

    private static async Task<JsonDocument?> ReceiveMessageAsync() {
        var buffer = new byte[1024 * 4];
        var segment = new ArraySegment<byte>(buffer);

        try {
            WebSocketReceiveResult result = await _webSocket.ReceiveAsync(segment, _cancellationTokenSource.Token);

            if (result.MessageType == WebSocketMessageType.Close) {
                await _webSocket.CloseAsync(WebSocketCloseStatus.NormalClosure, "Closing", _cancellationTokenSource.Token);
                return null;
            }

            if (result.MessageType == WebSocketMessageType.Binary) {
                // Process binary data.
                Console.WriteLine("Receiving binary data...");

                // Save the binary data to a file.
                using (var fileStream = new FileStream(OutputFilePath, FileMode.Append)) {
                    fileStream.Write(buffer, 0, result.Count);
                }

                return null;
            }

            string message = Encoding.UTF8.GetString(buffer, 0, result.Count);
            return JsonDocument.Parse(message);
        } catch (OperationCanceledException) {
            Console.WriteLine("Message reception was canceled.");
            return null;
        }
    }

    private static string GenerateTaskId() {
        return Guid.NewGuid().ToString("N").Substring(0, 32);
    }

    private static string CreateCommand(string action, string taskId, string streaming, object payload) {
        var command = new {
            header = new {
                action,
                task_id = taskId,
                streaming
            },
            payload
        };

        return JsonSerializer.Serialize(command);
    }
}

PHP

The directory structure of the sample code is as follows:

my-php-project/

├── composer.json

├── vendor/

└── index.php

The content of composer.json is as follows. Determine the version numbers of the dependencies as needed:

{
    "require": {
        "react/event-loop": "^1.3",
        "react/socket": "^1.11",
        "react/stream": "^1.2",
        "react/http": "^1.1",
        "ratchet/pawl": "^0.4"
    },
    "autoload": {
        "psr-4": {
            "App\\": "src/"
        }
    }
}

The content of index.php is as follows:

<?php

require __DIR__ . '/vendor/autoload.php';

use Ratchet\Client\Connector;
use React\EventLoop\Loop;
use React\Socket\Connector as SocketConnector;

# If you have not configured the API key as an environment variable, you can replace the next line with: $api_key="your_api_key";. We do not recommend hard coding the API key in your code in a production environment to reduce the risk of API key leaks.
$api_key = getenv("DASHSCOPE_API_KEY");
$websocket_url = 'wss://dashscope.aliyuncs.com/api-ws/v1/inference/'; // WebSocket server endpoint.
$output_file = 'output.mp3'; // Output file path.

$loop = Loop::get();

if (file_exists($output_file)) {
    // Clear the file content.
    file_put_contents($output_file, '');
}

// Create a custom connector.
$socketConnector = new SocketConnector($loop, [
    'tcp' => [
        'bindto' => '0.0.0.0:0',
    ],
    'tls' => [
        'verify_peer' => false,
        'verify_peer_name' => false,
    ],
]);

$connector = new Connector($loop, $socketConnector);

$headers = [
    'Authorization' => 'bearer ' . $api_key,
    'X-DashScope-DataInspection' => 'enable'
];

$connector($websocket_url, [], $headers)->then(function ($conn) use ($loop, $output_file) {
    echo "Connected to WebSocket server\n";

    // Generate a task ID.
    $taskId = generateTaskId();

    // Send the run-task instruction.
    sendRunTaskMessage($conn, $taskId);

    // Define the function to send continue-task instructions.
    $sendContinueTask = function() use ($conn, $loop, $taskId) {
        // Text to be sent.
        $texts = ["A quiet night thought,", "I see the moonlight before my bed,", "I lift my head and watch the moon,", "I lower my head and think of home."];
        $continueTaskCount = 0;
        foreach ($texts as $text) {
            $continueTaskMessage = json_encode([
                "header" => [
                    "action" => "continue-task",
                    "task_id" => $taskId,
                    "streaming" => "duplex"
                ],
                "payload" => [
                    "input" => [
                        "text" => $text
                    ]
                ]
            ]);
            echo "Preparing to send continue-task instruction: " . $continueTaskMessage . "\n";
            $conn->send($continueTaskMessage);
            $continueTaskCount++;
        }
        echo "Number of continue-task instructions sent: " . $continueTaskCount . "\n";

        // Send the finish-task instruction.
        sendFinishTaskMessage($conn, $taskId);
    };

    // Flag to indicate if the task-started event has been received.
    $taskStarted = false;

    // Listen for messages.
    $conn->on('message', function($msg) use ($conn, $sendContinueTask, $loop, &$taskStarted, $taskId, $output_file) {
        if ($msg->isBinary()) {
            // Write binary data to a local file.
            file_put_contents($output_file, $msg->getPayload(), FILE_APPEND);
        } else {
            // Process non-binary messages.
            $response = json_decode($msg, true);

            if (isset($response['header']['event'])) {
                handleEvent($conn, $response, $sendContinueTask, $loop, $taskId, $taskStarted);
            } else {
                echo "Unknown message format\n";
            }
        }
    });

    // Listen for connection closure.
    $conn->on('close', function($code = null, $reason = null) {
        echo "Connection closed\n";
        if ($code !== null) {
            echo "Close code: " . $code . "\n";
        }
        if ($reason !== null) {
            echo "Close reason: " . $reason . "\n";
        }
    });
}, function ($e) {
    echo "Could not connect: {$e->getMessage()}\n";
});

$loop->run();

/**
 * Generate a task ID.
 * @return string
 */
function generateTaskId(): string {
    return bin2hex(random_bytes(16));
}

/**
 * Send the run-task instruction.
 * @param $conn
 * @param $taskId
 */
function sendRunTaskMessage($conn, $taskId) {
    $runTaskMessage = json_encode([
        "header" => [
            "action" => "run-task",
            "task_id" => $taskId,
            "streaming" => "duplex"
        ],
        "payload" => [
            "task_group" => "audio",
            "task" => "tts",
            "function" => "SpeechSynthesizer",
            "model" => "cosyvoice-v3-flash",
            "parameters" => [
                "text_type" => "PlainText",
                "voice" => "longanyang",
                "format" => "mp3",
                "sample_rate" => 22050,
                "volume" => 50,
                "rate" => 1,
                "pitch" => 1
            ],
            "input" => (object) []
        ]
    ]);
    echo "Preparing to send run-task instruction: " . $runTaskMessage . "\n";
    $conn->send($runTaskMessage);
    echo "run-task instruction sent\n";
}

/**
 * Read an audio file.
 * @param string $filePath
 * @return bool|string
 */
function readAudioFile(string $filePath) {
    $voiceData = file_get_contents($filePath);
    if ($voiceData === false) {
        echo "Cannot read audio file\n";
    }
    return $voiceData;
}

/**
 * Split audio data.
 * @param string $data
 * @param int $chunkSize
 * @return array
 */
function splitAudioData(string $data, int $chunkSize): array {
    return str_split($data, $chunkSize);
}

/**
 * Send the finish-task instruction.
 * @param $conn
 * @param $taskId
 */
function sendFinishTaskMessage($conn, $taskId) {
    $finishTaskMessage = json_encode([
        "header" => [
            "action" => "finish-task",
            "task_id" => $taskId,
            "streaming" => "duplex"
        ],
        "payload" => [
            "input" => (object) []
        ]
    ]);
    echo "Preparing to send finish-task instruction: " . $finishTaskMessage . "\n";
    $conn->send($finishTaskMessage);
    echo "finish-task instruction sent\n";
}

/**
 * Handle events.
 * @param $conn
 * @param $response
 * @param $sendContinueTask
 * @param $loop
 * @param $taskId
 * @param $taskStarted
 */
function handleEvent($conn, $response, $sendContinueTask, $loop, $taskId, &$taskStarted) {
    switch ($response['header']['event']) {
        case 'task-started':
            echo "Task started, sending continue-task instruction...\n";
            $taskStarted = true;
            // Send the continue-task instruction.
            $sendContinueTask();
            break;
        case 'result-generated':
            // Ignore the result-generated event.
            break;
        case 'task-finished':
            echo "Task finished\n";
            $conn->close();
            break;
        case 'task-failed':
            echo "Task failed\n";
            echo "Error code: " . $response['header']['error_code'] . "\n";
            echo "Error message: " . $response['header']['error_message'] . "\n";
            $conn->close();
            break;
        case 'error':
            echo "Error: " . $response['payload']['message'] . "\n";
            break;
        default:
            echo "Unknown event: " . $response['header']['event'] . "\n";
            break;
    }

    // If the task is finished, close the connection.
    if ($response['header']['event'] == 'task-finished') {
        // Wait for 1 second to ensure all data has been transmitted.
        $loop->addTimer(1, function() use ($conn) {
            $conn->close();
            echo "Client closed connection\n";
        });
    }

    // If the task-started event is not received, close the connection.
    if (!$taskStarted && in_array($response['header']['event'], ['task-failed', 'error'])) {
        $conn->close();
    }
}

Node.js

Install the required dependencies:

npm install ws
npm install uuid

The sample code is as follows:

const WebSocket = require('ws');
const fs = require('fs');
const uuid = require('uuid').v4;

// If you have not configured the API key as an environment variable, you can replace the next line with: apiKey = 'your_api_key'. We do not recommend hard coding the API key in your code in a production environment to reduce the risk of API key leaks.
const apiKey = process.env.DASHSCOPE_API_KEY;
// WebSocket server endpoint.
const url = 'wss://dashscope.aliyuncs.com/api-ws/v1/inference/';
// Output file path.
const outputFilePath = 'output.mp3';

// Clear the output file.
fs.writeFileSync(outputFilePath, '');

// Create a WebSocket client.
const ws = new WebSocket(url, {
  headers: {
    Authorization: `bearer ${apiKey}`,
    'X-DashScope-DataInspection': 'enable'
  }
});

let taskStarted = false;
let taskId = uuid();

ws.on('open', () => {
  console.log('Connected to WebSocket server');

  // Send the run-task instruction.
  const runTaskMessage = JSON.stringify({
    header: {
      action: 'run-task',
      task_id: taskId,
      streaming: 'duplex'
    },
    payload: {
      task_group: 'audio',
      task: 'tts',
      function: 'SpeechSynthesizer',
      model: 'cosyvoice-v3-flash',
      parameters: {
        text_type: 'PlainText',
        voice: 'longanyang', // Voice
        format: 'mp3', // Audio format
        sample_rate: 22050, // Sample rate
        volume: 50, // Volume
        rate: 1, // Speech rate
        pitch: 1 // Pitch
      },
      input: {}
    }
  });
  ws.send(runTaskMessage);
  console.log('run-task message sent');
});

const fileStream = fs.createWriteStream(outputFilePath, { flags: 'a' });
ws.on('message', (data, isBinary) => {
  if (isBinary) {
    // Write binary data to the file.
    fileStream.write(data);
  } else {
    const message = JSON.parse(data);

    switch (message.header.event) {
      case 'task-started':
        taskStarted = true;
        console.log('Task has started');
        // Send continue-task instructions.
        sendContinueTasks(ws);
        break;
      case 'task-finished':
        console.log('Task has finished');
        ws.close();
        fileStream.end(() => {
          console.log('File stream closed');
        });
        break;
      case 'task-failed':
        console.error('Task failed: ', message.header.error_message);
        ws.close();
        fileStream.end(() => {
          console.log('File stream closed');
        });
        break;
      default:
        // You can handle result-generated here.
        break;
    }
  }
});

function sendContinueTasks(ws) {
  const texts = [
    'A quiet night thought,',
    'I see the moonlight before my bed.',
    'I lift my head and watch the moon,',
    'I lower my head and think of home.'
  ];
  
  texts.forEach((text, index) => {
    setTimeout(() => {
      if (taskStarted) {
        const continueTaskMessage = JSON.stringify({
          header: {
            action: 'continue-task',
            task_id: taskId,
            streaming: 'duplex'
          },
          payload: {
            input: {
              text: text
            }
          }
        });
        ws.send(continueTaskMessage);
        console.log(`continue-task sent, text: ${text}`);
      }
    }, index * 1000); // Send every 1 second.
  });

  // Send the finish-task instruction.
  setTimeout(() => {
    if (taskStarted) {
      const finishTaskMessage = JSON.stringify({
        header: {
          action: 'finish-task',
          task_id: taskId,
          streaming: 'duplex'
        },
        payload: {
          input: {}
        }
      });
      ws.send(finishTaskMessage);
      console.log('finish-task sent');
    }
  }, texts.length * 1000 + 1000); // Send 1 second after all continue-task instructions are sent.
}

ws.on('close', () => {
  console.log('Disconnected from WebSocket server');
});

Java

If you use the Java programming language, we recommend that you use the Java DashScope SDK for development. For more information, see Java SDK.

The following is a Java WebSocket call example. Before you run the example, make sure you have imported the following dependencies:

  • Java-WebSocket

  • jackson-databind

We recommend that you use Maven or Gradle to manage dependency packages. The configurations are as follows:

pom.xml

<dependencies>
    <!-- WebSocket Client -->
    <dependency>
        <groupId>org.java-websocket</groupId>
        <artifactId>Java-WebSocket</artifactId>
        <version>1.5.3</version>
    </dependency>

    <!-- JSON Processing -->
    <dependency>
        <groupId>com.fasterxml.jackson.core</groupId>
        <artifactId>jackson-databind</artifactId>
        <version>2.13.0</version>
    </dependency>
</dependencies>

build.gradle

// Other code omitted.
dependencies {
  // WebSocket Client
  implementation 'org.java-websocket:Java-WebSocket:1.5.3'
  // JSON Processing
  implementation 'com.fasterxml.jackson.core:jackson-databind:2.13.0'
}
// Other code omitted.

The Java code is as follows:

import com.fasterxml.jackson.databind.ObjectMapper;

import org.java_websocket.client.WebSocketClient;
import org.java_websocket.handshake.ServerHandshake;

import java.io.FileOutputStream;
import java.io.IOException;
import java.net.URI;
import java.nio.ByteBuffer;
import java.util.*;

public class TTSWebSocketClient extends WebSocketClient {
    private final String taskId = UUID.randomUUID().toString();
    private final String outputFile = "output_" + System.currentTimeMillis() + ".mp3";
    private boolean taskFinished = false;

    public TTSWebSocketClient(URI serverUri, Map<String, String> headers) {
        super(serverUri, headers);
    }

    @Override
    public void onOpen(ServerHandshake serverHandshake) {
        System.out.println("Connection successful");

        // Send the run-task instruction.
        String runTaskCommand = "{ \"header\": { \"action\": \"run-task\", \"task_id\": \"" + taskId + "\", \"streaming\": \"duplex\" }, \"payload\": { \"task_group\": \"audio\", \"task\": \"tts\", \"function\": \"SpeechSynthesizer\", \"model\": \"cosyvoice-v3-flash\", \"parameters\": { \"text_type\": \"PlainText\", \"voice\": \"longanyang\", \"format\": \"mp3\", \"sample_rate\": 22050, \"volume\": 50, \"rate\": 1, \"pitch\": 1 }, \"input\": {} }}";
        send(runTaskCommand);
    }

    @Override
    public void onMessage(String message) {
        System.out.println("Received message from server: " + message);
        try {
            // Parse JSON message
            Map<String, Object> messageMap = new ObjectMapper().readValue(message, Map.class);

            if (messageMap.containsKey("header")) {
                Map<String, Object> header = (Map<String, Object>) messageMap.get("header");

                if (header.containsKey("event")) {
                    String event = (String) header.get("event");

                    if ("task-started".equals(event)) {
                        System.out.println("Received task-started event from server");

                        List<String> texts = Arrays.asList(
                                "A quiet night thought, I see the moonlight before my bed",
                                "I lift my head and watch the moon, I lower my head and think of home"
                        );

                        for (String text : texts) {
                            // Send the continue-task instruction.
                            sendContinueTask(text);
                        }

                        // Send the finish-task instruction.
                        sendFinishTask();
                    } else if ("task-finished".equals(event)) {
                        System.out.println("Received task-finished event from server");
                        taskFinished = true;
                        closeConnection();
                    } else if ("task-failed".equals(event)) {
                        System.out.println("Task failed: " + message);
                        closeConnection();
                    }
                }
            }
        } catch (Exception e) {
            System.err.println("An exception occurred: " + e.getMessage());
        }
    }

    @Override
    public void onMessage(ByteBuffer message) {
        System.out.println("Size of binary audio data received: " + message.remaining());

        try (FileOutputStream fos = new FileOutputStream(outputFile, true)) {
            byte[] buffer = new byte[message.remaining()];
            message.get(buffer);
            fos.write(buffer);
            System.out.println("Audio data has been written to the local file " + outputFile);
        } catch (IOException e) {
            System.err.println("Failed to write audio data to local file: " + e.getMessage());
        }
    }

    @Override
    public void onClose(int code, String reason, boolean remote) {
        System.out.println("Connection closed: " + reason + " (" + code + ")");
    }

    @Override
    public void onError(Exception ex) {
        System.err.println("Error: " + ex.getMessage());
        ex.printStackTrace();
    }

    private void sendContinueTask(String text) {
        String command = "{ \"header\": { \"action\": \"continue-task\", \"task_id\": \"" + taskId + "\", \"streaming\": \"duplex\" }, \"payload\": { \"input\": { \"text\": \"" + text + "\" } }}";
        send(command);
    }

    private void sendFinishTask() {
        String command = "{ \"header\": { \"action\": \"finish-task\", \"task_id\": \"" + taskId + "\", \"streaming\": \"duplex\" }, \"payload\": { \"input\": {} }}";
        send(command);
    }

    private void closeConnection() {
        if (!isClosed()) {
            close();
        }
    }

    public static void main(String[] args) {
        try {
            String apiKey = System.getenv("DASHSCOPE_API_KEY");
            if (apiKey == null || apiKey.isEmpty()) {
                System.err.println("Set the DASHSCOPE_API_KEY environment variable");
                return;
            }

            Map<String, String> headers = new HashMap<>();
            headers.put("Authorization", "bearer " + apiKey);
            TTSWebSocketClient client = new TTSWebSocketClient(new URI("wss://dashscope.aliyuncs.com/api-ws/v1/inference/"), headers);

            client.connect();

            while (!client.isClosed() && !client.taskFinished) {
                Thread.sleep(1000);
            }
        } catch (Exception e) {
            System.err.println("Failed to connect to WebSocket service: " + e.getMessage());
            e.printStackTrace();
        }
    }
}

Python

If you use the Python programming language, we recommend that you use the Python DashScope SDK for development. For more information, see Python SDK.

The following is a Python WebSocket call example. Before you run the example, make sure to import the dependencies as follows:

pip uninstall websocket-client
pip uninstall websocket
pip install websocket-client
Important

Do not name the Python file that runs the sample code "websocket.py". Otherwise, an error is reported: AttributeError: module 'websocket' has no attribute 'WebSocketApp'. Did you mean: 'WebSocket'?

import websocket
import json
import uuid
import os
import time


class TTSClient:
    def __init__(self, api_key, uri):
        """
    Initializes the TTSClient instance.

    Parameters:
        api_key (str): The API key for authentication.
        uri (str): The WebSocket service endpoint.
    """
        self.api_key = api_key  # Replace with your API key.
        self.uri = uri  # Replace with your WebSocket endpoint.
        self.task_id = str(uuid.uuid4())  # Generate a unique task ID.
        self.output_file = f"output_{int(time.time())}.mp3"  # Output audio file path.
        self.ws = None  # WebSocketApp instance.
        self.task_started = False  # Whether task-started has been received.
        self.task_finished = False  # Whether task-finished or task-failed has been received.

    def on_open(self, ws):
        """
    Callback function for when the WebSocket connection is established.
    Sends a run-task instruction to start the speech synthesis task.
    """
        print("WebSocket connected")

        # Construct the run-task instruction.
        run_task_cmd = {
            "header": {
                "action": "run-task",
                "task_id": self.task_id,
                "streaming": "duplex"
            },
            "payload": {
                "task_group": "audio",
                "task": "tts",
                "function": "SpeechSynthesizer",
                "model": "cosyvoice-v3-flash",
                "parameters": {
                    "text_type": "PlainText",
                    "voice": "longanyang",
                    "format": "mp3",
                    "sample_rate": 22050,
                    "volume": 50,
                    "rate": 1,
                    "pitch": 1
                },
                "input": {}
            }
        }

        # Send the run-task instruction.
        ws.send(json.dumps(run_task_cmd))
        print("run-task instruction sent")

    def on_message(self, ws, message):
        """
    Callback function for receiving messages.
    Handles text and binary messages differently.
    """
        if isinstance(message, str):
            # Process JSON text messages.
            try:
                msg_json = json.loads(message)
                print(f"Received JSON message: {msg_json}")

                if "header" in msg_json:
                    header = msg_json["header"]

                    if "event" in header:
                        event = header["event"]

                        if event == "task-started":
                            print("Task has started")
                            self.task_started = True

                            # Send continue-task instructions.
                            texts = [
                                "A quiet night thought, I see the moonlight before my bed",
                                "I lift my head and watch the moon, I lower my head and think of home"
                            ]

                            for text in texts:
                                self.send_continue_task(text)

                            # Send finish-task after all continue-task instructions are sent.
                            self.send_finish_task()

                        elif event == "task-finished":
                            print("Task has finished")
                            self.task_finished = True
                            self.close(ws)

                        elif event == "task-failed":
                            error_msg = msg_json.get("error_message", "Unknown error")
                            print(f"Task failed: {error_msg}")
                            self.task_finished = True
                            self.close(ws)

            except json.JSONDecodeError as e:
                print(f"JSON parsing failed: {e}")
        else:
            # Process binary messages (audio data).
            print(f"Received binary message, size: {len(message)} bytes")
            with open(self.output_file, "ab") as f:
                f.write(message)
            print(f"Audio data has been written to the local file {self.output_file}")

    def on_error(self, ws, error):
        """Callback for when an error occurs."""
        print(f"WebSocket error: {error}")

    def on_close(self, ws, close_status_code, close_msg):
        """Callback for when the connection is closed."""
        print(f"WebSocket closed: {close_msg} ({close_status_code})")

    def send_continue_task(self, text):
        """Sends a continue-task instruction with the text to be synthesized."""
        cmd = {
            "header": {
                "action": "continue-task",
                "task_id": self.task_id,
                "streaming": "duplex"
            },
            "payload": {
                "input": {
                    "text": text
                }
            }
        }

        self.ws.send(json.dumps(cmd))
        print(f"continue-task instruction sent, text: {text}")

    def send_finish_task(self):
        """Sends a finish-task instruction to end the speech synthesis task."""
        cmd = {
            "header": {
                "action": "finish-task",
                "task_id": self.task_id,
                "streaming": "duplex"
            },
            "payload": {
                "input": {}
            }
        }

        self.ws.send(json.dumps(cmd))
        print("finish-task instruction sent")

    def close(self, ws):
        """Actively closes the connection."""
        if ws and ws.sock and ws.sock.connected:
            ws.close()
            print("Connection actively closed")

    def run(self):
        """Starts the WebSocket client."""
        # Set request headers (for authentication).
        header = {
            "Authorization": f"bearer {self.api_key}",
            "X-DashScope-DataInspection": "enable"
        }

        # Create a WebSocketApp instance.
        self.ws = websocket.WebSocketApp(
            self.uri,
            header=header,
            on_open=self.on_open,
            on_message=self.on_message,
            on_error=self.on_error,
            on_close=self.on_close
        )

        print("Listening for WebSocket messages...")
        self.ws.run_forever()  # Start the persistent connection listener.


# Example usage
if __name__ == "__main__":
    API_KEY = os.environ.get("DASHSCOPE_API_KEY")  # If you have not configured the API key as an environment variable, set API_KEY to your API key.
    SERVER_URI = "wss://dashscope.aliyuncs.com/api-ws/v1/inference/"  # Replace with your WebSocket endpoint.

    client = TTSClient(API_KEY, SERVER_URI)
    client.run()

Error codes

For troubleshooting information, see Error messages.

FAQ

Features, billing, and limits

Q: What can I do to fix inaccurate pronunciation?

You can use SSML to customize the speech synthesis output.

Q: Why use the WebSocket protocol instead of the HTTP/HTTPS protocol? Why not provide a RESTful API?

Voice Service uses WebSocket instead of HTTP, HTTPS, or RESTful because it requires full-duplex communication. WebSocket allows the server and client to actively exchange data in both directions, such as pushing real-time speech synthesis or recognition progress. In contrast, HTTP-based RESTful APIs only support a one-way, client-initiated request-response model, which is unsuitable for real-time interaction.

Q: Speech synthesis is billed based on the number of characters. How can I find the character count for each synthesis task?

You can obtain the character count from the payload.usage.characters parameter of the result-generated event returned by the server. Use the value from the last result-generated event that you receive.

Troubleshooting

Important

If an error occurs in your code, check whether the instruction sent to the server is correct. You can print the content of the instruction to check for formatting errors or missing required parameters. If the instruction is correct, troubleshoot the issue based on the information in the error code.

Q: How do I get the Request ID?

You can obtain the Request ID in two ways:

Q: Why is the SSML feature failing?

Follow these steps to troubleshoot:

  1. Ensure that the scope of application is correct.

  2. Ensure that you call the service correctly. For more information, see SSML support.

  3. Ensure that the text to be synthesized is in plain text format and meets the formatting requirements. For more information, see Speech Synthesis Markup Language.

Q: Why can't the audio be played?

Troubleshoot this issue based on the following scenarios:

  1. The audio is saved as a complete file, such as an .mp3 file.

    1. Audio format consistency: Ensure that the audio format specified in the request parameters matches the file extension. For example, playback might fail if the audio format is set to WAV in the request parameters but the file has an .mp3 extension.

    2. Player compatibility: Confirm that your player supports the format and sample rate of the audio file. For example, some players might not support high sample rates or specific audio encodings.

  2. The audio is played in streaming mode.

    1. Save the audio stream as a complete file and try to play it. If the file fails to play, see the troubleshooting steps for the first scenario.

    2. If the file plays correctly, the issue might be with the streaming playback implementation. Confirm that your player supports streaming playback.

      Common tools and libraries that support streaming playback include FFmpeg, pyaudio (Python), AudioFormat (Java), and MediaSource (JavaScript).

Q: Why does the audio playback stutter?

Troubleshoot this issue based on the following scenarios:

  1. Check the text sending speed: Ensure that the text sending interval is reasonable. Avoid delays in sending the next text segment after the audio for the previous segment has finished playing.

  2. Check the callback function performance:

    • Check whether the callback function contains excessive business logic that could cause it to block.

    • The callback function runs in the WebSocket thread. If this thread is blocked, it can interfere with the WebSocket's ability to receive network packets, resulting in audio stuttering.

    • To avoid blocking the WebSocket thread, write the audio data to a separate audio buffer and then use another thread to read and process it.

  3. Check network stability: Ensure that your network connection is stable to prevent audio transmission interruptions or delays caused by network fluctuations.

Q: Why is speech synthesis slow (long synthesis time)?

Perform the following troubleshooting steps:

  1. Check the input interval

    If you are using streaming speech synthesis, check whether the text sending interval is too long. For example, a delay of several seconds before sending the next segment will increase the total synthesis time.

  2. Analyze performance metrics

    • First packet delay: This is typically around 500 ms.

    • Real-Time Factor (RTF): This is calculated as Total Synthesis Time / Audio Duration. The RTF is normally less than 1.0.

Q: How do I handle incorrect pronunciation in the synthesized speech?

Use the <phoneme> tag of SSML to specify the correct pronunciation.

Q: Why is some audio missing? Why is the end of my text not synthesized into speech?

Ensure that you send the finish-task instruction. During the speech synthesis process, the server starts synthesis only after it caches a sufficient amount of text. If you forget to send the finish-task instruction, the last part of the text in the cache may not be synthesized into speech.

Q: Why are the returned audio stream segments out of order, causing jumbled playback?

Check the following two points:

Permissions and authentication

Q: I want my API key to be used only for the CosyVoice speech synthesis service, not for other Model Studio models (permission isolation). What should I do?

You can create a workspace and authorize only specific models to limit the scope of the API key. For more information, see Manage workspaces.

More questions

For more information, see the Q&A on GitHub.