All Products
Search
Document Center

Alibaba Cloud Model Studio:CosyVoice speech synthesis WebSocket API

Last Updated:Mar 18, 2026

Parameters and the interaction protocol for WebSocket connections to CosyVoice speech synthesis. Use WebSocket for languages other than Java and Python (which have SDK support).

User guide: For model overviews and selection suggestions, see Real-time speech synthesis - CosyVoice.

WebSocket provides full-duplex communication: the client and server establish a persistent connection with a single handshake, allowing both parties to push data to each other, providing better real-time performance.

WebSocket libraries are available for most languages (Go: gorilla/websocket, PHP: Ratchet, Node.js: ws). Familiarize yourself with WebSocket basics before starting.

Important

CosyVoice models support only WebSocket connections—not HTTP REST APIs. HTTP requests return InvalidParameter or URL errors.

Prerequisites

Get an API key.

Models and pricing

See Real-time speech synthesis - CosyVoice.

Text and format limitations

Text length limits

Send up to 20,000 characters per continue-task instruction, with a total limit of 200,000 characters across all instructions.

Character counting rules

  • Chinese characters (simplified/traditional Chinese, Japanese Kanji, Korean Hanja) count as two characters. All other characters (punctuation, letters, numbers, Kana, Hangul) count as one.

  • SSML tags are not included when calculating the text length.

  • Examples:

    • "你好" → 2 (Chinese character) + 2 (Chinese character) = 4 characters

    • "中A文123" → 2 (Chinese character) + 1 (A) + 2 (Chinese character) + 1 (1) + 1 (2) + 1 (3) = 8 characters

    • "中文。" → 2 (Chinese character) + 2 (Chinese character) + 1 (.) = 5 characters

    • "中 文。" → 2 (Chinese character) + 1 (space) + 2 (Chinese character) + 1 (.) = 6 characters

    • "<speak>你好</speak>" → 2 (Chinese character) + 2 (Chinese character) = 4 characters

Encoding format

Use UTF-8 encoding.

Support for mathematical expressions

Mathematical expression parsing (v3.5-flash, v3.5-plus, v3-flash, v3-plus, v2 only): Supports primary and secondary school math—basic operations, algebra, geometry.

Note

This feature only supports Chinese.

See Convert LaTeX formulas to speech (Chinese language only).

SSML support

SSML requires all of the following conditions to be met:

  1. Model support: Only cosyvoice-v3.5-flash, cosyvoice-v3.5-plus, cosyvoice-v3-flash, cosyvoice-v3-plus, and cosyvoice-v2 support SSML.

  2. Voice support: You must use an SSML-enabled voice. Supported voices include the following:

    • All cloned voices (created through the Voice Cloning API).

    • System voices marked as SSML-enabled in the voice list.

    Note

    System voices that do not support SSML (such as some basic voices) return the error “SSML text is not supported at the moment!” even with enable_ssml enabled.

  3. Parameter setting: In the run-task instruction, set the enable_ssml parameter to true.

After meeting these conditions, send SSML-formatted text through the continue-task instruction to use SSML. For a complete example, see Getting started.

Interaction flow

image

Client-to-server messages are instructions. Server-to-client messages are either JSON-formatted events or binary audio streams.

The client-server interaction follows this sequence:

  1. Establish connection: Client opens a WebSocket connection to the server.

  2. Start task: Client sends the run-task instruction.

  3. Wait for confirmation: Client receives the task-started event confirming the task has started.

  4. Send text to synthesize:

    Client sends one or more continue-task instructions in order. The server returns a result-generated event and audio stream after receiving a complete sentence. For text length constraints, see the text field in the continue-task instruction.

    Note

    Send multiple continue-task instructions to submit text fragments in order. The server automatically segments text into sentences:

    • Complete sentences are synthesized immediately, and the client receives the audio.

    • Incomplete sentences are buffered until complete. No audio is returned for incomplete sentences.

    After receiving the finish-task instruction, the server force-synthesizes all buffered content.

  5. Receive audio: Receive the audio stream through the binary channel.

  6. Notify the server to end the task:

    After sending all text, client sends the finish-task instruction and continues receiving audio. Do not skip this step, or the ending audio may be lost.

  7. Task ends:

    Client receives the task-finished event, marking the task end.

  8. Close connection: Client closes the WebSocket connection.

To improve resource utilization, reuse a WebSocket connection to handle multiple tasks instead of creating a new connection for each task. See Connection overhead and reuse.

Important

The task_id must remain consistent throughout: Within a single synthesis task, run-task, all continue-task, and finish-task instructions must use the same task_id.

Consequences of mismatched task_ids:

  • Disordered audio stream delivery due to unassociated requests.

  • Misaligned speech content due to text being assigned to different tasks.

  • Abnormal task state, possibly preventing receipt of the task-finished event.

  • Billing failures or inaccurate usage statistics.

Correct approach:

  • Generate a unique task_id (for example, using UUID) when sending the run-task instruction.

  • Store the task_id in a variable.

  • Use this task_id for all subsequent continue-task and finish-task instructions.

  • After the task ends (after receiving task-finished), generate a new task_id for the next task.

Client implementation considerations

When implementing a WebSocket client, especially on Flutter, web, or mobile platforms, clearly define server and client responsibilities to ensure task integrity and stability.

Server and client responsibilities

Server responsibilities

The server delivers the complete audio stream in order. You do not need to handle audio ordering or completeness.

Client responsibilities

The client must handle the following:

  1. Read and concatenate all audio chunks

    The server delivers audio as multiple binary frames. The client must receive all frames and concatenate them to form the final audio file:

    # Python example: Concatenate audio chunks
    with open("output.mp3", "ab") as f:  # Append mode
        f.write(audio_chunk)  # audio_chunk is each received binary audio chunk
    // JavaScript example: Concatenate audio chunks
    const audioChunks = [];
    ws.onmessage = (event) => {
      if (event.data instanceof Blob) {
        audioChunks.push(event.data);  // Collect all audio chunks
      }
    };
    // Merge audio after task completes
    const audioBlob = new Blob(audioChunks, { type: 'audio/mp3' });
  2. Maintain a complete WebSocket lifecycle

    Do not disconnect the WebSocket connection prematurely during the entire task, from sending the run-task instruction to receiving the task-finished event. Common mistakes:

    • Closing the connection before all audio chunks are returned, resulting in incomplete audio.

    • Forgetting to send the finish-task instruction, leaving text buffered on the server and unprocessed.

    • Failing to handle WebSocket keepalive properly during page navigation or app backgrounding.

    Important

    Mobile apps (such as Flutter, iOS, and Android) require special attention to network management when entering the background. Maintain the WebSocket connection in a background task or service, or reinitialize the connection when returning to the foreground.

  3. Text integrity in ASR→LLM→TTS workflows

    In ASR (speech recognition) → LLM (large language model) → TTS (speech synthesis) workflows, ensure the text passed to TTS is complete. For example:

    • Wait for the LLM to generate a full sentence or paragraph before sending the continue-task instruction, rather than streaming character-by-character.

    • For streaming synthesis, send text in batches at natural sentence boundaries (such as periods or question marks).

    • After the LLM finishes generating, always send the finish-task instruction to avoid missing trailing content.

Platform-specific tips

  • Flutter: When using the web_socket_channel package, close the connection correctly in the dispose method to prevent memory leaks. Also handle app lifecycle events (such as AppLifecycleState.paused) for background transitions.

  • Web (browser): Some browsers limit the number of WebSocket connections. Reuse a single connection for multiple tasks. Use the beforeunload event to close the connection explicitly before the page closes.

  • Mobile (iOS/Android native): The operating system may pause or terminate network connections when the app enters the background. Use a background task or foreground service to keep the WebSocket active, or reinitialize the task when returning to the foreground.

URL

The WebSocket URL is fixed:

International

In international deployment mode, the access point and data storage are both located in the Singapore region. Model inference computing resources are dynamically scheduled globally, excluding the Chinese mainland.

WebSocket URL: wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference

Chinese mainland

In Chinese mainland deployment mode, the access point and data storage are both located in the Beijing region. Model inference computing resources are restricted to the Chinese mainland.

WebSocket URL: wss://dashscope.aliyuncs.com/api-ws/v1/inference

Important

Common URL configuration errors:

  • Error: Using URLs that start with http:// or https:// → Correct: You must use the wss:// protocol.

  • Error: Placing the Authorization parameter in the URL query string (e.g., ?Authorization=bearer <your_api_key>) → Correct: Set the Authorization parameter in the HTTP handshake headers. See Headers.

  • Error: Adding model names or other path parameters to the end of the URL → Correct: The URL is fixed. Specify the model using the payload.model parameter in the run-task instruction.

Headers

Set the following request headers:

Parameter

Type

Required

Description

Authorization

string

Yes

Authentication token. Format: Bearer <your_api_key>. Replace <your_api_key> with your actual API key.

user-agent

string

No

Client identifier for tracking the source.

X-DashScope-WorkSpace

string

No

Your Alibaba Cloud Model Studio workspace ID.

X-DashScope-DataInspection

string

No

Whether to enable data compliance inspection. Default: enable. Do not set unless necessary.

Important

Authentication timing and common errors

Authentication occurs during the WebSocket handshake, not when sending the run-task instruction. If the Authorization header is missing or invalid, the server rejects the handshake and returns an HTTP 401 or 403 error. Client libraries typically report this as a WebSocketBadStatus exception.

Troubleshoot authentication failures

If the WebSocket connection fails, follow these steps:

  1. Check API key format: Confirm the Authorization header follows bearer <your_api_key> with a space separating `bearer` and the key.

  2. Verify API key validity: In the Model Studio console, confirm the key is active and has CosyVoice model permissions.

  3. Check header settings: Confirm the Authorization header is set during the WebSocket handshake. Configuration differs by language:

    • Python (websockets library): extra_headers={"Authorization": f"bearer {api_key}"}

    • JavaScript: The standard browser WebSocket API does not support custom headers. Use a server-side proxy or another library, such as ws.

    • Go (gorilla/websocket): header.Add("Authorization", fmt.Sprintf("bearer %s", apiKey))

  4. Test network connectivity: Use curl or Postman to test whether the API key is valid by calling other HTTP-supported DashScope APIs.

Using WebSocket in browser environments

In browser environments such as Vue3 and React, the native new WebSocket(url) API does not support custom request headers (including Authorization) during the handshake. This is a browser security restriction, so you cannot authenticate directly from frontend code.

Solution: Use a backend proxy

  1. Set up a WebSocket connection from your backend (Node.js, Java, or Python) to the CosyVoice service. The backend can set the Authorization header.

  2. Have the frontend connect via WebSocket to your backend, which acts as a proxy to forward messages to CosyVoice.

  3. Benefits: The API key stays hidden from the frontend. You can also add authentication, logging, or rate limiting on the backend.

Important

Never hardcode your API key in frontend code. Leaking your API key could lead to account compromise, unexpected charges, or data breaches.

Example code:

For other programming languages, implement the same logic or use AI tools to convert these examples.

Instructions (client → server)

Instructions are JSON messages sent from the client to the server as WebSocket text frames. They control the task lifecycle.

Send instructions in the following order to prevent task failure:

  1. Send the run-task instruction

  2. Send the continue-task instruction

    • Sends text to synthesize.

    • Send only after receiving the task-started event from the server.

  3. Send the finish-task instruction

1. run-task instruction: Start a task

Starts a speech synthesis task. Configure the voice, sample rate, and other parameters in this instruction.

Important
  • Timing: Send this instruction after the WebSocket connection is established.

  • Do not send text to synthesize: Sending text in the run-task instruction complicates troubleshooting. Avoid doing so. Instead, send text using the continue-task instruction.

  • The input field is required: The payload must contain the input field, which is formatted as {}. If you omit this field, the "task can not be null" error occurs.

Example:

{
    "header": {
        "action": "run-task",
        "task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx", // Random UUID
        "streaming": "duplex"
    },
    "payload": {
        "task_group": "audio",
        "task": "tts",
        "function": "SpeechSynthesizer",
        "model": "cosyvoice-v3-flash",
        "parameters": {
            "text_type": "PlainText",
            "voice": "longanyang",            // Voice
            "format": "mp3",		        // Audio format
            "sample_rate": 22050,	        // Sample rate
            "volume": 50,			// Volume
            "rate": 1,				// Speech rate
            "pitch": 1				// Pitch
        },
        "input": {// input must exist, or it will return an error
        }
    }
}

header parameter descriptions:

Parameter

Type

Required

Description

header.action

string

Yes

Instruction type.

Fixed value: "run-task".

header.task_id

string

Yes

Task ID for this operation.

A 32-character UUID composed of randomly generated letters and digits. Hyphens are optional (for example, "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx") or omit them (e.g., "2bf83b9abaeb4fda8d9axxxxxxxxxxxx"). Most programming languages provide built-in UUID APIs. Python example:

import uuid

def generateTaskId(self):
    # Generate random UUID
    return uuid.uuid4().hex

When sending subsequent continue-task instructions and finish-task instructions, use the same task_id as in the run-task instruction.

header.streaming

string

Yes

Fixed string: "duplex"

payload parameter descriptions:

Parameter

Type

Required

Description

payload.task_group

string

Yes

Fixed string: "audio".

payload.task

string

Yes

Fixed string: "tts".

payload.function

string

Yes

Fixed string: "SpeechSynthesizer".

payload.model

string

Yes

Speech synthesis model. Each model version requires compatible voices:

  • cosyvoice-v3.5-flash/cosyvoice-v3.5-plus: No system voices are available. Only custom voices from voice design or voice cloning are supported.

  • cosyvoice-v3-flash/cosyvoice-v3-plus: Use voices such as longanyang.

  • cosyvoice-v2: Use voices such as longxiaochun_v2.

  • For a complete list of voices, see Voice list.

payload.input

object

Yes

The input field is required but must be empty in the run-task instruction. Use an empty object {}. Send text using later continue-task instructions for easier troubleshooting and streaming synthesis.

The input format is:

"input": {}
Important

Common error: Omitting the input field or adding unexpected fields (like mode or content) causes the server to reject the request with “InvalidParameter: task can not be null” or close the connection (WebSocket code 1007).

payload.parameters

text_type

string

Yes

Fixed string: "PlainText".

voice

string

Yes

The voice used for speech synthesis.

Supported voice types:

  • System voices: For more information, see Voice list.

  • Cloned voices: Customized using the Voice cloning feature. When using a cloned voice, make sure that the same account is used for voice cloning and speech synthesis.

    For cloned voices, model must match the voice creation model (target_model).

  • Designed voices: Customized using the Voice design feature. When using a designed voice, make sure that the same account is used for voice design and speech synthesis.

    For designed voices, model must match the voice creation model (target_model).

format

string

No

Audio coding format.

Supports pcm, wav, mp3 (default), and opus.

When format is opus, adjust bitrate using the bit_rate parameter.

sample_rate

integer

No

Audio sampling rate (in Hz).

Default: 22050.

Valid values: 8000, 16000, 22050, 24000, 44100, 48000.

Note

The default sample rate represents the optimal rate for the selected voice. Output uses this rate by default, but downsampling and upsampling are supported.

volume

integer

No

The volume.

Default: 50.

Valid range: [0, 100]. Values scale linearly—0 is silent, 50 is default, 100 is maximum.

rate

float

No

The speech rate.

Default value: 1.0.

Valid values: [0.5, 2.0]. A value of 1.0 is the standard speech rate. A value less than 1.0 slows down the speech, and a value greater than 1.0 speeds it up.

pitch

float

No

Pitch multiplier. The relationship to perceived pitch is neither linear nor logarithmic—test to find suitable values.

Default value: 1.0.

Valid values: [0.5, 2.0]. A value of 1.0 is the natural pitch of the voice. A value greater than 1.0 raises the pitch, and a value less than 1.0 lowers it.

enable_ssml

boolean

No

Enable SSML.

When set to true, you may send text only once (only one continue-task instruction allowed).

bit_rate

int

No

The audio bitrate in kbps. If the audio format is Opus, adjust the bitrate by using the bit_rate parameter.

Default value: 32.

Valid values: [6, 510].

word_timestamp_enabled

boolean

No

Specifies whether to enable word-level timestamps.

Default value: false.

  • true

  • false

This feature is available only for cloned voices of the cosyvoice-v3-flash, cosyvoice-v3-plus, and cosyvoice-v2 models, and for system voices that are marked as supported in the voice list.

When word_timestamp_enabled is enabled, timestamp information appears in the result-generated event. Example:

{
  "header": {
    "task_id": "3f39be22-efbd-4844-91d5-xxxxxxxxxxxx",
    "event": "result-generated",
    "attributes": {}
  },
  "payload": {
    "output": {
      "sentence": {
        "index": 0,
        "words": [
          {
            "text": "bed",
            "begin_index": 0,
            "end_index": 1,
            "begin_time": 280,
            "end_time": 640
          }
        ]
      },
      "type": "sentence-begin",
      "original_text": "Before my bed, moonlight shines bright,"
    }
  }
}

seed

int

No

The random seed used during generation. Different seeds produce different synthesis results. If the model, text, voice, and other parameters are identical, using the same seed reproduces the same output.

Default value: 0.

Valid values: [0, 65535].

language_hints

array[string]

No

Specifies the target language for speech synthesis to improve the synthesis effect.

Use when pronunciation or synthesis quality is poor for numbers, abbreviations, symbols, or less common languages:

  • Numbers are not read aloud as expected. For example, "hello, this is 110" is read as "hello, this is one one zero" rather than "hello, this is yāo yāo líng".

  • The '@' symbol is mispronounced as 'ai te' instead of 'at'.

  • The synthesis quality for less common languages is poor and sounds unnatural.

Valid values:

  • zh: Chinese

  • en: English

  • fr: French

  • de: German

  • ja: Japanese

  • ko: Korean

  • ru: Russian

  • pt: Portuguese

  • th: Thai

  • id: Indonesian

  • vi: Vietnamese

Note: This parameter is an array, but the current version only processes the first element. Therefore, we recommend passing only one value.

Important

This parameter specifies the target language for speech synthesis. This setting is independent of the language of the sample audio used for voice cloning. To set the source language for a cloning task, see CosyVoice Voice Cloning/Design API.

instruction

string

No

Sets an instruction to control synthesis effects such as dialect, emotion, or speaking style. This feature is available only for cloned voices of the cosyvoice-v3.5-flash, cosyvoice-v3.5-plus, and cosyvoice-v3-flash models, and for system voices marked as supporting Instruct in the voice list.

Length limit: 100 characters.

A Chinese character (including simplified and traditional Chinese, Japanese Kanji, and Korean Hanja) is counted as two characters. All other characters, such as punctuation marks, letters, numbers, and Japanese/Korean Kana/Hangul, are counted as one character.

Usage requirements (vary by model):

  • v3.5-flash and v3.5-plus: Any natural language instruction to control emotion, speech rate, etc.

    Important

    cosyvoice-v3.5-flash and cosyvoice-v3.5-plus have no system voices. Only custom voices from voice design or voice cloning are supported.

    Instruction examples:

    Speak in a very excited and high-pitched tone, expressing the ecstasy and excitement of a great success.
    Please maintain a medium-slow speech rate, with an elegant and intellectual tone, giving a sense of calm and reassurance.
    The tone should be full of sorrow and nostalgia, with a slight nasal quality, as if narrating a heartbreaking story.
    Please try to speak in a breathy voice, with a very low volume, creating a sense of intimate and mysterious whispering.
    The tone should be very impatient and annoyed, with a faster speech rate and minimal pauses between sentences.
    Please imitate a kind and gentle elder, with a steady speech rate and a voice full of care and affection.
    The tone should be sarcastic and disdainful, with emphasis on keywords and a slightly rising intonation at the end of sentences.
    Please speak in an extremely fearful and trembling voice.
    The tone should be like a professional news anchor: calm, objective, and articulate, with a neutral emotion.
    The tone should be lively and playful, with a clear smile, making the voice sound energetic and sunny.
  • cosyvoice-v3-flash: The following requirements must be met:

    • Cloned voices: Use any natural language to control the speech synthesis effect.

      Instruction examples:

      Please speak in Cantonese. (Supported dialects: Cantonese, Northeastern, Gansu, Guizhou, Henan, Hubei, Jiangxi, Minnan, Ningxia, Shanxi, Shaanxi, Shandong, Shanghainese, Sichuan, Tianjin, Yunnan.)
      Please say a sentence as loudly as possible.
      Please say a sentence as slowly as possible.
      Please say a sentence as quickly as possible.
      Please say a sentence very softly.
      Can you speak a little slower?
      Can you speak very quickly?
      Can you speak very slowly?
      Can you speak a little faster?
      Please say a sentence very angrily.
      Please say a sentence very happily.
      Please say a sentence very fearfully.
      Please say a sentence very sadly.
      Please say a sentence very surprisedly.
      Please try to sound as firm as possible.
      Please try to sound as angry as possible.
      Please try an approachable tone.
      Please speak in a cold tone.
      Please speak in a majestic tone.
      I want to experience a natural tone.
      I want to see how you express a threat.
      I want to see how you express wisdom.
      I want to see how you express seduction.
      I want to hear you speak in a lively way.
      I want to hear you speak with passion.
      I want to hear you speak in a steady manner.
      I want to hear you speak with confidence.
      Can you talk to me with excitement?
      Can you show an arrogant emotion?
      Can you show an elegant emotion?
      Can you answer the question happily?
      Can you give a gentle emotional demonstration?
      Can you talk to me in a calm tone?
      Can you answer me in a deep way?
      Can you talk to me with a gruff attitude?
      Tell me the answer in a sinister voice.
      Tell me the answer in a resilient voice.
      Narrate in a natural and friendly chat style.
      Speak in the tone of a radio drama podcaster.
    • System voices: The instruction must use a fixed format and content. For more information, see the voice list.

enable_aigc_tag

boolean

No

Specifies whether to add an invisible AIGC identifier to the generated audio. When set to true, an invisible identifier is embedded into the audio in supported formats (WAV, MP3, and Opus).

Default value: false.

Only cosyvoice-v3-flash, cosyvoice-v3-plus, and cosyvoice-v2 support this feature.

aigc_propagator

string

No

Sets the ContentPropagator field in the invisible AIGC identifier to identify the content propagator. This parameter takes effect only when enable_aigc_tag is true.

Default value: Alibaba Cloud UID.

Only cosyvoice-v3-flash, cosyvoice-v3-plus, and cosyvoice-v2 support this feature.

aigc_propagate_id

string

No

Sets the PropagateID field in the invisible AIGC identifier to uniquely identify a specific propagation behavior. This parameter takes effect only when enable_aigc_tag is true.

Default value: The request ID of the current speech synthesis request.

Only cosyvoice-v3-flash, cosyvoice-v3-plus, and cosyvoice-v2 support this feature.

hot_fix

object

No

Configuration for text hotpatching. Allows you to customize the pronunciation of specific words or replace text before synthesis. This feature is available only for cloned voices of cosyvoice-v3-flash.

Parameters:

  • pronunciation: Customize pronunciation. Provide pinyin for words to correct incorrect default pronunciation.

  • replace: Text replacement. Replace specified words with target text before synthesis. The replaced text becomes the actual synthesized content.

Example:

"hot_fix": {
  "pronunciation": [
    {"weather": "tian1 qi4"}
  ],
  "replace": [
    {"today": "jin1 tian1"}
  ]
}

enable_markdown_filter

boolean

No

Specifies whether to enable Markdown filtering. When enabled, the system automatically removes Markdown symbols from the input text before synthesizing speech, preventing them from being read aloud. This feature is available only for cloned voices of cosyvoice-v3-flash.

Default value: false.

Valid values:

  • true

  • false

2. continue-task instruction

Sends text to synthesize.

Send all text in one continue-task instruction, or split it across multiple continue-task instructions, in order.

Important

When to send: After receiving the task-started event.

Note

Do not wait longer than 23 seconds between sending text fragments, otherwise, the 'request timeout after 23 seconds' error occurs.

If no more text remains, send the finish-task instruction to end the task.

The server enforces a 23-second timeout. Clients cannot modify this.

Example:

{
    "header": {
        "action": "continue-task",
        "task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx", // Random UUID
        "streaming": "duplex"
    },
    "payload": {
        "input": {
            "text": "Before my bed, moonlight shines bright, I suspect it's frost upon the ground."
        }
    }
}

header parameter descriptions:

Parameter

Type

Required

Description

header.action

string

Yes

Instruction type.

Fixed value: "continue-task".

header.task_id

string

Yes

Task ID for this request.

The task_id must match the one used in the run-task instruction.

header.streaming

string

Yes

Fixed string: "duplex"

payload parameter descriptions:

Parameter

Type

Required

Description

input.text

string

Yes

Text to synthesize.

3. finish-task instruction: End task

Ends the speech synthesis task.

Always send this instruction. Otherwise, you may face:

  • Incomplete audio: The server will not force-synthesize incomplete cached sentences, which results in missing endings.

  • Connection timeout: If you wait more than 23 seconds after the last continue-task instruction before sending finish-task, the connection times out and closes.

  • Billing issues: Tasks that do not end normally may return inaccurate usage information.

Important

When to send: Send immediately after all continue-task instructions have been sent. Do not wait for audio to finish or delay sending—this may trigger timeouts.

Example:

{
    "header": {
        "action": "finish-task",
        "task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx",
        "streaming": "duplex"
    },
    "payload": {
        "input": {}//input must exist, or it will return an error
    }
}

header parameter descriptions:

Parameter

Type

Required

Description

header.action

string

Yes

Instruction type.

Fixed value: "finish-task".

header.task_id

string

Yes

Task ID for this request.

The task_id must match the one used in the run-task instruction.

header.streaming

string

Yes

Fixed string: "duplex"

payload parameter descriptions:

Parameter

Type

Required

Description

payload.input

object

Yes

Fixed format: {}.

Events (server → client)

Events are JSON messages sent from the server to the client. Each event marks a stage in the task lifecycle.

Note

The server sends binary audio separately—it is not included in any event.

1. task-started event: Task started

The task-started event confirms that the task has started. Send continue-task or finish-task instructions only after receiving this event. Otherwise, the task fails.

The task-started event’s payload contains no content.

Example:

{
    "header": {
        "task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx",
        "event": "task-started",
        "attributes": {}
    },
    "payload": {}
}

header parameter descriptions:

Parameter

Type

Description

header.event

string

Event type.

Fixed value: "task-started".

header.task_id

string

Task ID generated by the client.

2. result-generated event

While you send continue-task and finish-task instructions, the server continuously returns result-generated events.

To link audio data to its corresponding text, the server includes sentence metadata in the result-generated event alongside the audio. The server automatically splits the input text into sentences. The synthesis of each sentence consists of three sub-events:

  • sentence-begin: Marks sentence start and returns the text to synthesize.

  • sentence-synthesis: Marks an audio data chunk. Each event is followed immediately by an audio data frame over the WebSocket binary channel.

    • One sentence produces multiple sentence-synthesis events—one per audio chunk.

    • The client must receive these audio chunks in order and append them to the same file.

    • Each sentence-synthesis event maps one-to-one with its following audio frame—no misalignment occurs.

  • sentence-end: Marks sentence end and returns the sentence text and cumulative billed character count.

Use the payload.output.type field to distinguish between sub-event types.

Example:

sentence-begin

{
    "header": {
        "task_id": "3f2d5c86-0550-45c0-801f-xxxxxxxxxx",
        "event": "result-generated",
        "attributes": {}
    },
    "payload": {
        "output": {
            "sentence": {
                "index": 0,
                "words": []
            },
            "type": "sentence-begin",
            "original_text": "Before my bed, moonlight shines bright,"
        }
    }
}

sentence-synthesis

{
    "header": {
        "task_id": "3f2d5c86-0550-45c0-801f-xxxxxxxxxx",
        "event": "result-generated",
        "attributes": {}
    },
    "payload": {
        "output": {
            "sentence": {
                "index": 0,
                "words": []
            },
            "type": "sentence-synthesis"
        }
    }
}

sentence-end

{
    "header": {
        "task_id": "3f2d5c86-0550-45c0-801f-xxxxxxxxxx",
        "event": "result-generated",
        "attributes": {}
    },
    "payload": {
        "output": {
            "sentence": {
                "index": 0,
                "words": []
            },
            "type": "sentence-end",
            "original_text": "Before my bed, moonlight shines bright,"
        },
        "usage": {
            "characters": 11
        }
    }
}

header parameter descriptions:

Parameter

Type

Description

header.event

string

Event type.

Fixed value: "result-generated".

header.task_id

string

Task ID generated by the client.

header.attributes

object

Additional attributes—usually an empty object.

payload parameter descriptions:

Parameter

Type

Description

payload.output.type

string

Sub-event type.

Values:

  • sentence-begin: Marks sentence start and returns the text to synthesize.

  • sentence-synthesis: Marks an audio data chunk. Each event is followed immediately by an audio data frame over the WebSocket binary channel.

    • One sentence produces multiple sentence-synthesis events—one per audio chunk.

    • The client must receive these audio chunks in order and append them to the same file.

    • Each sentence-synthesis event maps one-to-one with its following audio frame—no misalignment occurs.

  • sentence-end: Marks sentence end and returns the sentence text and cumulative billed character count.

Full event flow

For each sentence to synthesize, the server returns events in this order:

  1. sentence-begin: Marks sentence start and includes sentence text (original_text).

  2. sentence-synthesis (multiple times): Each event is followed immediately by a binary audio data frame.

  3. sentence-end: Marks sentence end and includes sentence text and cumulative billed character count.

payload.output.sentence.index

integer

Sentence number, starting from 0.

payload.output.sentence.words

array

An array of character information.

payload.output.sentence.words.text

string

Word text.

payload.output.sentence.words.begin_index

integer

Starting position index of the word in the sentence, counting from 0.

payload.output.sentence.words.end_index

integer

Ending position index of the word in the sentence, counting from 1.

payload.output.sentence.words.begin_time

integer

Start timestamp of the word’s audio, in milliseconds.

payload.output.sentence.words.end_time

integer

End timestamp of the word’s audio, in milliseconds.

payload.output.original_text

string

Sentence content after splitting the user’s input text. The last sentence may omit this field.

payload.usage.characters

integer

Total billed characters in this request so far. Within one task, the usage field may appear in either the result-generated event or the task-finished event. The usage field contains the cumulative total. Use the last occurrence.

3. task-finished event: Task finished

The task-finished event marks the end of the task.

After the task ends, close the WebSocket connection and exit, or reuse the connection to send a new run-task instruction (see Connection overhead and reuse).

Example:

{
    "header": {
        "task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx",
        "event": "task-finished",
        "attributes": {
            "request_uuid": "0a9dba9e-d3a6-45a4-be6d-xxxxxxxxxxxx"
        }
    },
    "payload": {
        "output": {
            "sentence": {
                "words": []
            }
        },
        "usage": {
            "characters": 13
        }
    }
}

header parameter descriptions:

Parameter

Type

Description

header.event

string

Event type.

Fixed value: "task-finished".

header.task_id

string

Task ID generated by the client.

header.attributes.request_uuid

string

Request ID. Provide this to CosyVoice developers for issue diagnosis.

payload parameter descriptions:

Parameter

Type

Description

payload.usage.characters

integer

Total billed characters in this request so far. Within one task, the usage field may appear in either the result-generated event or the task-finished event. The usage field contains the cumulative total. Use the last occurrence.

4. task-failed event: Task failed

The task-failed event indicates that the task has failed. Close the WebSocket connection and review the error message to identify the cause.

Example:

{
    "header": {
        "task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx",
        "event": "task-failed",
        "error_code": "InvalidParameter",
        "error_message": "[tts:]Engine return error code: 418",
        "attributes": {}
    },
    "payload": {}
}

header parameter descriptions:

Parameter

Type

Description

header.event

string

Event type.

Fixed value: "task-failed".

header.task_id

string

Task ID generated by the client.

header.error_code

string

Error type description.

header.error_message

string

Detailed error reason.

Task interruption methods

During streaming synthesis, you can interrupt the current task early—for example, if the user cancels playback or interrupts a live conversation—using one of these methods:

Interrupt Mode

Server behavior

Use case

Directly close the connection

  • The server immediately stops synthesis.

  • Any audio already generated but not yet sent is discarded.

  • The client does not receive the task-finished event.

  • The connection cannot be reused after it is closed.

Immediate interruption: The user cancels playback, switches content, or exits the app.

Send a finish-task command

  • The server forces the synthesis of all cached text.

  • It returns any remaining audio chunks.

  • It returns the task-finished event.

  • The connection remains reusable, allowing you to start new tasks.

Elegant end: You stop sending new text but still receive audio for cached content.

Connection overhead and reuse

The WebSocket service supports connection reuse to reduce overhead.

Send a run-task instruction to start a task, then a finish-task instruction to end it. After the task-finished event, reuse the same connection by sending a new run-task instruction.

Important
  1. A new run-task instruction can only be sent after the server returns a task-finished event.

  2. Different tasks on a reused connection must use different task_ids.

  3. If a task fails during execution, the server returns a task-failed event and closes the connection. This connection cannot be reused.

  4. If no new task starts within 60 seconds after the task ends, the connection times out and closes automatically.

Performance metrics and concurrency limits

Concurrency limits

See Rate limiting.

To increase your concurrency quota, contact customer support. Quota adjustments require a review and are usually completed within 1 to 3 business days.

Note

Best practice: Reuse a WebSocket connection for multiple tasks instead of creating a new connection for each task. See Connection overhead and reuse.

Connection performance and latency

Typical connection time:

  • Clients in the Chinese mainland: WebSocket connection establishment (from newWebSocket to onOpen) typically takes 200 to 1000 ms.

  • Cross-border connections (such as Hong Kong or international regions): Connection latency may reach 1 to 3 seconds. In rare cases, it may reach 10 to 30 seconds.

Troubleshooting long connection times:

If a WebSocket connection takes longer than 30 seconds to establish, check for the following issues:

  1. Network issues: High network latency between the client and the server, such as latency caused by cross-border connections or poor ISP quality.

  2. Slow DNS resolution: DNS resolution for dashscope.aliyuncs.com may be slow. Try using a public DNS such as 8.8.8.8, or configure your local hosts file.

  3. Slow TLS handshake: An outdated TLS version on the client or slow certificate validation. Use TLS 1.2 or later.

  4. Proxy or firewall: Corporate networks may block WebSocket connections or require the use of a proxy.

Troubleshooting tools:

  • Use Wireshark or tcpdump to analyze the timing of the TCP handshake, TLS handshake, and WebSocket Upgrade phases.

  • Test HTTP connection latency with curl: curl -w "@curl-format.txt" -o /dev/null -s https://dashscope.aliyuncs.com

Note

The CosyVoice WebSocket API is deployed in the Beijing region of the Chinese mainland. If your client is in another region, such as Hong Kong or an overseas region, you can use a nearby relay server or CDN to accelerate the connection.

Audio generation performance

Synthesis speed:

  • Real-time factor (RTF): CosyVoice models typically synthesize audio at 0.1 to 0.5 times the real-time speed. This means that generating 1 second of audio takes 0.1 to 0.5 seconds. The actual speed depends on the model version, text length, and server load.

  • First packet latency: The latency from sending the continue-task instruction to receiving the first audio chunk is typically 200 to 800 ms.

Example code

This example demonstrates basic service connectivity only. Implement production-ready logic for your specific use case.

When writing WebSocket client code, use asynchronous programming to send and receive messages simultaneously:

  1. Establish the WebSocket connection

    Call your WebSocket library’s connection function and pass the Headers and URL. Implementation varies by language or library.

  2. Listen for server messages

    Use your WebSocket library’s callback function to listen for server messages. Implementation varies by language.

    Server messages fall into two categories: binary audio streams and events.

    Listen for events

    Process the binary audio stream: The server sends audio over the binary channel in frames. Complete audio data is split across multiple packets.

    • In streaming speech synthesis, for compressed formats such as MP3 and Opus, the segmented audio data must be played using a streaming player. Do not play it frame by frame, as this causes decoding to fail.

      Streaming players include FFmpeg, PyAudio (Python), AudioFormat (Java), and MediaSource (JavaScript).
    • When combining audio data into a complete audio file, write to the same file in append mode.

    • For WAV and MP3 audio from streaming speech synthesis, only the first frame contains header information. Subsequent frames contain only audio data.

  3. Send messages to the server (observe timing carefully)

    From a separate thread, send instructions to the server. Implementation varies by language.

    Send instructions in the following order to prevent task failure:

    1. Send the run-task instruction

    2. Send the continue-task instruction

      • Sends text to synthesize.

      • Send only after receiving the task-started event from the server.

    3. Send the finish-task instruction

  4. Close the WebSocket connection

    Close the WebSocket connection when the program ends normally, encounters an exception, or receives the task-finished event or task-failed event. Usually call the library’s close function.

View full examples

Go

package main

import (
	"encoding/json"
	"fmt"
	"net/http"
	"os"
	"strings"
	"time"

	"github.com/google/uuid"
	"github.com/gorilla/websocket"
)

const (
	// Use this URL for Singapore region. For Beijing region, replace with: wss://dashscope.aliyuncs.com/api-ws/v1/inference/
	wsURL      = "wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference/"
	outputFile = "output.mp3"
)

func main() {
	// Get API key: https://www.alibabacloud.com/help/zh/model-studio/get-api-key
	// If no environment variable is set, replace next line with: apiKey := "sk-xxx"
	apiKey := os.Getenv("DASHSCOPE_API_KEY")

	// Clear output file
	os.Remove(outputFile)
	os.Create(outputFile)

	// Connect WebSocket
	header := make(http.Header)
	header.Add("X-DashScope-DataInspection", "enable")
	header.Add("Authorization", fmt.Sprintf("bearer %s", apiKey))

	conn, resp, err := websocket.DefaultDialer.Dial(wsURL, header)
	if err != nil {
		if resp != nil {
			fmt.Printf("Connection failed HTTP status code: %d\n", resp.StatusCode)
		}
		fmt.Println("Connection failed:", err)
		return
	}
	defer conn.Close()

	// Generate task ID
	taskID := uuid.New().String()
	fmt.Printf("Generated task ID: %s\n", taskID)

	// Send run-task instruction
	runTaskCmd := map[string]interface{}{
		"header": map[string]interface{}{
			"action":    "run-task",
			"task_id":   taskID,
			"streaming": "duplex",
		},
		"payload": map[string]interface{}{
			"task_group": "audio",
			"task":       "tts",
			"function":   "SpeechSynthesizer",
			"model":      "cosyvoice-v3-flash",
			"parameters": map[string]interface{}{
				"text_type":   "PlainText",
				"voice":       "longanyang",
				"format":      "mp3",
				"sample_rate": 22050,
				"volume":      50,
				"rate":        1,
				"pitch":       1,
				// If enable_ssml is true, only one continue-task instruction is allowed. Otherwise, it returns “Text request limit violated, expected 1.”
				"enable_ssml": false,
			},
			"input": map[string]interface{}{},
		},
	}

	runTaskJSON, _ := json.Marshal(runTaskCmd)
	fmt.Printf("Sent run-task instruction: %s\n", string(runTaskJSON))

	err = conn.WriteMessage(websocket.TextMessage, runTaskJSON)
	if err != nil {
		fmt.Println("Failed to send run-task:", err)
		return
	}

	textSent := false

	// Process messages
	for {
		messageType, message, err := conn.ReadMessage()
		if err != nil {
			fmt.Println("Failed to read message:", err)
			break
		}

		// Process binary message
		if messageType == websocket.BinaryMessage {
			fmt.Printf("Received binary message, length: %d\n", len(message))
			file, _ := os.OpenFile(outputFile, os.O_APPEND|os.O_WRONLY|os.O_CREATE, 0644)
			file.Write(message)
			file.Close()
			continue
		}

		// Process text message
		messageStr := string(message)
		fmt.Printf("Received text message: %s\n", strings.ReplaceAll(messageStr, "\n", ""))

		// Simple JSON parse to get event type
		var msgMap map[string]interface{}
		if json.Unmarshal(message, &msgMap) == nil {
			if header, ok := msgMap["header"].(map[string]interface{}); ok {
				if event, ok := header["event"].(string); ok {
					fmt.Printf("Event type: %s\n", event)

					switch event {
					case "task-started":
						fmt.Println("=== Received task-started event ===")

						if !textSent {
							// Send continue-task instruction

							texts := []string{"Before my bed, moonlight shines bright, I suspect it's frost upon the ground.", "I raise my eyes to gaze at the bright moon, then bow my head, thinking of home."}

							for _, text := range texts {
								continueTaskCmd := map[string]interface{}{
									"header": map[string]interface{}{
										"action":    "continue-task",
										"task_id":   taskID,
										"streaming": "duplex",
									},
									"payload": map[string]interface{}{
										"input": map[string]interface{}{
											"text": text,
										},
									},
								}

								continueTaskJSON, _ := json.Marshal(continueTaskCmd)
								fmt.Printf("Sent continue-task instruction: %s\n", string(continueTaskJSON))

								err = conn.WriteMessage(websocket.TextMessage, continueTaskJSON)
								if err != nil {
									fmt.Println("Failed to send continue-task:", err)
									return
								}
							}

							textSent = true

							// Delay before sending finish-task
							time.Sleep(500 * time.Millisecond)

							// Send finish-task instruction
							finishTaskCmd := map[string]interface{}{
								"header": map[string]interface{}{
									"action":    "finish-task",
									"task_id":   taskID,
									"streaming": "duplex",
								},
								"payload": map[string]interface{}{
									"input": map[string]interface{}{},
								},
							}

							finishTaskJSON, _ := json.Marshal(finishTaskCmd)
							fmt.Printf("Sent finish-task instruction: %s\n", string(finishTaskJSON))

							err = conn.WriteMessage(websocket.TextMessage, finishTaskJSON)
							if err != nil {
								fmt.Println("Failed to send finish-task:", err)
								return
							}
						}

					case "task-finished":
						fmt.Println("=== Task completed ===")
						return

					case "task-failed":
						fmt.Println("=== Task failed ===")
						if header["error_message"] != nil {
							fmt.Printf("Error message: %s\n", header["error_message"])
						}
						return

					case "result-generated":
						fmt.Println("Received result-generated event")
					}
				}
			}
		}
	}
}

C#

using System.Net.WebSockets;
using System.Text;
using System.Text.Json;

class Program {
    // Get API key: https://www.alibabacloud.com/help/zh/model-studio/get-api-key
    // If no environment variable is set, replace next line with: private static readonly string ApiKey = "sk-xxx"
    private static readonly string ApiKey = Environment.GetEnvironmentVariable("DASHSCOPE_API_KEY") ?? throw new InvalidOperationException("DASHSCOPE_API_KEY environment variable is not set.");

    // Use this URL for Singapore region. For Beijing region, replace with: wss://dashscope.aliyuncs.com/api-ws/v1/inference/
    private const string WebSocketUrl = "wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference/";
    // Output file path
    private const string OutputFilePath = "output.mp3";

    // WebSocket client
    private static ClientWebSocket _webSocket = new ClientWebSocket();
    // Cancellation token source
    private static CancellationTokenSource _cancellationTokenSource = new CancellationTokenSource();
    // Task ID
    private static string? _taskId;
    // Task started flag
    private static TaskCompletionSource<bool> _taskStartedTcs = new TaskCompletionSource<bool>();

    static async Task Main(string[] args) {
        try {
            // Clear output file
            ClearOutputFile(OutputFilePath);

            // Connect to WebSocket service
            await ConnectToWebSocketAsync(WebSocketUrl);

            // Start receiving messages
            Task receiveTask = ReceiveMessagesAsync();

            // Send run-task instruction
            _taskId = GenerateTaskId();
            await SendRunTaskCommandAsync(_taskId);

            // Wait for task-started event
            await _taskStartedTcs.Task;

            // Send continue-task instructions
            string[] texts = {
                "Before my bed, moonlight shines bright",
                "I suspect it's frost upon the ground",
                "I raise my eyes to gaze at the bright moon",
                "then bow my head, thinking of home"
            };
            foreach (string text in texts) {
                await SendContinueTaskCommandAsync(text);
            }

            // Send finish-task instruction
            await SendFinishTaskCommandAsync(_taskId);

            // Wait for receive task to complete
            await receiveTask;

            Console.WriteLine("Task completed. Connection closed.");
        } catch (OperationCanceledException) {
            Console.WriteLine("Task canceled.");
        } catch (Exception ex) {
            Console.WriteLine($"Error: {ex.Message}");
        } finally {
            _cancellationTokenSource.Cancel();
            _webSocket.Dispose();
        }
    }

    private static void ClearOutputFile(string filePath) {
        if (File.Exists(filePath)) {
            File.WriteAllText(filePath, string.Empty);
            Console.WriteLine("Output file cleared.");
        } else {
            Console.WriteLine("Output file does not exist. No action needed.");
        }
    }

    private static async Task ConnectToWebSocketAsync(string url) {
        var uri = new Uri(url);
        if (_webSocket.State == WebSocketState.Connecting || _webSocket.State == WebSocketState.Open) {
            return;
        }

        // Set WebSocket request headers
        _webSocket.Options.SetRequestHeader("Authorization", $"bearer {ApiKey}");
        _webSocket.Options.SetRequestHeader("X-DashScope-DataInspection", "enable");

        try {
            await _webSocket.ConnectAsync(uri, _cancellationTokenSource.Token);
            Console.WriteLine("Successfully connected to WebSocket service.");
        } catch (OperationCanceledException) {
            Console.WriteLine("WebSocket connection canceled.");
        } catch (Exception ex) {
            Console.WriteLine($"WebSocket connection failed: {ex.Message}");
            throw;
        }
    }

    private static async Task SendRunTaskCommandAsync(string taskId) {
        var command = CreateCommand("run-task", taskId, "duplex", new {
            task_group = "audio",
            task = "tts",
            function = "SpeechSynthesizer",
            model = "cosyvoice-v3-flash",
            parameters = new
            {
                text_type = "PlainText",
                voice = "longanyang",
                format = "mp3",
                sample_rate = 22050,
                volume = 50,
                rate = 1,
                pitch = 1,
                // If enable_ssml is true, only one continue-task instruction is allowed. Otherwise, it returns “Text request limit violated, expected 1.”
                enable_ssml = false
            },
            input = new { }
        });

        await SendJsonMessageAsync(command);
        Console.WriteLine("Sent run-task instruction.");
    }

    private static async Task SendContinueTaskCommandAsync(string text) {
        if (_taskId == null) {
            throw new InvalidOperationException("Task ID not initialized.");
        }

        var command = CreateCommand("continue-task", _taskId, "duplex", new {
            input = new {
                text
            }
        });

        await SendJsonMessageAsync(command);
        Console.WriteLine("Sent continue-task instruction.");
    }

    private static async Task SendFinishTaskCommandAsync(string taskId) {
        var command = CreateCommand("finish-task", taskId, "duplex", new {
            input = new { }
        });

        await SendJsonMessageAsync(command);
        Console.WriteLine("Sent finish-task instruction.");
    }

    private static async Task SendJsonMessageAsync(string message) {
        var buffer = Encoding.UTF8.GetBytes(message);
        try {
            await _webSocket.SendAsync(new ArraySegment<byte>(buffer), WebSocketMessageType.Text, true, _cancellationTokenSource.Token);
        } catch (OperationCanceledException) {
            Console.WriteLine("Message send canceled.");
        }
    }

    private static async Task ReceiveMessagesAsync() {
        while (_webSocket.State == WebSocketState.Open) {
            var response = await ReceiveMessageAsync();
            if (response != null) {
                var eventStr = response.RootElement.GetProperty("header").GetProperty("event").GetString();
                switch (eventStr) {
                    case "task-started":
                        Console.WriteLine("Task started.");
                        _taskStartedTcs.TrySetResult(true);
                        break;
                    case "task-finished":
                        Console.WriteLine("Task completed.");
                        _cancellationTokenSource.Cancel();
                        break;
                    case "task-failed":
                        Console.WriteLine("Task failed: " + response.RootElement.GetProperty("header").GetProperty("error_message").GetString());
                        _cancellationTokenSource.Cancel();
                        break;
                    default:
                        // Handle result-generated here
                        break;
                }
            }
        }
    }

    private static async Task<JsonDocument?> ReceiveMessageAsync() {
        var buffer = new byte[1024 * 4];
        var segment = new ArraySegment<byte>(buffer);

        try {
            WebSocketReceiveResult result = await _webSocket.ReceiveAsync(segment, _cancellationTokenSource.Token);

            if (result.MessageType == WebSocketMessageType.Close) {
                await _webSocket.CloseAsync(WebSocketCloseStatus.NormalClosure, "Closing", _cancellationTokenSource.Token);
                return null;
            }

            if (result.MessageType == WebSocketMessageType.Binary) {
                // Process binary data
                Console.WriteLine("Received binary data...");

                // Save binary data to file
                using (var fileStream = new FileStream(OutputFilePath, FileMode.Append)) {
                    fileStream.Write(buffer, 0, result.Count);
                }

                return null;
            }

            string message = Encoding.UTF8.GetString(buffer, 0, result.Count);
            return JsonDocument.Parse(message);
        } catch (OperationCanceledException) {
            Console.WriteLine("Message receive canceled.");
            return null;
        }
    }

    private static string GenerateTaskId() {
        return Guid.NewGuid().ToString("N").Substring(0, 32);
    }

    private static string CreateCommand(string action, string taskId, string streaming, object payload) {
        var command = new {
            header = new {
                action,
                task_id = taskId,
                streaming
            },
            payload
        };

        return JsonSerializer.Serialize(command);
    }
}

PHP

Example code directory structure:

my-php-project/

├── composer.json

├── vendor/

└── index.php

composer.json contents (adjust versions as needed):

{
    "require": {
        "react/event-loop": "^1.3",
        "react/socket": "^1.11",
        "react/stream": "^1.2",
        "react/http": "^1.1",
        "ratchet/pawl": "^0.4"
    },
    "autoload": {
        "psr-4": {
            "App\\": "src/"
        }
    }
}

index.php contents:

<?php

require __DIR__ . '/vendor/autoload.php';

use Ratchet\Client\Connector;
use React\EventLoop\Loop;
use React\Socket\Connector as SocketConnector;

// Get API key: https://www.alibabacloud.com/help/zh/model-studio/get-api-key
// If no environment variable is set, replace next line with: $api_key = "sk-xxx"
$api_key = getenv("DASHSCOPE_API_KEY");
// Use this URL for Singapore region. For Beijing region, replace with: wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference/
$websocket_url = 'wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference/'; // WebSocket server address
$output_file = 'output.mp3'; // Output file path

$loop = Loop::get();

if (file_exists($output_file)) {
    // Clear file content
    file_put_contents($output_file, '');
}

// Create custom connector
$socketConnector = new SocketConnector($loop, [
    'tcp' => [
        'bindto' => '0.0.0.0:0',
    ],
    'tls' => [
        'verify_peer' => false,
        'verify_peer_name' => false,
    ],
]);

$connector = new Connector($loop, $socketConnector);

$headers = [
    'Authorization' => 'bearer ' . $api_key,
    'X-DashScope-DataInspection' => 'enable'
];

$connector($websocket_url, [], $headers)->then(function ($conn) use ($loop, $output_file) {
    echo "Connected to WebSocket server\n";

    // Generate task ID
    $taskId = generateTaskId();

    // Send run-task instruction
    sendRunTaskMessage($conn, $taskId);

    // Define function to send continue-task instruction
    $sendContinueTask = function() use ($conn, $loop, $taskId) {
        // Text to send
        $texts = ["Before my bed, moonlight shines bright", "I suspect it's frost upon the ground", "I raise my eyes to gaze at the bright moon", "then bow my head, thinking of home"];
        $continueTaskCount = 0;
        foreach ($texts as $text) {
            $continueTaskMessage = json_encode([
                "header" => [
                    "action" => "continue-task",
                    "task_id" => $taskId,
                    "streaming" => "duplex"
                ],
                "payload" => [
                    "input" => [
                        "text" => $text
                    ]
                ]
            ]);
            echo "Preparing to send continue-task instruction: " . $continueTaskMessage . "\n";
            $conn->send($continueTaskMessage);
            $continueTaskCount++;
        }
        echo "Number of continue-task instructions sent: " . $continueTaskCount . "\n";

        // Send finish-task instruction
        sendFinishTaskMessage($conn, $taskId);
    };

    // Flag for task-started event
    $taskStarted = false;

    // Listen for messages
    $conn->on('message', function($msg) use ($conn, $sendContinueTask, $loop, &$taskStarted, $taskId, $output_file) {
        if ($msg->isBinary()) {
            // Write binary data to local file
            file_put_contents($output_file, $msg->getPayload(), FILE_APPEND);
        } else {
            // Process non-binary message
            $response = json_decode($msg, true);

            if (isset($response['header']['event'])) {
                handleEvent($conn, $response, $sendContinueTask, $loop, $taskId, $taskStarted);
            } else {
                echo "Unknown message format\n";
            }
        }
    });

    // Listen for connection close
    $conn->on('close', function($code = null, $reason = null) {
        echo "Connection closed\n";
        if ($code !== null) {
            echo "Close code: " . $code . "\n";
        }
        if ($reason !== null) {
            echo "Close reason: " . $reason . "\n";
        }
    });
}, function ($e) {
    echo "Cannot connect: {$e->getMessage()}\n";
});

$loop->run();

/**
 * Generate task ID
 * @return string
 */
function generateTaskId(): string {
    return bin2hex(random_bytes(16));
}

/**
 * Send run-task instruction
 * @param $conn
 * @param $taskId
 */
function sendRunTaskMessage($conn, $taskId) {
    $runTaskMessage = json_encode([
        "header" => [
            "action" => "run-task",
            "task_id" => $taskId,
            "streaming" => "duplex"
        ],
        "payload" => [
            "task_group" => "audio",
            "task" => "tts",
            "function" => "SpeechSynthesizer",
            "model" => "cosyvoice-v3-flash",
            "parameters" => [
                "text_type" => "PlainText",
                "voice" => "longanyang",
                "format" => "mp3",
                "sample_rate" => 22050,
                "volume" => 50,
                "rate" => 1,
                "pitch" => 1,
                // If enable_ssml is true, only one continue-task instruction is allowed. Otherwise, it returns “Text request limit violated, expected 1.”
                "enable_ssml" => false
            ],
            "input" => (object) []
        ]
    ]);
    echo "Preparing to send run-task instruction: " . $runTaskMessage . "\n";
    $conn->send($runTaskMessage);
    echo "run-task instruction sent\n";
}

/**
 * Read audio file
 * @param string $filePath
 * @return bool|string
 */
function readAudioFile(string $filePath) {
    $voiceData = file_get_contents($filePath);
    if ($voiceData === false) {
        echo "Cannot read audio file\n";
    }
    return $voiceData;
}

/**
 * Split audio data
 * @param string $data
 * @param int $chunkSize
 * @return array
 */
function splitAudioData(string $data, int $chunkSize): array {
    return str_split($data, $chunkSize);
}

/**
 * Send finish-task instruction
 * @param $conn
 * @param $taskId
 */
function sendFinishTaskMessage($conn, $taskId) {
    $finishTaskMessage = json_encode([
        "header" => [
            "action" => "finish-task",
            "task_id" => $taskId,
            "streaming" => "duplex"
        ],
        "payload" => [
            "input" => (object) []
        ]
    ]);
    echo "Preparing to send finish-task instruction: " . $finishTaskMessage . "\n";
    $conn->send($finishTaskMessage);
    echo "finish-task instruction sent\n";
}

/**
 * Handle event
 * @param $conn
 * @param $response
 * @param $sendContinueTask
 * @param $loop
 * @param $taskId
 * @param $taskStarted
 */
function handleEvent($conn, $response, $sendContinueTask, $loop, $taskId, &$taskStarted) {
    switch ($response['header']['event']) {
        case 'task-started':
            echo "Task started. Sending continue-task instructions...\n";
            $taskStarted = true;
            // Send continue-task instruction
            $sendContinueTask();
            break;
        case 'result-generated':
            // Received result-generated event
            break;
        case 'task-finished':
            echo "Task completed\n";
            $conn->close();
            break;
        case 'task-failed':
            echo "Task failed\n";
            echo "Error code: " . $response['header']['error_code'] . "\n";
            echo "Error message: " . $response['header']['error_message'] . "\n";
            $conn->close();
            break;
        case 'error':
            echo "Error: " . $response['payload']['message'] . "\n";
            break;
        default:
            echo "Unknown event: " . $response['header']['event'] . "\n";
            break;
    }

    // Close connection if task completed
    if ($response['header']['event'] == 'task-finished') {
        // Wait 1 second to ensure all data is transmitted
        $loop->addTimer(1, function() use ($conn) {
            $conn->close();
            echo "Client closed connection\n";
        });
    }

    // Close connection if no task-started event received
    if (!$taskStarted && in_array($response['header']['event'], ['task-failed', 'error'])) {
        $conn->close();
    }
}

Node.js

Install dependencies:

npm install ws
npm install uuid

Example code:

const WebSocket = require('ws');
const fs = require('fs');
const uuid = require('uuid').v4;

// Get API key: https://www.alibabacloud.com/help/zh/model-studio/get-api-key
// If no environment variable is set, replace next line with: const apiKey = "sk-xxx"
const apiKey = process.env.DASHSCOPE_API_KEY;
// Use this URL for Singapore region. For Beijing region, replace with: wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference/
const url = 'wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference/';
// Output file path
const outputFilePath = 'output.mp3';

// Clear output file
fs.writeFileSync(outputFilePath, '');

// Create WebSocket client
const ws = new WebSocket(url, {
  headers: {
    Authorization: `bearer ${apiKey}`,
    'X-DashScope-DataInspection': 'enable'
  }
});

let taskStarted = false;
let taskId = uuid();

ws.on('open', () => {
  console.log('Connected to WebSocket server');

  // Send run-task instruction
  const runTaskMessage = JSON.stringify({
    header: {
      action: 'run-task',
      task_id: taskId,
      streaming: 'duplex'
    },
    payload: {
      task_group: 'audio',
      task: 'tts',
      function: 'SpeechSynthesizer',
      model: 'cosyvoice-v3-flash',
      parameters: {
        text_type: 'PlainText',
        voice: 'longanyang', // Voice
        format: 'mp3', // Audio format
        sample_rate: 22050, // Sample rate
        volume: 50, // Volume
        rate: 1, // Speech rate
        pitch: 1, // Pitch
        enable_ssml: false // Enable SSML. If true, only one continue-task instruction is allowed. Otherwise, it returns “Text request limit violated, expected 1.”
      },
      input: {}
    }
  });
  ws.send(runTaskMessage);
  console.log('Sent run-task message');
});

const fileStream = fs.createWriteStream(outputFilePath, { flags: 'a' });
ws.on('message', (data, isBinary) => {
  if (isBinary) {
    // Write binary data to file
    fileStream.write(data);
  } else {
    const message = JSON.parse(data);

    switch (message.header.event) {
      case 'task-started':
        taskStarted = true;
        console.log('Task started');
        // Send continue-task instruction
        sendContinueTasks(ws);
        break;
      case 'task-finished':
        console.log('Task completed');
        ws.close();
        fileStream.end(() => {
          console.log('File stream closed');
        });
        break;
      case 'task-failed':
        console.error('Task failed: ', message.header.error_message);
        ws.close();
        fileStream.end(() => {
          console.log('File stream closed');
        });
        break;
      default:
        // Handle result-generated here
        break;
    }
  }
});

function sendContinueTasks(ws) {
  const texts = [
    'Before my bed, moonlight shines bright,',
    'I suspect it\'s frost upon the ground.',
    'I raise my eyes to gaze at the bright moon,',
    'then bow my head, thinking of home.'
  ];

  texts.forEach((text, index) => {
    setTimeout(() => {
      if (taskStarted) {
        const continueTaskMessage = JSON.stringify({
          header: {
            action: 'continue-task',
            task_id: taskId,
            streaming: 'duplex'
          },
          payload: {
            input: {
              text: text
            }
          }
        });
        ws.send(continueTaskMessage);
        console.log(`Sent continue-task, text: ${text}`);
      }
    }, index * 1000); // Send every second
  });

  // Send finish-task instruction
  setTimeout(() => {
    if (taskStarted) {
      const finishTaskMessage = JSON.stringify({
        header: {
          action: 'finish-task',
          task_id: taskId,
          streaming: 'duplex'
        },
        payload: {
          input: {}
        }
      });
      ws.send(finishTaskMessage);
      console.log('Sent finish-task');
    }
  }, texts.length * 1000 + 1000); // Send 1 second after last continue-task
}

ws.on('close', () => {
  console.log('Disconnected from WebSocket server');
});

Java

If you use Java, we recommend using the Java DashScope SDK. See Java SDK.

Below is a Java WebSocket usage example. Before running, import these dependencies:

  • Java-WebSocket

  • jackson-databind

We recommend managing dependencies with Maven or Gradle. Configuration examples:

pom.xml

<dependencies>
    <!-- WebSocket Client -->
    <dependency>
        <groupId>org.java-websocket</groupId>
        <artifactId>Java-WebSocket</artifactId>
        <version>1.5.3</version>
    </dependency>

    <!-- JSON Processing -->
    <dependency>
        <groupId>com.fasterxml.jackson.core</groupId>
        <artifactId>jackson-databind</artifactId>
        <version>2.13.0</version>
    </dependency>
</dependencies>

build.gradle

// Omit other code
dependencies {
  // WebSocket Client
  implementation 'org.java-websocket:Java-WebSocket:1.5.3'
  // JSON Processing
  implementation 'com.fasterxml.jackson.core:jackson-databind:2.13.0'
}
// Omit other code

Java code:

import com.fasterxml.jackson.databind.ObjectMapper;

import org.java_websocket.client.WebSocketClient;
import org.java_websocket.handshake.ServerHandshake;

import java.io.FileOutputStream;
import java.io.IOException;
import java.net.URI;
import java.nio.ByteBuffer;
import java.util.*;

public class TTSWebSocketClient extends WebSocketClient {
    private final String taskId = UUID.randomUUID().toString();
    private final String outputFile = "output_" + System.currentTimeMillis() + ".mp3";
    private boolean taskFinished = false;

    public TTSWebSocketClient(URI serverUri, Map<String, String> headers) {
        super(serverUri, headers);
    }

    @Override
    public void onOpen(ServerHandshake serverHandshake) {
        System.out.println("Connection successful");

        // Send run-task instruction
        // If enable_ssml is true, only one continue-task instruction is allowed. Otherwise, it returns “Text request limit violated, expected 1.”
        String runTaskCommand = "{ \"header\": { \"action\": \"run-task\", \"task_id\": \"" + taskId + "\", \"streaming\": \"duplex\" }, \"payload\": { \"task_group\": \"audio\", \"task\": \"tts\", \"function\": \"SpeechSynthesizer\", \"model\": \"cosyvoice-v3-flash\", \"parameters\": { \"text_type\": \"PlainText\", \"voice\": \"longanyang\", \"format\": \"mp3\", \"sample_rate\": 22050, \"volume\": 50, \"rate\": 1, \"pitch\": 1, \"enable_ssml\": false }, \"input\": {} }}";
        send(runTaskCommand);
    }

    @Override
    public void onMessage(String message) {
        System.out.println("Received server message: " + message);
        try {
            // Parse JSON message
            Map<String, Object> messageMap = new ObjectMapper().readValue(message, Map.class);

            if (messageMap.containsKey("header")) {
                Map<String, Object> header = (Map<String, Object>) messageMap.get("header");

                if (header.containsKey("event")) {
                    String event = (String) header.get("event");

                    if ("task-started".equals(event)) {
                        System.out.println("Received task-started event");

                        List<String> texts = Arrays.asList(
                                "Before my bed, moonlight shines bright, I suspect it's frost upon the ground",
                                "I raise my eyes to gaze at the bright moon, then bow my head, thinking of home"
                        );

                        for (String text : texts) {
                            // Send continue-task instruction
                            sendContinueTask(text);
                        }

                        // Send finish-task instruction
                        sendFinishTask();
                    } else if ("task-finished".equals(event)) {
                        System.out.println("Received task-finished event");
                        taskFinished = true;
                        closeConnection();
                    } else if ("task-failed".equals(event)) {
                        System.out.println("Task failed: " + message);
                        closeConnection();
                    }
                }
            }
        } catch (Exception e) {
            System.err.println("Exception occurred: " + e.getMessage());
        }
    }

    @Override
    public void onMessage(ByteBuffer message) {
        System.out.println("Received binary audio data size: " + message.remaining());

        try (FileOutputStream fos = new FileOutputStream(outputFile, true)) {
            byte[] buffer = new byte[message.remaining()];
            message.get(buffer);
            fos.write(buffer);
            System.out.println("Audio data written to local file " + outputFile);
        } catch (IOException e) {
            System.err.println("Failed to write audio data to local file: " + e.getMessage());
        }
    }

    @Override
    public void onClose(int code, String reason, boolean remote) {
        System.out.println("Connection closed: " + reason + " (" + code + ")");
    }

    @Override
    public void onError(Exception ex) {
        System.err.println("Error: " + ex.getMessage());
        ex.printStackTrace();
    }

    private void sendContinueTask(String text) {
        String command = "{ \"header\": { \"action\": \"continue-task\", \"task_id\": \"" + taskId + "\", \"streaming\": \"duplex\" }, \"payload\": { \"input\": { \"text\": \"" + text + "\" } }}";
        send(command);
    }

    private void sendFinishTask() {
        String command = "{ \"header\": { \"action\": \"finish-task\", \"task_id\": \"" + taskId + "\", \"streaming\": \"duplex\" }, \"payload\": { \"input\": {} }}";
        send(command);
    }

    private void closeConnection() {
        if (!isClosed()) {
            close();
        }
    }

    public static void main(String[] args) {
        try {
            // Get API key: https://www.alibabacloud.com/help/zh/model-studio/get-api-key
            // If no environment variable is set, replace next line with: String apiKey = "sk-xxx"
            String apiKey = System.getenv("DASHSCOPE_API_KEY");
            if (apiKey == null || apiKey.isEmpty()) {
                System.err.println("Set DASHSCOPE_API_KEY environment variable");
                return;
            }

            Map<String, String> headers = new HashMap<>();
            headers.put("Authorization", "bearer " + apiKey);
            // Use this URL for Singapore region. For Beijing region, replace with: wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference/
            TTSWebSocketClient client = new TTSWebSocketClient(new URI("wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference/"), headers);

            client.connect();

            while (!client.isClosed() && !client.taskFinished) {
                Thread.sleep(1000);
            }
        } catch (Exception e) {
            System.err.println("Failed to connect to WebSocket service: " + e.getMessage());
            e.printStackTrace();
        }
    }
}

Python

If you use Python, we recommend using the Python DashScope SDK. See Python SDK.

Below is a Python WebSocket usage example. Before running, install dependencies:

pip uninstall websocket-client
pip uninstall websocket
pip install websocket-client
Important

Do not name your Python script “websocket.py”, or it will cause an AttributeError: module 'websocket' has no attribute 'WebSocketApp'. Did you mean: 'WebSocket'?

import websocket
import json
import uuid
import os
import time


class TTSClient:
    def __init__(self, api_key, uri):
        """
    Initialize TTSClient instance

    Parameters:
        api_key (str): API key for authentication
        uri (str): WebSocket service address
    """
        self.api_key = api_key  # Replace with your API key
        self.uri = uri  # Replace with your WebSocket address
        self.task_id = str(uuid.uuid4())  # Generate unique task ID
        self.output_file = f"output_{int(time.time())}.mp3"  # Output audio file path
        self.ws = None  # WebSocketApp instance
        self.task_started = False  # Whether task-started was received
        self.task_finished = False  # Whether task-finished or task-failed was received

    def on_open(self, ws):
        """
    Callback when WebSocket connection opens
    Send run-task instruction to start speech synthesis
    """
        print("WebSocket connected")

        # Build run-task instruction
        run_task_cmd = {
            "header": {
                "action": "run-task",
                "task_id": self.task_id,
                "streaming": "duplex"
            },
            "payload": {
                "task_group": "audio",
                "task": "tts",
                "function": "SpeechSynthesizer",
                "model": "cosyvoice-v3-flash",
                "parameters": {
                    "text_type": "PlainText",
                    "voice": "longanyang",
                    "format": "mp3",
                    "sample_rate": 22050,
                    "volume": 50,
                    "rate": 1,
                    "pitch": 1,
                    # If enable_ssml is True, only one continue-task instruction is allowed. Otherwise, it returns “Text request limit violated, expected 1.”
                    "enable_ssml": False
                },
                "input": {}
            }
        }

        # Send run-task instruction
        ws.send(json.dumps(run_task_cmd))
        print("Sent run-task instruction")

    def on_message(self, ws, message):
        """
    Callback when message is received
    Handle text and binary messages separately
    """
        if isinstance(message, str):
            # Handle JSON text message
            try:
                msg_json = json.loads(message)
                print(f"Received JSON message: {msg_json}")

                if "header" in msg_json:
                    header = msg_json["header"]

                    if "event" in header:
                        event = header["event"]

                        if event == "task-started":
                            print("Task started")
                            self.task_started = True

                            # Send continue-task instruction
                            texts = [
                                "Before my bed, moonlight shines bright, I suspect it's frost upon the ground",
                                "I raise my eyes to gaze at the bright moon, then bow my head, thinking of home"
                            ]

                            for text in texts:
                                self.send_continue_task(text)

                            # Send finish-task after all continue-task instructions
                            self.send_finish_task()

                        elif event == "task-finished":
                            print("Task completed")
                            self.task_finished = True
                            self.close(ws)

                        elif event == "task-failed":
                            error_msg = msg_json.get("error_message", "Unknown error")
                            print(f"Task failed: {error_msg}")
                            self.task_finished = True
                            self.close(ws)

            except json.JSONDecodeError as e:
                print(f"JSON parsing failed: {e}")
        else:
            # Handle binary message (audio data)
            print(f"Received binary message, size: {len(message)} bytes")
            with open(self.output_file, "ab") as f:
                f.write(message)
            print(f"Audio data written to local file {self.output_file}")

    def on_error(self, ws, error):
        """Callback on error"""
        print(f"WebSocket error: {error}")

    def on_close(self, ws, close_status_code, close_msg):
        """Callback on close"""
        print(f"WebSocket closed: {close_msg} ({close_status_code})")

    def send_continue_task(self, text):
        """Send continue-task instruction with text to synthesize"""
        cmd = {
            "header": {
                "action": "continue-task",
                "task_id": self.task_id,
                "streaming": "duplex"
            },
            "payload": {
                "input": {
                    "text": text
                }
            }
        }

        self.ws.send(json.dumps(cmd))
        print(f"Sent continue-task instruction, text: {text}")

    def send_finish_task(self):
        """Send finish-task instruction to end speech synthesis"""
        cmd = {
            "header": {
                "action": "finish-task",
                "task_id": self.task_id,
                "streaming": "duplex"
            },
            "payload": {
                "input": {}
            }
        }

        self.ws.send(json.dumps(cmd))
        print("Sent finish-task instruction")

    def close(self, ws):
        """Close connection manually"""
        if ws and ws.sock and ws.sock.connected:
            ws.close()
            print("Manually closed connection")

    def run(self):
        """Start WebSocket client"""
        # Set request headers (authentication)
        header = {
            "Authorization": f"bearer {self.api_key}",
            "X-DashScope-DataInspection": "enable"
        }

        # Create WebSocketApp instance
        self.ws = websocket.WebSocketApp(
            self.uri,
            header=header,
            on_open=self.on_open,
            on_message=self.on_message,
            on_error=self.on_error,
            on_close=self.on_close
        )

        print("Listening for WebSocket messages...")
        self.ws.run_forever()  # Start long-lived connection


# Example usage
if __name__ == "__main__":
    # Get API key: https://www.alibabacloud.com/help/zh/model-studio/get-api-key
    # If no environment variable is set, replace next line with: API_KEY = "sk-xxx"
    API_KEY = os.environ.get("DASHSCOPE_API_KEY")
    # Use this URL for Singapore region. For Beijing region, replace with: wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference/
    SERVER_URI = "wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference/"  # Replace with your WebSocket address

    client = TTSClient(API_KEY, SERVER_URI)
    client.run()

Error codes

If an error occurs, see Error messages for troubleshooting.

FAQ

Features, billing, and rate limiting

Q: What can I do if the pronunciation is inaccurate?

Use SSML to fix pronunciation.

Q: Why use WebSocket instead of HTTP/HTTPS? Why not provide a RESTful API?

The Speech Service uses WebSocket instead of HTTP/HTTPS or RESTful APIs because it requires full-duplex communication. WebSocket allows both the server and client to proactively push data, such as real-time progress updates for synthesis or recognition. RESTful APIs over HTTP only support client-initiated request-response cycles and cannot meet real-time interaction requirements.

Q: Speech synthesis is billed per character. How do I check or retrieve the character count for each synthesis?

The character count is available in the payload.usage.characters field of the server’s result-generated event. Use the value from the last received result-generated event.

Troubleshooting

Important

If your code throws an error, check whether the instruction sent to the server is correct. Print the instruction and verify its format and required fields. If the instruction is correct, refer to the error codes for further diagnosis.

Q: How do I get the Request ID?

Two methods are available:

Q: Why does SSML fail?

Troubleshoot this issue step by step:

  1. Ensure you correctly follow the limitations and constraints.

  2. Ensure SSML is called correctly. For details, see SSML support.

  3. Ensure the text is plain text and meets formatting requirements. For details, see SSML markup language overview.

Q: Why does the audio duration of TTS speech synthesis differ from the WAV file's displayed duration? For example, a WAV file shows 7 seconds but the actual audio is less than 5 seconds?

TTS uses a streaming synthesis mechanism, which means it synthesizes and returns data progressively. As a result, the WAV file header contains an estimated value, which may have some margin of error. If you require precise duration, you can set the format to PCM and manually add the WAV header information after obtaining the complete synthesis result. This will give you a more accurate duration.

Q: Why can't the audio be played?

Check the following scenarios one by one:

  1. The audio is saved as a complete file (such as xx.mp3).

    1. Format consistency: Verify request format matches file extension (e.g., WAV with .wav, not .mp3).

    2. Player compatibility: Verify that your player supports the format and sample rate of the audio file. Some players may not support high sample rates or specific audio encodings.

  2. The audio is played in a stream.

    1. Save the audio stream as a complete file and try to play it with a player. If the file cannot be played, see the troubleshooting method for scenario 1.

    2. If the file plays normally, the problem may be with your streaming playback implementation. Verify that your player supports streaming playback.

      Common tools and libraries that support streaming playback include FFmpeg, PyAudio (Python), AudioFormat (Java), and MediaSource (JavaScript).

Q: Why does the audio playback stutter?

Check the following scenarios one by one:

  1. Check the text sending speed: Make sure the interval between text segments is reasonable. Avoid situations where the next segment is not sent promptly after the previous audio segment finishes playing.

  2. Check the callback function performance:

    • Avoid heavy business logic in the callback function—it can cause blocking.

    • Callbacks run in the WebSocket thread. Blocking prevents timely packet reception and causes audio playback to stutter.

    • We recommend writing audio data to a separate buffer and processing it in another thread to avoid blocking the WebSocket thread.

  3. Check network stability: Ensure your network connection is stable to avoid audio transmission interruptions or delays caused by network fluctuations.

Q: Why does speech synthesis take a long time?

Follow these steps to troubleshoot:

  1. Check input interval

    Check the input interval. If you are using streaming speech synthesis, verify whether the interval between sending text segments is too long (for example, a delay of several seconds). A long interval increases the total synthesis time.

  2. Analyze performance metrics.

    • First-packet latency: Normally around 500 ms.

    • RTF (RTF = Total synthesis time / Audio duration): Normally less than 1.0.

Q: How do I handle incorrect pronunciation in the synthesized speech?

Use the <phoneme> tag of SSML to specify the correct pronunciation.

Q: Why is no audio returned? Why is part of the ending text missing from the audio? (Missing audio)

Verify that the finish-task instruction was sent. During synthesis, the server waits until enough text is buffered before processing. Without finish-task, buffered text at the end may never be converted to audio.

Q: Why is the audio stream order scrambled, causing garbled playback?

Troubleshoot in two areas:

  • Ensure that the run-task, continue-task, and finish-task instructions for one synthesis task all use the same task_id.

  • Check whether asynchronous operations cause audio files to be written in a different order than binary data is received.

Q: How do I handle WebSocket connection errors?

  • How do you handle WebSocket connection closure (code 1007)?

    A WebSocket connection closes immediately after sending the run-task instruction, with close code 1007.

    • Root cause: The server detects protocol or data format errors and disconnects. Common reasons include the following:

      • Invalid fields in the run-task payload, such as adding fields besides "input": {}.

      • JSON format errors, such as missing commas or mismatched brackets.

      • Missing required fields, such as task_id or action.

    • Solution:

      1. Validate JSON format: Check the request body syntax.

      2. Verify required fields: Confirm that header.action, header.task_id, header.streaming, payload.task_group, payload.task, payload.function, payload.model, and payload.input are all set.

      3. Remove invalid fields: In the run-task payload.input, allow only an empty object {} or a text field. Do not add other fields.

  • How do you handle WebSocketBadStatus, 401 Unauthorized, or 403 Forbidden errors?

    A WebSocket connection fails with WebSocketBadStatus, 401 Unauthorized, or 403 Forbidden.

    • Root cause: Authentication failure. The server validates the Authorization header during the WebSocket handshake. An invalid or missing API key triggers rejection.

    • Solution: See Authentication failure troubleshooting.

Permissions and authentication

Q: How can I restrict my API key to the CosyVoice speech synthesis service only (permission isolation)?

Create a workspace and grant authorization only to specific models to limit the API key scope. For more information, see Manage workspaces.

More questions

See the QA on GitHub.