CosyVoice WebSocket client events - Alibaba Cloud Model Studio

User guide: For model introduction and selection recommendations, see Speech synthesis.

run-task

Description: Starts a speech synthesis task and configures the model, voice, sample rate, and other parameters.

When to send: Immediately after the WebSocket connection is established.

Response event: The server returns a task-started event. Wait for this event before sending subsequent commands.

header object (required)

Properties

action string (required)

The command type. Set to run-task.

task_id string (required)

A client-generated task ID in UUID format. This ID correlates subsequent events and must match the task_id in the continue-task and finish-task commands.

streaming string (required)

Set to duplex.

{
    "header": {
        "action": "run-task",
        "task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx",
        "streaming": "duplex"
    },
    "payload": {
        "task_group": "audio",
        "task": "tts",
        "function": "SpeechSynthesizer",
        "model": "cosyvoice-v3-flash",
        "parameters": {
            "text_type": "PlainText",
            "voice": "longanyang",
            "format": "mp3",
            "sample_rate": 22050,
            "volume": 50,
            "rate": 1.0,
            "pitch": 1.0,
            "enable_ssml": false
        },
        "input": {}
    }
}

payload object (required)

Properties

task_group string (required)

The task group. Set to audio.

task string (required)

The task type. Set to tts.

function string (required)

The function type. Set to SpeechSynthesizer.

model string (required)

The model name.

input object (required)

Set to an empty object {}. Send the text to synthesize through the continue-task command.

parameters object (required)

Speech synthesis parameters.

Properties

text_type string (required)

Set to PlainText.

voice string (required)

The voice used for speech synthesis.

System voices: See Voice list
Cloned voices: Custom voices created through voice cloning
Custom voices: Custom voices created through voice design

format string (optional)

The audio encoding format.

Valid values:

pcm
wav
mp3 (default)
opus

sample_rate integer (optional)

The audio sample rate in Hz.

Valid values: 8000, 16000, 22050 (default), 24000, 44100, 48000.

volume integer (optional)

The volume level.

Default value: 50.

Valid values: [0, 100].

rate float (optional)

The speech rate.

Default value: 1.0.

Valid values: [0.5, 2.0].

pitch float (optional)

The pitch.

Default value: 1.0.

Valid values: [0.5, 2.0].

bit_rate integer (optional)

The audio bit rate in kbps. When the audio format is opus, use bit_rate to adjust the bit rate.

Default value: 32.

Valid values: [6, 510].

enable_ssml boolean (optional)

Specifies whether to enable SSML.

Default value: false.

When set to true, only one continue-task command is allowed.

word_timestamp_enabled boolean (optional)

Specifies whether to enable word-level timestamps.

Default value: false.

Supported for cloned voices of cosyvoice-v3-flash, cosyvoice-v3-plus, and cosyvoice-v2, and for system voices marked as supported in Voice list.

seed integer (optional)

A random seed for controlling variation in the synthesis output. When the model version, text, voice, and other parameters are unchanged, using the same seed produces identical results.

Default value: 0.

Valid values: [0, 65535].

language_hints array[string] (optional)

Important

This parameter is an array, but the current version only processes the first element. Pass a single value.
This parameter specifies the target language for speech synthesis. It's unrelated to the language of the audio sample used in voice cloning. To set the source language for a cloning task, see the voice cloning API reference.

Specifies the target language for speech synthesis to improve output quality.

When digit pronunciation, abbreviation expansion, symbol reading, or minority-language synthesis doesn't meet expectations, use this parameter. For example:

Unexpected digit pronunciation: "hello, this is 110" is read as "hello, this is one one zero" instead of the expected Chinese pronunciation
Inaccurate symbol pronunciation: "@" is read as the Chinese equivalent instead of "at"
Poor minor language synthesis quality with unnatural results

Valid values:

zh: Chinese
en: English
fr: French
de: German
ja: Japanese
ko: Korean
ru: Russian
pt: Portuguese
th: Thai
id: Indonesian
vi: Vietnamese

instruction string (optional)

Sets an instruction to control dialect, emotion, or voice character during synthesis. This feature is only available for cloned voices of cosyvoice-v3.5-flash, cosyvoice-v3.5-plus, and cosyvoice-v3-flash, as well as system voices marked as supporting Instruct in Voice list.

Length limit: 100 characters.

Chinese characters (including simplified and traditional Chinese, Japanese kanji, and Korean hanja) count as 2 characters. All other characters (such as punctuation, letters, digits, Japanese kana, and Korean hangul) count as 1 character.

Usage requirements (vary by model):

cosyvoice-v3.5-flash and cosyvoice-v3.5-plus: Accept any instruction to control synthesis effects (such as emotion or speech rate).

Important

cosyvoice-v3.5-flash and cosyvoice-v3.5-plus don't have system voices. Only designed or cloned voices are supported.

Instruction examples:

Speak in a very excited and high-pitched tone, expressing the ecstasy and excitement of a great success.
Please maintain a medium-slow speech rate, with an elegant and intellectual tone, giving a sense of calm and reassurance.
The tone should be full of sorrow and nostalgia, with a slight nasal quality, as if narrating a heartbreaking story.
Please try to speak in a breathy voice, with a very low volume, creating a sense of intimate and mysterious whispering.
The tone should be very impatient and annoyed, with a faster speech rate and minimal pauses between sentences.
Please imitate a kind and gentle elder, with a steady speech rate and a voice full of care and affection.
The tone should be sarcastic and disdainful, with emphasis on keywords and a slightly rising intonation at the end of sentences.
Please speak in an extremely fearful and trembling voice.
The tone should be like a professional news anchor: calm, objective, and articulate, with a neutral emotion.
The tone should be lively and playful, with a clear smile, making the voice sound energetic and sunny.

cosyvoice-v3-flash: Must follow the requirements below.

Cloned voices: Accept any natural-language instruction to control synthesis effects.

Instruction examples:

Please speak in Cantonese. (Supported dialects: Cantonese, Northeastern, Gansu, Guizhou, Henan, Hubei, Jiangxi, Minnan, Ningxia, Shanxi, Shaanxi, Shandong, Shanghainese, Sichuan, Tianjin, Yunnan.)
Please say a sentence as loudly as possible.
Please say a sentence as slowly as possible.
Please say a sentence as quickly as possible.
Please say a sentence very softly.
Can you speak a little slower?
Can you speak very quickly?
Can you speak very slowly?
Can you speak a little faster?
Please say a sentence very angrily.
Please say a sentence very happily.
Please say a sentence very fearfully.
Please say a sentence very sadly.
Please say a sentence very surprisedly.
Please try to sound as firm as possible.
Please try to sound as angry as possible.
Please try an approachable tone.
Please speak in a cold tone.
Please speak in a majestic tone.
I want to experience a natural tone.
I want to see how you express a threat.
I want to see how you express wisdom.
I want to see how you express seduction.
I want to hear you speak in a lively way.
I want to hear you speak with passion.
I want to hear you speak in a steady manner.
I want to hear you speak with confidence.
Can you talk to me with excitement?
Can you show an arrogant emotion?
Can you show an elegant emotion?
Can you answer the question happily?
Can you give a gentle emotional demonstration?
Can you talk to me in a calm tone?
Can you answer me in a deep way?
Can you talk to me with a gruff attitude?
Tell me the answer in a sinister voice.
Tell me the answer in a resilient voice.
Narrate in a natural and friendly chat style.
Speak in the tone of a radio drama podcaster.

System voices: Instructions must follow a fixed format. For details, see Voice list.

enable_aigc_tag boolean (optional)

Specifies whether to embed an AIGC watermark in the generated audio. When set to true, the watermark is embedded in audio files of supported formats (wav/mp3/opus).

Default value: false.

Only cosyvoice-v3-flash, cosyvoice-v3-plus, and cosyvoice-v2 support this feature.

aigc_propagator string (optional)

Sets the ContentPropagator field in the AIGC watermark, identifying the content propagator. Takes effect only when enable_aigc_tag is true.

Default value: Alibaba Cloud UID.

Only cosyvoice-v3-flash, cosyvoice-v3-plus, and cosyvoice-v2 support this feature.

aigc_propagate_id string (optional)

Sets the PropagateID field in the AIGC watermark, uniquely identifying a specific propagation action. Takes effect only when enable_aigc_tag is true.

Default value: The request ID of the current speech synthesis request.

Only cosyvoice-v3-flash, cosyvoice-v3-plus, and cosyvoice-v2 support this feature.

hot_fix object (optional)

Configures pronunciation corrections and text replacements applied before synthesis. Only cloned voices of cosyvoice-v3-flash support this feature.

Parameters:

pronunciation: Custom pronunciation. Specifies pinyin annotations for words to correct inaccurate default pronunciations.
replace: Text replacement. Replaces specified words with target text before synthesis. The replaced text is used as the actual synthesis input.

Example:

"hot_fix": {
  "pronunciation": [
    {"weather": "tian1 qi4"}
  ],
  "replace": [
    {"today": "gold day"}
  ]
}

enable_markdown_filter boolean (optional)

Important

Only cloned voices of cosyvoice-v3-flash support this feature.

Specifies whether to enable Markdown filtering. When enabled, the system automatically strips Markdown markup symbols from the input text before synthesis, preventing them from being read aloud.

Default value: false.

Valid values:

true: Enable Markdown filtering
false: Disable Markdown filtering

continue-task

Description: Sends the text to synthesize. The text can be sent all at once or in multiple segments.

When to send: After receiving the task-started event from the server.

Limits:

Maximum of 20,000 characters per message
Maximum of 200,000 characters cumulatively
The send interval must not exceed 23 seconds; otherwise, the connection times out.

header object (required)

Properties

action string (required)

The command type. Set to continue-task.

task_id string (required)

The task ID in UUID format. Must match the task_id in run-task.

streaming string (required)

Set to duplex.

{
    "header": {
        "action": "continue-task",
        "task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx",
        "streaming": "duplex"
    },
    "payload": {
        "input": {
            "text": "Before my bed, moonlight shines bright, I suspect it's frost upon the ground."
        }
    }
}

payload object (required)

Properties

input object (required)

Contains the text to synthesize.

text string (required)

The text to synthesize. Maximum of 20,000 characters per message and 200,000 characters cumulatively.

finish-task

Description: Notifies the server that all text has been sent and requests task completion.

When to send: Immediately after all text has been sent.

Response event: The server returns a task-finished event.

header object (required)

Properties

action string (required)

The command type. Set to finish-task.

task_id string (required)

The task ID in UUID format. Must match the task_id in run-task.

streaming string (required)

Set to duplex.

{
    "header": {
        "action": "finish-task",
        "task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx",
        "streaming": "duplex"
    },
    "payload": {
        "input": {}
    }
}

payload object (required)

Properties

input object (required)

Set to {}.