All Products
Search
Document Center

Alibaba Cloud Model Studio:CosyVoice speech synthesis Python SDK

Last Updated:Jan 16, 2026

This topic describes the parameters and interface details of the CosyVoice Python SDK for speech synthesis.

Important

To use a model in the China (Beijing) region, go to the API key page for the China (Beijing) region

User guide: For more information about the models and guidance on model selection, see Real-time speech synthesis - CosyVoice/Sambert.

Prerequisites

  • You have activated the Model Studio and created an API key. To prevent security risks, export the API key as an environment variable instead of hard-coding it in your code.

    Note

    To grant temporary access permissions to third-party applications or users, or if you want to strictly control high-risk operations such as accessing or deleting sensitive data, we recommend that you use a temporary authentication token.

    Compared with long-term API keys, temporary authentication tokens are more secure because they are short-lived (60 seconds). They are suitable for temporary call scenarios and can effectively reduce the risk of API key leakage.

    To use a temporary token, replace the API key used for authentication in your code with the temporary authentication token.

  • Install the latest version of the DashScope SDK.

Models and pricing

Model

Price

Free quota (Note)

cosyvoice-v3-plus

$0.286706 per 10,000 characters

No free quota

cosyvoice-v3-flash

$0.14335 per 10,000 characters

cosyvoice-v2

$0.286706 per 10,000 characters

Text and format limitations

Text length limits

Character counting rules

  • A Chinese character, which includes simplified or traditional Chinese, Japanese kanji, and Korean hanja, is counted as 2 characters. All other characters, such as punctuation marks, letters, numbers, and Japanese or Korean kana or hangul, are counted as 1 character.

  • SSML tags are not included in the text length calculation.

  • Examples:

    • "你好" → 2(你) + 2(好) = 4 characters

    • "中A文123" → 2(中) + 1(A) + 2(文) + 1(1) + 1(2) + 1(3) = 8 characters

    • "中文。" → 2(中) + 2(文) + 1(。) = 5 characters

    • "中 文。" → 2(中) + 1(space) + 2(文) + 1(。) = 6 characters

    • "<speak>你好</speak>" → 2(你) + 2(好) = 4 characters

Encoding format

Use UTF-8 encoding.

Support for mathematical expressions

The mathematical expression parsing feature is currently available only for the cosyvoice-v2, cosyvoice-v3-flash, and cosyvoice-v3-plus models. This feature supports common mathematical expressions from primary and secondary school, such as basic arithmetic, algebra, and geometry.

For more information, see LaTeX Formula to Speech.

SSML support

The Speech Synthesis Markup Language (SSML) feature is currently available only for cloned voices of the cosyvoice-v3-flash, cosyvoice-v3-plus, and cosyvoice-v2 models, and for system voices that are indicated as supported in the voice list. The following conditions must be met:

Getting started

The SpeechSynthesizer class is the primary class for speech synthesis and supports the following invocation methods:

  • Non-streaming call: A blocking call that sends the complete text at once and returns the complete audio directly. This method is suitable for short text synthesis scenarios.

  • Unidirectional streaming call: A non-blocking call that sends the complete text at once and uses a callback function to receive audio data, which may be delivered in chunks. This method is suitable for short text synthesis scenarios that require low latency.

  • Bidirectional streaming call: A non-blocking call that sends text in fragments and uses a callback function to receive the synthesized audio stream incrementally in real time. This method is suitable for long text synthesis scenarios that require low latency.

Npn-streaming call

This method submits a single speech synthesis task without using a callback function. The synthesis does not stream intermediate results. Instead, the complete result is returned at once.

image

You can instantiate the SpeechSynthesizer class, attach the request parameters, and call the call method to synthesize the text and retrieve the binary audio data.

The text that you send cannot be longer than 2,000 characters. For more information, see the call method of the SpeechSynthesizer class.

Important

Before each call to the call method, you must create a new SpeechSynthesizer instance.

Click to view the complete sample

# coding=utf-8

import dashscope
from dashscope.audio.tts_v2 import *

# If you have not configured the API key as an environment variable, replace "your-api-key" with your API key.
# dashscope.api_key = "your-api-key"

# Model
model = "cosyvoice-v3-flash"
# Voice
voice = "longanyang"

# Instantiate SpeechSynthesizer and pass request parameters such as model and voice in the constructor.
synthesizer = SpeechSynthesizer(model=model, voice=voice)
# Send the text to be synthesized and get the binary audio data.
audio = synthesizer.call("What is the weather like today?")
# The first time you send text, a WebSocket connection must be established. Therefore, the first-packet latency includes the time taken for connection establishment.
print('[Metric] Request ID: {}, First-packet latency: {} ms'.format(
    synthesizer.get_last_request_id(),
    synthesizer.get_first_package_delay()))

# Save the audio to a local file.
with open('output.mp3', 'wb') as f:
    f.write(audio)

Unidirectional streaming invocation

This method submits a single speech synthesis task. Intermediate results are streamed through callbacks, and the final synthesis result is streamed through the ResultCallback callback function.

image

You can instantiate the SpeechSynthesizer class, attach the request parameters and the ResultCallback interface, and call the call method to perform the synthesis. You can then retrieve the real-time synthesis results through the on_data method of the ResultCallback interface.

The length of the text to send cannot exceed 2,000 characters. For more information, see the call method of the SpeechSynthesizer class.

Important

Before each call to the call method, you must create a new SpeechSynthesizer instance.

Click to view the complete sample

# coding=utf-8

import dashscope
from dashscope.audio.tts_v2 import *

from datetime import datetime

def get_timestamp():
    now = datetime.now()
    formatted_timestamp = now.strftime("[%Y-%m-%d %H:%M:%S.%f]")
    return formatted_timestamp

# If you have not configured the API key as an environment variable, replace "your-api-key" with your API key.
# dashscope.api_key = "your-api-key"

# Model
model = "cosyvoice-v3-flash"
# Voice
voice = "longanyang"


# Define the callback interface.
class Callback(ResultCallback):
    _player = None
    _stream = None

    def on_open(self):
        self.file = open("output.mp3", "wb")
        print("Connection established: " + get_timestamp())

    def on_complete(self):
        print("Speech synthesis is complete. All results have been received: " + get_timestamp())
        # Only after the task is compeleted (on_complete callback triggered), call get_first_package_delay to get delay.
        # The first time you send text, a WebSocket connection must be established. Therefore, the first-packet latency includes the time required to establish the connection.
        print('[Metric] Request ID: {}, First-packet latency: {} ms'.format(
            synthesizer.get_last_request_id(),
            synthesizer.get_first_package_delay()))

    def on_error(self, message: str):
        print(f"An error occurred during speech synthesis: {message}")

    def on_close(self):
        print("Connection closed: " + get_timestamp())
        self.file.close()

    def on_event(self, message):
        pass

    def on_data(self, data: bytes) -> None:
        print(get_timestamp() + " Binary audio data length: " + str(len(data)))
        self.file.write(data)


callback = Callback()

# Instantiate SpeechSynthesizer and pass request parameters, such as model and voice, in the constructor.
synthesizer = SpeechSynthesizer(
    model=model,
    voice=voice,
    callback=callback,
)

# Send the text for synthesis and receive the binary audio data in real time in the on_data method of the callback.
synthesizer.call("What is the weather like today?")

Bidirectional streaming call

This method lets you submit text in multiple parts within a single speech synthesis task and receive the synthesis results in real time through a callback.

Note
  • To stream input, call the streaming_call method multiple times to submit text fragments in order. The server automatically segments the text fragments into sentences after it receives them:

    • Complete sentences are synthesized immediately.

    • Incomplete sentences are buffered and synthesized after they are complete.

    When you call the streaming_complete method, the server synthesizes all received but unprocessed text fragments, including incomplete sentences.

  • The interval between sending text fragments cannot exceed 23 seconds. Otherwise, a "request timeout after 23 seconds" exception occurs.

    If you have no more text to send, you must call the streaming_complete method to end the task.

    The server enforces a 23 second timeout. The client cannot modify this configuration.
image
  1. You can instantiate the SpeechSynthesizer class.

    Instantiate the SpeechSynthesizer class and attach the request parameters and the ResultCallback callback interface.

  2. Streaming data

    Stream data by calling the streaming_call method of the SpeechSynthesizer class multiple times. This sends the text to be synthesized to the server-side in segments.

    While you send text, the server uses the on_data method of the ResultCallback interface to return the synthesized result to the client in real time.

    The length of the text segment (the text parameter) sent in each call to the streaming_call method cannot exceed 2,000 characters. The cumulative length of all text that you send cannot exceed 200,000 characters.

  3. Processing is complete.

    End the process by calling the streaming_complete method of the SpeechSynthesizer class to end the speech synthesis task.

    This method blocks the current thread until the on_complete or on_error method of the ResultCallback interface is triggered.

    You must call this method. Otherwise, the end of the text may not be successfully synthesized.

Click to view the complete sample

# coding=utf-8
#
# pyaudio installation instructions:
# For macOS, run the following commands:
#   brew install portaudio
#   pip install pyaudio
# For Debian/Ubuntu, run the following commands:
#   sudo apt-get install python-pyaudio python3-pyaudio
#   or
#   pip install pyaudio
# For CentOS, run the following commands:
#   sudo yum install -y portaudio portaudio-devel && pip install pyaudio
# For Microsoft Windows, run the following command:
#   python -m pip install pyaudio

import time
import pyaudio
import dashscope
from dashscope.api_entities.dashscope_response import SpeechSynthesisResponse
from dashscope.audio.tts_v2 import *

from datetime import datetime

def get_timestamp():
    now = datetime.now()
    formatted_timestamp = now.strftime("[%Y-%m-%d %H:%M:%S.%f]")
    return formatted_timestamp

# If you have not configured the API key as an environment variable, replace "your-api-key" with your API key.
# dashscope.api_key = "your-api-key"

# Model
model = "cosyvoice-v3-flash"
# Voice
voice = "longanyang"


# Define the callback interface.
class Callback(ResultCallback):
    _player = None
    _stream = None

    def on_open(self):
        print("Connection established: " + get_timestamp())
        self._player = pyaudio.PyAudio()
        self._stream = self._player.open(
            format=pyaudio.paInt16, channels=1, rate=22050, output=True
        )

    def on_complete(self):
        print("Speech synthesis completed. All results have been received: " + get_timestamp())

    def on_error(self, message: str):
        print(f"An error occurred during speech synthesis: {message}")

    def on_close(self):
        print("Connection closed: " + get_timestamp())
        # Stop the player.
        self._stream.stop_stream()
        self._stream.close()
        self._player.terminate()

    def on_event(self, message):
        pass

    def on_data(self, data: bytes) -> None:
        print(get_timestamp() + " Binary audio data length: " + str(len(data)))
        self._stream.write(data)


callback = Callback()

test_text = [
    "The streaming text-to-speech SDK ",
    "can convert input text ",
    "into binary speech data. ",
    "Compared to non-streaming speech synthesis, ",
    "the advantage of streaming synthesis is its lower latency. ",
    "Users can hear nearly synchronous speech output ",
    "while inputting text, ",
    "which greatly improves the interactive experience ",
    "and reduces user waiting time. ",
    "It is suitable for scenarios that call a large ",
    "language model (LLM) to perform ",
    "speech synthesis with streaming text input.",
]

# Instantiate SpeechSynthesizer and pass request parameters such as model and voice in the constructor.
synthesizer = SpeechSynthesizer(
    model=model,
    voice=voice,
    format=AudioFormat.PCM_22050HZ_MONO_16BIT,  
    callback=callback,
)


# Stream the text to be synthesized. Receive the binary audio data in real time in the on_data method of the callback interface.
for text in test_text:
    synthesizer.streaming_call(text)
    time.sleep(0.1)
# End the streaming speech synthesis.
synthesizer.streaming_complete()

# The first time you send text, a WebSocket connection must be established. Therefore, the first-packet latency includes the time taken for connection establishment.
print('[Metric] Request ID: {}, First-packet latency: {} ms'.format(
    synthesizer.get_last_request_id(),
    synthesizer.get_first_package_delay()))

Request parameters

You can set the request parameters in the constructor of the SpeechSynthesizer class.

Parameter

Type

Required

Description

model

str

Yes

The speech synthesis model.

Difference models require corresponding voices:

  • cosyvoice-v3-flash/cosyvoice-v3-plus: Use voices such as longanyang.

  • cosyvoice-v2: Use voices such as longxiaochun_v2.

  • For a complete list, see Voice list.

voice

str

Yes

The voice to use for speech synthesis.

System voices and cloned voices are supported:

  • System voices: See Voice list.

  • Cloned voices: Customize voices using the voice cloning feature. When you use a cloned voice, make sure that the same account is used for both voice cloning and speech synthesis. For detailed steps, see CosyVoice voice cloning API.

    When you use a cloned voice, the value of the model parameter in the request must be exactly the same as the model version used to create the voice (the target_model parameter).

format

enum

No

Specifies the audio coding format and sample rate.

If format is not specified, the synthesized audio has a sample rate of 22.05 kHz and is in the MP3 format.

Note

The default sample rate is the optimal rate for the current voice. By default, the output uses this sample rate. Downsampling and upsampling are also supported.

The following audio coding formats and sample rates can be specified:

  • Audio coding formats and sample rates supported by all models:

    • AudioFormat.WAV_8000HZ_MONO_16BIT: The audio format is WAV and the sample rate is 8 kHz.

    • AudioFormat.WAV_16000HZ_MONO_16BIT: The audio format is WAV and the sample rate is 16 kHz.

    • AudioFormat.WAV_22050HZ_MONO_16BIT: The audio format is WAV and the sample rate is 22.05 kHz.

    • AudioFormat.WAV_24000HZ_MONO_16BIT: The audio format is WAV and the sample rate is 24 kHz.

    • AudioFormat.WAV_44100HZ_MONO_16BIT: The audio format is WAV and the sample rate is 44.1 kHz.

    • AudioFormat.WAV_48000HZ_MONO_16BIT: The audio format is WAV and the sample rate is 48 kHz.

    • AudioFormat.MP3_8000HZ_MONO_128KBPS: The audio format is MP3 and the sample rate is 8 kHz.

    • AudioFormat.MP3_16000HZ_MONO_128KBPS: The audio format is MP3 and the sample rate is 16 kHz.

    • AudioFormat.MP3_22050HZ_MONO_256KBPS: The audio format is MP3 and the sample rate is 22.05 kHz.

    • AudioFormat.MP3_24000HZ_MONO_256KBPS: The audio format is MP3 and the sample rate is 24 kHz.

    • AudioFormat.MP3_44100HZ_MONO_256KBPS: The audio format is MP3 and the sample rate is 44.1 kHz.

    • AudioFormat.MP3_48000HZ_MONO_256KBPS: The audio format is MP3 and the sample rate is 48 kHz.

    • AudioFormat.PCM_8000HZ_MONO_16BIT: The audio format is PCM and the sample rate is 8 kHz.

    • AudioFormat.PCM_16000HZ_MONO_16BIT: The audio format is PCM and the sample rate is 16 kHz.

    • AudioFormat.PCM_22050HZ_MONO_16BIT: The audio format is PCM and the sample rate is 22.05 kHz.

    • AudioFormat.PCM_24000HZ_MONO_16BIT: The audio format is PCM and the sample rate is 24 kHz.

    • AudioFormat.PCM_44100HZ_MONO_16BIT: The audio format is PCM and the sample rate is 44.1 kHz.

    • AudioFormat.PCM_48000HZ_MONO_16BIT: The audio format is PCM and the sample rate is 48 kHz.

  • When the audio format is Opus, adjust the bitrate using the bit_rate parameter. This feature is available only in DashScope SDK versions 1.24.0 and later.

    • AudioFormat.OGG_OPUS_8KHZ_MONO_32KBPS: The audio format is Opus, the sample rate is 8 kHz, and the bitrate is 32 kbps.

    • AudioFormat.OGG_OPUS_16KHZ_MONO_16KBPS: The audio format is Opus, the sample rate is 16 kHz, and the bitrate is 16 kbps.

    • AudioFormat.OGG_OPUS_16KHZ_MONO_32KBPS: The audio format is Opus, the sample rate is 16 kHz, and the bitrate is 32 kbps.

    • AudioFormat.OGG_OPUS_16KHZ_MONO_64KBPS: The audio format is Opus, the sample rate is 16 kHz, and the bitrate is 64 kbps.

    • AudioFormat.OGG_OPUS_24KHZ_MONO_16KBPS: The audio format is Opus, the sample rate is 24 kHz, and the bitrate is 16 kbps.

    • AudioFormat.OGG_OPUS_24KHZ_MONO_32KBPS: The audio format is Opus, the sample rate is 24 kHz, and the bitrate is 32 kbps.

    • AudioFormat.OGG_OPUS_24KHZ_MONO_64KBPS: The audio format is Opus, the sample rate is 24 kHz, and the bitrate is 64 kbps.

    • AudioFormat.OGG_OPUS_48KHZ_MONO_16KBPS: The audio format is Opus, the sample rate is 48 kHz, and the bitrate is 16 kbps.

    • AudioFormat.OGG_OPUS_48KHZ_MONO_32KBPS: The audio format is Opus, the sample rate is 48 kHz, and the bitrate is 32 kbps.

    • AudioFormat.OGG_OPUS_48KHZ_MONO_64KBPS: The audio format is Opus, the sample rate is 48 kHz, and the bitrate is 64 kbps.

volume

int

No

The volume.

Default value: 50.

Value range: [0, 100]. A value of 50 is the standard volume. The volume has a linear relationship with this value. 0 is mute and 100 is the maximum volume.

Important

This field differs in various versions of the DashScope SDK:

  • SDK versions 1.20.10 and later: volume

  • SDK versions earlier than 1.20.10: volumn

speech_rate

float

No

The speech rate.

Default value: 1.0.

Value range: [0.5, 2.0]. A value of 1.0 is the standard rate. Values less than 1.0 slow down the speech, and values greater than 1.0 speed it up.

pitch_rate

float

No

The pitch. This value is a multiplier for pitch adjustment. The relationship between this value and the perceived pitch is not strictly linear or logarithmic. Test different values to find the best one.

Default value: 1.0.

Value range: [0.5, 2.0]. A value of 1.0 is the natural pitch of the voice. Values greater than 1.0 raise the pitch, and values less than 1.0 lower it.

bit_rate

int

No

The audio bitrate in kbps. If the audio format is Opus, you can adjust the bitrate using the bit_rate parameter.

Default value: 32.

Value range: [6, 510].

Note

bit_rate must be set using the additional_params parameter:

synthesizer = SpeechSynthesizer(model="cosyvoice-v3-flash",
                                voice="longanyang",
                                format=AudioFormat.OGG_OPUS_16KHZ_MONO_16KBPS,
                                additional_params={"bit_rate": 32})

word_timestamp_enabled

bool

No

Specifies whether to enable word-level timestamps.

Default value: False.

  • True

  • False

This feature applies only to cloned voices of the cosyvoice-v3-flash, cosyvoice-v3-plus, and cosyvoice-v2 models, and system voices in the Voice list that are marked as supported.

Timestamp results can only be retrieved through the callback interface.
Note

word_timestamp_enabled must be set using the additional_params parameter:

synthesizer = SpeechSynthesizer(model="cosyvoice-v3-flash",
                                voice="longyingjing_v3",
                                callback=callback, # Timestamp results can only be retrieved through the callback interface.
                                additional_params={'word_timestamp_enabled': True})

Click to view the complete sample code

# coding=utf-8

import dashscope
from dashscope.audio.tts_v2 import *
import json
from datetime import datetime


def get_timestamp():
    now = datetime.now()
    formatted_timestamp = now.strftime("[%Y-%m-%d %H:%M:%S.%f]")
    return formatted_timestamp


# If you have not configured the API key as an environment variable, replace "your-api-key" with your API key.
# dashscope.api_key = "your-api-key"

# Only the cosyvoice-v2 model is supported.
model = "cosyvoice-v3-flash"
# Voice
voice = "longyingjing_v3"


# Define the callback interface.
class Callback(ResultCallback):
    _player = None
    _stream = None

    def on_open(self):
        self.file = open("output.mp3", "wb")
        print("Connection established: " + get_timestamp())

    def on_complete(self):
        print("Speech synthesis complete. All results have been received: " + get_timestamp())

    def on_error(self, message: str):
        print(f"An error occurred during speech synthesis: {message}")

    def on_close(self):
        print("Connection closed: " + get_timestamp())
        self.file.close()

    def on_event(self, message):
        json_data = json.loads(message)
        if json_data['payload'] and json_data['payload']['output'] and json_data['payload']['output']['sentence']:
            sentence = json_data['payload']['output']['sentence']
            print(f'sentence: {sentence}')
            # Get the sentence number.
            # index = sentence['index']
            words = sentence['words']
            if words:
                for word in words:
                    print(f'word: {word}')
                    # Sample value: word: {'text': 'What', 'begin_index': 0, 'end_index': 1, 'begin_time': 80, 'end_time': 200}

    def on_data(self, data: bytes) -> None:
        print(get_timestamp() + " Binary audio data length: " + str(len(data)))
        self.file.write(data)


callback = Callback()

# Instantiate SpeechSynthesizer and pass request parameters such as model and voice in the constructor.
synthesizer = SpeechSynthesizer(
    model=model,
    voice=voice,
    callback=callback,
    additional_params={'word_timestamp_enabled': True}
)

# Send the text to be synthesized and receive the binary audio data in real time in the on_data method of the callback interface.
synthesizer.call("What is the weather like today?")
# The first time you send text, a WebSocket connection must be established. Therefore, the first-packet latency includes the time taken for connection establishment.
print('[Metric] Request ID: {}, First-packet latency: {} ms'.format(
    synthesizer.get_last_request_id(),
    synthesizer.get_first_package_delay()))

seed

int

No

The random number seed used during generation, which varies the synthesis effect. If the model version, text, voice, and other parameters are the same, using the same seed reproduces the same synthesis result.

Default value: 0.

Value range: [0, 65535].

language_hints

list[str]

No

Specifies the target language for speech synthesis to improve the synthesis effect.

Use this parameter when the pronunciation of numbers, abbreviations, or symbols, or when the synthesis effect for non-Chinese languages, does not meet expectations. For example:

  • The number is not read as expected. For example, "hello, this is 110" is read as "hello, this is one one zero" instead of "hello, this is yao yao ling".

  • The symbol is not read correctly. For example, the at sign (@) is read as "ai te" instead of "at".

  • The synthesis effect for non-Chinese languages is poor and sounds unnatural.

Valid values:

  • zh: Chinese

  • en: English

  • fr: French

  • de: German

  • ja: Japanese

  • ko: Korean

  • ru: Russian

Note: Although this parameter is an array, the current version processes only the first element. Therefore, you must pass only one value.

Important

This parameter specifies the target language for speech synthesis. This setting is independent of the language of the sample audio used for voice cloning. To set the source language for a voice cloning task, see CosyVoice voice cloning API.

instruction

str

No

Set instruction: This feature is available only for cloned voices for the cosyvoice-v3-flash and cosyvoice-v3-plus models, and system voices marked as supported in the Voice List.

No default value. This parameter has no effect if it is not set.

Speech synthesis has the following effects:

  1. Specifies a dialect (for cloned voices only)

    • Format: "Express in <dialect>." (Note: Do not omit the period (。) at the end. Replace <dialect> with a specific dialect, such as 广东话.)

    • Example: "Please say this in Cantonese."

    • Supported dialects: 广东话 (Cantonese), 东北话 (Dongbei), 甘肃话 (Gansu), 贵州话 (Guizhou), 河南话 (Henan), 湖北话 (Hubei), 江西话 (Jiangxi), 闽南话 (Minnan), 宁夏话 (Ningxia), 山西话 (Shanxi), 陕西话 (Shaanxi), 山东话 (Shandong), 上海话 (Shanghainese), 四川话 (Sichuan), 天津话 (Tianjin), and 云南话 (Yunnan).

  2. Specifies emotion, scenario, role, or identity. Only some system voices support this feature, and it varies by voice. For more information, see Voice list.

enable_aigc_tag

bool

No

Specifies whether to add an invisible AIGC identifier to the generated audio. If set to True, the invisible identifier is embedded into the audio for supported formats (WAV, MP3, and Opus).

Default value: False.

This feature is supported only by the cosyvoice-v3-flash, cosyvoice-v3-plus, and cosyvoice-v2 models.

Note

enable_aigc_tag must be set using the additional_params parameter:

synthesizer = SpeechSynthesizer(model="cosyvoice-v3-flash",
                                voice="longanyang",
                                format=AudioFormat.OGG_OPUS_16KHZ_MONO_16KBPS,
                                additional_params={"enable_aigc_tag": True})

aigc_propagator

str

No

Sets the ContentPropagator field in the invisible AIGC identifier to identify the content propagator. This parameter takes effect only when enable_aigc_tag is True.

Default value: Alibaba Cloud UID.

This feature is supported only by the cosyvoice-v3-flash, cosyvoice-v3-plus, and cosyvoice-v2 models.

Note

aigc_propagator must be set using the additional_params parameter:

synthesizer = SpeechSynthesizer(model="cosyvoice-v3-flash",
                                voice="longanyang",
                                format=AudioFormat.OGG_OPUS_16KHZ_MONO_16KBPS,
                                additional_params={"enable_aigc_tag": True, "aigc_propagator": "xxxx"})

aigc_propagate_id

str

No

Sets the PropagateID field in the invisible AIGC identifier to uniquely identify a specific propagation action. This parameter takes effect only when enable_aigc_tag is True.

Default value: The Request ID of the current speech synthesis request.

This feature is supported only by the cosyvoice-v3-flash, cosyvoice-v3-plus, and cosyvoice-v2 models.

Note

aigc_propagate_id must be set using the additional_params parameter:

synthesizer = SpeechSynthesizer(model="cosyvoice-v3-flash",
                                voice="longanyang",
                                format=AudioFormat.OGG_OPUS_16KHZ_MONO_16KBPS,
                                additional_params={"enable_aigc_tag": True, "aigc_propagate_id": "xxxx"})

callback

ResultCallback

No

ResultCallback interface.

Key interfaces

SpeechSynthesizer class

The SpeechSynthesizer class is the main interface for speech synthesis. You can import this class using from dashscope.audio.tts_v2 import *.

Method

Parameters

Return value

Description

def call(self, text: str, timeout_millis=None)
  • text: The text to synthesize.

  • timeout_millis: The timeout in milliseconds for a blocking thread. This parameter does not take effect if it is not set or is set to 0.

Returns binary audio data if ResultCallback is not specified. Otherwise, it returns None.

Transforms an entire segment of text into speech. The text can be plain text or contain SSML.

When you create a SpeechSynthesizer instance, two scenarios are possible:

  • If you do not specify a ResultCallback, the call method blocks the current thread until speech synthesis is complete and returns the binary audio data. For more information, see non-streaming call.

  • If you specify a ResultCallback, the call method immediately returns None. The speech synthesis result is returned through the on_data method of the ResultCallback interface. For more information, see unidirectional streaming call.

Important

Before each call to the call method, you must create a new SpeechSynthesizer instance.

def streaming_call(self, text: str)

text: The text segment to synthesize.

None

Streams the text to synthesize. Text that contains SSML is not supported.

You can call this interface multiple times to send the text to be synthesized to the server in segments. The synthesis result is retrieved through the on_data method of the ResultCallback interface.

For more information, see Bidirectional streaming call.

def streaming_complete(self, complete_timeout_millis=600000)

complete_timeout_millis: The wait time in milliseconds.

None

Ends the streaming speech synthesis.

This method blocks the current thread for the duration specified by complete_timeout_millis until the task is complete. If completeTimeoutMillis is set to 0, the thread waits indefinitely.

By default, the wait stops if the wait time exceeds 10 minutes.

For more information, see Bidirectional streaming call.

Important

When making bidirectional streaming calls, call this method. Otherwise, the synthesized speech may be incomplete.

def get_last_request_id(self)

None

The request ID of the last task.

Gets the request ID of the last task.

def get_first_package_delay(self)

None

First-packet latency

Gets the first-packet latency. The latency is typically about 500 ms.

First-packet latency is the time in milliseconds from when you send the text to when you receive the first audio packet. Check the latency after the task is complete.

When you send text for the first time, a WebSocket connection must be established. Therefore, the first-packet latency includes the time required to establish the connection.

def get_response(self)

None

The last message

Gets the last message, which is in JSON format. You can use this to retrieve task-failed errors.

Callback interface (ResultCallback)

For an unidirectional streaming call or a bidirectional streaming call, the server returns key process information and data to the client through a callback. You must implement the callback methods to process the returned information and data.

You can import it using from dashscope.audio.tts_v2 import *.

Click to view a sample

class Callback(ResultCallback):
    def on_open(self) -> None:
        print('Connection successful')
    
    def on_data(self, data: bytes) -> None:
        # Implement the logic to receive the synthesized binary audio result.

    def on_complete(self) -> None:
        print('Synthesis complete')

    def on_error(self, message) -> None:
        print('An exception occurred: ', message)

    def on_close(self) -> None:
        print('Connection closed')


callback = Callback()

Method

Parameters

Return value

Description

def on_open(self) -> None

None

None

This method is called immediately after a connection is established with the server.

def on_event( self, message: str) -> None

message: The information returned by the server.

None

This method is called when the service sends a response. The message is a JSON string. Parse the string to obtain information such as the task ID (the task_id parameter) and the number of billable characters in the request (the characters parameter).

def on_complete(self) -> None

None

None

This method is called after all synthesized data is returned and the speech synthesis is complete.

def on_error(self, message) -> None

message: The error message.

None

This method is called when an exception occurs.

def on_data(self, data: bytes) -> None

data: The binary audio data returned by the server.

None

This method is called when the server returns synthesized audio.

Combine the binary audio data into a complete audio file for playback, or play it in real time with a player that supports streaming playback.

Important
  • In streaming speech synthesis, for compressed formats such as MP3 and Opus, use a streaming player to play the audio segments. Do not play them frame by frame to avoid decoding failures.

    Players that support streaming playback include ffmpeg, pyaudio (Python), AudioFormat (Java), and MediaSource (JavaScript).
  • When combining audio data into a complete audio file, append the data to the same file.

  • For WAV and MP3 audio formats in streaming speech synthesis, only the first frame contains header information. Subsequent frames contain only audio data.

def on_close(self) -> None

None

None

This method is called after the service has closed the connection.

Response

The server returns binary audio data:

Error codes

For troubleshooting information, see Error messages.

More examples

For more examples, see GitHub.

FAQ

Features, billing, and rate limiting

Q: What can I do to fix inaccurate pronunciation?

You can use SSML to customize the speech synthesis output.

Q: Speech synthesis is billed based on the number of text characters. How can I view or obtain the text length for each synthesis?

This depends on whether logging is enabled:

  1. Logging is disabled.

    • For a non-streaming call, you can calculate the number of characters according to the character counting rules.

    • Alternatively, you can retrieve the information from the message parameter of the on_event method of the ResultCallback callback interface. The message is a JSON string that you can parse to retrieve the number of billable characters for the current request from the characters parameter. Use the last message that you receive.

  2. Logging is enabled.

    If logging is enabled, the console prints a log that contains the characters parameter. This parameter indicates the number of billable characters for the request. Use the value from the last log entry for the request.

    2025-08-27 11:02:09,429 - dashscope - speech_synthesizer.py - on_message - 454 - DEBUG - <<<recv {"header":{"task_id":"62ebb7d6cb0a4080868f0edb######","event":"result-generated","attributes":{}},"payload":{"output":{"sentence":{"words":[]}},"usage":{"characters":15}}}

Click to view how to enable logging

You can enable logging by setting an environment variable on the command line:

  • Windows: $env:DASHSCOPE_LOGGING_LEVEL="debug"

  • Linux/macOS: export DASHSCOPE_LOGGING_LEVEL=debug

Troubleshooting

If you encounter a code error, refer to Error codes to troubleshoot the issue.

Q: How do I get the Request ID?

You can retrieve it in one of the following two ways:

Q: Why does the SSML feature fail?

Check the following:

  1. Ensure that the scope is correct.

  2. Ensure that you have installed the latest version of the DashScope SDK.

  3. Ensure that you are using the correct interface. SSML is supported only by the call method of the SpeechSynthesizer class.

  4. Ensure that the text for synthesis is in plain text and meets the required format. For more information, see Introduction to SSML.

Q: Why can't the audio be played?

Troubleshoot this issue based on the following scenarios:

  1. The audio is saved as a complete file, such as an .mp3 file.

    1. Audio format consistency: Ensure that the audio format specified in the request parameters matches the file extension. For example, playback might fail if the audio format is set to WAV in the request parameters but the file has an .mp3 extension.

    2. Player compatibility: Confirm that your player supports the format and sample rate of the audio file. For example, some players might not support high sample rates or specific audio encodings.

  2. The audio is played in streaming mode.

    1. Save the audio stream as a complete file and try to play it. If the file fails to play, see the troubleshooting steps for the first scenario.

    2. If the file plays correctly, the issue might be with the streaming playback implementation. Confirm that your player supports streaming playback.

      Common tools and libraries that support streaming playback include ffmpeg, pyaudio (Python), AudioFormat (Java), and MediaSource (JavaScript).

Q: Why does the audio playback stutter?

Troubleshoot this issue based on the following scenarios:

  1. Check the text sending speed: Ensure that the text sending interval is reasonable. Avoid delays in sending the next text segment after the audio for the previous segment has finished playing.

  2. Check the callback function performance:

    • Check whether the callback function contains excessive business logic that could cause it to block.

    • The callback function runs in the WebSocket thread. If this thread is blocked, it can interfere with the WebSocket's ability to receive network packets, resulting in audio stuttering.

    • To avoid blocking the WebSocket thread, write the audio data to a separate audio buffer and then use another thread to read and process it.

  3. Check network stability: Ensure that your network connection is stable to prevent audio transmission interruptions or delays caused by network fluctuations.

Q: Why is speech synthesis slow (long synthesis time)?

Perform the following troubleshooting steps:

  1. Check the input interval

    If you are using streaming speech synthesis, check whether the text sending interval is too long. For example, a delay of several seconds before sending the next segment will increase the total synthesis time.

  2. Analyze performance metrics

    • First packet delay: This is typically around 500 ms.

    • Real-Time Factor (RTF): This is calculated as Total Synthesis Time / Audio Duration. The RTF is normally less than 1.0.

Q: How do I handle incorrect pronunciation in the synthesized speech?

Use the <phoneme> tag of SSML to specify the correct pronunciation.

Q: Why is no speech returned? Why is the end of the text not successfully converted to speech? (Missing synthesized speech)

Check whether you called the streaming_complete method of the SpeechSynthesizer class. The server caches text and begins synthesis only after it has received enough text. If you do not call the streaming_complete method, the text remaining in the cache may not be synthesized.

Q: How do I handle an SSL certificate verification failure?

  1. Install the system root certificate.

    sudo yum install -y ca-certificates
    sudo update-ca-trust enable
  2. Add the following content to your code.

    import os
    os.environ["SSL_CERT_FILE"] = "/etc/ssl/certs/ca-bundle.crt"

Q: What causes the "SSL: CERTIFICATE_VERIFY_FAILED" exception on macOS? (websocket closed due to [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1000))

When connecting to a WebSocket, you may encounter an OpenSSL certificate verification failure with a message indicating that the certificate cannot be found. This usually occurs because of an incorrect certificate configuration in the Python environment. Follow these steps to manually locate and fix the certificate issue:

  1. Export the system certificate and set the environment variable. Run the following commands to export all certificates from your macOS system to a file and set this file as the default certificate path for Python and its related libraries:

    security find-certificate -a -p > ~/all_mac_certs.pem
    export SSL_CERT_FILE=~/all_mac_certs.pem
    export REQUESTS_CA_BUNDLE=~/all_mac_certs.pem
  2. Create a symbolic link to fix Python's OpenSSL configuration. If Python's OpenSSL configuration is missing certificates, run the following command to create a symbolic link. Make sure to replace the path in the command with the actual installation path of your local Python version:

    # 3.9 is a sample version number. Adjust the path according to your locally installed Python version.
    ln -s /etc/ssl/* /Library/Frameworks/Python.framework/Versions/3.9/etc/openssl
  3. Restart the terminal and clear the cache. After you complete the preceding steps, close and reopen the terminal to ensure that the environment variables take effect. Clear any cache that might exist and try to connect to the WebSocket again.

These steps should resolve connection issues caused by incorrect certificate configurations. If the problem persists, check whether the certificate configuration on the target server is correct.

Q: What causes the "AttributeError: module 'websocket' has no attribute 'WebSocketApp'. Did you mean: 'WebSocket'?" error when running the code?

This error occurs because the websocket-client is not installed or its version is mismatched. Run the following commands to resolve the issue:

pip uninstall websocket-client
pip uninstall websocket
pip install websocket-client

Permissions and authentication

Q: I want my API key to be used only for the CosyVoice speech synthesis service, not for other Model Studio models (permission isolation). What should I do?

You can create a workspace and authorize only specific models to limit the scope of the API key. For more information, see Manage workspaces.

More questions

For more information, see the Q&A on GitHub.