All Products
Search
Document Center

Alibaba Cloud Model Studio:CosyVoice Speech Synthesis Python SDK

Last Updated:Nov 11, 2025

This topic describes the parameters and interfaces of the CosyVoice speech synthesis Python SDK.

Important

This document applies only to the China (Beijing) region. To use the model, you must use an API key from the China (Beijing) region.

User guide: For more information about the models and guidance on model selection, see Speech synthesis - CosyVoice/Sambert.

Prerequisites

  • You have activated the Model Studio and created an API key. To prevent security risks, export the API key as an environment variable instead of hard-coding it in your code.

    Note

    To grant temporary access permissions to third-party applications or users, or if you want to strictly control high-risk operations such as accessing or deleting sensitive data, we recommend that you use a temporary authentication token.

    Compared with long-term API keys, temporary authentication tokens are more secure because they are short-lived (60 seconds). They are suitable for temporary call scenarios and can effectively reduce the risk of API key leakage.

    To use a temporary token, replace the API key used for authentication in your code with the temporary authentication token.

  • Install the latest version of the DashScope SDK.

Models and pricing

Model

Unit price

cosyvoice-v3-plus

$0.286706/10,000 characters

cosyvoice-v2

$0.286706/10,000 characters

Character billing rule: One Chinese character is counted as two characters. English letters, punctuation, and spaces are each counted as one character.

For more information, see Throttling.

Text and format limitations

Text length limits

  • For non-streaming calls (synchronous call or asynchronous invocation), the text sent in a single request cannot exceed 2,000 characters in length.

  • For streaming calls, the text sent in a single request cannot exceed 2,000 characters in length, and the total length of all text sent cannot exceed 200,000 characters.

Character calculation rules

  • Chinese characters: 2 characters each

  • English letters, numbers, punctuation, and spaces: 1 character each

  • The content of SSML tags is included when calculating the text length.

  • Examples:

    • "Hello" → 4 distinct characters

    • "ChineseA123" → 2 + 1 + 2 + 1 + 1 + 1 = 8 characters

    • "Chinese." → 2 + 2 + 1 = 5 characters

    • "Chinese." → 2+1+2+1=6 characters

    • "<speak>你好</speak>" → 7 + 4 + 8 = 19 characters

Encoding format

Use UTF-8 encoding.

Support for mathematical expressions

The mathematical expression parsing feature is available only for the cosyvoice-v2 model. It supports common mathematical expressions, such as those in primary and secondary school curricula, including basic arithmetic, algebra, and geometry.

For more information, see Convert LaTeX formulas to speech.

SSML support

The Speech Synthesis Markup Language (SSML) feature is available only for some voices of the cosyvoice-v2 model. Check the voice list to confirm whether a voice supports SSML. To use SSML, the following conditions must be met:

Getting started

The SpeechSynthesizer class provides the key interfaces for speech synthesis and supports the following call methods:

  • Synchronous call: After you submit the text, the server immediately processes it and returns the complete synthesized speech. The entire process is blocking. The client must wait for the server to finish processing before it can perform the next operation. This method is suitable for speech synthesis scenarios that involve short text.

  • Asynchronous invocation: Send the text to the server at one time and receive the synthesized speech in real time. You cannot send the text in segments. This method is suitable for speech synthesis scenarios that involve short text and require high real-time performance.

  • Streaming call: Send the text to the server in segments and receive the synthesized speech in real time. The server starts processing as soon as it receives a portion of the text. This method is suitable for speech synthesis scenarios that involve long text and require high real-time performance.

Synchronous call

Submit a single speech synthesis task and obtain the complete result at once without using a callback function for streaming intermediate results.

image

Instantiate the SpeechSynthesizer class, bind the request parameters, and call the call method to synthesize the speech and obtain the binary audio data.

The length of the sent text cannot exceed 2,000 characters. For more information, see the call method of the SpeechSynthesizer class.

Important

You must re-initialize the SpeechSynthesizer instance before each call to the call method.

Click to view the complete example

# coding=utf-8

import dashscope
from dashscope.audio.tts_v2 import *

# If the API key is not configured in the environment variables,
# replace "your-api-key" with your own API key.
# dashscope.api_key = "your-api-key"

# Model
model = "cosyvoice-v2"
# Voice
voice = "longxiaochun_v2"

# Instantiate SpeechSynthesizer and pass request parameters
# such as model and voice in the constructor.
synthesizer = SpeechSynthesizer(model=model, voice=voice)
# Send the text to be synthesized and get the binary audio data.
audio = synthesizer.call("What is the weather like today?")
# A WebSocket connection must be established when sending text for the first time.
# Therefore, the first-packet latency includes the time to establish the connection.
print('[Metric] Request ID: {}, First-packet latency: {} ms'.format(
    synthesizer.get_last_request_id(),
    synthesizer.get_first_package_delay()))

# Save the audio to a local file.
with open('output.mp3', 'wb') as f:
    f.write(audio)

Asynchronous invocation

Submit a single speech synthesis task and receive streaming intermediate results through a callback. The synthesis results are streamed through the callback functions in ResultCallback.

image

Instantiate the SpeechSynthesizer class, bind the request parameters and the ResultCallback interface, and call the call method to synthesize the speech. The results are retrieved in real time through the on_data method of the ResultCallback interface.

The length of the sent text cannot exceed 2,000 characters. For more information, see the call method of the SpeechSynthesizer class.

Important

You must re-initialize the SpeechSynthesizer instance before each call to the call method.

Click to view the complete example

# coding=utf-8

import dashscope
from dashscope.audio.tts_v2 import *

from datetime import datetime

def get_timestamp():
    now = datetime.now()
    formatted_timestamp = now.strftime("[%Y-%m-%d %H:%M:%S.%f]")
    return formatted_timestamp

# If the API key is not configured in the environment variables,
# replace "your-api-key" with your own API key.
# dashscope.api_key = "your-api-key"

# Model
model = "cosyvoice-v2"
# Voice
voice = "longxiaochun_v2"


# Define the callback interface.
class Callback(ResultCallback):
    _player = None
    _stream = None

    def on_open(self):
        self.file = open("output.mp3", "wb")
        print("Connection established: " + get_timestamp())

    def on_complete(self):
        print("Speech synthesis completed. All results have been received: " + get_timestamp())

    def on_error(self, message: str):
        print(f"An error occurred during speech synthesis: {message}")

    def on_close(self):
        print("Connection closed: " + get_timestamp())
        self.file.close()

    def on_event(self, message):
        pass

    def on_data(self, data: bytes) -> None:
        print(get_timestamp() + " Binary audio length: " + str(len(data)))
        self.file.write(data)


callback = Callback()

# Instantiate SpeechSynthesizer and pass request parameters
# such as model and voice in the constructor.
synthesizer = SpeechSynthesizer(
    model=model,
    voice=voice,
    callback=callback,
)

# Send the text to be synthesized and get the binary audio in real time
# through the on_data method of the callback interface.
synthesizer.call("What is the weather like today?")
# A WebSocket connection must be established when sending text for the first time.
# Therefore, the first-packet latency includes the time to establish the connection.
print('[Metric] Request ID: {}, First-packet latency: {} ms'.format(
    synthesizer.get_last_request_id(),
    synthesizer.get_first_package_delay()))

Streaming call

Submit text in multiple parts within a single speech synthesis task and receive the synthesis results in real time through a callback.

Note
  • For streaming input, call the streaming_call method multiple times to submit text segments in order. The server automatically splits the text into sentences when it receives the segments:

    • Complete sentences are synthesized immediately.

    • Incomplete sentences are cached until they are complete and are then synthesized.

    When you call the streaming_complete method, the server forcibly synthesizes all received but unprocessed text segments, including incomplete sentences.

  • The interval between sending text segments cannot exceed 23 seconds. Otherwise, a "request timeout after 23 seconds" exception is triggered.

    If you have no more text to send, promptly call the streaming_complete method to end the task.

    The server enforces a 23-second timeout. This configuration cannot be changed by the client.
image
  1. Instantiate the SpeechSynthesizer class

    Instantiate the SpeechSynthesizer class and bind the request parameters and the ResultCallback interface.

  2. Stream data

    Call the streaming_call method of the SpeechSynthesizer class multiple times to submit the text to be synthesized in segments.

    While you send the text, the server returns the synthesis results to the client in real time through the on_data method of the ResultCallback interface.

    The length of the text segment (text) sent in each streaming_call cannot exceed 2,000 characters. The total length of all sent text cannot exceed 200,000 characters.

  3. End processing

    Call the streaming_complete method of the SpeechSynthesizer class to end the speech synthesis.

    This method blocks the current thread until the on_complete or on_error callback of the ResultCallback interface is triggered. The thread is unblocked only after the callback is triggered.

    You must call this method. Otherwise, the text at the end may not be converted to speech.

Click to view the complete example

# coding=utf-8
#
# pyaudio installation instructions:
# For macOS, run the following commands:
#   brew install portaudio
#   pip install pyaudio
# For Debian/Ubuntu, run the following commands:
#   sudo apt-get install python-pyaudio python3-pyaudio
#   or
#   pip install pyaudio
# For CentOS, run the following commands:
#   sudo yum install -y portaudio portaudio-devel && pip install pyaudio
# For Microsoft Windows, run the following command:
#   python -m pip install pyaudio

import time
import pyaudio
import dashscope
from dashscope.api_entities.dashscope_response import SpeechSynthesisResponse
from dashscope.audio.tts_v2 import *

from datetime import datetime

def get_timestamp():
    now = datetime.now()
    formatted_timestamp = now.strftime("[%Y-%m-%d %H:%M:%S.%f]")
    return formatted_timestamp

# If the API key is not configured in the environment variables,
# replace "your-api-key" with your own API key.
# dashscope.api_key = "your-api-key"

# Model
model = "cosyvoice-v2"
# Voice
voice = "longxiaochun_v2"


# Define the callback interface.
class Callback(ResultCallback):
    _player = None
    _stream = None

    def on_open(self):
        print("Connection established: " + get_timestamp())
        self._player = pyaudio.PyAudio()
        self._stream = self._player.open(
            format=pyaudio.paInt16, channels=1, rate=22050, output=True
        )

    def on_complete(self):
        print("Speech synthesis completed. All results have been received: " + get_timestamp())

    def on_error(self, message: str):
        print(f"An error occurred during speech synthesis: {message}")

    def on_close(self):
        print("Connection closed: " + get_timestamp())
        # Stop the player.
        self._stream.stop_stream()
        self._stream.close()
        self._player.terminate()

    def on_event(self, message):
        pass

    def on_data(self, data: bytes) -> None:
        print(get_timestamp() + " Binary audio length: " + str(len(data)))
        self._stream.write(data)


callback = Callback()

test_text = [
    "Streaming text-to-speech SDK, ",
    "can convert input text ",
    "into binary speech data. ",
    "Compared to non-streaming speech synthesis, ",
    "the advantage of streaming synthesis is its real-time performance, ",
    "which is much stronger. Users can hear ",
    "nearly synchronous speech output while inputting text, ",
    "greatly improving the interactive experience ",
    "and reducing user waiting time. ",
    "It is suitable for calling large ",
    "language models (LLMs) to perform ",
    "speech synthesis by ",
    "streaming text input.",
]

# Instantiate SpeechSynthesizer and pass request parameters
# such as model and voice in the constructor.
synthesizer = SpeechSynthesizer(
    model=model,
    voice=voice,
    format=AudioFormat.PCM_22050HZ_MONO_16BIT,  
    callback=callback,
)


# Stream the text to be synthesized. Get the binary audio in real time
# through the on_data method of the callback interface.
for text in test_text:
    synthesizer.streaming_call(text)
    time.sleep(0.1)
# End the streaming speech synthesis.
synthesizer.streaming_complete()

# A WebSocket connection must be established when sending text for the first time.
# Therefore, the first-packet latency includes the time to establish the connection.
print('[Metric] Request ID: {}, First-packet latency: {} ms'.format(
    synthesizer.get_last_request_id(),
    synthesizer.get_first_package_delay()))

Request parameters

Set the request parameters through the constructor of the SpeechSynthesizer class.

Parameter

Type

Default value

Required

Description

model

str

-

Yes

Specifies the model.

Different model versions share the same codec. However, the model and voice must match: each model version can only use its own default or exclusive voices.

voice

str

-

Yes

Specify the voice to use for speech synthesis.

The following voice types are available:

  • Default voices: See the Voice list section.

  • Custom voice: Created using the Voice Cloning feature. When using a custom voice, ensure that Voice Cloning and Speech Synthesis use the same account. For more information, see CosyVoice Voice Cloning API.

⚠️ When you use a voice cloning model for speech synthesis, use only the custom voice generated by that model. Do not use a default voice.

⚠️ When you use a custom voice for speech synthesis, the speech synthesis model (model) must be the same as the voice cloning model (target_model).

format

enum

Varies by voice

No

Specifies the audio coding format and sample rate.

If format is not specified, the synthesized audio has a sample rate of 22.05 kHz and is in MP3 format.

Note

The default sample rate is the optimal sample rate for the current voice. By default, the output uses this sample rate. Downsampling and upsampling are also supported.

The following audio coding formats and sample rates are available:

  • Audio coding formats and sample rates supported by all models:

    • AudioFormat.WAV_8000HZ_MONO_16BIT: WAV format, 8 kHz sample rate

    • AudioFormat.WAV_16000HZ_MONO_16BIT: WAV format, 16 kHz sample rate

    • AudioFormat.WAV_22050HZ_MONO_16BIT: WAV format, 22.05 kHz sample rate

    • AudioFormat.WAV_24000HZ_MONO_16BIT: WAV format, 24 kHz sample rate

    • AudioFormat.WAV_44100HZ_MONO_16BIT: WAV format, 44.1 kHz sample rate

    • AudioFormat.WAV_48000HZ_MONO_16BIT: WAV format, 48 kHz sample rate

    • AudioFormat.MP3_8000HZ_MONO_128KBPS: MP3 format, 8 kHz sample rate

    • AudioFormat.MP3_16000HZ_MONO_128KBPS: MP3 format, 16 kHz sample rate

    • AudioFormat.MP3_22050HZ_MONO_256KBPS: MP3 format, 22.05 kHz sample rate

    • AudioFormat.MP3_24000HZ_MONO_256KBPS: MP3 format, 24 kHz sample rate

    • AudioFormat.MP3_44100HZ_MONO_256KBPS: MP3 format, 44.1 kHz sample rate

    • AudioFormat.MP3_48000HZ_MONO_256KBPS: MP3 format, 48 kHz sample rate

    • AudioFormat.PCM_8000HZ_MONO_16BIT: PCM format, 8 kHz sample rate

    • AudioFormat.PCM_16000HZ_MONO_16BIT: PCM format, 16 kHz sample rate

    • AudioFormat.PCM_22050HZ_MONO_16BIT: PCM format, 22.05 kHz sample rate

    • AudioFormat.PCM_24000HZ_MONO_16BIT: PCM format, 24 kHz sample rate

    • AudioFormat.PCM_44100HZ_MONO_16BIT: PCM format, 44.1 kHz sample rate

    • AudioFormat.PCM_48000HZ_MONO_16BIT: PCM format, 48 kHz sample rate

  • When the audio format is Opus, adjust the bitrate using the bit_rate parameter. This applies only to DashScope SDK 1.24.0 and later.

    • AudioFormat.OGG_OPUS_8KHZ_MONO_32KBPS: Opus format, 8 kHz sample rate, 32 kbps bitrate

    • AudioFormat.OGG_OPUS_16KHZ_MONO_16KBPS: Opus format, 16 kHz sample rate, 16 kbps bitrate

    • AudioFormat.OGG_OPUS_16KHZ_MONO_32KBPS: Opus format, 16 kHz sample rate, 32 kbps bitrate

    • AudioFormat.OGG_OPUS_16KHZ_MONO_64KBPS: Opus format, 16 kHz sample rate, 64 kbps bitrate

    • AudioFormat.OGG_OPUS_24KHZ_MONO_16KBPS: Opus format, 24 kHz sample rate, 16 kbps bitrate

    • AudioFormat.OGG_OPUS_24KHZ_MONO_32KBPS: Opus format, 24 kHz sample rate, 32 kbps bitrate

    • AudioFormat.OGG_OPUS_24KHZ_MONO_64KBPS: Opus format, 24 kHz sample rate, 64 kbps bitrate

    • AudioFormat.OGG_OPUS_48KHZ_MONO_16KBPS: Opus format, 48 kHz sample rate, 16 kbps bitrate

    • AudioFormat.OGG_OPUS_48KHZ_MONO_32KBPS: Opus format, 48 kHz sample rate, 32 kbps bitrate

    • AudioFormat.OGG_OPUS_48KHZ_MONO_64KBPS: Opus format, 48 kHz sample rate, 64 kbps bitrate

volume

int

50

No

The volume of the synthesized audio. Valid values: 0 to 100.

Important

This field differs in different versions of the DashScope SDK:

  • SDK 1.20.10 and later: volume

  • SDK versions earlier than 1.20.10: volumn

speech_rate

float

1.0

No

The speech rate of the synthesized audio. Valid values: 0.5 to 2.

  • 0.5: 0.5 times the default speed.

  • 1: The default speed. The default speech rate is the rate at which the model outputs synthesized speech by default. The rate may vary slightly depending on the voice. The rate is about four characters per second.

  • 2: 2 times the default speed.

pitch_rate

float

1.0

No

The pitch of the synthesized audio. Valid values: 0.5 to 2.

bit_rate

int

32

No

Specifies the bitrate of the audio. Valid values: 6 to 510 kbps.

A higher bitrate results in better audio quality and a larger audio file.

Only available when the audio format (format) is opus.

Note

Set the bit_rate parameter using the additional_params parameter:

synthesizer = SpeechSynthesizer(model="cosyvoice-v2",
                                voice="longxiaochun_v2",
                                format=AudioFormat.OGG_OPUS_16KHZ_MONO_16KBPS,
                                additional_params={"bit_rate": 32})

word_timestamp_enabled

bool

False

No

Specifies whether to enable word-level timestamps. The default value is false.

This feature is supported only by cosyvoice-v2.

Timestamp results can be obtained only through the callback interface.
Note

Set the word_timestamp_enabled parameter using the additional_params parameter:

synthesizer = SpeechSynthesizer(model="cosyvoice-v2",
                                voice="longxiaochun_v2",
                                callback=callback, # Timestamp results can only be obtained through the callback interface.
                                additional_params={'word_timestamp_enabled': True})

Click to view the complete example code

# coding=utf-8

import dashscope
from dashscope.audio.tts_v2 import *
import json
from datetime import datetime


def get_timestamp():
    now = datetime.now()
    formatted_timestamp = now.strftime("[%Y-%m-%d %H:%M:%S.%f]")
    return formatted_timestamp


# If the API key is not configured in the environment variables,
# replace "your-api-key" with your own API key.
# dashscope.api_key = "your-api-key"

# The model must be cosyvoice-v2.
model = "cosyvoice-v2"
# Voice
voice = "longxiaochun_v2"


# Define the callback interface.
class Callback(ResultCallback):
    _player = None
    _stream = None

    def on_open(self):
        self.file = open("output.mp3", "wb")
        print("Connection established: " + get_timestamp())

    def on_complete(self):
        print("Speech synthesis completed. All results have been received: " + get_timestamp())

    def on_error(self, message: str):
        print(f"An error occurred during speech synthesis: {message}")

    def on_close(self):
        print("Connection closed: " + get_timestamp())
        self.file.close()

    def on_event(self, message):
        json_data = json.loads(message)
        if json_data['payload'] and json_data['payload']['output'] and json_data['payload']['output']['sentence']:
            sentence = json_data['payload']['output']['sentence']
            print(f'sentence: {sentence}')
            # Get the sentence number.
            # index = sentence['index']
            words = sentence['words']
            if words:
                for word in words:
                    print(f'word: {word}')
                    # Example value: word: {'text': 'To', 'begin_index': 0, 'end_index': 1, 'begin_time': 80, 'end_time': 200}

    def on_data(self, data: bytes) -> None:
        print(get_timestamp() + " Binary audio length: " + str(len(data)))
        self.file.write(data)


callback = Callback()

# Instantiate SpeechSynthesizer and pass request parameters
# such as model and voice in the constructor.
synthesizer = SpeechSynthesizer(
    model=model,
    voice=voice,
    callback=callback,
    additional_params={'word_timestamp_enabled': True}
)

# Send the text to be synthesized and get the binary audio in real time
# through the on_data method of the callback interface.
synthesizer.call("What is the weather like today?")
# A WebSocket connection must be established when sending text for the first time.
# Therefore, the first-packet latency includes the time to establish the connection.
print('[Metric] Request ID: {}, First-packet latency: {} ms'.format(
    synthesizer.get_last_request_id(),
    synthesizer.get_first_package_delay()))

seed

int

0

No

The random number seed used for generation, which changes the synthesis effect. The default value is 0. Valid values: 0 to 65535.

callback

ResultCallback

-

No

ResultCallback interface.

Key interfaces

SpeechSynthesizer class

The SpeechSynthesizer class is imported using "from dashscope.audio.tts_v2 import *" and provides the key interfaces for speech synthesis.

Method

Parameters

Return value

Description

def call(self, text: str, timeout_millis=None)
  • text: The text to be synthesized.

  • timeout_millis: The timeout period for blocking the thread, in milliseconds. If not set or set to 0, it does not take effect.

Returns binary audio data if ResultCallback is not specified. Otherwise, returns None.

Converts a whole text segment (either plain text or text with SSML) into speech.

When creating a SpeechSynthesizer instance, there are two cases:

  • ResultCallback is not specified: The call method blocks the current thread until speech synthesis is complete and returns the binary audio data. For usage, see Synchronous call.

  • ResultCallback is specified: The call method immediately returns None, and the speech synthesis result is returned through the on_data method of the ResultCallback interface. For usage, see Asynchronous invocation.

Important

You must re-initialize the SpeechSynthesizer instance before each call to the call method.

def streaming_call(self, text: str)

text: The text segment to be synthesized.

None

Streams the text to be synthesized (text with SSML is not supported).

Call this interface multiple times to send the text to be synthesized to the server in segments. The synthesis result is obtained through the on_data method of the ResultCallback interface.

For usage, see Streaming call.

def streaming_complete(self, complete_timeout_millis=600000)

complete_timeout_millis: The waiting time, in milliseconds.

None

Ends the streaming speech synthesis.

This method blocks the current thread for N milliseconds (the duration is determined by complete_timeout_millis) until the task ends. If completeTimeoutMillis is set to 0, it waits indefinitely.

By default, if the waiting time exceeds 10 minutes, the waiting stops.

For usage, see Streaming call.

Important

When making a streaming call, make sure to call this method. Otherwise, parts of the synthesized speech may be missing.

def get_last_request_id(self)

None

The request_id of the previous task.

Gets the request_id of the previous task.

def get_first_package_delay(self)

None

First-packet latency

Gets the first-packet latency (usually around 500 ms).

First-packet latency is the time between when the text is sent and when the first audio packet is received, in milliseconds. Use this after the task is completed.

A WebSocket connection must be established when sending text for the first time. Therefore, the first-packet latency includes the time to establish the connection.

def get_response(self)

None

The last message.

Gets the last message (in JSON format), which can be used to get task-failed errors.

ResultCallback interface

When you make an asynchronous invocation or a streaming call, the server returns key process information and data to the client through a callback. You need to implement the callback methods to handle the information or data that is returned by the server.

Import it using "from dashscope.audio.tts_v2 import *".

Click to view an example

class Callback(ResultCallback):
    def on_open(self) -> None:
        print('Connection successful')
    
    def on_data(self, data: bytes) -> None:
        # Implement the logic to receive the synthesized binary audio result.

    def on_complete(self) -> None:
        print('Synthesis complete')

    def on_error(self, message) -> None:
        print('An exception occurred: ', message)

    def on_close(self) -> None:
        print('Connection closed')


callback = Callback()

Method

Parameters

Return value

Description

def on_open(self) -> None

None

None

This method is called immediately after a connection is established with the server.

def on_event( self, message: str) -> None

message: The information returned by the server.

None

This method is called when there is a response from the service. The message is a JSON string. Parse it to get information such as the Task ID (task_id parameter) and the number of billable characters in the current request (characters parameter).

def on_complete(self) -> None

None

None

This method is called after all synthesized data has been returned (speech synthesis is complete).

def on_error(self, message) -> None

message: The exception information.

None

This method is called when an exception occurs.

def on_data(self, data: bytes) -> None

data: The binary audio data returned by the server.

None

This method is called when the server returns synthesized audio.

Combine the binary audio data into a complete audio file for playback, or play the data in real time using a player that supports streaming playback.

Important
  • In streaming speech synthesis, for compressed formats such as MP3 and Opus, the audio is transmitted in segments and requires a streaming player. Do not play the audio frame by frame, because this can cause decoding to fail.

    Players that support streaming playback include FFmpeg, PyAudio (Python), AudioFormat (Java), and MediaSource (JavaScript).
  • When combining audio data into a complete audio file, write the data to the same file in append mode.

  • For WAV and MP3 audio formats in streaming speech synthesis, only the first frame contains header information. Subsequent frames contain audio-only data.

def on_close(self) -> None

None

None

This method is called after the service has closed the connection.

Response

The server returns binary audio data:

Error codes

If an error occurs, see Error messages to troubleshoot the issue.

If the problem persists, you can join the developer group to provide feedback. Include the Request ID for further investigation.

More examples

For more examples, see GitHub.

Voice list

The default voices that are currently supported are listed in the table below. If you need a more personalized voice, customize an exclusive voice for free using the voice cloning feature. For more information, see Use a cloned voice for speech synthesis.

When you perform speech synthesis, the model parameter must match the selected voice. Otherwise, the call fails.

The text to be synthesized (text) must be in the same language as the selected voice. Otherwise, pronunciation errors or unnatural speech may occur.

cosyvoice-v2

Scenario

Voice

Characteristics

Audio sample (Right-click to save)

voice parameter

Language

SSML

Permission requirements

Telemarketing

Longyingxiao

Sweet-voiced saleswoman

longyingxiao

Chinese, English

✅ Available for direct use

Short video voiceover

Longjiqi

Cute robot

longjiqi

Chinese, English

✅ Available for direct use

Longhouge

Classic Monkey King

longhouge

Chinese, English

✅ Available for direct use

Longjixin

Sharp-tongued and scheming female

longjixin

Chinese, English

✅ Available for direct use

Longanyue

Lively Cantonese male

longanyue

Chinese, English

✅ Available for direct use

Longgangmei

TVB drama Mandarin female

longgangmei

Chinese, English

✅ Available for direct use

Longshange

Authentic Northern Shaanxi male

longshange

Chinese, English

✅ Available for direct use

Longanmin

Sweet Southern Min female

longanmin

Chinese, English

✅ Available for direct use

Longdaiyu

Delicate and talented female

longdaiyu

Chinese, English

✅ Available for direct use

Longgaoseng

The voice of an enlightened master

longgaoseng

Chinese, English

✅ Available for direct use

Voice assistant

Longanli

Crisp and composed female

longanli

Chinese, English

✅ Available for direct use

Longanlang

Fresh and crisp male

longanlang

Chinese, English

✅ Available for direct use

Longanwen

Elegant and intellectual female

longanwen

Chinese, English

✅ Available for direct use

Longanyun

Homely and warm male

longanyun

Chinese, English

✅ Available for direct use

YUMI

Formal young female

longyumi_v2

Chinese, English

✅ Available for direct use

Longxiaochun

Intellectual and positive female

longxiaochun_v2

Chinese, English

✅ Available for direct use

Longxiaoxia

Calm and authoritative female

longxiaoxia_v2

Chinese, English

✅ Available for direct use

Audiobook

Longyichen

Free-spirited and energetic male

longyichen

Chinese, English

✅ Available for direct use

Longwanjun

Delicate and gentle female

longwanjun

Chinese, English

✅ Available for direct use

Longlaobo

Weathered old man

longlaobo

Chinese, English

✅ Available for direct use

Longlaoyi

Worldly and composed aunt

longlaoyi

Chinese, English

✅ Available for direct use

Longbaizhi

Wise female narrator

longbaizhi

Chinese, English

✅ Available for direct use

Longsanshu

Calm and textured male

longsanshu

Chinese, English

✅ Available for direct use

Longxiu

Erudite male storyteller

longxiu_v2

Chinese, English

✅ Available for direct use

Longmiao

Cadenced female

longmiao_v2

Chinese, English

✅ Available for direct use

Longyue

Warm and magnetic female

longyue_v2

Chinese, English

✅ Available for direct use

Longnan

Wise young male

longnan_v2

Chinese, English

✅ Available for direct use

Longyuan

Warm and healing female

longyuan_v2

Chinese, English

✅ Available for direct use

Social companion

Longanqin

Approachable and lively female

longanqin

Chinese, English

✅ Available for direct use

Longanya

Elegant and classy female

longanya

Chinese, English

✅ Available for direct use

Longanshuo

Clean and fresh male

longanshuo

Chinese, English

✅ Available for direct use

Longanling

Agile-minded female

longanling

Chinese, English

✅ Available for direct use

Longanzhi

Wise and mature young male

longanzhi

Chinese, English

✅ Available for direct use

Longanrou

Gentle female best friend

longanrou

Chinese, English

✅ Available for direct use

Longqiang

Romantic and charming female

longqiang_v2

Chinese, English

✅ Available for direct use

Longhan

Warm and devoted male

longhan_v2

Chinese, English

✅ Available for direct use

Longxing

Gentle girl-next-door

longxing_v2

Chinese, English

✅ Available for direct use

Longhua

Energetic and sweet female

longhua_v2

Chinese, English

✅ Available for direct use

Longwan

Positive and intellectual female

longwan_v2

Chinese, English

✅ Available for direct use

Longcheng

Intelligent young male

longcheng_v2

Chinese, English

✅ Available for direct use

Longfeifei

Sweet and delicate female

longfeifei_v2

Chinese, English

✅ Available for direct use

Longxiaocheng

Magnetic low-pitched male

longxiaocheng_v2

Chinese, English

✅ Available for direct use

Longzhe

Awkward but warm-hearted male

longzhe_v2

Chinese, English

✅ Available for direct use

Longyan

Warm and gentle female

longyan_v2

Chinese, English

✅ Available for direct use

Longtian

Magnetic and rational male

longtian_v2

Chinese, English

✅ Available for direct use

Longze

Warm and energetic male

longze_v2

Chinese, English

✅ Available for direct use

Longshao

Positive and ambitious male

longshao_v2

Chinese, English

✅ Available for direct use

Longhao

Emotional and melancholic male

longhao_v2

Chinese, English

✅ Available for direct use

Longshen

Talented male singer

kabuleshen_v2

Chinese, English

✅ Available for direct use

Child's voice (benchmark voice)

Longhuhu

Innocent and lively young girl

longhuhu

Chinese, English

✅ Available for direct use

Consumer electronics - Education and training

Longanpei

Young female teacher

longanpei

Chinese, English

✅ Available for direct use

Consumer electronics - Child companion

Longwangwang

Taiwanese youth

longwangwang

Chinese, English

✅ Available for direct use

Longpaopao

Apsara bubble voice

longpaopao

Chinese, English

✅ Available for direct use

Consumer electronics - Children's audiobooks

Longshanshan

Dramatic child's voice

longshanshan

Chinese, English

✅ Available for direct use

Longniuniu

Sunny young boy's voice

longniuniu

Chinese, English

✅ Available for direct use

Customer service

Longyingmu

Elegant and intellectual female

longyingmu

Chinese, English

✅ Available for direct use

Longyingxun

Young and inexperienced male

longyingxun

Chinese, English

✅ Available for direct use

Longyingcui

Serious male for collections

longyingcui

Chinese, English

✅ Available for direct use

Longyingda

Cheerful high-pitched female

longyingda

Chinese, English

✅ Available for direct use

Longyingjing

Low-key and calm female

longyingjing

Chinese, English

✅ Available for direct use

Longyingyan

Righteous and stern female

longyingyan

Chinese, English

✅ Available for direct use

Longyingtian

Gentle and sweet female

longyingtian

Chinese, English

✅ Available for direct use

Longyingbing

Sharp and assertive female

longyingbing

Chinese, English

✅ Available for direct use

Longyingtao

Gentle and calm female

longyingtao

Chinese, English

✅ Available for direct use

Longyingling

Gentle and empathetic female

longyingling

Chinese, English

✅ Available for direct use

Livestreaming e-commerce

Longanran

Lively and textured female

longanran

Chinese, English

✅ Available for direct use

Longanxuan

Classic female livestreamer

longanxuan

Chinese, English

✅ Available for direct use

Longanchong

Passionate male salesperson

longanchong

Chinese, English

✅ Available for direct use

Longanping

High-pitched female livestreamer

longanping

Chinese, English

✅ Available for direct use

Child's voice

Longjielidou

Sunny and mischievous male

longjielidou_v2

Chinese, English

✅ Available for direct use

Longling

Childish and deadpan female

longling_v2

Chinese, English

✅ Available for direct use

Longke

Innocent and well-behaved female

longke_v2

Chinese, English

✅ Available for direct use

Longxian

Bold and cute female

longxian_v2

Chinese, English

✅ Available for direct use

Dialect

Longlaotie

Forthright Northeastern male

longlaotie_v2

Chinese (Northeastern), English

✅ Available for direct use

Longjiayi

Intellectual Cantonese female

longjiayi_v2

Chinese (Cantonese), English

✅ Available for direct use

Longtao

Positive Cantonese female

longtao_v2

Chinese (Cantonese), English

✅ Available for direct use

Poetry recitation

Longfei

Passionate and magnetic male

longfei_v2

Chinese, English

✅ Available for direct use

Libai

Ancient male poet

libai_v2

Chinese, English

✅ Available for direct use

Longjin

Elegant and gentle male

longjin_v2

Chinese, English

✅ Available for direct use

News broadcast

Longshu

Calm young male

longshu_v2

Chinese, English

✅ Available for direct use

Bella2.0

Precise and capable female

loongbella_v2

Chinese, English

✅ Available for direct use

Longshuo

Erudite and capable male

longshuo_v2

Chinese, English

✅ Available for direct use

Longxiaobai

Calm female announcer

longxiaobai_v2

Chinese, English

✅ Available for direct use

Longjing

Typical female announcer

longjing_v2

Chinese, English

✅ Available for direct use

loongstella

Confident and crisp female

loongstella_v2

Chinese, English

✅ Available for direct use

Overseas marketing

loongyuuna

Energetic Japanese female

loongyuuna_v2

Japanese

✅ Available for direct use

loongyuuma

Capable Japanese male

loongyuuma_v2

Japanese

✅ Available for direct use

loongjihun

Sunny Korean male

loongjihun_v2

Korean

✅ Available for direct use

loongeva

Intellectual British English female

loongeva_v2

British English

✅ Available for direct use

loongbrian

Calm British English male

loongbrian_v2

British English

✅ Available for direct use

loongluna

British English female

loongluna_v2

British English

✅ Available for direct use

loongluca

British English male

loongluca_v2

British English

✅ Available for direct use

loongemily

British English female

loongemily_v2

British English

✅ Available for direct use

loongeric

British English male

loongeric_v2

British English

✅ Available for direct use

loongabby

American English female

loongabby_v2

American English

✅ Available for direct use

loongannie

American English female

loongannie_v2

American English

✅ Available for direct use

loongandy

American English male

loongandy_v2

American English

✅ Available for direct use

loongava

American English female

loongava_v2

American English

✅ Available for direct use

loongbeth

American English female

loongbeth_v2

American English

✅ Available for direct use

loongbetty

American English female

loongbetty_v2

American English

✅ Available for direct use

loongcindy

American English female

loongcindy_v2

American English

✅ Available for direct use

loongcally

American English female

loongcally_v2

American English

✅ Available for direct use

loongdavid

American English male

loongdavid_v2

American English

✅ Available for direct use

loongdonna

American English female

loongdonna_v2

American English

✅ Available for direct use

loongkyong

Korean female

loongkyong_v2

Korean

✅ Available for direct use

loongtomoka

Japanese female

loongtomoka_v2

Japanese

✅ Available for direct use

loongtomoya

Japanese male

loongtomoya_v2

Japanese

✅ Available for direct use

FAQ

Features, billing, and rate limiting

Q: Where can I find information about the features, billing, and throttling of CosyVoice?

A: For more information, see CosyVoice.

Q: What can I do if the pronunciation is inaccurate?

A: You can use SSML to customize the speech synthesis results.

Q: The current requests per second (RPS) cannot meet my business requirements. What should I do? How can I increase the limit? Is there a fee?

A: You can submit an Alibaba Cloud ticket or join the developer group to request a scale-out. The scale-out is free of charge.

Q: How do I specify the language of the speech to be synthesized?

A: You cannot specify the language of the speech to be synthesized through request parameters. If you want to synthesize speech in a specific language, see the voice list and select a voice based on its language.

Q: Speech synthesis is billed based on the number of text characters. How do I check or get the text length for each synthesis task?

The method for retrieving the text length depends on whether you have logging enabled:

  1. If logging is disabled

    • For synchronous calls, calculate the length based on the character counting rules.

    • For other call methods, retrieve the length from the on_event method's message parameter in the ResultCallback interface. The message is a JSON string. Parse the string to retrieve the number of billable characters for the request from the characters parameter. Use the value from the last message that you receive.

  2. If logging is enabled

    The following log is printed to the console. The characters field shows the number of billable characters for the request. Use the value from the last log entry that is printed.

    2025-08-27 11:02:09,429 - dashscope - speech_synthesizer.py - on_message - 454 - DEBUG - <<<recv {"header":{"task_id":"62ebb7d6cb0a4080868f0edb######","event":"result-generated","attributes":{}},"payload":{"output":{"sentence":{"words":[]}},"usage":{"characters":15}}}

Click to see how to enable logging

Enable logging by setting an environment variable in the command line:

  • For Windows: $env:DASHSCOPE_LOGGING_LEVEL="debug"

  • For Linux/macOS: export DASHSCOPE_LOGGING_LEVEL=debug

Troubleshooting

If you encounter a code error, troubleshoot the issue based on the information in Error codes.

Q: How to get a Request ID?

Get it in the following two ways:

Q: Why does the SSML feature fail?

Perform the following troubleshooting operations:

  1. Ensure that the current voice supports the SSML feature. Personalized voices do not support SSML.

  2. Ensure that the model parameter is cosyvoice-v2.

  3. Install the latest version of the DashScope SDK.

  4. Ensure that you use the correct interface. Only the call method of the SpeechSynthesizer class supports SSML.

  5. Ensure that the text to be synthesized is in plain text format and meets the format requirements. For more information, see Introduction to the SSML markup language.

Q: Why can't the audio be played?

A: Troubleshoot the issue based on the following scenarios:

  1. The audio is saved as a complete file, such as xx.mp3

    1. Audio format consistency: Make sure that the audio format set in the request parameters matches the file extension. For example, if the audio format is set to WAV in the request parameters but the file extension is .mp3, playback may fail.

    2. Player compatibility: Confirm whether your player supports the format and sample rate of the audio file. For example, some players may not support high sample rates or specific audio encodings.

  2. The audio is played in a stream

    1. Save the audio stream as a complete file and try to play it with a player. If the file cannot be played, see the troubleshooting method for the first scenario.

    2. If the file can be played normally, the problem may be with the streaming playback implementation. Confirm whether your player supports streaming playback.

      Common tools and libraries that support streaming playback include FFmpeg, PyAudio (Python), AudioFormat (Java), and MediaSource (JavaScript).

Q: Why does the audio playback stutter?

A: Troubleshoot the issue based on the following scenarios:

  1. The audio is saved as a complete file, such as xx.mp3

    Join the developer group and provide the Request ID so that we can troubleshoot the issue for you.

  2. The audio is played in a stream

    1. Check the text sending speed: Make sure that the interval for sending text is reasonable. Avoid situations where the next sentence is not sent promptly after the previous audio segment has finished playing.

    2. Check the callback function performance:

      • Check whether there is too much business logic in the callback function, which may cause blocking.

      • The callback function runs in the WebSocket thread. If it is blocked, it may affect the WebSocket's ability to receive network packets, which can cause stuttering when receiving the audio stream.

      • Write the audio data to a separate audio buffer and then read and process it in other threads. This avoids blocking the WebSocket thread.

    3. Check network stability: Make sure that the network connection is stable to avoid audio transmission interruptions or delays due to network fluctuations.

    4. Further troubleshooting: If the preceding methods do not resolve the issue, join the developer group and provide the Request ID so that we can further investigate the issue for you.

Q: Why is speech synthesis slow (long synthesis time)?

A: Troubleshoot the issue as follows:

  1. You can check the input interval.

    Check the input interval: If you are using streaming speech synthesis, check whether the interval between sending text segments is too long. For example, a delay of several seconds between segments will increase the total synthesis time.

  2. Analyze performance metrics

    Analyze performance metrics: If the first-packet latency does not meet the following requirements, submit the request ID to the technical team for assistance.

    • First-packet latency: Typically around 500 ms.

    • Real-Time Factor (RTF): Typically around 0.3. RTF = Total synthesis time / Audio duration.

Q: How to handle pronunciation errors in the synthesized speech?

  • If you are using the cosyvoice-v1 model, we recommend using cosyvoice-v2, which delivers better results and supports SSML.

  • If the current model is cosyvoice-v2, use the SSML <phoneme> tag to specify the correct pronunciation.

Q: Why is no audio returned, or why is the synthesized audio incomplete?

Check whether you called the streaming_complete method of the SpeechSynthesizer class. During speech synthesis, the server starts the process only after it caches enough text. If you do not call the streaming_complete method, the last part of the text in the cache might not be synthesized into audio.

Q: What do I do if SSL certificate verification fails?

  1. Install the system root certificates.

    sudo yum install -y ca-certificates
    sudo update-ca-trust enable
  2. You can add the following lines to your code.

    import os
    os.environ["SSL_CERT_FILE"] = "/etc/ssl/certs/ca-bundle.crt"

Q: Why do I get an "SSL: CERTIFICATE_VERIFY_FAILED" error on macOS? (websocket closed due to [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1000))

When you connect to a WebSocket, you might encounter an OpenSSL certificate authentication failure that reports that the certificate cannot be found. This issue usually occurs because the certificate configuration in your Python environment is incorrect. You can follow these steps to manually locate and fix the certificate issue:

  1. Export system certificates and set environment variables. You can run the following commands to export all certificates from your macOS system to a file. This sets the file as the default certification path for Python and its libraries:

    security find-certificate -a -p > ~/all_mac_certs.pem
    export SSL_CERT_FILE=~/all_mac_certs.pem
    export REQUESTS_CA_BUNDLE=~/all_mac_certs.pem
  2. Create a symbolic link to fix the Python OpenSSL configuration. If your Python OpenSSL configuration is missing a certificate, run the following command to create a symbolic link. Replace the path in the command with the actual installation folder for your Python version:

    # 3.9 is an example. Adjust the path to your installed Python version.
    ln -s /etc/ssl/* /Library/Frameworks/Python.framework/Versions/3.9/etc/openssl
  3. Restart the terminal and clear the cache. After you complete the steps, close and reopen your terminal to apply the environment variables. Then, clear any caches and retry the WebSocket connection.

These steps can resolve connection issues that are caused by incorrect certificate configuration. If the issue persists, check the certificate configuration on the target server.

Q: Why do I get the error "AttributeError: module 'websocket' has no attribute 'WebSocketApp'. Did you mean: 'WebSocket'?"

This error occurs because websocket-client is not installed or there is a version mismatch. You can run the following commands in order:

pip uninstall websocket-client
pip uninstall websocket
pip install websocket-client

Permissions and authentication

Q: I want my API key to be used only for the CosyVoice speech synthesis service and not for other Model Studio models (permission isolation). How can I do this?

A: You can limit the scope of an API key by creating a new workspace and authorizing only specific models. For more information, see Workspace Management.

More questions

For more information, see the GitHub QA.