All Products
Search
Document Center

Alibaba Cloud Model Studio:CosyVoice speech synthesis Python SDK

Last Updated:Mar 18, 2026

The parameters and interfaces of the CosyVoice speech synthesis Python SDK.

User guide: For model overviews and selection suggestions, see Real-time speech synthesis - CosyVoice.

Prerequisites

Models and pricing

See Real-time speech synthesis - CosyVoice.

Text and format limitations

Text length limits

Character counting rules

  • Chinese characters (simplified/traditional Chinese, Japanese Kanji, Korean Hanja) count as two characters. All other characters (punctuation, letters, numbers, Kana, Hangul) count as one.

  • SSML tags are not included when calculating the text length.

  • Examples:

    • "你好" → 2 (Chinese character) + 2 (Chinese character) = 4 characters

    • "中A文123" → 2 (Chinese character) + 1 (A) + 2 (Chinese character) + 1 (1) + 1 (2) + 1 (3) = 8 characters

    • "中文。" → 2 (Chinese character) + 2 (Chinese character) + 1 (.) = 5 characters

    • "中 文。" → 2 (Chinese character) + 1 (space) + 2 (Chinese character) + 1 (.) = 6 characters

    • "<speak>你好</speak>" → 2 (Chinese character) + 2 (Chinese character) = 4 characters

Encoding format

Use UTF-8 encoding.

Support for mathematical expressions

Mathematical expression parsing (v3.5-flash, v3.5-plus, v3-flash, v3-plus, v2 only): Supports primary and secondary school math—basic operations, algebra, geometry.

Note

This feature only supports Chinese.

See Convert LaTeX formulas to speech (Chinese language only).

SSML support

SSML is available for custom voices (voice design or cloning) with v3.5-flash, v3.5-plus, v3-flash, v3-plus, and v2, and for system voices marked as supported in the voice list. Requirements:

Getting started

The SpeechSynthesizer class provides core speech synthesis interfaces and supports the following invocation methods:

  • Non-streaming: A blocking call that sends the full text at once and returns the complete audio. Suitable for short text.

  • Unidirectional streaming: A non-blocking call that sends the full text at once and receives audio via callback. Suitable for short text with low latency.

  • Bidirectional streaming: A non-blocking call that sends text fragments incrementally and receives audio via callback in real time. Suitable for long text with low latency.

Non-streaming

Submit a single speech synthesis task and receive the complete audio result in one response (no streaming, no callbacks).

image

Instantiate the SpeechSynthesizer class, bind request parameters, and call the call method to synthesize and retrieve binary audio data.

The text length must not exceed 20,000 characters.

Important

Before each call to the call method, you must re-initialize the SpeechSynthesizer instance.

View full example

# coding=utf-8

import dashscope
from dashscope.audio.tts_v2 import *
import os

# API keys differ between the Singapore and Beijing regions. Get your API key: https://www.alibabacloud.com/help/zh/model-studio/get-api-key
# If you have not configured environment variables, replace the next line with your Model Studio API key: dashscope.api_key = "sk-xxx"
dashscope.api_key = os.environ.get('DASHSCOPE_API_KEY')

# This URL is for the Singapore region. If you use the Beijing region model, replace it with: wss://dashscope.aliyuncs.com/api-ws/v1/inference
dashscope.base_websocket_api_url='wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference'

# Model
model = "cosyvoice-v3-flash"
# Voice
voice = "longanyang"

# Instantiate SpeechSynthesizer and pass request parameters such as model and voice in the constructor
synthesizer = SpeechSynthesizer(model=model, voice=voice)
# Send text for synthesis and get binary audio
audio = synthesizer.call("How is the weather today?")
# The first request establishes a WebSocket connection, so the first-package delay includes connection setup time
print('[Metric] requestId: {}, first-package delay: {} ms'.format(
    synthesizer.get_last_request_id(),
    synthesizer.get_first_package_delay()))

# Save audio to local file
with open('output.mp3', 'wb') as f:
    f.write(audio)

Unidirectional streaming

Submit a single speech synthesis task and stream results in real time through the ResultCallback interface.

image

Instantiate the SpeechSynthesizer class, bind request parameters and the callback interface (ResultCallback), and call the call method to synthesize and retrieve results in real time through the on_data method of the callback interface (ResultCallback).

The text length must not exceed 20,000 characters.

Important

Before each call to the call method, you must re-initialize the SpeechSynthesizer instance.

View full example

# coding=utf-8

import os
import dashscope
from dashscope.audio.tts_v2 import *

from datetime import datetime

def get_timestamp():
    now = datetime.now()
    formatted_timestamp = now.strftime("[%Y-%m-%d %H:%M:%S.%f]")
    return formatted_timestamp

# API keys differ between the Singapore and Beijing regions. Get your API key: https://www.alibabacloud.com/help/zh/model-studio/get-api-key
# If you have not configured environment variables, replace the next line with your Model Studio API key: dashscope.api_key = "sk-xxx"
dashscope.api_key = os.environ.get('DASHSCOPE_API_KEY')

# This URL is for the Singapore region. If you use the Beijing region model, replace it with: wss://dashscope.aliyuncs.com/api-ws/v1/inference
dashscope.base_websocket_api_url='wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference'

# Model
model = "cosyvoice-v3-flash"
# Voice
voice = "longanyang"


# Define callback interface
class Callback(ResultCallback):
    _player = None
    _stream = None

    def on_open(self):
        self.file = open("output.mp3", "wb")
        print("Connection established: " + get_timestamp())

    def on_complete(self):
        print("Speech synthesis completed, all results received: " + get_timestamp())
        # Call get_first_package_delay only after on_complete triggers
        # The first request establishes a WebSocket connection, so the first-package delay includes connection setup time
        print('[Metric] requestId: {}, first-package delay: {} ms'.format(
            synthesizer.get_last_request_id(),
            synthesizer.get_first_package_delay()))

    def on_error(self, message: str):
        print(f"Speech synthesis error: {message}")

    def on_close(self):
        print("Connection closed: " + get_timestamp())
        self.file.close()

    def on_event(self, message):
        pass

    def on_data(self, data: bytes) -> None:
        print(get_timestamp() + " Binary audio length: " + str(len(data)))
        self.file.write(data)


callback = Callback()

# Instantiate SpeechSynthesizer and pass request parameters such as model and voice in the constructor
synthesizer = SpeechSynthesizer(
    model=model,
    voice=voice,
    callback=callback,
)

# Send text for synthesis and retrieve binary audio in real time through the on_data method of the callback interface
synthesizer.call("How is the weather today?")

Bidirectional streaming

Submit text in multiple parts within a single speech synthesis task and receive synthesis results in real time through callbacks.

Note
  • During streaming input, call streaming_call multiple times to submit text fragments in order. After receiving a fragment, the server automatically splits it into sentences:

    • Complete sentences are synthesized immediately.

    • Incomplete sentences are cached until complete, then synthesized.

    When you call streaming_complete, the server forces synthesis of all received but unprocessed fragments—including incomplete sentences.

  • The interval between text fragment submissions must not exceed 23 seconds, or the system throws a "request timeout after 23 seconds" error.

    If no more text remains to send, call streaming_complete promptly to end the task.

    The server enforces a fixed 23-second timeout. Clients cannot modify this setting.
image
  1. Instantiate the SpeechSynthesizer class

    Instantiate the SpeechSynthesizer class, and bind request parameters and the callback interface (ResultCallback).

  2. Streaming

    Call the streaming_call method of the SpeechSynthesizer class multiple times to submit text fragments for synthesis. Send text fragments to the server in parts.

    While sending text, the server returns synthesis results in real time to the client through the on_data method of the callback interface (ResultCallback).

    Each text fragment (the text parameter) sent via streaming_call must not exceed 20,000 characters, and the cumulative text length across all fragments must not exceed 200,000 characters.

  3. End processing

    Call the streaming_complete method of the SpeechSynthesizer class to end speech synthesis.

    This method blocks the current thread until the on_complete or on_error callback of the callback interface (ResultCallback) triggers.

    Always call this method. Otherwise, trailing text may fail to convert to speech.

View full example

# coding=utf-8
#
# PyAudio installation instructions:
# For macOS, run:
#   brew install portaudio
#   pip install pyaudio
# For Debian/Ubuntu, run:
#   sudo apt-get install python-pyaudio python3-pyaudio
#   or
#   pip install pyaudio
# For CentOS, run:
#   sudo yum install -y portaudio portaudio-devel && pip install pyaudio
# For Microsoft Windows, run:
#   python -m pip install pyaudio

import os
import time
import pyaudio
import dashscope
from dashscope.api_entities.dashscope_response import SpeechSynthesisResponse
from dashscope.audio.tts_v2 import *

from datetime import datetime

def get_timestamp():
    now = datetime.now()
    formatted_timestamp = now.strftime("[%Y-%m-%d %H:%M:%S.%f]")
    return formatted_timestamp

# API keys differ between the Singapore and Beijing regions. Get your API key: https://www.alibabacloud.com/help/zh/model-studio/get-api-key
# If you have not configured environment variables, replace the next line with your Model Studio API key: dashscope.api_key = "sk-xxx"
dashscope.api_key = os.environ.get('DASHSCOPE_API_KEY')

# This URL is for the Singapore region. If you use the Beijing region model, replace it with: wss://dashscope.aliyuncs.com/api-ws/v1/inference
dashscope.base_websocket_api_url='wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference'

# Model
model = "cosyvoice-v3-flash"
# Voice
voice = "longanyang"


# Define callback interface
class Callback(ResultCallback):
    _player = None
    _stream = None

    def on_open(self):
        print("Connection established: " + get_timestamp())
        self._player = pyaudio.PyAudio()
        self._stream = self._player.open(
            format=pyaudio.paInt16, channels=1, rate=22050, output=True
        )

    def on_complete(self):
        print("Speech synthesis completed, all results received: " + get_timestamp())

    def on_error(self, message: str):
        print(f"Speech synthesis error: {message}")

    def on_close(self):
        print("Connection closed: " + get_timestamp())
        # Stop player
        self._stream.stop_stream()
        self._stream.close()
        self._player.terminate()

    def on_event(self, message):
        pass

    def on_data(self, data: bytes) -> None:
        print(get_timestamp() + " Binary audio length: " + str(len(data)))
        self._stream.write(data)


callback = Callback()

test_text = [
    "Streaming text-to-speech SDK,",
    "converts input text",
    "into binary audio data.",
    "Compared with non-streaming speech synthesis,",
    "streaming synthesis offers better real-time performance.",
    "Users hear near-synchronous audio output while typing,",
    "greatly improving interaction experience",
    "and reducing wait time.",
    "Ideal for large language model (LLM) integration,",
    "where text is streamed for speech synthesis.",
]

# Instantiate SpeechSynthesizer and pass request parameters such as model and voice in the constructor
synthesizer = SpeechSynthesizer(
    model=model,
    voice=voice,
    format=AudioFormat.PCM_22050HZ_MONO_16BIT,  
    callback=callback,
)


# Stream text for synthesis. Retrieve binary audio in real time through the on_data method of the callback interface
for text in test_text:
    synthesizer.streaming_call(text)
    time.sleep(0.1)
# End streaming speech synthesis
synthesizer.streaming_complete()

# The first request establishes a WebSocket connection, so the first-package delay includes connection setup time
print('[Metric] requestId: {}, first-package delay: {} ms'.format(
    synthesizer.get_last_request_id(),
    synthesizer.get_first_package_delay()))

Request parameters

Set request parameters using the constructor of the SpeechSynthesizer class.

Parameter

Type

Required

Description

model

str

Yes

Speech synthesis model. Each model version requires compatible voices:

  • cosyvoice-v3.5-flash/cosyvoice-v3.5-plus: No system voices are available. Only custom voices from voice design or voice cloning are supported.

  • cosyvoice-v3-flash/cosyvoice-v3-plus: Use voices such as longanyang.

  • cosyvoice-v2: Use voices such as longxiaochun_v2.

  • For a complete list of voices, see Voice list.

voice

str

Yes

The voice used for speech synthesis.

Supported voice types:

  • System voices: For more information, see Voice list.

  • Cloned voices: Customized using the Voice cloning feature. When using a cloned voice, make sure that the same account is used for voice cloning and speech synthesis.

    For cloned voices, model must match the voice creation model (target_model).

  • Designed voices: Customized using the Voice design feature. When using a designed voice, make sure that the same account is used for voice design and speech synthesis.

    For designed voices, model must match the voice creation model (target_model).

format

enum

No

Audio encoding format and sample rate.

If you do not specify format, the default is 22.05 kHz and MP3.

Note

The default sample rate represents the optimal rate for the selected voice. Output uses this rate by default, but downsampling and upsampling are supported.

Supported audio encoding formats and sample rates include the following:

  • All models support these formats and sample rates:

    • AudioFormat.WAV_8000HZ_MONO_16BIT: WAV format, 8 kHz sample rate

    • AudioFormat.WAV_16000HZ_MONO_16BIT: WAV format, 16 kHz sample rate

    • AudioFormat.WAV_22050HZ_MONO_16BIT: WAV format, 22.05 kHz sample rate

    • AudioFormat.WAV_24000HZ_MONO_16BIT: WAV format, 24 kHz sample rate

    • AudioFormat.WAV_44100HZ_MONO_16BIT: WAV format, 44.1 kHz sample rate

    • AudioFormat.WAV_48000HZ_MONO_16BIT: WAV format, 48 kHz sample rate

    • AudioFormat.MP3_8000HZ_MONO_128KBPS: MP3 format, 8 kHz sample rate

    • AudioFormat.MP3_16000HZ_MONO_128KBPS: MP3 format, 16 kHz sample rate

    • AudioFormat.MP3_22050HZ_MONO_256KBPS: MP3 format, 22.05 kHz sample rate

    • AudioFormat.MP3_24000HZ_MONO_256KBPS: MP3 format, 24 kHz sample rate

    • AudioFormat.MP3_44100HZ_MONO_256KBPS: MP3 format, 44.1 kHz sample rate

    • AudioFormat.MP3_48000HZ_MONO_256KBPS: MP3 format, 48 kHz sample rate

    • AudioFormat.PCM_8000HZ_MONO_16BIT: PCM format, 8 kHz sample rate

    • AudioFormat.PCM_16000HZ_MONO_16BIT: PCM format, 16 kHz sample rate

    • AudioFormat.PCM_22050HZ_MONO_16BIT: PCM format, 22.05 kHz sample rate

    • AudioFormat.PCM_24000HZ_MONO_16BIT: PCM format, 24 kHz sample rate

    • AudioFormat.PCM_44100HZ_MONO_16BIT: PCM format, 44.1 kHz sample rate

    • AudioFormat.PCM_48000HZ_MONO_16BIT: PCM format, 48 kHz sample rate

  • For OPUS format, adjust bitrate using the bit_rate parameter. Available only in DashScope SDK 1.24.0 and later.

    • AudioFormat.OGG_OPUS_8KHZ_MONO_32KBPS: OPUS format, 8 kHz sample rate, 32 kbps bitrate

    • AudioFormat.OGG_OPUS_16KHZ_MONO_16KBPS: OPUS format, 16 kHz sample rate, 16 kbps bitrate

    • AudioFormat.OGG_OPUS_16KHZ_MONO_32KBPS: OPUS format, 16 kHz sample rate, 32 kbps bitrate

    • AudioFormat.OGG_OPUS_16KHZ_MONO_64KBPS: OPUS format, 16 kHz sample rate, 64 kbps bitrate

    • AudioFormat.OGG_OPUS_24KHZ_MONO_16KBPS: OPUS format, 24 kHz sample rate, 16 kbps bitrate

    • AudioFormat.OGG_OPUS_24KHZ_MONO_32KBPS: OPUS format, 24 kHz sample rate, 32 kbps bitrate

    • AudioFormat.OGG_OPUS_24KHZ_MONO_64KBPS: OPUS format, 24 kHz sample rate, 64 kbps bitrate

    • AudioFormat.OGG_OPUS_48KHZ_MONO_16KBPS: OPUS format, 48 kHz sample rate, 16 kbps bitrate

    • AudioFormat.OGG_OPUS_48KHZ_MONO_32KBPS: OPUS format, 48 kHz sample rate, 32 kbps bitrate

    • AudioFormat.OGG_OPUS_48KHZ_MONO_64KBPS: OPUS format, 48 kHz sample rate, 64 kbps bitrate

volume

int

No

The volume.

Default: 50.

Valid range: [0, 100]. Values scale linearly—0 is silent, 50 is default, 100 is maximum.

Important

This field differs across DashScope SDK versions:

  • SDK 1.20.10 and later: volume

  • SDK versions earlier than 1.20.10: volume

speech_rate

float

No

The speech rate.

Default value: 1.0.

Valid values: [0.5, 2.0]. A value of 1.0 is the standard speech rate. A value less than 1.0 slows down the speech, and a value greater than 1.0 speeds it up.

pitch_rate

float

No

Pitch multiplier. The relationship to perceived pitch is neither linear nor logarithmic—test to find suitable values.

Default value: 1.0.

Valid values: [0.5, 2.0]. A value of 1.0 is the natural pitch of the voice. A value greater than 1.0 raises the pitch, and a value less than 1.0 lowers it.

bit_rate

int

No

The audio bitrate in kbps. If the audio format is Opus, adjust the bitrate by using the bit_rate parameter.

Default value: 32.

Valid values: [6, 510].

Note

Set bit_rate using the additional_params parameter:

synthesizer = SpeechSynthesizer(model="cosyvoice-v3-flash",
                                voice="longanyang",
                                format=AudioFormat.OGG_OPUS_16KHZ_MONO_16KBPS,
                                additional_params={"bit_rate": 32})

word_timestamp_enabled

bool

No

Enable word-level timestamps.

Default: False.

  • True

  • False

This feature supports only replicated voice styles of the cosyvoice-v3-flash, cosyvoice-v3-plus, and cosyvoice-v2 models, and system voice styles marked as supported in the Voice Styles List.

Timestamps are available only through the callback interface.
Note

Set word_timestamp_enabled using the additional_params parameter:

synthesizer = SpeechSynthesizer(model="cosyvoice-v3-flash",
                                voice="longyingjing_v3",
                                callback=callback, # Timestamps are available only through the callback interface
                                additional_params={'word_timestamp_enabled': True})

View full example code

# coding=utf-8

import dashscope
from dashscope.audio.tts_v2 import *
import json
from datetime import datetime


def get_timestamp():
    now = datetime.now()
    formatted_timestamp = now.strftime("[%Y-%m-%d %H:%M:%S.%f]")
    return formatted_timestamp


# If you have not configured your API key in an environment variable, replace your-api-key with your actual key
# dashscope.api_key = "your-api-key"

model = "cosyvoice-v3-flash"
# Voice
voice = "longyingjing_v3"


# Define callback interface
class Callback(ResultCallback):
    _player = None
    _stream = None

    def on_open(self):
        self.file = open("output.mp3", "wb")
        print("Connection established: " + get_timestamp())

    def on_complete(self):
        print("Speech synthesis completed, all results received: " + get_timestamp())

    def on_error(self, message: str):
        print(f"Speech synthesis error: {message}")

    def on_close(self):
        print("Connection closed: " + get_timestamp())
        self.file.close()

    def on_event(self, message):
        json_data = json.loads(message)
        if json_data['payload'] and json_data['payload']['output'] and json_data['payload']['output']['sentence']:
            sentence = json_data['payload']['output']['sentence']
            print(f'sentence: {sentence}')
            # Get sentence index
            # index = sentence['index']
            words = sentence['words']
            if words:
                for word in words:
                    print(f'word: {word}')
                    # Example: word: {'text': 'T', 'begin_index': 0, 'end_index': 1, 'begin_time': 80, 'end_time': 200}

    def on_data(self, data: bytes) -> None:
        print(get_timestamp() + " Binary audio length: " + str(len(data)))
        self.file.write(data)


callback = Callback()

# Instantiate SpeechSynthesizer and pass request parameters such as model and voice in the constructor
synthesizer = SpeechSynthesizer(
    model=model,
    voice=voice,
    callback=callback,
    additional_params={'word_timestamp_enabled': True}
)

# Send text for synthesis and retrieve binary audio in real time through the on_data method of the callback interface
synthesizer.call("How is the weather today?")
# The first request establishes a WebSocket connection, so the first-package delay includes connection setup time
print('[Metric] requestId: {}, first-package delay: {} ms'.format(
    synthesizer.get_last_request_id(),
    synthesizer.get_first_package_delay()))

seed

int

No

The random seed used during generation. Different seeds produce different synthesis results. If the model, text, voice, and other parameters are identical, using the same seed reproduces the same output.

Default value: 0.

Valid values: [0, 65535].

language_hints

list[str]

No

Specifies the target language for speech synthesis to improve the synthesis effect.

Use when pronunciation or synthesis quality is poor for numbers, abbreviations, symbols, or less common languages:

  • Numbers are not read aloud as expected. For example, "hello, this is 110" is read as "hello, this is one one zero" rather than "hello, this is yāo yāo líng".

  • The '@' symbol is mispronounced as 'ai te' instead of 'at'.

  • The synthesis quality for less common languages is poor and sounds unnatural.

Valid values:

  • zh: Chinese

  • en: English

  • fr: French

  • de: German

  • ja: Japanese

  • ko: Korean

  • ru: Russian

  • pt: Portuguese

  • th: Thai

  • id: Indonesian

  • vi: Vietnamese

Note: This parameter is an array, but the current version only processes the first element. Therefore, we recommend passing only one value.

Important

This parameter specifies the target language for speech synthesis. This setting is independent of the language of the sample audio used for voice cloning. To set the source language for a cloning task, see CosyVoice Voice Cloning/Design API.

instruction

str

No

Sets an instruction to control synthesis effects such as dialect, emotion, or speaking style. This feature is available only for cloned voices of the cosyvoice-v3.5-flash, cosyvoice-v3.5-plus, and cosyvoice-v3-flash models, and for system voices marked as supporting Instruct in the voice list.

Length limit: 100 characters.

A Chinese character (including simplified and traditional Chinese, Japanese Kanji, and Korean Hanja) is counted as two characters. All other characters, such as punctuation marks, letters, numbers, and Japanese/Korean Kana/Hangul, are counted as one character.

Usage requirements (vary by model):

  • v3.5-flash and v3.5-plus: Any natural language instruction to control emotion, speech rate, etc.

    Important

    cosyvoice-v3.5-flash and cosyvoice-v3.5-plus have no system voices. Only custom voices from voice design or voice cloning are supported.

    Instruction examples:

    Speak in a very excited and high-pitched tone, expressing the ecstasy and excitement of a great success.
    Please maintain a medium-slow speech rate, with an elegant and intellectual tone, giving a sense of calm and reassurance.
    The tone should be full of sorrow and nostalgia, with a slight nasal quality, as if narrating a heartbreaking story.
    Please try to speak in a breathy voice, with a very low volume, creating a sense of intimate and mysterious whispering.
    The tone should be very impatient and annoyed, with a faster speech rate and minimal pauses between sentences.
    Please imitate a kind and gentle elder, with a steady speech rate and a voice full of care and affection.
    The tone should be sarcastic and disdainful, with emphasis on keywords and a slightly rising intonation at the end of sentences.
    Please speak in an extremely fearful and trembling voice.
    The tone should be like a professional news anchor: calm, objective, and articulate, with a neutral emotion.
    The tone should be lively and playful, with a clear smile, making the voice sound energetic and sunny.
  • cosyvoice-v3-flash: The following requirements must be met:

    • Cloned voices: Use any natural language to control the speech synthesis effect.

      Instruction examples:

      Please speak in Cantonese. (Supported dialects: Cantonese, Northeastern, Gansu, Guizhou, Henan, Hubei, Jiangxi, Minnan, Ningxia, Shanxi, Shaanxi, Shandong, Shanghainese, Sichuan, Tianjin, Yunnan.)
      Please say a sentence as loudly as possible.
      Please say a sentence as slowly as possible.
      Please say a sentence as quickly as possible.
      Please say a sentence very softly.
      Can you speak a little slower?
      Can you speak very quickly?
      Can you speak very slowly?
      Can you speak a little faster?
      Please say a sentence very angrily.
      Please say a sentence very happily.
      Please say a sentence very fearfully.
      Please say a sentence very sadly.
      Please say a sentence very surprisedly.
      Please try to sound as firm as possible.
      Please try to sound as angry as possible.
      Please try an approachable tone.
      Please speak in a cold tone.
      Please speak in a majestic tone.
      I want to experience a natural tone.
      I want to see how you express a threat.
      I want to see how you express wisdom.
      I want to see how you express seduction.
      I want to hear you speak in a lively way.
      I want to hear you speak with passion.
      I want to hear you speak in a steady manner.
      I want to hear you speak with confidence.
      Can you talk to me with excitement?
      Can you show an arrogant emotion?
      Can you show an elegant emotion?
      Can you answer the question happily?
      Can you give a gentle emotional demonstration?
      Can you talk to me in a calm tone?
      Can you answer me in a deep way?
      Can you talk to me with a gruff attitude?
      Tell me the answer in a sinister voice.
      Tell me the answer in a resilient voice.
      Narrate in a natural and friendly chat style.
      Speak in the tone of a radio drama podcaster.
    • System voices: The instruction must use a fixed format and content. For more information, see the voice list.

enable_aigc_tag

bool

No

Add an invisible AIGC identifier to generated audio. When set to True, the identifier is embedded in supported audio formats (WAV, MP3, OPUS).

Default: False.

Only cosyvoice-v3-flash, cosyvoice-v3-plus, and cosyvoice-v2 support this feature.

Note

Set enable_aigc_tag using the additional_params parameter:

synthesizer = SpeechSynthesizer(model="cosyvoice-v3-flash",
                                voice="longanyang",
                                format=AudioFormat.OGG_OPUS_16KHZ_MONO_16KBPS,
                                additional_params={"enable_aigc_tag": True})

aigc_propagator

str

No

Set the ContentPropagator field in the AIGC invisible identifier. Only effective when enable_aigc_tag is True.

Default: Alibaba Cloud UID.

Only cosyvoice-v3-flash, cosyvoice-v3-plus, and cosyvoice-v2 support this feature.

Note

Set aigc_propagator using the additional_params parameter:

synthesizer = SpeechSynthesizer(model="cosyvoice-v3-flash",
                                voice="longanyang",
                                format=AudioFormat.OGG_OPUS_16KHZ_MONO_16KBPS,
                                additional_params={"enable_aigc_tag": True, "aigc_propagator": "xxxx"})

aigc_propagate_id

str

No

Set the PropagateID field in the AIGC invisible identifier. Only effective when enable_aigc_tag is True.

Default: Request ID of this speech synthesis request.

Only cosyvoice-v3-flash, cosyvoice-v3-plus, and cosyvoice-v2 support this feature.

Note

Set aigc_propagate_id using the additional_params parameter:

synthesizer = SpeechSynthesizer(model="cosyvoice-v3-flash",
                                voice="longanyang",
                                format=AudioFormat.OGG_OPUS_16KHZ_MONO_16KBPS,
                                additional_params={"enable_aigc_tag": True, "aigc_propagate_id": "xxxx"})

hot_fix

dict

No

Configuration for text hotpatching. Allows you to customize the pronunciation of specific words or replace text before synthesis. This feature is available only for cloned voices of cosyvoice-v3-flash.

Parameters:

  • pronunciation: Customize pronunciation. Provide pinyin for Chinese words to correct inaccurate default pronunciation.

  • replace: Text replacement. Replace specified words with target text before synthesis. The replaced text becomes the actual synthesis input.

Example:

synthesizer = SpeechSynthesizer(
    model="cosyvoice-v3-flash",
    voice="your_voice", # Replace with a cosyvoice-v3-flash cloned voice
    hot_fix={
        "pronunciation": [{"weather": "tian1 qi4"}],
        "replace": [{"today": "jin1 tian1"}]
    }
)

enable_markdown_filter

bool

No

Specifies whether to enable Markdown filtering. When enabled, the system automatically removes Markdown symbols from the input text before synthesizing speech, preventing them from being read aloud. This feature is available only for cloned voices of cosyvoice-v3-flash.

Default: False.

Values:

  • True

  • False

Note

Set enable_markdown_filter using the additional_params parameter:

synthesizer = SpeechSynthesizer(
    model="cosyvoice-v3-flash",
    voice="your_voice", # Replace with a cosyvoice-v3-flash cloned voice
    additional_params={"enable_markdown_filter": True}
)

callback

ResultCallback

No

Callback interface (ResultCallback).

Key interfaces

SpeechSynthesizer class

Import the SpeechSynthesizer class using from dashscope.audio.tts_v2 import *. It provides core speech synthesis interfaces.

Method

Parameters

Return value

Description

def call(self, text: str, timeout_millis=None)
  • text: Text to synthesize

  • timeout_millis: Timeout in milliseconds for blocking the thread. Not effective if unset or set to 0.

Returns binary audio data if no ResultCallback is specified; otherwise returns None.

Convert the entire text (whether plain text or SSML) to speech.

Two cases exist when creating a SpeechSynthesizer instance:

Important

Before each call to the call method, you must re-initialize the SpeechSynthesizer instance.

def streaming_call(self, text: str)

text: Text fragment to synthesize

None

Stream text fragments for synthesis (SSML is not supported).

Call this method multiple times to send text fragments to the server. Retrieve synthesis results through the on_data method of the callback interface (ResultCallback).

See Bidirectional streaming.

def streaming_complete(self, complete_timeout_millis=600000)

complete_timeout_millis: Wait time in milliseconds

None

End streaming speech synthesis.

This method blocks the current thread for N milliseconds (determined by complete_timeout_millis) until the task ends. If completeTimeoutMillis is set to 0, it waits indefinitely.

By default, waiting stops after 10 minutes.

See Bidirectional streaming.

Important

In bidirectional streaming calls, always call this method. Otherwise, synthesized speech may be missing.

def get_last_request_id(self)

None

Request ID of the previous task

Get the request ID of the previous task.

def get_first_package_delay(self)

None

First-package delay

Returns first-packet latency in milliseconds (time from sending text to receiving first audio). Call after task completes.

Factors affecting first-packet latency:

  • Time to establish the WebSocket connection (on the first call)

  • Voice loading time (varies by voice)

  • Service load (queuing may occur during peak hours)

  • Network latency

Typical latency:

  • Reusing a connection with the voice already loaded: about 500 ms

  • First connection or switching voices: may reach 1,500 to 2,000 ms

If latency consistently exceeds 2,000 ms:

  1. Use the connection pool feature for high-concurrency scenarios to establish connections in advance.

  2. Check the quality of your network connection.

  3. Avoid making calls during peak hours.

def get_response(self)

None

Last message

Get the last message (JSON-formatted data), useful for detecting task-failed errors.

Callback interface (ResultCallback)

During unidirectional streaming calls or bidirectional streaming calls, the server returns key process information and data to the client via callbacks. Implement callback methods to handle server responses.

Import using from dashscope.audio.tts_v2 import *.

View example

class Callback(ResultCallback):
    def on_open(self) -> None:
        print('Connection successful')
    
    def on_data(self, data: bytes) -> None:
        # Implement logic to handle binary audio data

    def on_complete(self) -> None:
        print('Synthesis complete')

    def on_error(self, message) -> None:
        print('Error: ', message)

    def on_close(self) -> None:
        print('Connection closed')


callback = Callback()

Method

Parameters

Return value

Description

def on_open(self) -> None

None

None

Called immediately after the client connects to the server.

def on_event( self, message: str) -> None

message: Server message

None

Called when the server sends a message. message is a JSON string. Parse it to get the task ID (task_id) and billed character count (characters) for this request.

def on_complete(self) -> None

None

None

Called after all synthesis data is returned (speech synthesis complete).

def on_error(self, message) -> None

message: Error message

None

Called when an error occurs.

def on_data(self, data: bytes) -> None

data: Binary audio data from the server

None

Called when audio data arrives.

Combine segments into a complete file or stream to a compatible player.

Important
  • In streaming speech synthesis, for compressed formats such as MP3 and Opus, the segmented audio data must be played using a streaming player. Do not play it frame by frame, as this causes decoding to fail.

    Streaming players include FFmpeg, PyAudio (Python), AudioFormat (Java), and MediaSource (JavaScript).
  • When combining audio data into a complete audio file, write to the same file in append mode.

  • For WAV and MP3 audio from streaming speech synthesis, only the first frame contains header information. Subsequent frames contain only audio data.

def on_close(self) -> None

None

None

Called after the server closes the connection.

Response

The server returns binary audio data:

Error codes

If an error occurs, see Error messages for troubleshooting.

More examples

For more examples, see GitHub.

FAQ

Features, billing, and rate limiting

Q: What can I do if the pronunciation is inaccurate?

Use SSML to fix pronunciation.

Q: Speech synthesis is billed by character count. How do I view or get the character count for each synthesis?

How you retrieve the count depends on whether logging is enabled:

  1. Logging disabled

    • Non-streaming: Calculate manually using the character counting rules.

    • Other call types: Retrieve the count from the on_event method parameter message of the callback interface (ResultCallback). message is a JSON string. Parse it to get the billed character count (characters). Use the last message received.

  2. Logging enabled

    The console prints logs like this. characters is the billed character count for this request. Use the last log printed.

    2025-08-27 11:02:09,429 - dashscope - speech_synthesizer.py - on_message - 454 - DEBUG - <<<recv {"header":{"task_id":"62ebb7d6cb0a4080868f0edb######","event":"result-generated","attributes":{}},"payload":{"output":{"sentence":{"words":[]}},"usage":{"characters":15}}}

View how to enable logging

Enable logging by setting an environment variable in the command line:

  • Windows: $env:DASHSCOPE_LOGGING_LEVEL="debug"

  • Linux/macOS: export DASHSCOPE_LOGGING_LEVEL=debug

Troubleshooting

If your code throws errors, troubleshoot using the information in Error codes.

Q: How do I get the request ID?

Retrieve it in two ways:

Q: Why does SSML fail?

Troubleshoot step by step:

  1. Verify correct limitations and constraints.

  2. Install the latest DashScope SDK.

  3. Ensure you use the correct interface: Only the call method of the SpeechSynthesizer class supports SSML.

  4. Ensure the text to synthesize is plain text and meets formatting requirements. See SSML overview.

Q: Why does the audio duration of TTS speech synthesis differ from the WAV file's displayed duration? For example, a WAV file shows 7 seconds but the actual audio is less than 5 seconds?

TTS uses a streaming synthesis mechanism, which means it synthesizes and returns data progressively. As a result, the WAV file header contains an estimated value, which may have some margin of error. If you require precise duration, you can set the format to PCM and manually add the WAV header information after obtaining the complete synthesis result. This will give you a more accurate duration.

Q: Why can't the audio be played?

Check the following scenarios one by one:

  1. The audio is saved as a complete file (such as xx.mp3).

    1. Format consistency: Verify request format matches file extension (e.g., WAV with .wav, not .mp3).

    2. Player compatibility: Verify that your player supports the format and sample rate of the audio file. Some players may not support high sample rates or specific audio encodings.

  2. The audio is played in a stream.

    1. Save the audio stream as a complete file and try to play it with a player. If the file cannot be played, see the troubleshooting method for scenario 1.

    2. If the file plays normally, the problem may be with your streaming playback implementation. Verify that your player supports streaming playback.

      Common tools and libraries that support streaming playback include FFmpeg, PyAudio (Python), AudioFormat (Java), and MediaSource (JavaScript).

Q: Why does the audio playback stutter?

Check the following scenarios one by one:

  1. Check the text sending speed: Make sure the interval between text segments is reasonable. Avoid situations where the next segment is not sent promptly after the previous audio segment finishes playing.

  2. Check the callback function performance:

    • Avoid heavy business logic in the callback function—it can cause blocking.

    • Callbacks run in the WebSocket thread. Blocking prevents timely packet reception and causes audio playback to stutter.

    • We recommend writing audio data to a separate buffer and processing it in another thread to avoid blocking the WebSocket thread.

  3. Check network stability: Ensure your network connection is stable to avoid audio transmission interruptions or delays caused by network fluctuations.

Q: Why does speech synthesis take a long time?

Follow these steps to troubleshoot:

  1. Check input interval

    Check the input interval. If you are using streaming speech synthesis, verify whether the interval between sending text segments is too long (for example, a delay of several seconds). A long interval increases the total synthesis time.

  2. Analyze performance metrics.

    • First-packet latency: Normally around 500 ms.

    • RTF (RTF = Total synthesis time / Audio duration): Normally less than 1.0.

Q: How do I handle incorrect pronunciation in the synthesized speech?

Use the <phoneme> tag of SSML to specify the correct pronunciation.

Q: Why is no speech returned? Why is part of the text at the end not converted to speech? (Missing speech)

Confirm you called the streaming_complete method of the SpeechSynthesizer class. During synthesis, the server waits until it has enough cached text before starting synthesis. If you omit streaming_complete, trailing text in the cache may never synthesize.

Q: How do I fix SSL certificate verification failure?

  1. Install system root certificates

    sudo yum install -y ca-certificates
    sudo update-ca-trust enable
  2. Add this to your code

    import os
    os.environ["SSL_CERT_FILE"] = "/etc/ssl/certs/ca-bundle.crt"

Q: Why do I get “SSL: CERTIFICATE_VERIFY_FAILED” on macOS? (websocket closed due to [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1000))

OpenSSL certificate verification may fail during WebSocket connection due to incorrect Python certificate configuration. Fix it manually:

  1. Export system certificates and set environment variables: Run this command to export all macOS certificates to a file and set it as the default certificate path for Python and related libraries:

    security find-certificate -a -p > ~/all_mac_certs.pem
    export SSL_CERT_FILE=~/all_mac_certs.pem
    export REQUESTS_CA_BUNDLE=~/all_mac_certs.pem
  2. Create a symbolic link to fix Python’s OpenSSL configuration: If Python’s OpenSSL config lacks certificates, create a symbolic link. Replace the path with your local Python version:

    # 3.9 is an example version number. Adjust to your installed Python version.
    ln -s /etc/ssl/* /Library/Frameworks/Python.framework/Versions/3.9/etc/openssl
  3. Restart your terminal and clear caches: Close and reopen your terminal to apply environment variables. Clear caches and retry the WebSocket connection.

These steps resolve connection issues caused by certificate misconfiguration. If problems persist, check the server’s certificate configuration.

Q: Why do I get “AttributeError: module 'websocket' has no attribute 'WebSocketApp'. Did you mean: 'WebSocket'?”

This happens when websocket-client is not installed or the version is incompatible. Run these commands:

pip uninstall websocket-client
pip uninstall websocket
pip install websocket-client

Permissions and authentication

Q: How can I restrict my API key to the CosyVoice speech synthesis service only (permission isolation)?

Create a workspace and grant authorization only to specific models to limit the API key scope. For more information, see Manage workspaces.

More questions

See the QA on GitHub.