cosyvoice, sambert, speech synthesis, tts, java, sdk, alibaba cloud - Alibaba Cloud Model Studio

This topic describes the parameters and interface details of the CosyVoice Java SDK for speech synthesis.

Important

To use a model in the China (Beijing) region, go to the API key page for the China (Beijing) region

User guide: For more information about the models and guidance on model selection, see Real-time speech synthesis - CosyVoice.

Prerequisites

You have activated the Model Studio and created an API key. To prevent security risks, export the API key as an environment variable instead of hard-coding it in your code.
Note
To grant temporary access permissions to third-party applications or users, or if you want to strictly control high-risk operations such as accessing or deleting sensitive data, we recommend that you use a temporary authentication token.
Compared with long-term API keys, temporary authentication tokens are more secure because they are short-lived (60 seconds). They are suitable for temporary call scenarios and can effectively reduce the risk of API key leakage.
To use a temporary token, replace the API key used for authentication in your code with the temporary authentication token.
Install the latest version of the DashScope SDK.

Models and pricing

Model	Unit price
cosyvoice-v3-plus	$0.286706 per 10,000 characters
cosyvoice-v3-flash	$0.14335 per 10,000 characters
cosyvoice-v2	$0.286706 per 10,000 characters

Text and format limitations

Text length limits

Non-streaming call (synchronous call, asynchronous call, or Flowable non-streaming call): The text for a single request cannot exceed 2,000 characters.
Streaming call (streaming call or Flowable streaming call): The text in a single request cannot exceed 2,000 characters, and the total length of the text cannot exceed 200,000 characters.

Character counting rules

A Chinese character, including simplified or traditional Chinese, Japanese kanji, and Korean hanja, is counted as 2 characters. All other characters, such as punctuation marks, letters, numbers, and Japanese or Korean kana or hangul, are counted as 1 character.
SSML tags are not included in the text length calculation.
Examples:
- "你好" → 2(你) + 2(好) = 4 characters
- "中A文123" → 2(中) + 1(A) + 2(文) + 1(1) + 1(2) + 1(3) = 8 characters
- "中文。" → 2(中) + 2(文) + 1(。) = 5 characters
- "中文。" → 2(中) + 1(space) + 2(文) + 1(。) = 6 characters
- "<speak>你好</speak>" → 2(你) + 2(好) = 4 characters

Encoding format

Use UTF-8 encoding.

Support for mathematical expressions

The mathematical expression parsing feature is currently available only for the cosyvoice-v2, cosyvoice-v3-flash, and cosyvoice-v3-plus models. This feature supports common mathematical expressions from primary and secondary school, such as basic arithmetic, algebra, and geometry.

For more information, see Convert LaTeX formulas to speech.

SSML support

The Speech Synthesis Markup Language (SSML) feature is currently available only for cloned voices of the cosyvoice-v3-flash, cosyvoice-v3-plus, and cosyvoice-v2 models, and for system voices that are indicated as supported in the voice list. The following conditions must be met:

Use DashScope SDK 2.20.3 or later.
Only synchronous calls and asynchronous calls (which use the SpeechSynthesizer class's call method) are supported. Streaming calls (which use the SpeechSynthesizer class's streamingCall method) and Flowable calls are not supported.
The usage is the same as for standard speech synthesis. Pass the text containing SSML to the call method of the SpeechSynthesizer class.

Getting started

The SpeechSynthesizer class provides interfaces for speech synthesis and supports the following call methods:

Synchronous call: A blocking call that sends the complete text at once and returns the complete audio directly. This method is suitable for short text synthesis scenarios.
Asynchronous call: A non-blocking call that sends the complete text at once and uses a callback function to receive audio data, which may be delivered in chunks. This method is suitable for short text synthesis scenarios that require low latency.
Streaming call: A non-blocking call that sends text in fragments and uses a callback function to receive the synthesized audio stream incrementally in real time. This method is suitable for long text synthesis scenarios that require low latency.

Synchronous call

Submit a speech synthesis task synchronously to obtain the complete result directly.

Instantiate the SpeechSynthesizer class, set the request parameters, and call the call method to synthesize and obtain the binary audio data.

The text length cannot exceed 2,000 characters. For more information, see the call method of the SpeechSynthesizer class.

Important

Before each call to the call method, you must create a new SpeechSynthesizer instance.

Click to view the complete example

import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesisParam;
import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesizer;

import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.nio.ByteBuffer;

public class Main {
    // Model
    private static String model = "cosyvoice-v3-flash";
    // Voice
    private static String voice = "longanyang";

    public static void streamAudioDataToSpeaker() {
        // Request parameters
        SpeechSynthesisParam param =
                SpeechSynthesisParam.builder()
                        // If the API key is not configured as an environment variable, uncomment the following line and replace your-api-key with your API key.
                        // .apiKey("your-api-key")
                        .model(model) // Model
                        .voice(voice) // Voice
                        .build();

        // Synchronous mode: Disable callback (the second parameter is null).
        SpeechSynthesizer synthesizer = new SpeechSynthesizer(param, null);
        ByteBuffer audio = null;
        try {
            // Block until the audio is returned.
            audio = synthesizer.call("What's the weather like today?");
        } catch (Exception e) {
            throw new RuntimeException(e);
        } finally {
            // Close the WebSocket connection after the task is complete.
            synthesizer.getDuplexApi().close(1000, "bye");
        }
        if (audio != null) {
            // Save the audio data to a local file named "output.mp3".
            File file = new File("output.mp3");
            // A WebSocket connection must be established when the text is sent for the first time. Therefore, the first-packet latency includes the time required to establish the connection.
            System.out.println(
                    "[Metric] Request ID: "
                            + synthesizer.getLastRequestId()
                            + ", First-packet latency (ms): "
                            + synthesizer.getFirstPackageDelay());
            try (FileOutputStream fos = new FileOutputStream(file)) {
                fos.write(audio.array());
            } catch (IOException e) {
                throw new RuntimeException(e);
            }
        }
    }

    public static void main(String[] args) {
        streamAudioDataToSpeaker();
        System.exit(0);
    }
}

Asynchronous call

Submit a speech synthesis task asynchronously and receive real-time speech segments frame by frame by registering a ResultCallback callback.

Instantiate the SpeechSynthesizer class, set the request parameters and the ResultCallback interface, and then call the call method to synthesize the audio. The onEvent method of the ResultCallback interface provides the synthesis result in real time.

The text length cannot exceed 2,000 characters. For more information, see the call method of the SpeechSynthesizer class.

Important

Before each call to the call method, you must create a new SpeechSynthesizer instance.

Click to view the complete example

import com.alibaba.dashscope.audio.tts.SpeechSynthesisResult;
import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesisParam;
import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesizer;
import com.alibaba.dashscope.common.ResultCallback;

import java.time.LocalDateTime;
import java.time.format.DateTimeFormatter;
import java.util.concurrent.CountDownLatch;

class TimeUtils {
    private static final DateTimeFormatter formatter =
            DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss.SSS");

    public static String getTimestamp() {
        return LocalDateTime.now().format(formatter);
    }
}

public class Main {
    // Model
    private static String model = "cosyvoice-v3-flash";
    // Voice
    private static String voice = "longanyang";

    public static void streamAudioDataToSpeaker() {
        CountDownLatch latch = new CountDownLatch(1);

        // Implement the ResultCallback interface.
        ResultCallback<SpeechSynthesisResult> callback = new ResultCallback<SpeechSynthesisResult>() {
            @Override
            public void onEvent(SpeechSynthesisResult result) {
                // System.out.println("Message received: " + result);
                if (result.getAudioFrame() != null) {
                    // Implement the logic to save the audio data to a local file here.
                    System.out.println(TimeUtils.getTimestamp() + " Audio received");
                }
            }

            @Override
            public void onComplete() {
                System.out.println(TimeUtils.getTimestamp() + " 'Complete' received. Speech synthesis is finished.");
                latch.countDown();
            }

            @Override
            public void onError(Exception e) {
                System.out.println("An exception occurred: " + e.toString());
                latch.countDown();
            }
        };

        // Request parameters
        SpeechSynthesisParam param =
                SpeechSynthesisParam.builder()
                        // If you have not configured the API key as an environment variable, uncomment the following line and replace "your-api-key" with your API key.
                        // .apiKey("your-api-key")
                        .model(model) // Model
                        .voice(voice) // Voice
                        .build();
        // Pass the callback as the second parameter to enable asynchronous mode.
        SpeechSynthesizer synthesizer = new SpeechSynthesizer(param, callback);
        // This is a non-blocking call that returns null immediately. The actual result is passed asynchronously through the callback interface. Retrieve the binary audio in real time in the onEvent method of the callback interface.
        try {
            synthesizer.call("What's the weather like today?");
            // Wait for the synthesis to complete.
            latch.await();
            // Wait for all playback threads to finish.
        } catch (Exception e) {
            throw new RuntimeException(e);
        } finally {
            // Close the WebSocket connection after the task is complete.
            synthesizer.getDuplexApi().close(1000, "bye");
        }
        // A WebSocket connection must be established when the text is sent for the first time. Therefore, the first-packet latency includes the time required to establish the connection.
        System.out.println(
                "[Metric] Request ID: "
                        + synthesizer.getLastRequestId()
                        + ", First-packet latency (ms): "
                        + synthesizer.getFirstPackageDelay());
    }

    public static void main(String[] args) {
        streamAudioDataToSpeaker();
        System.exit(0);
    }
}

Streaming call

Submit text in fragments and receive real-time speech segments frame by frame by registering a ResultCallback callback.

Note

For streaming input, call streamingCall multiple times to submit text fragments sequentially. After the server receives the text fragments, it automatically segments the text into sentences:
- Complete sentences are synthesized immediately.
- Incomplete sentences are cached until they are complete and then synthesized.
When you call streamingComplete, the server synthesizes all received but unprocessed text fragments, including incomplete sentences.
The interval between sending text fragments cannot exceed 23 seconds. Otherwise, a "request timeout after 23 seconds" exception occurs.
If you have no more text to send, call streamingComplete to end the task promptly.
The server enforces a 23-second timeout, which cannot be modified by the client.

Instantiate the SpeechSynthesizer class.
Instantiate the SpeechSynthesizer class and set the request parameters and the ResultCallback interface.
Stream data
Call the streamingCall method of the SpeechSynthesizer class multiple times to submit the text to be synthesized to the server in segments.
As you send the text, the server returns the synthesis result in real time through the onEvent method of the ResultCallback interface.
For each call to the streamingCall method, the length of the text segment (that is, text) cannot exceed 2,000 characters. The total length of all text sent cannot exceed 200,000 characters.
End processing
Call the streamingComplete method of the SpeechSynthesizer class to end the speech synthesis task.
This method blocks the current thread until the onComplete or onError callback of the ResultCallback interface is triggered, after which the thread is unblocked.
You must call this method. Otherwise, the final text fragments may not be successfully converted to speech.

Click to view the complete example

import com.alibaba.dashscope.audio.tts.SpeechSynthesisResult;
import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesisAudioFormat;
import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesisParam;
import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesizer;
import com.alibaba.dashscope.common.ResultCallback;

import java.time.LocalDateTime;
import java.time.format.DateTimeFormatter;
import java.util.concurrent.CountDownLatch;

class TimeUtils {
    private static final DateTimeFormatter formatter =
            DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss.SSS");

    public static String getTimestamp() {
        return LocalDateTime.now().format(formatter);
    }
}


public class Main {
    private static String[] textArray = {"The streaming text-to-speech SDK, ",
            "can convert input text ", "into binary audio data. ", "Compared to non-streaming speech synthesis, ",
            "streaming synthesis offers the advantage of stronger ", "real-time performance. Users can hear ",
            "nearly synchronous speech output while inputting text, ", "greatly enhancing the interactive experience ",
            "and reducing user waiting time. ", "It is suitable for scenarios that call large ", "language models (LLMs) to perform ",
            "speech synthesis by ", "streaming text input."};
    private static String model = "cosyvoice-v3-flash"; // Model
    private static String voice = "longanyang"; // Voice

    public static void streamAudioDataToSpeaker() {
        // Configure the callback function.
        ResultCallback<SpeechSynthesisResult> callback = new ResultCallback<SpeechSynthesisResult>() {
            @Override
            public void onEvent(SpeechSynthesisResult result) {
                // System.out.println("Message received: " + result);
                if (result.getAudioFrame() != null) {
                    // Implement the logic to process audio data here.
                    System.out.println(TimeUtils.getTimestamp() + " Audio received");
                }
            }

            @Override
            public void onComplete() {
                System.out.println(TimeUtils.getTimestamp() + " 'Complete' received. Speech synthesis is finished.");
            }

            @Override
            public void onError(Exception e) {
                System.out.println("An exception occurred: " + e.toString());
            }
        };

        // Request parameters
        SpeechSynthesisParam param =
                SpeechSynthesisParam.builder()
                        // If the API key is not set as an environment variable, uncomment the following line and replace your-api-key with your key.
                        // .apiKey("your-api-key")
                        .model(model)
                        .voice(voice)
                        .format(SpeechSynthesisAudioFormat
                                .PCM_22050HZ_MONO_16BIT) // Streaming synthesis uses PCM or MP3.
                        .build();
        SpeechSynthesizer synthesizer = new SpeechSynthesizer(param, callback);
        // The call method with a callback does not block the current thread.
        try {
            for (String text : textArray) {
                // Send text segments and receive binary audio in real time from the onEvent method of the callback interface.
                synthesizer.streamingCall(text);
            }
            // Wait for the streaming speech synthesis to complete.
            synthesizer.streamingComplete();
        } catch (Exception e) {
            throw new RuntimeException(e);
        } finally {
            // After the task is complete, close the WebSocket connection.
            synthesizer.getDuplexApi().close(1000, "bye");
        }

        // A WebSocket connection is established when the first text segment is sent. Therefore, the first-packet latency includes the connection establishment time.
        System.out.println(
                "[Metric] Request ID: "
                        + synthesizer.getLastRequestId()
                        + ", First-packet latency (ms): "
                        + synthesizer.getFirstPackageDelay());
    }

    public static void main(String[] args) {
        streamAudioDataToSpeaker();
        System.exit(0);
    }
}

Call through Flowable

Flowable is an open-source framework for workflow and business process management that is released under the Apache 2.0 license. For more information about how to use Flowable, see Flowable API details.

Before you use Flowable, ensure that you have integrated the RxJava library and understand the basic concepts of reactive programming.

Non-streaming call

The following example shows how to use the blockingForEach interface of a Flowable object to block the current thread and retrieve the SpeechSynthesisResult data that is returned in each stream.

You can also obtain the complete synthesis result using the getAudioData method of the SpeechSynthesizer class after the Flowable stream is complete.

Click to view the complete example

import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesisParam;
import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesizer;
import com.alibaba.dashscope.exception.NoApiKeyException;

import java.time.LocalDateTime;
import java.time.format.DateTimeFormatter;

class TimeUtils {
    private static final DateTimeFormatter formatter =
            DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss.SSS");

    public static String getTimestamp() {
        return LocalDateTime.now().format(formatter);
    }
}

public class Main {
    private static String model = "cosyvoice-v3-flash"; // Model
    private static String voice = "longanyang"; // Voice

    public static void streamAudioDataToSpeaker() throws NoApiKeyException {
        // Request parameters
        SpeechSynthesisParam param =
                SpeechSynthesisParam.builder()
                        // If the API key is not configured as an environment variable, uncomment the following line and replace your-api-key with your API key.
                        // .apiKey("your-api-key")
                        .model(model) // Model
                        .voice(voice) // Voice
                        .build();
        SpeechSynthesizer synthesizer = new SpeechSynthesizer(param, null);
        synthesizer.callAsFlowable("What's the weather like today?").blockingForEach(result -> {
            // System.out.println("Received message: " + result);
            if (result.getAudioFrame() != null) {
                // Implement the logic to process audio data here.
                System.out.println(TimeUtils.getTimestamp() + " Audio received");
            }
        });
        // Close the WebSocket connection after the task is complete.
        synthesizer.getDuplexApi().close(1000, "bye");
        // A WebSocket connection must be established when the text is sent for the first time. Therefore, the first-packet latency includes the time required to establish the connection.
        System.out.println(
                "[Metric] Request ID: "
                        + synthesizer.getLastRequestId()
                        + ", First-packet latency (ms): "
                        + synthesizer.getFirstPackageDelay());
    }

    public static void main(String[] args) throws NoApiKeyException {
        streamAudioDataToSpeaker();
        System.exit(0);
    }
}

Streaming call

The following example shows how to use a Flowable object as an input parameter for a text stream. The example also shows how to use a Flowable object as a return value and use the blockingForEach interface to block the current thread and retrieve the SpeechSynthesisResult data that is returned in each stream.

You can also obtain the complete synthesis result using the getAudioData method of the SpeechSynthesizer class after the Flowable stream is complete.

Click to view the complete example

import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesisParam;
import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesizer;
import com.alibaba.dashscope.exception.NoApiKeyException;
import io.reactivex.BackpressureStrategy;
import io.reactivex.Flowable;

import java.time.LocalDateTime;
import java.time.format.DateTimeFormatter;

class TimeUtils {
    private static final DateTimeFormatter formatter =
            DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss.SSS");

    public static String getTimestamp() {
        return LocalDateTime.now().format(formatter);
    }
}

public class Main {
    private static String[] textArray = {"The streaming text-to-speech SDK, ",
            "can convert input text ", "into binary audio data. ", "Compared to non-streaming speech synthesis, ",
            "streaming synthesis offers the advantage of stronger ", "real-time performance. Users can hear ",
            "nearly synchronous speech output while inputting text, ", "greatly enhancing the interactive experience ",
            "and reducing user waiting time. ", "It is suitable for scenarios that call large ", "language models (LLMs) to perform ",
            "speech synthesis by ", "streaming text input."};
    private static String model = "cosyvoice-v3-flash";
    private static String voice = "longanyang";

    public static void streamAudioDataToSpeaker() throws NoApiKeyException {
        // Simulate streaming input.
        Flowable<String> textSource = Flowable.create(emitter -> {
            new Thread(() -> {
                for (int i = 0; i < textArray.length; i++) {
                    emitter.onNext(textArray[i]);
                    try {
                        Thread.sleep(1000);
                    } catch (InterruptedException e) {
                        throw new RuntimeException(e);
                    }
                }
                emitter.onComplete();
            }).start();
        }, BackpressureStrategy.BUFFER);

        // Request parameters
        SpeechSynthesisParam param =
                SpeechSynthesisParam.builder()
                        // If the API key is not configured as an environment variable, uncomment the following line and replace yourApikey with your API key.
                        // .apiKey("yourApikey")
                        .model(model) // Model
                        .voice(voice) // Voice
                        .build();
        SpeechSynthesizer synthesizer = new SpeechSynthesizer(param, null);
        synthesizer.streamingCallAsFlowable(textSource).blockingForEach(result -> {
            if (result.getAudioFrame() != null) {
                // Implement the logic to play the audio here.
                System.out.println(
                        TimeUtils.getTimestamp() +
                                " Binary audio size: " + result.getAudioFrame().capacity());
            }
        });
        synthesizer.getDuplexApi().close(1000, "bye");
        // A WebSocket connection must be established when the text is sent for the first time. Therefore, the first-packet latency includes the time required to establish the connection.
        System.out.println(
                "[Metric] Request ID: "
                        + synthesizer.getLastRequestId()
                        + ", First-packet latency (ms): "
                        + synthesizer.getFirstPackageDelay());
    }

    public static void main(String[] args) throws NoApiKeyException {
        streamAudioDataToSpeaker();
        System.exit(0);
    }
}

High-concurrency calls

The DashScope Java SDK uses the connection pool technology of OkHttp3 to reduce the overhead from repeatedly establishing connections. For more information, see High-concurrency scenarios.

Request parameters

Use the chained methods of SpeechSynthesisParam to configure parameters, such as the model and voice, and pass the configured parameter object to the constructor of the SpeechSynthesizer class.

Click to view an example

SpeechSynthesisParam param = SpeechSynthesisParam.builder()
    .model("cosyvoice-v3-flash") // Model
    .voice("longanyang") // Voice
    .format(SpeechSynthesisAudioFormat.WAV_8000HZ_MONO_16BIT) // Audio coding format and sample rate
    .volume(50) // Volume. Value range: [0, 100].
    .speechRate(1.0f) // Speech rate. Value range: [0.5, 2].
    .pitchRate(1.0f) // Pitch. Value range: [0.5, 2].
    .build();

Parameter	Type	Required	Description
model	String	Yes	The speech synthesis model. Difference models require corresponding voices: cosyvoice-v3-flash/cosyvoice-v3-plus: Use voices such as longanyang. cosyvoice-v2: Use voices such as longxiaochun_v2. For a complete list, see Voice list.
voice	String	Yes	The voice to use for speech synthesis. System voices and cloned voices are supported: System voices: See Voice list. Cloned voices: Customize voices using the voice cloning feature. When you use a cloned voice, make sure that the same account is used for both voice cloning and speech synthesis. For detailed steps, see CosyVoice Voice Cloning API. When you use a cloned voice, the value of the `model` parameter in the request must be the same as the model version used to create the voice (the `target_model` parameter).
format	enum	No	The audio coding format and sample rate. If you do not specify the `format`, the synthesized audio has a sample rate of 22.05 kHz and is in MP3 format. Note The default sample rate is the optimal rate for the current voice. By default, the output uses this sample rate. Downsampling and upsampling are also supported. The following audio coding formats and sample rates are available: Audio coding formats and sample rates supported by all models: SpeechSynthesisAudioFormat.WAV_8000HZ_MONO_16BIT: WAV format with a sample rate of 8 kHz SpeechSynthesisAudioFormat.WAV_16000HZ_MONO_16BIT: WAV format with a sample rate of 16 kHz SpeechSynthesisAudioFormat.WAV_22050HZ_MONO_16BIT: WAV format with a sample rate of 22.05 kHz SpeechSynthesisAudioFormat.WAV_24000HZ_MONO_16BIT: WAV format with a sample rate of 24 kHz SpeechSynthesisAudioFormat.WAV_44100HZ_MONO_16BIT: WAV format with a sample rate of 44.1 kHz SpeechSynthesisAudioFormat.WAV_48000HZ_MONO_16BIT: WAV format with a sample rate of 48 kHz SpeechSynthesisAudioFormat.MP3_8000HZ_MONO_128KBPS: MP3 format with a sample rate of 8 kHz SpeechSynthesisAudioFormat.MP3_16000HZ_MONO_128KBPS: MP3 format with a sample rate of 16 kHz SpeechSynthesisAudioFormat.MP3_22050HZ_MONO_256KBPS: MP3 format with a sample rate of 22.05 kHz SpeechSynthesisAudioFormat.MP3_24000HZ_MONO_256KBPS: MP3 format with a sample rate of 24 kHz SpeechSynthesisAudioFormat.MP3_44100HZ_MONO_256KBPS: MP3 format with a sample rate of 44.1 kHz SpeechSynthesisAudioFormat.MP3_48000HZ_MONO_256KBPS: MP3 format with a sample rate of 48 kHz SpeechSynthesisAudioFormat.PCM_8000HZ_MONO_16BIT: PCM format with a sample rate of 8 kHz SpeechSynthesisAudioFormat.PCM_16000HZ_MONO_16BIT: PCM format with a sample rate of 16 kHz SpeechSynthesisAudioFormat.PCM_22050HZ_MONO_16BIT: PCM format with a sample rate of 22.05 kHz SpeechSynthesisAudioFormat.PCM_24000HZ_MONO_16BIT: PCM format with a sample rate of 24 kHz SpeechSynthesisAudioFormat.PCM_44100HZ_MONO_16BIT: PCM format with a sample rate of 44.1 kHz SpeechSynthesisAudioFormat.PCM_48000HZ_MONO_16BIT: PCM format with a sample rate of 48 kHz If the audio format is Opus, you can adjust the bitrate using the `bit_rate` parameter. This applies only to DashScope 2.21.0 and later versions. SpeechSynthesisAudioFormat.OGG_OPUS_8KHZ_MONO_32KBPS: Opus format with a sample rate of 8 kHz and a bitrate of 32 kbps SpeechSynthesisAudioFormat.OGG_OPUS_16KHZ_MONO_16KBPS: Opus format with a sample rate of 16 kHz and a bitrate of 16 kbps SpeechSynthesisAudioFormat.OGG_OPUS_16KHZ_MONO_32KBPS: Opus format with a sample rate of 16 kHz and a bitrate of 32 kbps SpeechSynthesisAudioFormat.OGG_OPUS_16KHZ_MONO_64KBPS: Opus format with a sample rate of 16 kHz and a bitrate of 64 kbps SpeechSynthesisAudioFormat.OGG_OPUS_24KHZ_MONO_16KBPS: Opus format with a sample rate of 24 kHz and a bitrate of 16 kbps SpeechSynthesisAudioFormat.OGG_OPUS_24KHZ_MONO_32KBPS: Opus format with a sample rate of 24 kHz and a bitrate of 32 kbps SpeechSynthesisAudioFormat.OGG_OPUS_24KHZ_MONO_64KBPS: Opus format with a sample rate of 24 kHz and a bitrate of 64 kbps SpeechSynthesisAudioFormat.OGG_OPUS_48KHZ_MONO_16KBPS: Opus format with a sample rate of 48 kHz and a bitrate of 16 kbps SpeechSynthesisAudioFormat.OGG_OPUS_48KHZ_MONO_32KBPS: Opus format with a sample rate of 48 kHz and a bitrate of 32 kbps SpeechSynthesisAudioFormat.OGG_OPUS_48KHZ_MONO_64KBPS: Opus format with a sample rate of 48 kHz and a bitrate of 64 kbps
volume	int	No	The volume. Default value: 50. Value range: [0, 100]. A value of 50 is the standard volume. The volume has a linear relationship with this value. 0 is mute and 100 is the maximum volume.
speechRate	float	No	The speech rate. Default value: 1.0. Value range: [0.5, 2.0]. A value of 1.0 is the standard rate. Values less than 1.0 slow down the speech, and values greater than 1.0 speed it up.
pitchRate	float	No	The pitch. This value is a multiplier for pitch adjustment. The relationship between this value and the perceived pitch is not strictly linear or logarithmic. Test different values to find the best one. Default value: 1.0. Value range: [0.5, 2.0]. A value of 1.0 is the natural pitch of the voice. Values greater than 1.0 raise the pitch, and values less than 1.0 lower it.
bit_rate	int	No	The audio bitrate in kbps. If the audio format is Opus, you can adjust the bitrate using the `bit_rate` parameter. Default value: 32. Value range: [6, 510]. Note Set the `bit_rate` parameter using the `parameter` or `parameters` method of a `SpeechSynthesisParam` instance: Set using parameter `SpeechSynthesisParam param = SpeechSynthesisParam.builder() .model("cosyvoice-v3-flash") // Model .voice("longanyang") // Voice .parameter("bit_rate", 32) .build();` Set using parameters `SpeechSynthesisParam param = SpeechSynthesisParam.builder() .model("cosyvoice-v3-flash") // Model .voice("longanyang") // Voice .parameters(Collections.singletonMap("bit_rate", 32)) .build();`
enableWordTimestamp	boolean	No	Specifies whether to enable character-level timestamps. Default value: false. true false This feature is available only for cloned voices of the cosyvoice-v3-flash, cosyvoice-v3-plus, and cosyvoice-v2 models, and for system voices marked as supported in the voice list. Timestamp results can be obtained only through the callback interface.
seed	int	No	The random number seed used during generation, which varies the synthesis effect. If the model version, text, voice, and other parameters are the same, using the same seed reproduces the same synthesis result. Default value: 0. Value range: [0, 65535].
languageHints	List	No	Provides language hints. Only cosyvoice-v3-flash and cosyvoice-v3-plus support this feature. No default value. This parameter has no effect if it is not set. This parameter has the following effects in speech synthesis: Specifies the language for Text Normalization (TN) processing, which affects how numbers, abbreviations, and symbols are read. This is effective only for Chinese and English. Value range: zh: Chinese en: English Specifies the target language for speech synthesis (for cloned voices only) to improve synthesis accuracy. This is effective for English, French, German, Japanese, Korean, and Russian. You do not need to specify Chinese. The value must be consistent with the languageHints/language_hints used during voice cloning. Value range: en: English fr: French de: German ja: Japanese ko: Korean ru: Russian If the specified language hint clearly does not match the text content, for example, setting `en` for Chinese text, the hint is ignored, and the language is automatically detected based on the text content. Note: This parameter is an array, but the current version processes only the first element. Therefore, pass only one value.
instruction	String	No	Sets an instruction. This feature is available only for cloned voices of the cosyvoice-v3-flash and cosyvoice-v3-plus models, and for system voices marked as supported in the voice list. No default value. This parameter has no effect if it is not set. The instruction has the following effects in speech synthesis: Specifies a non-Chinese language (for cloned voices only) Format: "`You will say it in <language>.`" (Note: Do not omit the period at the end. Replace "`<language>`" with a specific language, for example, `German`.) Example: "`You will say it in German.`" Supported languages: French, German, Japanese, Korean, and Russian. Specifies a dialect (for cloned voices only) Format: "`Say it in <dialect>.`" (Note: Do not omit the period at the end. Replace "`<dialect>`" with a specific `dialect`, for example, `Cantonese`.) Example: "`Say it in Cantonese.`" Supported dialects: Cantonese, Dongbei, Gansu, Guizhou, Henan, Hubei, Jiangxi, Minnan, Ningxia, Shanxi, Shaanxi, Shandong, Shanghainese, Sichuan, Tianjin, and Yunnan. Specifies emotion, scenario, role, or identity. Only some system voices support this feature, and it varies by voice. For more information, see Voice list.
enable_aigc_tag	boolean	No	Specifies whether to add an invisible AIGC identifier to the generated audio. When set to true, an invisible identifier is embedded into the audio in supported formats (WAV, MP3, and Opus). Default value: false. This feature is available only for the cosyvoice-v3-flash, cosyvoice-v3-plus, and cosyvoice-v2 models. Note Set the `enable_aigc_tag` parameter using the `parameter` or `parameters` method of a `SpeechSynthesisParam` instance: Set using the parameter method `SpeechSynthesisParam param = SpeechSynthesisParam.builder() .model("cosyvoice-v3-flash") // Model .voice("longanyang") // Voice .parameter("enable_aigc_tag", true) .build();` Set using the parameters method `SpeechSynthesisParam param = SpeechSynthesisParam.builder() .model("cosyvoice-v3-flash") // Model .voice("longanyang") // Voice .parameters(Collections.singletonMap("enable_aigc_tag", true)) .build();`
aigc_propagator	String	No	Sets the `ContentPropagator` field in the AIGC invisible identifier to specify the content propagator. This setting takes effect only when `enable_aigc_tag` is `true` . Default value: Alibaba Cloud UID. This feature is available only for the cosyvoice-v3-flash, cosyvoice-v3-plus, and cosyvoice-v2 models. Note Set `aigc_propagator` using the `parameter` or `parameters` method of a `SpeechSynthesisParam` instance: Set using the parameter method `SpeechSynthesisParam param = SpeechSynthesisParam.builder() .model("cosyvoice-v3-flash") // Model .voice("longanyang") // Voice .parameter("enable_aigc_tag", true) .parameter("aigc_propagator", "xxxx") .build();` Set using the parameters method `Map<String, Object> map = new HashMap(); map.put("enable_aigc_tag", true); map.put("aigc_propagator", "xxxx"); SpeechSynthesisParam param = SpeechSynthesisParam.builder() .model("cosyvoice-v3-flash") // Model .voice("longanyang") // Voice .parameters(map) .build();`
aigc_propagate_id	String	No	Sets the `PropagateID` field in the AIGC invisible identifier to uniquely identify a specific propagation behavior. This field takes effect only when `enable_aigc_tag` is set to `true` . Default value: The request ID of the current speech synthesis request. This feature is available only for the cosyvoice-v3-flash, cosyvoice-v3-plus, and cosyvoice-v2 models. Note Set `aigc_propagate_id` using the `parameter` or `parameters` method of a `SpeechSynthesisParam` instance: Set using the parameter method `SpeechSynthesisParam param = SpeechSynthesisParam.builder() .model("cosyvoice-v3-flash") // Model .voice("longanyang") // Voice .parameter("enable_aigc_tag", true) .parameter("aigc_propagate_id", "xxxx") .build();` Set using the parameters method `Map<String, Object> map = new HashMap(); map.put("enable_aigc_tag", true); map.put("aigc_propagate_id", "xxxx"); SpeechSynthesisParam param = SpeechSynthesisParam.builder() .model("cosyvoice-v3-flash") // Model .voice("longanyang") // Voice .parameters(map) .build();`

Key interfaces

`SpeechSynthesizer` class

The SpeechSynthesizer class provides the primary interfaces for speech synthesis and is imported using import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesizer;.

Interface/Method	Parameter	Return value	Description
`public SpeechSynthesizer(SpeechSynthesisParam param, ResultCallback<SpeechSynthesisResult> callback)`	`param`: Request parameters `callback`: ResultCallback interface: For streaming or asynchronous calls `null`: For synchronous calls	`SpeechSynthesizer` instance	Constructor. When you use the asynchronous call or streaming call pattern, set the callback parameter to the ResultCallback interface. When you use the synchronous call or call through Flowable pattern, set the callback parameter to null.
`public ByteBuffer call(String text)`	`text`: The text to be synthesized (UTF-8)	`ByteBuffer` or `null`	Converts a segment of text into speech. The text can be plain text or text that contains SSML. When you create a `SpeechSynthesizer` instance, two cases are possible: No `ResultCallback` is specified: The `call` method blocks the current thread until speech synthesis is complete. For usage, see Synchronous call. A `ResultCallback` is specified: The `call` method returns null immediately. The synthesis result is obtained through the `onEvent` method of the ResultCallback interface. For usage, see Asynchronous call. Important Before each call to the `call` method, you must create a new `SpeechSynthesizer` instance.
`public void streamingCall(String text)`	`text`: The text to be synthesized (UTF-8)	None	Sends the text for synthesis in a stream. Text that contains SSML is not supported. Call this interface multiple times to send the text for synthesis to the server in parts. The `onEvent` method of the ResultCallback interface returns the synthesis result. For a detailed call flow and reference examples, see Streaming call.
`public void streamingComplete() throws RuntimeException`	None	None	Ends the streaming speech synthesis. This method blocks the calling thread until one of the following conditions occurs: The server completes the final audio synthesis (success). The streaming session is abnormally interrupted (failure). The 10 minute timeout threshold is reached (automatic unblocking). For a detailed call process and reference examples, see Streaming call. Important When making a streaming call, call this method to avoid missing parts of the synthesized speech.
`public Flowable<SpeechSynthesisResult> callAsFlowable(String text)`	`text`: The text to be synthesized. The text must be in UTF-8 format.	The synthesis result, encapsulated in `Flowable<SpeechSynthesisResult>`.	Converts non-streaming text input into a streaming speech output in real time. Text containing SSML is not supported. The synthesis result is returned in a stream within the Flowable object. For a detailed call process and reference examples, see Call through Flowable.
`boolean getDuplexApi().close(int code, String reason)`	code: WebSocket Close Code reason: Shutdown reason For information about how to configure these parameters, see The WebSocket Protocol.	true	After a task is complete, close the WebSocket connection, regardless of whether an exception occurred, to avoid connection leaks. For information about how to reuse connections to improve efficiency, see High-concurrency scenarios.
`public Flowable<SpeechSynthesisResult> streamingCallAsFlowable(Flowable<String> textStream)`	`textStream`: A `Flowable` instance that encapsulates the text to be synthesized.	The synthesis result, encapsulated in `Flowable<SpeechSynthesisResult>`.	Converts streaming text input into a streaming speech output in real time. Text that contains SSML is not supported. The synthesis result is returned as a stream in a Flowable object. For a detailed call process and reference examples, see Call through Flowable.
`public String getLastRequestId()`	None	The request ID of the previous task.	Gets the request ID of the previous task. You can use this method after starting a new task by calling `call`, `streamingCall`, `callAsFlowable`, or `streamingCallAsFlowable`.
`public long getFirstPackageDelay()`	None	First packet delay for the current task.	Gets the first packet delay of the current task, which is typically around 500 ms. Use this method after the task is complete. The first packet delay is the time between when the text starts being sent and when the first audio packet is received, measured in milliseconds. A WebSocket connection must be established when the text is sent for the first time. Therefore, the first-packet latency includes the time required to establish the connection. If the connection is reused in a high-concurrency scenario, the connection time is not included.

Callback interface (`ResultCallback`)

When you make an asynchronous call or a streaming call, you can retrieve the synthesis result from the ResultCallback interface. This interface is imported using import com.alibaba.dashscope.common.ResultCallback;.

Click to view an example

ResultCallback<SpeechSynthesisResult> callback = new ResultCallback<SpeechSynthesisResult>() {
    @Override
    public void onEvent(SpeechSynthesisResult result) {
        System.out.println("RequestId: " + result.getRequestId());
        // Process audio shards in real time, such as for playback or writing to a buffer.
    }

    @Override
    public void onComplete() {
        System.out.println("Task completed");
        // Handle post-synthesis logic, such as releasing the player.
    }

    @Override
    public void onError(Exception e) {
        System.out.println("Task failed: " + e.getMessage());
        // Handle exceptions, such as network errors or server-side error codes.
    }
};

Interface/Method	Parameter	Return value	Description
`public void onEvent(SpeechSynthesisResult result)`	`result`: SpeechSynthesisResult instance	None	This is called back asynchronously when the server pushes speech synthesis data. Call the `getAudioFrame` method of SpeechSynthesisResult to get the binary audio data. Call the `getUsage` method of SpeechSynthesisResult to get the number of billable characters in the current request so far.
`public void onComplete()`	None	None	The callback is invoked asynchronously after all synthetic data is returned (speech synthesis is complete).
`public void onError(Exception e)`	`e`: Exception information	None	This interface is called back asynchronously when an exception occurs. Implement complete exception logging and resource cleanup logic in the `onError` method.

Response

The server returns binary audio data:

Synchronous call: Process the binary audio data returned by the call method of the SpeechSynthesizer class.

Asynchronous call or streaming call: Process the SpeechSynthesisResult parameter of the onEvent method of the ResultCallback interface.

The key interfaces of SpeechSynthesisResult are as follows:

Interface/Method	Parameter	Return value	Description
`public ByteBuffer getAudioFrame()`	None	Binary audio data	Returns the binary audio data of the current streaming synthesis segment. This may be empty if no new data arrives. Combine the binary audio data into a complete audio file for playback, or play it in real time with a player that supports streaming playback. Important In streaming speech synthesis, for compressed formats such as MP3 and Opus, use a streaming player to play the audio segments. Do not play them frame by frame to avoid decoding failures. Players that support streaming playback include ffmpeg, pyaudio (Python), AudioFormat (Java), and MediaSource (JavaScript). When combining audio data into a complete audio file, append the data to the same file. For WAV and MP3 audio formats in streaming speech synthesis, only the first frame contains header information. Subsequent frames contain only audio data.
`public String getRequestId()`	None	The request ID of the task.	Gets the request ID of the task. When you call `getAudioFrame` to get binary audio data, the return value of the `getRequestId` method is `null`.
`public SpeechSynthesisUsage getUsage()`	None	`SpeechSynthesisUsage`: The number of billable characters in the current request so far.	Returns `SpeechSynthesisUsage` or `null`. The `getCharacters` method of `SpeechSynthesisUsage` returns the number of billable characters used so far in the current request. Use the last received `SpeechSynthesisUsage` as the final count.
`public Sentence getTimestamp()`	None	`Sentence`: The billable sentence in the current request so far. You must enable the `enableWordTimestamp` character-level timestamp feature.	Returns `Sentence` or `null`. Methods of `Sentence`: `getIndex`: Gets the sentence number, starting from 0. `getWords`: Gets the character array `List<Word>` that makes up the sentence. Use the last received `Sentence` as the final result. Methods of `Word`: `getText`: Gets the text of the character. `getBeginIndex`: Gets the start position index of the character in the sentence, starting from 0. `getEndIndex`: Gets the end position index of the character in the sentence, starting from 1. `getBeginTime`: Gets the start timestamp of the audio corresponding to the character, in milliseconds. `getEndTime`: Gets the end timestamp of the audio corresponding to the character, in milliseconds.

Error codes

For troubleshooting information, see Error messages.

More examples

For more examples, see GitHub.

FAQ

Features, billing, and rate limiting

Q: What can I do to fix inaccurate pronunciation?

You can use SSML to customize the speech synthesis output.

Q: Speech synthesis is billed based on the number of text characters. How can I view or get the text length for each synthesis?

Synchronous call: You must calculate it manually based on the character counting rules.
Other call methods: You can retrieve it using the getUsage method of the SpeechSynthesisResult response. The value in the final response is the final total.

Troubleshooting

If a code error occurs, see Error codes for troubleshooting information.

Q: How do I get the Request ID?

You can retrieve it in one of the following two ways:

In the ResultCallback onEvent method, you can call the SpeechSynthesisResult getRequestId method.
The return value of the getRequestId method may be null. For more information, see the description of the getRequestId method in SpeechSynthesisResult.
Call the getLastRequestId method of SpeechSynthesizer.

Q: Why does the SSML feature fail?

Perform the following steps to troubleshoot this issue:

Ensure that your use case meets the conditions described in the scope of application.
Install the latest version of the DashScope SDK.
Ensure that you are using the correct interface. SSML is only supported by the call method of the SpeechSynthesizer class.
Ensure that the text for synthesis is in plain text and meets the format requirements. For more information, see Speech Synthesis Markup Language.

Q: Why can't the audio be played?

Troubleshoot this issue based on the following scenarios:

The audio is saved as a complete file, such as an .mp3 file.
1. Audio format consistency: Ensure that the audio format specified in the request parameters matches the file extension. For example, playback might fail if the audio format is set to WAV in the request parameters but the file has an .mp3 extension.
2. Player compatibility: Confirm that your player supports the format and sample rate of the audio file. For example, some players might not support high sample rates or specific audio encodings.
The audio is played in streaming mode.
1. Save the audio stream as a complete file and try to play it. If the file fails to play, see the troubleshooting steps for the first scenario.
2. If the file plays correctly, the issue might be with the streaming playback implementation. Confirm that your player supports streaming playback.
  Common tools and libraries that support streaming playback include FFmpeg, pyaudio (Python), AudioFormat (Java), and MediaSource (JavaScript).

Q: Why does the audio playback stutter?

Troubleshoot this issue based on the following scenarios:

Check the text sending speed: Ensure that the text sending interval is reasonable. Avoid delays in sending the next text segment after the audio for the previous segment has finished playing.
Check the callback function performance:
- Check whether the callback function contains excessive business logic that could cause it to block.
- The callback function runs in the WebSocket thread. If this thread is blocked, it can interfere with the WebSocket's ability to receive network packets, resulting in audio stuttering.
- To avoid blocking the WebSocket thread, write the audio data to a separate audio buffer and then use another thread to read and process it.
Check network stability: Ensure that your network connection is stable to prevent audio transmission interruptions or delays caused by network fluctuations.

Q: Why is speech synthesis slow (long synthesis time)?

Perform the following troubleshooting steps:

Check the input interval
If you are using streaming speech synthesis, check whether the text sending interval is too long. For example, a delay of several seconds before sending the next segment will increase the total synthesis time.
Analyze performance metrics
- First packet delay: This is typically around 500 ms.
- Real-Time Factor (RTF): This is calculated as Total Synthesis Time / Audio Duration. The RTF is normally less than 1.0.

Q: How do I handle incorrect pronunciation in the synthesized speech?

Use the <phoneme> tag of SSML to specify the correct pronunciation.

Q: Why is no speech returned? Why is the text at the end not successfully converted to speech? (Missing synthesized speech)

Check whether you have called the streamingComplete method of the SpeechSynthesizer class. During speech synthesis, the server caches text and begins synthesis only after a sufficient amount of text is cached. If you do not call the streamingComplete method, the text remaining in the cache may not be synthesized.

Permissions and authentication

Q: I want my API key to be used only for the CosyVoice speech synthesis service, not for other Model Studio models (permission isolation). What should I do?

You can create a workspace and authorize only specific models to limit the scope of the API key. For more information, see Manage workspaces.

Prerequisites

Models and pricing

Text and format limitations

Text length limits

Character counting rules

Encoding format

Support for mathematical expressions

SSML support

Getting started

Synchronous call

Asynchronous call

Streaming call

Call through Flowable

Non-streaming call

Streaming call

High-concurrency calls

Request parameters

Set using parameter

Set using parameters

Set using the parameter method

Set using the parameters method

Set using the parameter method

Set using the parameters method

Set using the parameter method

Set using the parameters method

Key interfaces

SpeechSynthesizer class

Callback interface (ResultCallback)

Response

Error codes

More examples

FAQ

Features, billing, and rate limiting

Q: What can I do to fix inaccurate pronunciation?

Q: Speech synthesis is billed based on the number of text characters. How can I view or get the text length for each synthesis?

Troubleshooting

Q: How do I get the Request ID?

Q: Why does the SSML feature fail?

Q: Why can't the audio be played?

Q: Why does the audio playback stutter?

Q: Why is speech synthesis slow (long synthesis time)?

Q: How do I handle incorrect pronunciation in the synthesized speech?

Q: Why is no speech returned? Why is the text at the end not successfully converted to speech? (Missing synthesized speech)

Permissions and authentication

Q: I want my API key to be used only for the CosyVoice speech synthesis service, not for other Model Studio models (permission isolation). What should I do?

More questions

`SpeechSynthesizer` class

Callback interface (`ResultCallback`)