All Products
Search
Document Center

Alibaba Cloud Model Studio:CosyVoice Java SDK for speech synthesis

Last Updated:Dec 15, 2025

This topic describes the parameters and interface details of the CosyVoice Java SDK for speech synthesis.

Important

To use a model in the China (Beijing) region, go to the API key page for the China (Beijing) region

User guide: For more information about the models and guidance on model selection, see Real-time speech synthesis - CosyVoice.

Prerequisites

  • You have activated the Model Studio and created an API key. To prevent security risks, export the API key as an environment variable instead of hard-coding it in your code.

    Note

    To grant temporary access permissions to third-party applications or users, or if you want to strictly control high-risk operations such as accessing or deleting sensitive data, we recommend that you use a temporary authentication token.

    Compared with long-term API keys, temporary authentication tokens are more secure because they are short-lived (60 seconds). They are suitable for temporary call scenarios and can effectively reduce the risk of API key leakage.

    To use a temporary token, replace the API key used for authentication in your code with the temporary authentication token.

  • Install the latest version of the DashScope SDK.

Models and pricing

Model

Unit price

cosyvoice-v3-plus

$0.286706 per 10,000 characters

cosyvoice-v3-flash

$0.14335 per 10,000 characters

cosyvoice-v2

$0.286706 per 10,000 characters

Text and format limitations

Text length limits

  • Non-streaming call (synchronous call, asynchronous call, or Flowable non-streaming call): The text for a single request cannot exceed 2,000 characters.

  • Streaming call (streaming call or Flowable streaming call): The text in a single request cannot exceed 2,000 characters, and the total length of the text cannot exceed 200,000 characters.

Character counting rules

  • A Chinese character, including simplified or traditional Chinese, Japanese kanji, and Korean hanja, is counted as 2 characters. All other characters, such as punctuation marks, letters, numbers, and Japanese or Korean kana or hangul, are counted as 1 character.

  • SSML tags are not included in the text length calculation.

  • Examples:

    • "你好" → 2(你) + 2(好) = 4 characters

    • "中A文123" → 2(中) + 1(A) + 2(文) + 1(1) + 1(2) + 1(3) = 8 characters

    • "中文。" → 2(中) + 2(文) + 1(。) = 5 characters

    • "中 文。" → 2(中) + 1(space) + 2(文) + 1(。) = 6 characters

    • "<speak>你好</speak>" → 2(你) + 2(好) = 4 characters

Encoding format

Use UTF-8 encoding.

Support for mathematical expressions

The mathematical expression parsing feature is currently available only for the cosyvoice-v2, cosyvoice-v3-flash, and cosyvoice-v3-plus models. This feature supports common mathematical expressions from primary and secondary school, such as basic arithmetic, algebra, and geometry.

For more information, see Convert LaTeX formulas to speech.

SSML support

The Speech Synthesis Markup Language (SSML) feature is currently available only for cloned voices of the cosyvoice-v3-flash, cosyvoice-v3-plus, and cosyvoice-v2 models, and for system voices that are indicated as supported in the voice list. The following conditions must be met:

Getting started

The SpeechSynthesizer class provides interfaces for speech synthesis and supports the following call methods:

  • Synchronous call: A blocking call that sends the complete text at once and returns the complete audio directly. This method is suitable for short text synthesis scenarios.

  • Asynchronous call: A non-blocking call that sends the complete text at once and uses a callback function to receive audio data, which may be delivered in chunks. This method is suitable for short text synthesis scenarios that require low latency.

  • Streaming call: A non-blocking call that sends text in fragments and uses a callback function to receive the synthesized audio stream incrementally in real time. This method is suitable for long text synthesis scenarios that require low latency.

Synchronous call

Submit a speech synthesis task synchronously to obtain the complete result directly.

image

Instantiate the SpeechSynthesizer class, set the request parameters, and call the call method to synthesize and obtain the binary audio data.

The text length cannot exceed 2,000 characters. For more information, see the call method of the SpeechSynthesizer class.

Important

Before each call to the call method, you must create a new SpeechSynthesizer instance.

Click to view the complete example

import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesisParam;
import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesizer;

import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.nio.ByteBuffer;

public class Main {
    // Model
    private static String model = "cosyvoice-v3-flash";
    // Voice
    private static String voice = "longanyang";

    public static void streamAudioDataToSpeaker() {
        // Request parameters
        SpeechSynthesisParam param =
                SpeechSynthesisParam.builder()
                        // If the API key is not configured as an environment variable, uncomment the following line and replace your-api-key with your API key.
                        // .apiKey("your-api-key")
                        .model(model) // Model
                        .voice(voice) // Voice
                        .build();

        // Synchronous mode: Disable callback (the second parameter is null).
        SpeechSynthesizer synthesizer = new SpeechSynthesizer(param, null);
        ByteBuffer audio = null;
        try {
            // Block until the audio is returned.
            audio = synthesizer.call("What's the weather like today?");
        } catch (Exception e) {
            throw new RuntimeException(e);
        } finally {
            // Close the WebSocket connection after the task is complete.
            synthesizer.getDuplexApi().close(1000, "bye");
        }
        if (audio != null) {
            // Save the audio data to a local file named "output.mp3".
            File file = new File("output.mp3");
            // A WebSocket connection must be established when the text is sent for the first time. Therefore, the first-packet latency includes the time required to establish the connection.
            System.out.println(
                    "[Metric] Request ID: "
                            + synthesizer.getLastRequestId()
                            + ", First-packet latency (ms): "
                            + synthesizer.getFirstPackageDelay());
            try (FileOutputStream fos = new FileOutputStream(file)) {
                fos.write(audio.array());
            } catch (IOException e) {
                throw new RuntimeException(e);
            }
        }
    }

    public static void main(String[] args) {
        streamAudioDataToSpeaker();
        System.exit(0);
    }
}

Asynchronous call

Submit a speech synthesis task asynchronously and receive real-time speech segments frame by frame by registering a ResultCallback callback.

image

Instantiate the SpeechSynthesizer class, set the request parameters and the ResultCallback interface, and then call the call method to synthesize the audio. The onEvent method of the ResultCallback interface provides the synthesis result in real time.

The text length cannot exceed 2,000 characters. For more information, see the call method of the SpeechSynthesizer class.

Important

Before each call to the call method, you must create a new SpeechSynthesizer instance.

Click to view the complete example

import com.alibaba.dashscope.audio.tts.SpeechSynthesisResult;
import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesisParam;
import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesizer;
import com.alibaba.dashscope.common.ResultCallback;

import java.time.LocalDateTime;
import java.time.format.DateTimeFormatter;
import java.util.concurrent.CountDownLatch;

class TimeUtils {
    private static final DateTimeFormatter formatter =
            DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss.SSS");

    public static String getTimestamp() {
        return LocalDateTime.now().format(formatter);
    }
}

public class Main {
    // Model
    private static String model = "cosyvoice-v3-flash";
    // Voice
    private static String voice = "longanyang";

    public static void streamAudioDataToSpeaker() {
        CountDownLatch latch = new CountDownLatch(1);

        // Implement the ResultCallback interface.
        ResultCallback<SpeechSynthesisResult> callback = new ResultCallback<SpeechSynthesisResult>() {
            @Override
            public void onEvent(SpeechSynthesisResult result) {
                // System.out.println("Message received: " + result);
                if (result.getAudioFrame() != null) {
                    // Implement the logic to save the audio data to a local file here.
                    System.out.println(TimeUtils.getTimestamp() + " Audio received");
                }
            }

            @Override
            public void onComplete() {
                System.out.println(TimeUtils.getTimestamp() + " 'Complete' received. Speech synthesis is finished.");
                latch.countDown();
            }

            @Override
            public void onError(Exception e) {
                System.out.println("An exception occurred: " + e.toString());
                latch.countDown();
            }
        };

        // Request parameters
        SpeechSynthesisParam param =
                SpeechSynthesisParam.builder()
                        // If you have not configured the API key as an environment variable, uncomment the following line and replace "your-api-key" with your API key.
                        // .apiKey("your-api-key")
                        .model(model) // Model
                        .voice(voice) // Voice
                        .build();
        // Pass the callback as the second parameter to enable asynchronous mode.
        SpeechSynthesizer synthesizer = new SpeechSynthesizer(param, callback);
        // This is a non-blocking call that returns null immediately. The actual result is passed asynchronously through the callback interface. Retrieve the binary audio in real time in the onEvent method of the callback interface.
        try {
            synthesizer.call("What's the weather like today?");
            // Wait for the synthesis to complete.
            latch.await();
            // Wait for all playback threads to finish.
        } catch (Exception e) {
            throw new RuntimeException(e);
        } finally {
            // Close the WebSocket connection after the task is complete.
            synthesizer.getDuplexApi().close(1000, "bye");
        }
        // A WebSocket connection must be established when the text is sent for the first time. Therefore, the first-packet latency includes the time required to establish the connection.
        System.out.println(
                "[Metric] Request ID: "
                        + synthesizer.getLastRequestId()
                        + ", First-packet latency (ms): "
                        + synthesizer.getFirstPackageDelay());
    }

    public static void main(String[] args) {
        streamAudioDataToSpeaker();
        System.exit(0);
    }
}

Streaming call

Submit text in fragments and receive real-time speech segments frame by frame by registering a ResultCallback callback.

Note
  • For streaming input, call streamingCall multiple times to submit text fragments sequentially. After the server receives the text fragments, it automatically segments the text into sentences:

    • Complete sentences are synthesized immediately.

    • Incomplete sentences are cached until they are complete and then synthesized.

    When you call streamingComplete, the server synthesizes all received but unprocessed text fragments, including incomplete sentences.

  • The interval between sending text fragments cannot exceed 23 seconds. Otherwise, a "request timeout after 23 seconds" exception occurs.

    If you have no more text to send, call streamingComplete to end the task promptly.

    The server enforces a 23-second timeout, which cannot be modified by the client.
image
  1. Instantiate the SpeechSynthesizer class.

    Instantiate the SpeechSynthesizer class and set the request parameters and the ResultCallback interface.

  2. Stream data

    Call the streamingCall method of the SpeechSynthesizer class multiple times to submit the text to be synthesized to the server in segments.

    As you send the text, the server returns the synthesis result in real time through the onEvent method of the ResultCallback interface.

    For each call to the streamingCall method, the length of the text segment (that is, text) cannot exceed 2,000 characters. The total length of all text sent cannot exceed 200,000 characters.

  3. End processing

    Call the streamingComplete method of the SpeechSynthesizer class to end the speech synthesis task.

    This method blocks the current thread until the onComplete or onError callback of the ResultCallback interface is triggered, after which the thread is unblocked.

    You must call this method. Otherwise, the final text fragments may not be successfully converted to speech.

Click to view the complete example

import com.alibaba.dashscope.audio.tts.SpeechSynthesisResult;
import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesisAudioFormat;
import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesisParam;
import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesizer;
import com.alibaba.dashscope.common.ResultCallback;

import java.time.LocalDateTime;
import java.time.format.DateTimeFormatter;
import java.util.concurrent.CountDownLatch;

class TimeUtils {
    private static final DateTimeFormatter formatter =
            DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss.SSS");

    public static String getTimestamp() {
        return LocalDateTime.now().format(formatter);
    }
}


public class Main {
    private static String[] textArray = {"The streaming text-to-speech SDK, ",
            "can convert input text ", "into binary audio data. ", "Compared to non-streaming speech synthesis, ",
            "streaming synthesis offers the advantage of stronger ", "real-time performance. Users can hear ",
            "nearly synchronous speech output while inputting text, ", "greatly enhancing the interactive experience ",
            "and reducing user waiting time. ", "It is suitable for scenarios that call large ", "language models (LLMs) to perform ",
            "speech synthesis by ", "streaming text input."};
    private static String model = "cosyvoice-v3-flash"; // Model
    private static String voice = "longanyang"; // Voice

    public static void streamAudioDataToSpeaker() {
        // Configure the callback function.
        ResultCallback<SpeechSynthesisResult> callback = new ResultCallback<SpeechSynthesisResult>() {
            @Override
            public void onEvent(SpeechSynthesisResult result) {
                // System.out.println("Message received: " + result);
                if (result.getAudioFrame() != null) {
                    // Implement the logic to process audio data here.
                    System.out.println(TimeUtils.getTimestamp() + " Audio received");
                }
            }

            @Override
            public void onComplete() {
                System.out.println(TimeUtils.getTimestamp() + " 'Complete' received. Speech synthesis is finished.");
            }

            @Override
            public void onError(Exception e) {
                System.out.println("An exception occurred: " + e.toString());
            }
        };

        // Request parameters
        SpeechSynthesisParam param =
                SpeechSynthesisParam.builder()
                        // If the API key is not set as an environment variable, uncomment the following line and replace your-api-key with your key.
                        // .apiKey("your-api-key")
                        .model(model)
                        .voice(voice)
                        .format(SpeechSynthesisAudioFormat
                                .PCM_22050HZ_MONO_16BIT) // Streaming synthesis uses PCM or MP3.
                        .build();
        SpeechSynthesizer synthesizer = new SpeechSynthesizer(param, callback);
        // The call method with a callback does not block the current thread.
        try {
            for (String text : textArray) {
                // Send text segments and receive binary audio in real time from the onEvent method of the callback interface.
                synthesizer.streamingCall(text);
            }
            // Wait for the streaming speech synthesis to complete.
            synthesizer.streamingComplete();
        } catch (Exception e) {
            throw new RuntimeException(e);
        } finally {
            // After the task is complete, close the WebSocket connection.
            synthesizer.getDuplexApi().close(1000, "bye");
        }

        // A WebSocket connection is established when the first text segment is sent. Therefore, the first-packet latency includes the connection establishment time.
        System.out.println(
                "[Metric] Request ID: "
                        + synthesizer.getLastRequestId()
                        + ", First-packet latency (ms): "
                        + synthesizer.getFirstPackageDelay());
    }

    public static void main(String[] args) {
        streamAudioDataToSpeaker();
        System.exit(0);
    }
}

Call through Flowable

Flowable is an open-source framework for workflow and business process management that is released under the Apache 2.0 license. For more information about how to use Flowable, see Flowable API details.

Before you use Flowable, ensure that you have integrated the RxJava library and understand the basic concepts of reactive programming.

Non-streaming call

The following example shows how to use the blockingForEach interface of a Flowable object to block the current thread and retrieve the SpeechSynthesisResult data that is returned in each stream.

You can also obtain the complete synthesis result using the getAudioData method of the SpeechSynthesizer class after the Flowable stream is complete.

Click to view the complete example

import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesisParam;
import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesizer;
import com.alibaba.dashscope.exception.NoApiKeyException;

import java.time.LocalDateTime;
import java.time.format.DateTimeFormatter;

class TimeUtils {
    private static final DateTimeFormatter formatter =
            DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss.SSS");

    public static String getTimestamp() {
        return LocalDateTime.now().format(formatter);
    }
}

public class Main {
    private static String model = "cosyvoice-v3-flash"; // Model
    private static String voice = "longanyang"; // Voice

    public static void streamAudioDataToSpeaker() throws NoApiKeyException {
        // Request parameters
        SpeechSynthesisParam param =
                SpeechSynthesisParam.builder()
                        // If the API key is not configured as an environment variable, uncomment the following line and replace your-api-key with your API key.
                        // .apiKey("your-api-key")
                        .model(model) // Model
                        .voice(voice) // Voice
                        .build();
        SpeechSynthesizer synthesizer = new SpeechSynthesizer(param, null);
        synthesizer.callAsFlowable("What's the weather like today?").blockingForEach(result -> {
            // System.out.println("Received message: " + result);
            if (result.getAudioFrame() != null) {
                // Implement the logic to process audio data here.
                System.out.println(TimeUtils.getTimestamp() + " Audio received");
            }
        });
        // Close the WebSocket connection after the task is complete.
        synthesizer.getDuplexApi().close(1000, "bye");
        // A WebSocket connection must be established when the text is sent for the first time. Therefore, the first-packet latency includes the time required to establish the connection.
        System.out.println(
                "[Metric] Request ID: "
                        + synthesizer.getLastRequestId()
                        + ", First-packet latency (ms): "
                        + synthesizer.getFirstPackageDelay());
    }

    public static void main(String[] args) throws NoApiKeyException {
        streamAudioDataToSpeaker();
        System.exit(0);
    }
}

Streaming call

The following example shows how to use a Flowable object as an input parameter for a text stream. The example also shows how to use a Flowable object as a return value and use the blockingForEach interface to block the current thread and retrieve the SpeechSynthesisResult data that is returned in each stream.

You can also obtain the complete synthesis result using the getAudioData method of the SpeechSynthesizer class after the Flowable stream is complete.

Click to view the complete example

import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesisParam;
import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesizer;
import com.alibaba.dashscope.exception.NoApiKeyException;
import io.reactivex.BackpressureStrategy;
import io.reactivex.Flowable;

import java.time.LocalDateTime;
import java.time.format.DateTimeFormatter;

class TimeUtils {
    private static final DateTimeFormatter formatter =
            DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss.SSS");

    public static String getTimestamp() {
        return LocalDateTime.now().format(formatter);
    }
}

public class Main {
    private static String[] textArray = {"The streaming text-to-speech SDK, ",
            "can convert input text ", "into binary audio data. ", "Compared to non-streaming speech synthesis, ",
            "streaming synthesis offers the advantage of stronger ", "real-time performance. Users can hear ",
            "nearly synchronous speech output while inputting text, ", "greatly enhancing the interactive experience ",
            "and reducing user waiting time. ", "It is suitable for scenarios that call large ", "language models (LLMs) to perform ",
            "speech synthesis by ", "streaming text input."};
    private static String model = "cosyvoice-v3-flash";
    private static String voice = "longanyang";

    public static void streamAudioDataToSpeaker() throws NoApiKeyException {
        // Simulate streaming input.
        Flowable<String> textSource = Flowable.create(emitter -> {
            new Thread(() -> {
                for (int i = 0; i < textArray.length; i++) {
                    emitter.onNext(textArray[i]);
                    try {
                        Thread.sleep(1000);
                    } catch (InterruptedException e) {
                        throw new RuntimeException(e);
                    }
                }
                emitter.onComplete();
            }).start();
        }, BackpressureStrategy.BUFFER);

        // Request parameters
        SpeechSynthesisParam param =
                SpeechSynthesisParam.builder()
                        // If the API key is not configured as an environment variable, uncomment the following line and replace yourApikey with your API key.
                        // .apiKey("yourApikey")
                        .model(model) // Model
                        .voice(voice) // Voice
                        .build();
        SpeechSynthesizer synthesizer = new SpeechSynthesizer(param, null);
        synthesizer.streamingCallAsFlowable(textSource).blockingForEach(result -> {
            if (result.getAudioFrame() != null) {
                // Implement the logic to play the audio here.
                System.out.println(
                        TimeUtils.getTimestamp() +
                                " Binary audio size: " + result.getAudioFrame().capacity());
            }
        });
        synthesizer.getDuplexApi().close(1000, "bye");
        // A WebSocket connection must be established when the text is sent for the first time. Therefore, the first-packet latency includes the time required to establish the connection.
        System.out.println(
                "[Metric] Request ID: "
                        + synthesizer.getLastRequestId()
                        + ", First-packet latency (ms): "
                        + synthesizer.getFirstPackageDelay());
    }

    public static void main(String[] args) throws NoApiKeyException {
        streamAudioDataToSpeaker();
        System.exit(0);
    }
}

High-concurrency calls

The DashScope Java SDK uses the connection pool technology of OkHttp3 to reduce the overhead from repeatedly establishing connections. For more information, see High-concurrency scenarios.

Request parameters

Use the chained methods of SpeechSynthesisParam to configure parameters, such as the model and voice, and pass the configured parameter object to the constructor of the SpeechSynthesizer class.

Click to view an example

SpeechSynthesisParam param = SpeechSynthesisParam.builder()
    .model("cosyvoice-v3-flash") // Model
    .voice("longanyang") // Voice
    .format(SpeechSynthesisAudioFormat.WAV_8000HZ_MONO_16BIT) // Audio coding format and sample rate
    .volume(50) // Volume. Value range: [0, 100].
    .speechRate(1.0f) // Speech rate. Value range: [0.5, 2].
    .pitchRate(1.0f) // Pitch. Value range: [0.5, 2].
    .build();

Parameter

Type

Required

Description

model

String

Yes

The speech synthesis model.

Difference models require corresponding voices:

  • cosyvoice-v3-flash/cosyvoice-v3-plus: Use voices such as longanyang.

  • cosyvoice-v2: Use voices such as longxiaochun_v2.

  • For a complete list, see Voice list.

voice

String

Yes

The voice to use for speech synthesis.

System voices and cloned voices are supported:

  • System voices: See Voice list.

  • Cloned voices: Customize voices using the voice cloning feature. When you use a cloned voice, make sure that the same account is used for both voice cloning and speech synthesis. For detailed steps, see CosyVoice Voice Cloning API.

    When you use a cloned voice, the value of the model parameter in the request must be the same as the model version used to create the voice (the target_model parameter).

format

enum

No

The audio coding format and sample rate.

If you do not specify the format, the synthesized audio has a sample rate of 22.05 kHz and is in MP3 format.

Note

The default sample rate is the optimal rate for the current voice. By default, the output uses this sample rate. Downsampling and upsampling are also supported.

The following audio coding formats and sample rates are available:

  • Audio coding formats and sample rates supported by all models:

    • SpeechSynthesisAudioFormat.WAV_8000HZ_MONO_16BIT: WAV format with a sample rate of 8 kHz

    • SpeechSynthesisAudioFormat.WAV_16000HZ_MONO_16BIT: WAV format with a sample rate of 16 kHz

    • SpeechSynthesisAudioFormat.WAV_22050HZ_MONO_16BIT: WAV format with a sample rate of 22.05 kHz

    • SpeechSynthesisAudioFormat.WAV_24000HZ_MONO_16BIT: WAV format with a sample rate of 24 kHz

    • SpeechSynthesisAudioFormat.WAV_44100HZ_MONO_16BIT: WAV format with a sample rate of 44.1 kHz

    • SpeechSynthesisAudioFormat.WAV_48000HZ_MONO_16BIT: WAV format with a sample rate of 48 kHz

    • SpeechSynthesisAudioFormat.MP3_8000HZ_MONO_128KBPS: MP3 format with a sample rate of 8 kHz

    • SpeechSynthesisAudioFormat.MP3_16000HZ_MONO_128KBPS: MP3 format with a sample rate of 16 kHz

    • SpeechSynthesisAudioFormat.MP3_22050HZ_MONO_256KBPS: MP3 format with a sample rate of 22.05 kHz

    • SpeechSynthesisAudioFormat.MP3_24000HZ_MONO_256KBPS: MP3 format with a sample rate of 24 kHz

    • SpeechSynthesisAudioFormat.MP3_44100HZ_MONO_256KBPS: MP3 format with a sample rate of 44.1 kHz

    • SpeechSynthesisAudioFormat.MP3_48000HZ_MONO_256KBPS: MP3 format with a sample rate of 48 kHz

    • SpeechSynthesisAudioFormat.PCM_8000HZ_MONO_16BIT: PCM format with a sample rate of 8 kHz

    • SpeechSynthesisAudioFormat.PCM_16000HZ_MONO_16BIT: PCM format with a sample rate of 16 kHz

    • SpeechSynthesisAudioFormat.PCM_22050HZ_MONO_16BIT: PCM format with a sample rate of 22.05 kHz

    • SpeechSynthesisAudioFormat.PCM_24000HZ_MONO_16BIT: PCM format with a sample rate of 24 kHz

    • SpeechSynthesisAudioFormat.PCM_44100HZ_MONO_16BIT: PCM format with a sample rate of 44.1 kHz

    • SpeechSynthesisAudioFormat.PCM_48000HZ_MONO_16BIT: PCM format with a sample rate of 48 kHz

  • If the audio format is Opus, you can adjust the bitrate using the bit_rate parameter. This applies only to DashScope 2.21.0 and later versions.

    • SpeechSynthesisAudioFormat.OGG_OPUS_8KHZ_MONO_32KBPS: Opus format with a sample rate of 8 kHz and a bitrate of 32 kbps

    • SpeechSynthesisAudioFormat.OGG_OPUS_16KHZ_MONO_16KBPS: Opus format with a sample rate of 16 kHz and a bitrate of 16 kbps

    • SpeechSynthesisAudioFormat.OGG_OPUS_16KHZ_MONO_32KBPS: Opus format with a sample rate of 16 kHz and a bitrate of 32 kbps

    • SpeechSynthesisAudioFormat.OGG_OPUS_16KHZ_MONO_64KBPS: Opus format with a sample rate of 16 kHz and a bitrate of 64 kbps

    • SpeechSynthesisAudioFormat.OGG_OPUS_24KHZ_MONO_16KBPS: Opus format with a sample rate of 24 kHz and a bitrate of 16 kbps

    • SpeechSynthesisAudioFormat.OGG_OPUS_24KHZ_MONO_32KBPS: Opus format with a sample rate of 24 kHz and a bitrate of 32 kbps

    • SpeechSynthesisAudioFormat.OGG_OPUS_24KHZ_MONO_64KBPS: Opus format with a sample rate of 24 kHz and a bitrate of 64 kbps

    • SpeechSynthesisAudioFormat.OGG_OPUS_48KHZ_MONO_16KBPS: Opus format with a sample rate of 48 kHz and a bitrate of 16 kbps

    • SpeechSynthesisAudioFormat.OGG_OPUS_48KHZ_MONO_32KBPS: Opus format with a sample rate of 48 kHz and a bitrate of 32 kbps

    • SpeechSynthesisAudioFormat.OGG_OPUS_48KHZ_MONO_64KBPS: Opus format with a sample rate of 48 kHz and a bitrate of 64 kbps

volume

int

No

The volume.

Default value: 50.

Value range: [0, 100]. A value of 50 is the standard volume. The volume has a linear relationship with this value. 0 is mute and 100 is the maximum volume.

speechRate

float

No

The speech rate.

Default value: 1.0.

Value range: [0.5, 2.0]. A value of 1.0 is the standard rate. Values less than 1.0 slow down the speech, and values greater than 1.0 speed it up.

pitchRate

float

No

The pitch. This value is a multiplier for pitch adjustment. The relationship between this value and the perceived pitch is not strictly linear or logarithmic. Test different values to find the best one.

Default value: 1.0.

Value range: [0.5, 2.0]. A value of 1.0 is the natural pitch of the voice. Values greater than 1.0 raise the pitch, and values less than 1.0 lower it.

bit_rate

int

No

The audio bitrate in kbps. If the audio format is Opus, you can adjust the bitrate using the bit_rate parameter.

Default value: 32.

Value range: [6, 510].

Note

Set the bit_rate parameter using the parameter or parameters method of a SpeechSynthesisParam instance:

Set using parameter

SpeechSynthesisParam param = SpeechSynthesisParam.builder()
    .model("cosyvoice-v3-flash") // Model
    .voice("longanyang") // Voice
    .parameter("bit_rate", 32)
    .build();

Set using parameters

SpeechSynthesisParam param = SpeechSynthesisParam.builder()
    .model("cosyvoice-v3-flash") // Model
    .voice("longanyang") // Voice
    .parameters(Collections.singletonMap("bit_rate", 32))
    .build();

enableWordTimestamp

boolean

No

Specifies whether to enable character-level timestamps.

Default value: false.

  • true

  • false

This feature is available only for cloned voices of the cosyvoice-v3-flash, cosyvoice-v3-plus, and cosyvoice-v2 models, and for system voices marked as supported in the voice list.

Timestamp results can be obtained only through the callback interface.

seed

int

No

The random number seed used during generation, which varies the synthesis effect. If the model version, text, voice, and other parameters are the same, using the same seed reproduces the same synthesis result.

Default value: 0.

Value range: [0, 65535].

languageHints

List

No

Provides language hints. Only cosyvoice-v3-flash and cosyvoice-v3-plus support this feature.

No default value. This parameter has no effect if it is not set.

This parameter has the following effects in speech synthesis:

  1. Specifies the language for Text Normalization (TN) processing, which affects how numbers, abbreviations, and symbols are read. This is effective only for Chinese and English.

    Value range:

    • zh: Chinese

    • en: English

  2. Specifies the target language for speech synthesis (for cloned voices only) to improve synthesis accuracy. This is effective for English, French, German, Japanese, Korean, and Russian. You do not need to specify Chinese. The value must be consistent with the languageHints/language_hints used during voice cloning.

    Value range:

    • en: English

    • fr: French

    • de: German

    • ja: Japanese

    • ko: Korean

    • ru: Russian

If the specified language hint clearly does not match the text content, for example, setting en for Chinese text, the hint is ignored, and the language is automatically detected based on the text content.

Note: This parameter is an array, but the current version processes only the first element. Therefore, pass only one value.

instruction

String

No

Sets an instruction. This feature is available only for cloned voices of the cosyvoice-v3-flash and cosyvoice-v3-plus models, and for system voices marked as supported in the voice list.

No default value. This parameter has no effect if it is not set.

The instruction has the following effects in speech synthesis:

  1. Specifies a non-Chinese language (for cloned voices only)

    • Format: "You will say it in <language>." (Note: Do not omit the period at the end. Replace "<language>" with a specific language, for example, German.)

    • Example: "You will say it in German."

    • Supported languages: French, German, Japanese, Korean, and Russian.

  2. Specifies a dialect (for cloned voices only)

    • Format: "Say it in <dialect>." (Note: Do not omit the period at the end. Replace "<dialect>" with a specific dialect, for example, Cantonese.)

    • Example: "Say it in Cantonese."

    • Supported dialects: Cantonese, Dongbei, Gansu, Guizhou, Henan, Hubei, Jiangxi, Minnan, Ningxia, Shanxi, Shaanxi, Shandong, Shanghainese, Sichuan, Tianjin, and Yunnan.

  3. Specifies emotion, scenario, role, or identity. Only some system voices support this feature, and it varies by voice. For more information, see Voice list.

enable_aigc_tag

boolean

No

Specifies whether to add an invisible AIGC identifier to the generated audio. When set to true, an invisible identifier is embedded into the audio in supported formats (WAV, MP3, and Opus).

Default value: false.

This feature is available only for the cosyvoice-v3-flash, cosyvoice-v3-plus, and cosyvoice-v2 models.

Note

Set the enable_aigc_tag parameter using the parameter or parameters method of a SpeechSynthesisParam instance:

Set using the parameter method

SpeechSynthesisParam param = SpeechSynthesisParam.builder()
    .model("cosyvoice-v3-flash") // Model
    .voice("longanyang") // Voice
    .parameter("enable_aigc_tag", true)
    .build();

Set using the parameters method

SpeechSynthesisParam param = SpeechSynthesisParam.builder()
    .model("cosyvoice-v3-flash") // Model
    .voice("longanyang") // Voice
    .parameters(Collections.singletonMap("enable_aigc_tag", true))
    .build();

aigc_propagator

String

No

Sets the  ContentPropagator  field in the AIGC invisible identifier to specify the content propagator. This setting takes effect only when  enable_aigc_tag  is  true .

Default value: Alibaba Cloud UID.

This feature is available only for the cosyvoice-v3-flash, cosyvoice-v3-plus, and cosyvoice-v2 models.

Note

Set aigc_propagator using the parameter or parameters method of a SpeechSynthesisParam instance:

Set using the parameter method

SpeechSynthesisParam param = SpeechSynthesisParam.builder()
    .model("cosyvoice-v3-flash") // Model
    .voice("longanyang") // Voice
    .parameter("enable_aigc_tag", true)
    .parameter("aigc_propagator", "xxxx")
    .build();

Set using the parameters method

Map<String, Object> map = new HashMap();
map.put("enable_aigc_tag", true);
map.put("aigc_propagator", "xxxx");

SpeechSynthesisParam param = SpeechSynthesisParam.builder()
    .model("cosyvoice-v3-flash") // Model
    .voice("longanyang") // Voice
    .parameters(map)
    .build();

aigc_propagate_id

String

No

Sets the  PropagateID  field in the AIGC invisible identifier to uniquely identify a specific propagation behavior. This field takes effect only when  enable_aigc_tag  is set to  true .

Default value: The request ID of the current speech synthesis request.

This feature is available only for the cosyvoice-v3-flash, cosyvoice-v3-plus, and cosyvoice-v2 models.

Note

Set aigc_propagate_id using the parameter or parameters method of a SpeechSynthesisParam instance:

Set using the parameter method

SpeechSynthesisParam param = SpeechSynthesisParam.builder()
    .model("cosyvoice-v3-flash") // Model
    .voice("longanyang") // Voice
    .parameter("enable_aigc_tag", true)
    .parameter("aigc_propagate_id", "xxxx")
    .build();

Set using the parameters method

Map<String, Object> map = new HashMap();
map.put("enable_aigc_tag", true);
map.put("aigc_propagate_id", "xxxx");

SpeechSynthesisParam param = SpeechSynthesisParam.builder()
    .model("cosyvoice-v3-flash") // Model
    .voice("longanyang") // Voice
    .parameters(map)
    .build();

Key interfaces

SpeechSynthesizer class

The SpeechSynthesizer class provides the primary interfaces for speech synthesis and is imported using import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesizer;.

Interface/Method

Parameter

Return value

Description

public SpeechSynthesizer(SpeechSynthesisParam param, ResultCallback<SpeechSynthesisResult> callback)

SpeechSynthesizer instance

Constructor.

public ByteBuffer call(String text)

text: The text to be synthesized (UTF-8)

ByteBuffer or null

Converts a segment of text into speech. The text can be plain text or text that contains SSML.

When you create a SpeechSynthesizer instance, two cases are possible:

  • No ResultCallback is specified: The call method blocks the current thread until speech synthesis is complete. For usage, see Synchronous call.

  • A ResultCallback is specified: The call method returns null immediately. The synthesis result is obtained through the onEvent method of the ResultCallback interface. For usage, see Asynchronous call.

Important

Before each call to the call method, you must create a new SpeechSynthesizer instance.

public void streamingCall(String text)

text: The text to be synthesized (UTF-8)

None

Sends the text for synthesis in a stream. Text that contains SSML is not supported.

Call this interface multiple times to send the text for synthesis to the server in parts. The onEvent method of the ResultCallback interface returns the synthesis result.

For a detailed call flow and reference examples, see Streaming call.

public void streamingComplete() throws RuntimeException

None

None

Ends the streaming speech synthesis.

This method blocks the calling thread until one of the following conditions occurs:

  • The server completes the final audio synthesis (success).

  • The streaming session is abnormally interrupted (failure).

  • The 10 minute timeout threshold is reached (automatic unblocking).

For a detailed call process and reference examples, see Streaming call.

Important

When making a streaming call, call this method to avoid missing parts of the synthesized speech.

public Flowable<SpeechSynthesisResult> callAsFlowable(String text)

text: The text to be synthesized. The text must be in UTF-8 format.

The synthesis result, encapsulated in Flowable<SpeechSynthesisResult>.

Converts non-streaming text input into a streaming speech output in real time. Text containing SSML is not supported. The synthesis result is returned in a stream within the Flowable object.

For a detailed call process and reference examples, see Call through Flowable.

boolean getDuplexApi().close(int code, String reason)

code: WebSocket Close Code

reason: Shutdown reason

For information about how to configure these parameters, see The WebSocket Protocol.

true

After a task is complete, close the WebSocket connection, regardless of whether an exception occurred, to avoid connection leaks. For information about how to reuse connections to improve efficiency, see High-concurrency scenarios.

public Flowable<SpeechSynthesisResult> streamingCallAsFlowable(Flowable<String> textStream)

textStream: A Flowable instance that encapsulates the text to be synthesized.

The synthesis result, encapsulated in Flowable<SpeechSynthesisResult>.

Converts streaming text input into a streaming speech output in real time. Text that contains SSML is not supported. The synthesis result is returned as a stream in a Flowable object.

For a detailed call process and reference examples, see Call through Flowable.

public String getLastRequestId()

None

The request ID of the previous task.

Gets the request ID of the previous task. You can use this method after starting a new task by calling call, streamingCall, callAsFlowable, or streamingCallAsFlowable.

public long getFirstPackageDelay()

None

First packet delay for the current task.

Gets the first packet delay of the current task, which is typically around 500 ms. Use this method after the task is complete.

The first packet delay is the time between when the text starts being sent and when the first audio packet is received, measured in milliseconds.

A WebSocket connection must be established when the text is sent for the first time. Therefore, the first-packet latency includes the time required to establish the connection. If the connection is reused in a high-concurrency scenario, the connection time is not included.

Callback interface (ResultCallback)

When you make an asynchronous call or a streaming call, you can retrieve the synthesis result from the ResultCallback interface. This interface is imported using import com.alibaba.dashscope.common.ResultCallback;.

Click to view an example

ResultCallback<SpeechSynthesisResult> callback = new ResultCallback<SpeechSynthesisResult>() {
    @Override
    public void onEvent(SpeechSynthesisResult result) {
        System.out.println("RequestId: " + result.getRequestId());
        // Process audio shards in real time, such as for playback or writing to a buffer.
    }

    @Override
    public void onComplete() {
        System.out.println("Task completed");
        // Handle post-synthesis logic, such as releasing the player.
    }

    @Override
    public void onError(Exception e) {
        System.out.println("Task failed: " + e.getMessage());
        // Handle exceptions, such as network errors or server-side error codes.
    }
};

Interface/Method

Parameter

Return value

Description

public void onEvent(SpeechSynthesisResult result)

result: SpeechSynthesisResult instance

None

This is called back asynchronously when the server pushes speech synthesis data.

Call the getAudioFrame method of SpeechSynthesisResult to get the binary audio data.

Call the getUsage method of SpeechSynthesisResult to get the number of billable characters in the current request so far.

public void onComplete()

None

None

The callback is invoked asynchronously after all synthetic data is returned (speech synthesis is complete).

public void onError(Exception e)

e: Exception information

None

This interface is called back asynchronously when an exception occurs.

Implement complete exception logging and resource cleanup logic in the onError method.

Response

The server returns binary audio data:

  • Synchronous call: Process the binary audio data returned by the call method of the SpeechSynthesizer class.

  • Asynchronous call or streaming call: Process the SpeechSynthesisResult parameter of the onEvent method of the ResultCallback interface.

    The key interfaces of SpeechSynthesisResult are as follows:

    Interface/Method

    Parameter

    Return value

    Description

    public ByteBuffer getAudioFrame()

    None

    Binary audio data

    Returns the binary audio data of the current streaming synthesis segment. This may be empty if no new data arrives.

    Combine the binary audio data into a complete audio file for playback, or play it in real time with a player that supports streaming playback.

    Important
    • In streaming speech synthesis, for compressed formats such as MP3 and Opus, use a streaming player to play the audio segments. Do not play them frame by frame to avoid decoding failures.

      Players that support streaming playback include ffmpeg, pyaudio (Python), AudioFormat (Java), and MediaSource (JavaScript).
    • When combining audio data into a complete audio file, append the data to the same file.

    • For WAV and MP3 audio formats in streaming speech synthesis, only the first frame contains header information. Subsequent frames contain only audio data.

    public String getRequestId()

    None

    The request ID of the task.

    Gets the request ID of the task. When you call getAudioFrame to get binary audio data, the return value of the getRequestId method is null.

    public SpeechSynthesisUsage getUsage()

    None

    SpeechSynthesisUsage: The number of billable characters in the current request so far.

    Returns SpeechSynthesisUsage or null.

    The getCharacters method of SpeechSynthesisUsage returns the number of billable characters used so far in the current request. Use the last received SpeechSynthesisUsage as the final count.

    public Sentence getTimestamp()

    None

    Sentence: The billable sentence in the current request so far.

    You must enable the enableWordTimestamp character-level timestamp feature.

    Returns Sentence or null.

    Methods of Sentence:

    • getIndex: Gets the sentence number, starting from 0.

    • getWords: Gets the character array List<Word> that makes up the sentence. Use the last received Sentence as the final result.

    Methods of Word:

    • getText: Gets the text of the character.

    • getBeginIndex: Gets the start position index of the character in the sentence, starting from 0.

    • getEndIndex: Gets the end position index of the character in the sentence, starting from 1.

    • getBeginTime: Gets the start timestamp of the audio corresponding to the character, in milliseconds.

    • getEndTime: Gets the end timestamp of the audio corresponding to the character, in milliseconds.

Error codes

For troubleshooting information, see Error messages.

More examples

For more examples, see GitHub.

FAQ

Features, billing, and rate limiting

Q: What can I do to fix inaccurate pronunciation?

You can use SSML to customize the speech synthesis output.

Q: Speech synthesis is billed based on the number of text characters. How can I view or get the text length for each synthesis?

Troubleshooting

If a code error occurs, see Error codes for troubleshooting information.

Q: How do I get the Request ID?

You can retrieve it in one of the following two ways:

Q: Why does the SSML feature fail?

Perform the following steps to troubleshoot this issue:

  1. Ensure that your use case meets the conditions described in the scope of application.

  2. Install the latest version of the DashScope SDK.

  3. Ensure that you are using the correct interface. SSML is only supported by the call method of the SpeechSynthesizer class.

  4. Ensure that the text for synthesis is in plain text and meets the format requirements. For more information, see Speech Synthesis Markup Language.

Q: Why can't the audio be played?

Troubleshoot this issue based on the following scenarios:

  1. The audio is saved as a complete file, such as an .mp3 file.

    1. Audio format consistency: Ensure that the audio format specified in the request parameters matches the file extension. For example, playback might fail if the audio format is set to WAV in the request parameters but the file has an .mp3 extension.

    2. Player compatibility: Confirm that your player supports the format and sample rate of the audio file. For example, some players might not support high sample rates or specific audio encodings.

  2. The audio is played in streaming mode.

    1. Save the audio stream as a complete file and try to play it. If the file fails to play, see the troubleshooting steps for the first scenario.

    2. If the file plays correctly, the issue might be with the streaming playback implementation. Confirm that your player supports streaming playback.

      Common tools and libraries that support streaming playback include FFmpeg, pyaudio (Python), AudioFormat (Java), and MediaSource (JavaScript).

Q: Why does the audio playback stutter?

Troubleshoot this issue based on the following scenarios:

  1. Check the text sending speed: Ensure that the text sending interval is reasonable. Avoid delays in sending the next text segment after the audio for the previous segment has finished playing.

  2. Check the callback function performance:

    • Check whether the callback function contains excessive business logic that could cause it to block.

    • The callback function runs in the WebSocket thread. If this thread is blocked, it can interfere with the WebSocket's ability to receive network packets, resulting in audio stuttering.

    • To avoid blocking the WebSocket thread, write the audio data to a separate audio buffer and then use another thread to read and process it.

  3. Check network stability: Ensure that your network connection is stable to prevent audio transmission interruptions or delays caused by network fluctuations.

Q: Why is speech synthesis slow (long synthesis time)?

Perform the following troubleshooting steps:

  1. Check the input interval

    If you are using streaming speech synthesis, check whether the text sending interval is too long. For example, a delay of several seconds before sending the next segment will increase the total synthesis time.

  2. Analyze performance metrics

    • First packet delay: This is typically around 500 ms.

    • Real-Time Factor (RTF): This is calculated as Total Synthesis Time / Audio Duration. The RTF is normally less than 1.0.

Q: How do I handle incorrect pronunciation in the synthesized speech?

Use the <phoneme> tag of SSML to specify the correct pronunciation.

Q: Why is no speech returned? Why is the text at the end not successfully converted to speech? (Missing synthesized speech)

Check whether you have called the streamingComplete method of the SpeechSynthesizer class. During speech synthesis, the server caches text and begins synthesis only after a sufficient amount of text is cached. If you do not call the streamingComplete method, the text remaining in the cache may not be synthesized.

Permissions and authentication

Q: I want my API key to be used only for the CosyVoice speech synthesis service, not for other Model Studio models (permission isolation). What should I do?

You can create a workspace and authorize only specific models to limit the scope of the API key. For more information, see Manage workspaces.

More questions

For more information, see the Q&A on GitHub.