All Products
Search
Document Center

Alibaba Cloud Model Studio:CosyVoice speech synthesis Java SDK

Last Updated:Mar 18, 2026

The parameters and key interfaces of the CosyVoice speech synthesis Java SDK.

User guide: For model overviews and selection suggestions, see Real-time speech synthesis - CosyVoice.

Prerequisites

Models and pricing

See Real-time speech synthesis - CosyVoice.

Text and format limitations

Text length limits

Character counting rules

  • Chinese characters (simplified/traditional Chinese, Japanese Kanji, Korean Hanja) count as two characters. All other characters (punctuation, letters, numbers, Kana, Hangul) count as one.

  • SSML tags are not included when calculating the text length.

  • Examples:

    • "你好" → 2 (Chinese character) + 2 (Chinese character) = 4 characters

    • "中A文123" → 2 (Chinese character) + 1 (A) + 2 (Chinese character) + 1 (1) + 1 (2) + 1 (3) = 8 characters

    • "中文。" → 2 (Chinese character) + 2 (Chinese character) + 1 (.) = 5 characters

    • "中 文。" → 2 (Chinese character) + 1 (space) + 2 (Chinese character) + 1 (.) = 6 characters

    • "<speak>你好</speak>" → 2 (Chinese character) + 2 (Chinese character) = 4 characters

Encoding format

Use UTF-8 encoding.

Support for mathematical expressions

Mathematical expression parsing (v3.5-flash, v3.5-plus, v3-flash, v3-plus, v2 only): Supports primary and secondary school math—basic operations, algebra, geometry.

Note

This feature only supports Chinese.

See Convert LaTeX formulas to speech (Chinese language only).

SSML support

SSML is available for custom voices (voice design or cloning) with v3.5-flash, v3.5-plus, v3-flash, v3-plus, and v2, and for system voices marked as supported in the voice list. Requirements:

Getting started

The SpeechSynthesizer class provides key interfaces for speech synthesis and supports the following call methods:

  • Non-streaming: A blocking call that sends the full text at once and returns the complete audio. Suitable for short text.

  • Unidirectional streaming: A non-blocking call that sends the full text at once and receives audio via callback. Suitable for short text with low latency.

  • Bidirectional streaming: A non-blocking call that sends text fragments incrementally and receives audio via callback in real time. Suitable for long text with low latency.

Non-streaming call

Submits a synthesis task synchronously and returns the complete result.

image

Instantiate the SpeechSynthesizer class, bind the request parameters, and call the call method to synthesize and get the binary audio data.

The length of the sent text cannot exceed 20,000 characters. For more information, see the call method of the SpeechSynthesizer class.

Important

Before each call to the call method, you must re-initialize the SpeechSynthesizer instance.

Click to view the full example

import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesisParam;
import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesizer;
import com.alibaba.dashscope.utils.Constants;

import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.nio.ByteBuffer;

public class Main {
    // Model
    private static String model = "cosyvoice-v3-flash";
    // Voice
    private static String voice = "longanyang";

    public static void streamAudioDataToSpeaker() {
        // Request parameters
        SpeechSynthesisParam param =
                SpeechSynthesisParam.builder()
                        // The API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
                        // If you have not configured environment variables, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
                        .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                        .model(model) // Model
                        .voice(voice) // Voice
                        .build();

        // Synchronous mode: Disable the callback (the second parameter is null).
        SpeechSynthesizer synthesizer = new SpeechSynthesizer(param, null);
        ByteBuffer audio = null;
        try {
            // Block until the audio is returned.
            audio = synthesizer.call("What's the weather like today?");
        } catch (Exception e) {
            throw new RuntimeException(e);
        } finally {
            // Close the WebSocket connection after the task is complete.
            synthesizer.getDuplexApi().close(1000, "bye");
        }
        if (audio != null) {
            // Save the audio data to a local file named "output.mp3".
            File file = new File("output.mp3");
            // A WebSocket connection must be established when you send text for the first time. Therefore, the first-packet latency includes the time required to establish the connection.
            System.out.println(
                    "[Metric] Request ID: "
                            + synthesizer.getLastRequestId()
                            + ", First-packet latency (ms): "
                            + synthesizer.getFirstPackageDelay());
            try (FileOutputStream fos = new FileOutputStream(file)) {
                fos.write(audio.array());
            } catch (IOException e) {
                throw new RuntimeException(e);
            }
        }
    }

    public static void main(String[] args) {
        // The following URL is for the Singapore region. If you use a model in the Beijing region, replace the URL with: wss://dashscope.aliyuncs.com/api-ws/v1/inference
        Constants.baseWebsocketApiUrl = "wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference";
        streamAudioDataToSpeaker();
        System.exit(0);
    }
}

Unidirectional streaming call

Submits a synthesis task asynchronously and receives audio incrementally via ResultCallback.

image

Instantiate the SpeechSynthesizer class, bind the request parameters and the ResultCallback interface, and call the call method to synthesize. Get the synthesis result in real time through the onEvent method of the ResultCallback interface.

The length of the sent text cannot exceed 20,000 characters. For more information, see the call method of the SpeechSynthesizer class.

Important

Before each call to the call method, you must re-initialize the SpeechSynthesizer instance.

Click to view the full example

import com.alibaba.dashscope.audio.tts.SpeechSynthesisResult;
import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesisParam;
import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesizer;
import com.alibaba.dashscope.common.ResultCallback;
import com.alibaba.dashscope.utils.Constants;

import java.time.LocalDateTime;
import java.time.format.DateTimeFormatter;
import java.util.concurrent.CountDownLatch;

class TimeUtils {
    private static final DateTimeFormatter formatter =
            DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss.SSS");

    public static String getTimestamp() {
        return LocalDateTime.now().format(formatter);
    }
}

public class Main {
    // Model
    private static String model = "cosyvoice-v3-flash";
    // Voice
    private static String voice = "longanyang";

    public static void streamAudioDataToSpeaker() {
        CountDownLatch latch = new CountDownLatch(1);

        // Implement the ResultCallback interface.
        ResultCallback<SpeechSynthesisResult> callback = new ResultCallback<SpeechSynthesisResult>() {
            @Override
            public void onEvent(SpeechSynthesisResult result) {
                // System.out.println("Message received: " + result);
                if (result.getAudioFrame() != null) {
                    // Implement the logic for saving audio data to a local file here.
                    System.out.println(TimeUtils.getTimestamp() + " Audio received");
                }
            }

            @Override
            public void onComplete() {
                System.out.println(TimeUtils.getTimestamp() + " Complete received. Speech synthesis finished.");
                latch.countDown();
            }

            @Override
            public void onError(Exception e) {
                System.out.println("An exception occurred: " + e.toString());
                latch.countDown();
            }
        };

        // Request parameters
        SpeechSynthesisParam param =
                SpeechSynthesisParam.builder()
                        // The API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
                        // If you have not configured environment variables, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
                        .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                        .model(model) // Model
                        .voice(voice) // Voice
                        .build();
        // Pass the callback as the second parameter to enable asynchronous mode.
        SpeechSynthesizer synthesizer = new SpeechSynthesizer(param, callback);
        // This is a non-blocking call that returns null immediately. The actual result is passed asynchronously through the callback interface. The binary audio is returned in real time in the onEvent method of the callback interface.
        try {
            synthesizer.call("What's the weather like today?");
            // Wait for the synthesis to complete.
            latch.await();
            // Wait for all playback threads to finish.
        } catch (Exception e) {
            throw new RuntimeException(e);
        } finally {
            // Close the WebSocket connection after the task is complete.
            synthesizer.getDuplexApi().close(1000, "bye");
        }
        // A WebSocket connection must be established when you send text for the first time. Therefore, the first-packet latency includes the time required to establish the connection.
        System.out.println(
                "[Metric] Request ID: "
                        + synthesizer.getLastRequestId()
                        + ", First-packet latency (ms): "
                        + synthesizer.getFirstPackageDelay());
    }

    public static void main(String[] args) {
        // The following URL is for the Singapore region. If you use a model in the Beijing region, replace the URL with: wss://dashscope.aliyuncs.com/api-ws/v1/inference
        Constants.baseWebsocketApiUrl = "wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference";
        streamAudioDataToSpeaker();
        System.exit(0);
    }
}

Bidirectional streaming call

Send text in multiple chunks and receive audio data incrementally through a registered ResultCallback callback.

Note
  • For streaming input, call streamingCall multiple times to submit text fragments in order. After the server receives the text fragments, it automatically segments them into sentences:

    • Complete sentences are synthesized immediately.

    • Incomplete sentences are buffered and synthesized after they are complete.

    When you call streamingComplete, the server forcibly synthesizes all received but unprocessed text fragments, including incomplete sentences.

  • The interval between sending text fragments cannot exceed 23 seconds, or a timeout exception occurs.

    Call the streamingComplete method promptly when there is no more text to send.

    The server enforces a 23-second timeout mechanism. This configuration cannot be modified on the client.
image
  1. Instantiate the SpeechSynthesizer class

    Instantiate the SpeechSynthesizer class, and bind the request parameters and the ResultCallback interface.

  2. Streaming

    Call the streamingCall method of the SpeechSynthesizer class multiple times to submit the text for synthesis in chunks. This sends the text to the server in segments.

    While you are sending the text, the server returns the synthesis result to the client in real time through the onEvent method of the ResultCallback interface.

    The length of the text fragment sent in each call to the streamingCall method (the text parameter) cannot exceed 20,000 characters. The total length of all sent text cannot exceed 200,000 characters.

  3. End processing

    Call the streamingComplete method of the SpeechSynthesizer class to end the speech synthesis.

    This method blocks the current thread until the onComplete or onError callback of the ResultCallback interface is triggered. Then, the thread is unblocked.

    Ensure you call this method. Otherwise, text at the end may not be converted to speech.

Click to view the full example

import com.alibaba.dashscope.audio.tts.SpeechSynthesisResult;
import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesisAudioFormat;
import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesisParam;
import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesizer;
import com.alibaba.dashscope.common.ResultCallback;
import com.alibaba.dashscope.utils.Constants;

import java.time.LocalDateTime;
import java.time.format.DateTimeFormatter;

class TimeUtils {
    private static final DateTimeFormatter formatter =
            DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss.SSS");

    public static String getTimestamp() {
        return LocalDateTime.now().format(formatter);
    }
}


public class Main {
    private static String[] textArray = {"The streaming text-to-speech SDK ",
            "can convert input text ", "into binary audio data. ", "Compared to non-streaming synthesis, ",
            "streaming offers better real-time performance. ", "You can hear the audio output almost instantly as you type, ",
            "which greatly improves the user experience ", "and reduces waiting time. ",
            "This is ideal for applications that use large ", "language models (LLMs) ",
            "to synthesize speech from a stream of text."};
    private static String model = "cosyvoice-v3-flash"; // Model
    private static String voice = "longanyang"; // Voice

    public static void streamAudioDataToSpeaker() {
        // Configure the callback function.
        ResultCallback<SpeechSynthesisResult> callback = new ResultCallback<SpeechSynthesisResult>() {
            @Override
            public void onEvent(SpeechSynthesisResult result) {
                // System.out.println("Message received: " + result);
                if (result.getAudioFrame() != null) {
                    // Implement the logic for processing audio data here.
                    System.out.println(TimeUtils.getTimestamp() + " Audio received");
                }
            }

            @Override
            public void onComplete() {
                System.out.println(TimeUtils.getTimestamp() + " Complete received. Speech synthesis finished.");
            }

            @Override
            public void onError(Exception e) {
                System.out.println("An exception occurred: " + e.toString());
            }
        };

        // Request parameters
        SpeechSynthesisParam param =
                SpeechSynthesisParam.builder()
                        // The API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
                        // If you have not configured environment variables, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
                        .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                        .model(model)
                        .voice(voice)
                        .format(SpeechSynthesisAudioFormat
                                .PCM_22050HZ_MONO_16BIT) // Use PCM or MP3 for streaming synthesis.
                        .build();
        SpeechSynthesizer synthesizer = new SpeechSynthesizer(param, callback);
        // The call method with a callback will not block the current thread.
        try {
            for (String text : textArray) {
                // Send text fragments and get the binary audio in real time in the onEvent method of the callback interface.
                synthesizer.streamingCall(text);
            }
            // Wait for the streaming speech synthesis to complete.
            synthesizer.streamingComplete();
        } catch (Exception e) {
            throw new RuntimeException(e);
        } finally {
            // Close the WebSocket connection after the task is complete.
            synthesizer.getDuplexApi().close(1000, "bye");
        }

        // A WebSocket connection must be established when you send text for the first time. Therefore, the first-packet latency includes the time required to establish the connection.
        System.out.println(
                "[Metric] Request ID: "
                        + synthesizer.getLastRequestId()
                        + ", First-packet latency (ms): "
                        + synthesizer.getFirstPackageDelay());
    }

    public static void main(String[] args) {
        // The following URL is for the Singapore region. If you use a model in the Beijing region, replace the URL with: wss://dashscope.aliyuncs.com/api-ws/v1/inference
        Constants.baseWebsocketApiUrl = "wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference";
        streamAudioDataToSpeaker();
        System.exit(0);
    }
}

Call using Flowable

Flowable is an open source workflow framework (Apache 2.0 license). See Flowable API details.

Before using Flowable, make sure you have integrated the RxJava library and understand the basic concepts of reactive programming.

Unidirectional streaming call

The following example shows how to use the blockingForEach interface of a Flowable object to block and get the SpeechSynthesisResult data returned from each stream.

The complete synthesis result is also available through the getAudioData method of the SpeechSynthesizer class after all the streaming data from Flowable has been returned.

Click to view the full example

import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesisParam;
import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesizer;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.utils.Constants;

import java.time.LocalDateTime;
import java.time.format.DateTimeFormatter;

class TimeUtils {
    private static final DateTimeFormatter formatter =
            DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss.SSS");

    public static String getTimestamp() {
        return LocalDateTime.now().format(formatter);
    }
}

public class Main {
    private static String model = "cosyvoice-v3-flash"; // Model
    private static String voice = "longanyang"; // Voice

    public static void streamAudioDataToSpeaker() throws NoApiKeyException {
        // Request parameters
        SpeechSynthesisParam param =
                SpeechSynthesisParam.builder()
                        // The API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
                        // If you have not configured environment variables, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
                        .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                        .model(model) // Model
                        .voice(voice) // Voice
                        .build();
        SpeechSynthesizer synthesizer = new SpeechSynthesizer(param, null);
        synthesizer.callAsFlowable("What's the weather like today?").blockingForEach(result -> {
            // System.out.println("Message received: " + result);
            if (result.getAudioFrame() != null) {
                // Implement the logic for processing audio data here.
                System.out.println(TimeUtils.getTimestamp() + " Audio received");
            }
        });
        // Close the WebSocket connection after the task is complete.
        synthesizer.getDuplexApi().close(1000, "bye");
        // A WebSocket connection must be established when you send text for the first time. Therefore, the first-packet latency includes the time required to establish the connection.
        System.out.println(
                "[Metric] Request ID: "
                        + synthesizer.getLastRequestId()
                        + ", First-packet latency (ms): "
                        + synthesizer.getFirstPackageDelay());
    }

    public static void main(String[] args) throws NoApiKeyException {
        // The following URL is for the Singapore region. If you use a model in the Beijing region, replace the URL with: wss://dashscope.aliyuncs.com/api-ws/v1/inference
        Constants.baseWebsocketApiUrl = "wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference";
        streamAudioDataToSpeaker();
        System.exit(0);
    }
}

Bidirectional streaming call

The following example shows how to use a Flowable object as an input parameter to input a text stream. It also shows how to use a Flowable object as a return value and use the blockingForEach interface to block and get the SpeechSynthesisResult data returned from each stream.

The complete synthesis result is also available through the getAudioData method of the SpeechSynthesizer class after all the streaming data from Flowable has been returned.

Click to view the full example

import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesisParam;
import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesizer;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.utils.Constants;
import io.reactivex.BackpressureStrategy;
import io.reactivex.Flowable;

import java.time.LocalDateTime;
import java.time.format.DateTimeFormatter;

class TimeUtils {
    private static final DateTimeFormatter formatter =
            DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss.SSS");

    public static String getTimestamp() {
        return LocalDateTime.now().format(formatter);
    }
}

public class Main {
    private static String[] textArray = {"The streaming text-to-speech SDK ",
            "can convert input text ", "into binary audio data. ", "Compared to non-streaming synthesis, ",
            "streaming offers better real-time performance. ", "You can hear the audio output almost instantly as you type, ",
            "which greatly improves the user experience ", "and reduces waiting time. ",
            "This is ideal for applications that use large ", "language models (LLMs) ",
            "to synthesize speech from a stream of text."};
    private static String model = "cosyvoice-v3-flash";
    private static String voice = "longanyang";

    public static void streamAudioDataToSpeaker() throws NoApiKeyException {
        // Simulate streaming input.
        Flowable<String> textSource = Flowable.create(emitter -> {
            new Thread(() -> {
                for (int i = 0; i < textArray.length; i++) {
                    emitter.onNext(textArray[i]);
                    try {
                        Thread.sleep(1000);
                    } catch (InterruptedException e) {
                        throw new RuntimeException(e);
                    }
                }
                emitter.onComplete();
            }).start();
        }, BackpressureStrategy.BUFFER);

        // Request parameters
        SpeechSynthesisParam param =
                SpeechSynthesisParam.builder()
                        // The API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
                        // If you have not configured environment variables, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
                        .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                        .model(model) // Model
                        .voice(voice) // Voice
                        .build();
        SpeechSynthesizer synthesizer = new SpeechSynthesizer(param, null);
        synthesizer.streamingCallAsFlowable(textSource).blockingForEach(result -> {
            if (result.getAudioFrame() != null) {
                // Implement the logic for playing audio here.
                System.out.println(
                        TimeUtils.getTimestamp() +
                                " Binary audio size: " + result.getAudioFrame().capacity());
            }
        });
        synthesizer.getDuplexApi().close(1000, "bye");
        // A WebSocket connection must be established when you send text for the first time. Therefore, the first-packet latency includes the time required to establish the connection.
        System.out.println(
                "[Metric] Request ID: "
                        + synthesizer.getLastRequestId()
                        + ", First-packet latency (ms): "
                        + synthesizer.getFirstPackageDelay());
    }

    public static void main(String[] args) throws NoApiKeyException {
        // The following URL is for the Singapore region. If you use a model in the Beijing region, replace the URL with: wss://dashscope.aliyuncs.com/api-ws/v1/inference
        Constants.baseWebsocketApiUrl = "wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference";
        streamAudioDataToSpeaker();
        System.exit(0);
    }
}

High-concurrency calls

The DashScope Java SDK uses OkHttp3's connection pool technology to reduce the overhead of repeatedly establishing connections. For more information, see High-concurrency scenarios.

Request parameters

Use the chained methods of SpeechSynthesisParam to configure parameters such as the model and voice. Pass the configured parameter object to the constructor of the SpeechSynthesizer class.

Click to view an example

SpeechSynthesisParam param = SpeechSynthesisParam.builder()
    .model("cosyvoice-v3-flash") // Model
    .voice("longanyang") // Voice
    .format(SpeechSynthesisAudioFormat.WAV_8000HZ_MONO_16BIT) // Audio encoding format and sample rate
    .volume(50) // Volume. Valid values: [0, 100].
    .speechRate(1.0f) // Speech rate. Valid values: [0.5, 2].
    .pitchRate(1.0f) // Pitch. Valid values: [0.5, 2].
    .build();

Parameter

Type

Required

Description

model

String

Yes

Speech synthesis model. Each model version requires compatible voices:

  • cosyvoice-v3.5-flash/cosyvoice-v3.5-plus: No system voices are available. Only custom voices from voice design or voice cloning are supported.

  • cosyvoice-v3-flash/cosyvoice-v3-plus: Use voices such as longanyang.

  • cosyvoice-v2: Use voices such as longxiaochun_v2.

  • For a complete list of voices, see Voice list.

voice

String

Yes

The voice used for speech synthesis.

Supported voice types:

  • System voices: For more information, see Voice list.

  • Cloned voices: Customized using the Voice cloning feature. When using a cloned voice, make sure that the same account is used for voice cloning and speech synthesis.

    For cloned voices, model must match the voice creation model (target_model).

  • Designed voices: Customized using the Voice design feature. When using a designed voice, make sure that the same account is used for voice design and speech synthesis.

    For designed voices, model must match the voice creation model (target_model).

format

enum

No

The audio encoding format and sample rate.

The default is MP3 format at 22.05 kHz sample rate.

Note

The default sample rate represents the optimal rate for the selected voice. Output uses this rate by default, but downsampling and upsampling are supported.

The following audio encoding formats and sample rates are supported:

  • Audio encoding formats and sample rates supported by all models:

    • SpeechSynthesisAudioFormat.WAV_8000HZ_MONO_16BIT: WAV format, 8 kHz sample rate

    • SpeechSynthesisAudioFormat.WAV_16000HZ_MONO_16BIT: WAV format, 16 kHz sample rate

    • SpeechSynthesisAudioFormat.WAV_22050HZ_MONO_16BIT: WAV format, 22.05 kHz sample rate

    • SpeechSynthesisAudioFormat.WAV_24000HZ_MONO_16BIT: WAV format, 24 kHz sample rate

    • SpeechSynthesisAudioFormat.WAV_44100HZ_MONO_16BIT: WAV format, 44.1 kHz sample rate

    • SpeechSynthesisAudioFormat.WAV_48000HZ_MONO_16BIT: WAV format, 48 kHz sample rate

    • SpeechSynthesisAudioFormat.MP3_8000HZ_MONO_128KBPS: MP3 format, 8 kHz sample rate

    • SpeechSynthesisAudioFormat.MP3_16000HZ_MONO_128KBPS: MP3 format, 16 kHz sample rate

    • SpeechSynthesisAudioFormat.MP3_22050HZ_MONO_256KBPS: MP3 format, 22.05 kHz sample rate

    • SpeechSynthesisAudioFormat.MP3_24000HZ_MONO_256KBPS: MP3 format, 24 kHz sample rate

    • SpeechSynthesisAudioFormat.MP3_44100HZ_MONO_256KBPS: MP3 format, 44.1 kHz sample rate

    • SpeechSynthesisAudioFormat.MP3_48000HZ_MONO_256KBPS: MP3 format, 48 kHz sample rate

    • SpeechSynthesisAudioFormat.PCM_8000HZ_MONO_16BIT: PCM format, 8 kHz sample rate

    • SpeechSynthesisAudioFormat.PCM_16000HZ_MONO_16BIT: PCM format, 16 kHz sample rate

    • SpeechSynthesisAudioFormat.PCM_22050HZ_MONO_16BIT: PCM format, 22.05 kHz sample rate

    • SpeechSynthesisAudioFormat.PCM_24000HZ_MONO_16BIT: PCM format, 24 kHz sample rate

    • SpeechSynthesisAudioFormat.PCM_44100HZ_MONO_16BIT: PCM format, 44.1 kHz sample rate

    • SpeechSynthesisAudioFormat.PCM_48000HZ_MONO_16BIT: PCM format, 48 kHz sample rate

  • If the audio format is Opus, adjust the bitrate by using the bit_rate parameter. This is applicable only to DashScope 2.21.0 and later.

    • SpeechSynthesisAudioFormat.OGG_OPUS_8KHZ_MONO_32KBPS: Opus format, 8 kHz sample rate, 32 kbps bitrate

    • SpeechSynthesisAudioFormat.OGG_OPUS_16KHZ_MONO_16KBPS: Opus format, 16 kHz sample rate, 16 kbps bitrate

    • SpeechSynthesisAudioFormat.OGG_OPUS_16KHZ_MONO_32KBPS: Opus format, 16 kHz sample rate, 32 kbps bitrate

    • SpeechSynthesisAudioFormat.OGG_OPUS_16KHZ_MONO_64KBPS: Opus format, 16 kHz sample rate, 64 kbps bitrate

    • SpeechSynthesisAudioFormat.OGG_OPUS_24KHZ_MONO_16KBPS: Opus format, 24 kHz sample rate, 16 kbps bitrate

    • SpeechSynthesisAudioFormat.OGG_OPUS_24KHZ_MONO_32KBPS: Opus format, 24 kHz sample rate, 32 kbps bitrate

    • SpeechSynthesisAudioFormat.OGG_OPUS_24KHZ_MONO_64KBPS: Opus format, 24 kHz sample rate, 64 kbps bitrate

    • SpeechSynthesisAudioFormat.OGG_OPUS_48KHZ_MONO_16KBPS: Opus format, 48 kHz sample rate, 16 kbps bitrate

    • SpeechSynthesisAudioFormat.OGG_OPUS_48KHZ_MONO_32KBPS: Opus format, 48 kHz sample rate, 32 kbps bitrate

    • SpeechSynthesisAudioFormat.OGG_OPUS_48KHZ_MONO_64KBPS: Opus format, 48 kHz sample rate, 64 kbps bitrate

volume

int

No

The volume.

Default: 50.

Valid range: [0, 100]. Values scale linearly—0 is silent, 50 is default, 100 is maximum.

speechRate

float

No

The speech rate.

Default value: 1.0.

Valid values: [0.5, 2.0]. A value of 1.0 is the standard speech rate. A value less than 1.0 slows down the speech, and a value greater than 1.0 speeds it up.

pitchRate

float

No

Pitch multiplier. The relationship to perceived pitch is neither linear nor logarithmic—test to find suitable values.

Default value: 1.0.

Valid values: [0.5, 2.0]. A value of 1.0 is the natural pitch of the voice. A value greater than 1.0 raises the pitch, and a value less than 1.0 lowers it.

bit_rate

int

No

The audio bitrate in kbps. If the audio format is Opus, adjust the bitrate by using the bit_rate parameter.

Default value: 32.

Valid values: [6, 510].

Note

Set the bit_rate parameter using the parameter method or the parameters method of the SpeechSynthesisParam instance:

Set using the parameter method

SpeechSynthesisParam param = SpeechSynthesisParam.builder()
    .model("cosyvoice-v3-flash") // Model
    .voice("longanyang") // Voice
    .parameter("bit_rate", 32)
    .build();

Set using the parameters method

SpeechSynthesisParam param = SpeechSynthesisParam.builder()
    .model("cosyvoice-v3-flash") // Model
    .voice("longanyang") // Voice
    .parameters(Collections.singletonMap("bit_rate", 32))
    .build();

enableWordTimestamp

boolean

No

Specifies whether to enable word-level timestamps.

Default value: false.

  • true

  • false

This feature is available only for cloned voices of the cosyvoice-v3-flash, cosyvoice-v3-plus, and cosyvoice-v2 models, and for system voices that are marked as supported in the voice list.

Timestamp results are only available through the callback interface.

seed

int

No

The random seed used during generation. Different seeds produce different synthesis results. If the model, text, voice, and other parameters are identical, using the same seed reproduces the same output.

Default value: 0.

Valid values: [0, 65535].

languageHints

List

No

Specifies the target language for speech synthesis to improve the synthesis effect.

Use when pronunciation or synthesis quality is poor for numbers, abbreviations, symbols, or less common languages:

  • Numbers are not read aloud as expected. For example, "hello, this is 110" is read as "hello, this is one one zero" rather than "hello, this is yāo yāo líng".

  • The '@' symbol is mispronounced as 'ai te' instead of 'at'.

  • The synthesis quality for less common languages is poor and sounds unnatural.

Valid values:

  • zh: Chinese

  • en: English

  • fr: French

  • de: German

  • ja: Japanese

  • ko: Korean

  • ru: Russian

  • pt: Portuguese

  • th: Thai

  • id: Indonesian

  • vi: Vietnamese

Note: This parameter is an array, but the current version only processes the first element. Therefore, we recommend passing only one value.

Important

This parameter specifies the target language for speech synthesis. This setting is independent of the language of the sample audio used for voice cloning. To set the source language for a cloning task, see CosyVoice Voice Cloning/Design API.

instruction

String

No

Sets an instruction to control synthesis effects such as dialect, emotion, or speaking style. This feature is available only for cloned voices of the cosyvoice-v3.5-flash, cosyvoice-v3.5-plus, and cosyvoice-v3-flash models, and for system voices marked as supporting Instruct in the voice list.

Length limit: 100 characters.

A Chinese character (including simplified and traditional Chinese, Japanese Kanji, and Korean Hanja) is counted as two characters. All other characters, such as punctuation marks, letters, numbers, and Japanese/Korean Kana/Hangul, are counted as one character.

Usage requirements (vary by model):

  • v3.5-flash and v3.5-plus: Any natural language instruction to control emotion, speech rate, etc.

    Important

    cosyvoice-v3.5-flash and cosyvoice-v3.5-plus have no system voices. Only custom voices from voice design or voice cloning are supported.

    Instruction examples:

    Speak in a very excited and high-pitched tone, expressing the ecstasy and excitement of a great success.
    Please maintain a medium-slow speech rate, with an elegant and intellectual tone, giving a sense of calm and reassurance.
    The tone should be full of sorrow and nostalgia, with a slight nasal quality, as if narrating a heartbreaking story.
    Please try to speak in a breathy voice, with a very low volume, creating a sense of intimate and mysterious whispering.
    The tone should be very impatient and annoyed, with a faster speech rate and minimal pauses between sentences.
    Please imitate a kind and gentle elder, with a steady speech rate and a voice full of care and affection.
    The tone should be sarcastic and disdainful, with emphasis on keywords and a slightly rising intonation at the end of sentences.
    Please speak in an extremely fearful and trembling voice.
    The tone should be like a professional news anchor: calm, objective, and articulate, with a neutral emotion.
    The tone should be lively and playful, with a clear smile, making the voice sound energetic and sunny.
  • cosyvoice-v3-flash: The following requirements must be met:

    • Cloned voices: Use any natural language to control the speech synthesis effect.

      Instruction examples:

      Please speak in Cantonese. (Supported dialects: Cantonese, Northeastern, Gansu, Guizhou, Henan, Hubei, Jiangxi, Minnan, Ningxia, Shanxi, Shaanxi, Shandong, Shanghainese, Sichuan, Tianjin, Yunnan.)
      Please say a sentence as loudly as possible.
      Please say a sentence as slowly as possible.
      Please say a sentence as quickly as possible.
      Please say a sentence very softly.
      Can you speak a little slower?
      Can you speak very quickly?
      Can you speak very slowly?
      Can you speak a little faster?
      Please say a sentence very angrily.
      Please say a sentence very happily.
      Please say a sentence very fearfully.
      Please say a sentence very sadly.
      Please say a sentence very surprisedly.
      Please try to sound as firm as possible.
      Please try to sound as angry as possible.
      Please try an approachable tone.
      Please speak in a cold tone.
      Please speak in a majestic tone.
      I want to experience a natural tone.
      I want to see how you express a threat.
      I want to see how you express wisdom.
      I want to see how you express seduction.
      I want to hear you speak in a lively way.
      I want to hear you speak with passion.
      I want to hear you speak in a steady manner.
      I want to hear you speak with confidence.
      Can you talk to me with excitement?
      Can you show an arrogant emotion?
      Can you show an elegant emotion?
      Can you answer the question happily?
      Can you give a gentle emotional demonstration?
      Can you talk to me in a calm tone?
      Can you answer me in a deep way?
      Can you talk to me with a gruff attitude?
      Tell me the answer in a sinister voice.
      Tell me the answer in a resilient voice.
      Narrate in a natural and friendly chat style.
      Speak in the tone of a radio drama podcaster.
    • System voices: The instruction must use a fixed format and content. For more information, see the voice list.

enable_aigc_tag

boolean

No

Specifies whether to add an invisible AIGC identifier to the generated audio. When set to true, an invisible identifier is embedded into the audio in supported formats (WAV, MP3, and Opus).

Default value: false.

Only cosyvoice-v3-flash, cosyvoice-v3-plus, and cosyvoice-v2 support this feature.

Note

Set the enable_aigc_tag parameter using the parameter method or the parameters method of the SpeechSynthesisParam instance:

Set using the parameter method

SpeechSynthesisParam param = SpeechSynthesisParam.builder()
    .model("cosyvoice-v3-flash") // Model
    .voice("longanyang") // Voice
    .parameter("enable_aigc_tag", true)
    .build();

Set using the parameters method

SpeechSynthesisParam param = SpeechSynthesisParam.builder()
    .model("cosyvoice-v3-flash") // Model
    .voice("longanyang") // Voice
    .parameters(Collections.singletonMap("enable_aigc_tag", true))
    .build();

aigc_propagator

String

No

Sets the ContentPropagator field in the invisible AIGC identifier to identify the content propagator. This parameter takes effect only when enable_aigc_tag is true.

Default value: Alibaba Cloud UID.

Only cosyvoice-v3-flash, cosyvoice-v3-plus, and cosyvoice-v2 support this feature.

Note

Set the aigc_propagator parameter using the parameter method or the parameters method of the SpeechSynthesisParam instance:

Set using the parameter method

SpeechSynthesisParam param = SpeechSynthesisParam.builder()
    .model("cosyvoice-v3-flash") // Model
    .voice("longanyang") // Voice
    .parameter("enable_aigc_tag", true)
    .parameter("aigc_propagator", "xxxx")
    .build();

Set using the parameters method

Map<String, Object> map = new HashMap();
map.put("enable_aigc_tag", true);
map.put("aigc_propagator", "xxxx");

SpeechSynthesisParam param = SpeechSynthesisParam.builder()
    .model("cosyvoice-v3-flash") // Model
    .voice("longanyang") // Voice
    .parameters(map)
    .build();

aigc_propagate_id

String

No

Sets the PropagateID field in the invisible AIGC identifier to uniquely identify a specific propagation behavior. This parameter takes effect only when enable_aigc_tag is true.

Default value: The request ID of the current speech synthesis request.

Only cosyvoice-v3-flash, cosyvoice-v3-plus, and cosyvoice-v2 support this feature.

Note

Set the aigc_propagate_id parameter using the parameter method or the parameters method of the SpeechSynthesisParam instance:

Set using the parameter method

SpeechSynthesisParam param = SpeechSynthesisParam.builder()
    .model("cosyvoice-v3-flash") // Model
    .voice("longanyang") // Voice
    .parameter("enable_aigc_tag", true)
    .parameter("aigc_propagate_id", "xxxx")
    .build();

Configuration via parameters

Map<String, Object> map = new HashMap();
map.put("enable_aigc_tag", true);
map.put("aigc_propagate_id", "xxxx");

SpeechSynthesisParam param = SpeechSynthesisParam.builder()
    .model("cosyvoice-v3-flash") // Model
    .voice("longanyang") // Voice
    .parameters(map)
    .build();

hotFix

ParamHotFix

No

Configuration for text hotpatching. Allows you to customize the pronunciation of specific words or replace text before synthesis. This feature is available only for cloned voices of cosyvoice-v3-flash.

Parameter description:

  • ParamHotFix.PronunciationItem: Customizes pronunciation. Specifies the pinyin annotation for a word to correct inaccurate default pronunciations.

  • ParamHotFix.ReplaceItem: Replaces text. Replaces a specified word with a target text before speech synthesis. The replaced text is used for the actual synthesis.

Example:

List<ParamHotFix.PronunciationItem> pronunciationItems = new ArrayList<>();
pronunciationItems.add(new ParamHotFix.PronunciationItem("weather", "tian1 qi4"));

List<ParamHotFix.ReplaceItem> replaceItems = new ArrayList<>();
replaceItems.add(new ParamHotFix.ReplaceItem("today", "gold day"));

ParamHotFix paramHotFix = new ParamHotFix();
paramHotFix.setPronunciation(pronunciationItems);
paramHotFix.setReplace(replaceItems);

SpeechSynthesisParam param = SpeechSynthesisParam.builder()
                        .model("cosyvoice-v3-flash") // Model
                        .voice("your_voice") // Replace with a cloned voice for cosyvoice-v3-flash
                        .hotFix(paramHotFix)
                        .build();

enable_markdown_filter

boolean

false

Specifies whether to enable Markdown filtering. When enabled, the system automatically removes Markdown symbols from the input text before synthesizing speech, preventing them from being read aloud. This feature is available only for cloned voices of cosyvoice-v3-flash.

Default value: false.

Valid values:

  • true

  • false

Note

Set the enable_markdown_filter parameter using the parameter method or the parameters method of the SpeechSynthesisParam instance:

Set using the parameter method

SpeechSynthesisParam param = SpeechSynthesisParam.builder()
    .model("cosyvoice-v3-flash") // Model
    .voice("your_voice") // Replace with a cloned voice for cosyvoice-v3-flash
    .parameter("enable_markdown_filter", true)
    .build();

Set using the parameters method

Map<String, Object> map = new HashMap();
map.put("enable_markdown_filter", true);

SpeechSynthesisParam param = SpeechSynthesisParam.builder()
    .model("cosyvoice-v3-flash") // Model
    .voice("your_voice") // Replace with a cloned voice for cosyvoice-v3-flash
    .parameters(map)
    .build();

Key interfaces

SpeechSynthesizer class

Import the SpeechSynthesizer class using import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesizer;. It provides the key interfaces for speech synthesis.

Interface/Method

Parameter

Return value

Description

public SpeechSynthesizer(SpeechSynthesisParam param, ResultCallback<SpeechSynthesisResult> callback)

SpeechSynthesizer instance

Constructor.

  • Streaming calls (unidirectional or bidirectional): Set callback to the ResultCallback interface.

  • Non-streaming or Flowable calls: Set callback to null.

public ByteBuffer call(String text)

text: The text to be synthesized (UTF-8).

ByteBuffer or null

Converts text (plain or with SSML) to speech.

When you create a SpeechSynthesizer instance, there are two cases:

Important

Before each call to the call method, you must re-initialize the SpeechSynthesizer instance.

public void streamingCall(String text)

text: The text to be synthesized (UTF-8).

None

Sends text as a stream. SSML not supported.

Call this interface multiple times to send the text to the server in multiple parts. The synthesis result is returned through the onEvent method of the ResultCallback interface.

For a detailed call flow and reference example, see Bidirectional streaming call.

public void streamingComplete() throws RuntimeException

None

None

Ends the streaming speech synthesis.

This method blocks until one of the following conditions occurs:

  • The server completes synthesis (success).

  • The session is interrupted (failure).

  • The 10-minute timeout is reached (auto-unblock).

For a detailed call flow and reference example, see Bidirectional streaming call.

Important

When making a bidirectional streaming call, make sure to call this method to avoid missing parts of the synthesized speech.

public Flowable<SpeechSynthesisResult> callAsFlowable(String text)

text: The text to be synthesized, in UTF-8 format.

The synthesis result, encapsulated in Flowable<SpeechSynthesisResult>.

Converts non-streaming text input (text containing SSML is not supported) into a streaming speech output in real time. The synthesis result is returned in a stream within the Flowable object.

For a detailed call flow and reference example, see Call using Flowable.

boolean getDuplexApi().close(int code, String reason)

code: WebSocket Close Code

reason: Reason for closing

For information about how to configure these parameters, see The WebSocket Protocol document.

true

After a task is complete, you must close the WebSocket connection regardless of whether an exception occurred. This prevents connection leaks. For information about how to reuse connections to improve efficiency, see High-concurrency scenarios.

public Flowable<SpeechSynthesisResult> streamingCallAsFlowable(Flowable<String> textStream)

textStream: A Flowable instance that encapsulates the text to be synthesized.

The synthesis result, encapsulated in Flowable<SpeechSynthesisResult>.

Converts streaming text input (text containing SSML is not supported) into a streaming speech output in real time. The synthesis result is returned in a stream within the Flowable object.

For a detailed call flow and reference example, see Call using Flowable.

public String getLastRequestId()

None

The request ID of the previous task.

Gets the request ID of the previous task. Use this after starting a new task by calling call, streamingCall, callAsFlowable, or streamingCallAsFlowable.

public long getFirstPackageDelay()

None

The first-packet latency of the current task.

Returns first-packet latency in milliseconds (time from sending text to receiving first audio). Call after task completes.

Factors affecting first-packet latency:

  • Time to establish the WebSocket connection (on the first call)

  • Voice loading time (varies by voice)

  • Service load (queuing may occur during peak hours)

  • Network latency

Typical latency:

  • Reusing a connection with the voice already loaded: about 500 ms

  • First connection or switching voices: may reach 1,500 to 2,000 ms

If latency consistently exceeds 2,000 ms:

  1. Use the connection pool feature for high-concurrency scenarios to establish connections in advance.

  2. Check the quality of your network connection.

  3. Avoid making calls during peak hours.

ResultCallback interface

For streaming calls (unidirectional or bidirectional), get results via ResultCallback. Import: import com.alibaba.dashscope.common.ResultCallback;.

Click to view an example

ResultCallback<SpeechSynthesisResult> callback = new ResultCallback<SpeechSynthesisResult>() {
    @Override
    public void onEvent(SpeechSynthesisResult result) {
        System.out.println("Request ID: " + result.getRequestId());
        // Process audio chunks in real time (e.g., play or write to a buffer).
    }

    @Override
    public void onComplete() {
        System.out.println("Task complete");
        // Handle synthesis completion logic (e.g., release the player).
    }

    @Override
    public void onError(Exception e) {
        System.out.println("Task failed: " + e.getMessage());
        // Handle exceptions (network errors or server-side error codes).
    }
};

Interface/Method

Parameter

Return value

Description

public void onEvent(SpeechSynthesisResult result)

result: A SpeechSynthesisResult instance.

None

Called when server pushes audio data.

Use getAudioFrame to retrieve binary audio.

Call the getUsage method of SpeechSynthesisResult to get the number of billable characters in the current request so far.

public void onComplete()

None

None

Called asynchronously after all synthesis data has been returned and speech synthesis is complete.

public void onError(Exception e)

e: Exception information.

None

Called asynchronously when an exception occurs.

We recommend implementing complete exception logging and resource cleanup logic in the onError method.

Response

The server returns binary audio data:

  • Non-streaming call: Process the binary audio data returned by the call method of the SpeechSynthesizer class.

  • Unidirectional streaming call or bidirectional streaming call: Process the parameter (of type SpeechSynthesisResult) of the onEvent method of the ResultCallback interface.

    The key interfaces of SpeechSynthesisResult are as follows:

    Interface/Method

    Parameter

    Return value

    Description

    public ByteBuffer getAudioFrame()

    None

    Binary audio data

    Returns binary audio for current segment (may be empty if no new data).

    Combine segments into a complete file or stream to a compatible player.

    Important
    • In streaming speech synthesis, for compressed formats such as MP3 and Opus, the segmented audio data must be played using a streaming player. Do not play it frame by frame, as this causes decoding to fail.

      Streaming players include FFmpeg, PyAudio (Python), AudioFormat (Java), and MediaSource (JavaScript).
    • When combining audio data into a complete audio file, write to the same file in append mode.

    • For WAV and MP3 audio from streaming speech synthesis, only the first frame contains header information. Subsequent frames contain only audio data.

    public String getRequestId()

    None

    The request ID of the task.

    Gets the request ID of the task. When you get binary audio data by calling getAudioFrame, the return value of the getRequestId method is null.

    public SpeechSynthesisUsage getUsage()

    None

    SpeechSynthesisUsage: The number of billable characters in the current request so far.

    Returns SpeechSynthesisUsage or null.

    The getCharacters method of SpeechSynthesisUsage returns the number of billable characters in the current request so far. Use the last received SpeechSynthesisUsage as the final value.

    public Sentence getTimestamp()

    None

    Sentence: The sentence being billed in the current request so far.

    You need to enable the enableWordTimestamp word-level timestamp feature.

    Returns Sentence or null.

    Methods of Sentence:

    • getIndex: Gets the sentence number, starting from 0.

    • getWords: Gets the character array List<Word> that makes up the sentence. Use the last received Sentence as the final value.

    Methods of Word:

    • getText: Gets the text of the character.

    • getBeginIndex: Gets the starting position index of the character in the sentence, starting from 0.

    • getEndIndex: Gets the ending position index of the character in the sentence, starting from 1.

    • getBeginTime: Gets the start timestamp of the audio corresponding to the character, in milliseconds.

    • getEndTime: Gets the end timestamp of the audio corresponding to the character, in milliseconds.

Error codes

If an error occurs, see Error messages for troubleshooting.

More examples

For more examples, see GitHub.

FAQ

Features, billing, and rate limiting

Q: What can I do if the pronunciation is inaccurate?

Use SSML to fix pronunciation.

Q: Speech synthesis is billed by character count. How do I check the text length for each synthesis request?

Troubleshooting

If a code error occurs, see Error codes for troubleshooting.

Q: How do I get the request ID?

Get it in one of the following ways:

Q: Why does the SSML feature fail?

Troubleshooting:

  1. Verify limits and constraints.

  2. Install the latest version of the DashScope SDK.

  3. Make sure you are using the correct interface: only the call method of the SpeechSynthesizer class supports SSML.

  4. Make sure the text to be synthesized is in plain text format and meets the format requirements. For more information, see Introduction to the SSML markup language.

Q: Why does the audio duration of TTS speech synthesis differ from the WAV file's displayed duration? For example, a WAV file shows 7 seconds but the actual audio is less than 5 seconds?

TTS uses a streaming synthesis mechanism, which means it synthesizes and returns data progressively. As a result, the WAV file header contains an estimated value, which may have some margin of error. If you require precise duration, you can set the format to PCM and manually add the WAV header information after obtaining the complete synthesis result. This will give you a more accurate duration.

Q: Why can't the audio be played?

Check the following scenarios one by one:

  1. The audio is saved as a complete file (such as xx.mp3).

    1. Format consistency: Verify request format matches file extension (e.g., WAV with .wav, not .mp3).

    2. Player compatibility: Verify that your player supports the format and sample rate of the audio file. Some players may not support high sample rates or specific audio encodings.

  2. The audio is played in a stream.

    1. Save the audio stream as a complete file and try to play it with a player. If the file cannot be played, see the troubleshooting method for scenario 1.

    2. If the file plays normally, the problem may be with your streaming playback implementation. Verify that your player supports streaming playback.

      Common tools and libraries that support streaming playback include FFmpeg, PyAudio (Python), AudioFormat (Java), and MediaSource (JavaScript).

Q: Why does the audio playback stutter?

Check the following scenarios one by one:

  1. Check the text sending speed: Make sure the interval between text segments is reasonable. Avoid situations where the next segment is not sent promptly after the previous audio segment finishes playing.

  2. Check the callback function performance:

    • Avoid heavy business logic in the callback function—it can cause blocking.

    • Callbacks run in the WebSocket thread. Blocking prevents timely packet reception and causes audio playback to stutter.

    • We recommend writing audio data to a separate buffer and processing it in another thread to avoid blocking the WebSocket thread.

  3. Check network stability: Ensure your network connection is stable to avoid audio transmission interruptions or delays caused by network fluctuations.

Q: Why does speech synthesis take a long time?

Follow these steps to troubleshoot:

  1. Check input interval

    Check the input interval. If you are using streaming speech synthesis, verify whether the interval between sending text segments is too long (for example, a delay of several seconds). A long interval increases the total synthesis time.

  2. Analyze performance metrics.

    • First-packet latency: Normally around 500 ms.

    • RTF (RTF = Total synthesis time / Audio duration): Normally less than 1.0.

Q: How do I handle incorrect pronunciation in the synthesized speech?

Use the <phoneme> tag of SSML to specify the correct pronunciation.

Q: Why is some text at the end not converted to speech, or why is no speech returned?

Check whether you have called the streamingComplete method of the SpeechSynthesizer class. During speech synthesis, the server begins synthesizing only after caching enough text. If you do not call streamingComplete, text remaining in the buffer may not be synthesized.

Permissions and authentication

Q: How can I restrict my API key to the CosyVoice speech synthesis service only (permission isolation)?

Create a workspace and grant authorization only to specific models to limit the API key scope. For more information, see Manage workspaces.

More questions

See the QA on GitHub.