The parameters and key interfaces of the CosyVoice speech synthesis Java SDK.
User guide: For model overviews and selection suggestions, see Real-time speech synthesis - CosyVoice.
Prerequisites
-
You have activated the Model Studio and created an API key. Export it as an environment variable (not hard-coded) to prevent security risks.
NoteFor temporary access or strict control over high-risk operations (accessing/deleting sensitive data), use a temporary authentication token instead.
Compared with long-term API keys, temporary tokens are more secure (60-second lifespan) and reduce API key leakage risk.
To use a temporary token, replace the API key used for authentication in your code with the temporary authentication token.
Models and pricing
Text and format limitations
Text length limits
-
Non-streaming call, unidirectional streaming call, or Flowable unidirectional streaming call: The maximum is 20,000 characters per request.
-
Bidirectional streaming call or Flowable bidirectional streaming call: The maximum is 20,000 characters per request, with a total limit of 200,000 characters across all requests.
Character counting rules
-
Chinese characters (simplified/traditional Chinese, Japanese Kanji, Korean Hanja) count as two characters. All other characters (punctuation, letters, numbers, Kana, Hangul) count as one.
-
SSML tags are not included when calculating the text length.
-
Examples:
-
"你好"→ 2 (Chinese character) + 2 (Chinese character) = 4 characters -
"中A文123"→ 2 (Chinese character) + 1 (A) + 2 (Chinese character) + 1 (1) + 1 (2) + 1 (3) = 8 characters -
"中文。"→ 2 (Chinese character) + 2 (Chinese character) + 1 (.) = 5 characters -
"中 文。"→ 2 (Chinese character) + 1 (space) + 2 (Chinese character) + 1 (.) = 6 characters -
"<speak>你好</speak>"→ 2 (Chinese character) + 2 (Chinese character) = 4 characters
-
Encoding format
Use UTF-8 encoding.
Support for mathematical expressions
Mathematical expression parsing (v3.5-flash, v3.5-plus, v3-flash, v3-plus, v2 only): Supports primary and secondary school math—basic operations, algebra, geometry.
This feature only supports Chinese.
See Convert LaTeX formulas to speech (Chinese language only).
SSML support
SSML is available for custom voices (voice design or cloning) with v3.5-flash, v3.5-plus, v3-flash, v3-plus, and v2, and for system voices marked as supported in the voice list. Requirements:
-
DashScope SDK 2.20.3 or later.
-
Only non-streaming calls and unidirectional streaming calls (that is, the
callmethod of the SpeechSynthesizer class) are supported. Bidirectional streaming calls (that is, thestreamingCallmethod of the SpeechSynthesizer class) and Flowable calls are not supported. -
The usage is the same as for normal speech synthesis: pass the text containing SSML to the
callmethod of the SpeechSynthesizer class.
Getting started
The SpeechSynthesizer class provides key interfaces for speech synthesis and supports the following call methods:
-
Non-streaming: A blocking call that sends the full text at once and returns the complete audio. Suitable for short text.
-
Unidirectional streaming: A non-blocking call that sends the full text at once and receives audio via callback. Suitable for short text with low latency.
-
Bidirectional streaming: A non-blocking call that sends text fragments incrementally and receives audio via callback in real time. Suitable for long text with low latency.
Non-streaming call
Submits a synthesis task synchronously and returns the complete result.
Instantiate the SpeechSynthesizer class, bind the request parameters, and call the call method to synthesize and get the binary audio data.
The length of the sent text cannot exceed 20,000 characters. For more information, see the call method of the SpeechSynthesizer class.
Before each call to the call method, you must re-initialize the SpeechSynthesizer instance.
Unidirectional streaming call
Submits a synthesis task asynchronously and receives audio incrementally via ResultCallback.
Instantiate the SpeechSynthesizer class, bind the request parameters and the ResultCallback interface, and call the call method to synthesize. Get the synthesis result in real time through the onEvent method of the ResultCallback interface.
The length of the sent text cannot exceed 20,000 characters. For more information, see the call method of the SpeechSynthesizer class.
Before each call to the call method, you must re-initialize the SpeechSynthesizer instance.
Bidirectional streaming call
Send text in multiple chunks and receive audio data incrementally through a registered ResultCallback callback.
-
For streaming input, call
streamingCallmultiple times to submit text fragments in order. After the server receives the text fragments, it automatically segments them into sentences:-
Complete sentences are synthesized immediately.
-
Incomplete sentences are buffered and synthesized after they are complete.
When you call
streamingComplete, the server forcibly synthesizes all received but unprocessed text fragments, including incomplete sentences. -
-
The interval between sending text fragments cannot exceed 23 seconds, or a timeout exception occurs.
Call the
streamingCompletemethod promptly when there is no more text to send.The server enforces a 23-second timeout mechanism. This configuration cannot be modified on the client.
-
Instantiate the SpeechSynthesizer class
Instantiate the SpeechSynthesizer class, and bind the request parameters and the ResultCallback interface.
-
Streaming
Call the
streamingCallmethod of the SpeechSynthesizer class multiple times to submit the text for synthesis in chunks. This sends the text to the server in segments.While you are sending the text, the server returns the synthesis result to the client in real time through the
onEventmethod of the ResultCallback interface.The length of the text fragment sent in each call to the
streamingCallmethod (thetextparameter) cannot exceed 20,000 characters. The total length of all sent text cannot exceed 200,000 characters. -
End processing
Call the
streamingCompletemethod of the SpeechSynthesizer class to end the speech synthesis.This method blocks the current thread until the
onCompleteoronErrorcallback of the ResultCallback interface is triggered. Then, the thread is unblocked.Ensure you call this method. Otherwise, text at the end may not be converted to speech.
Call using Flowable
Flowable is an open source workflow framework (Apache 2.0 license). See Flowable API details.
Before using Flowable, make sure you have integrated the RxJava library and understand the basic concepts of reactive programming.
Unidirectional streaming call
The following example shows how to use the blockingForEach interface of a Flowable object to block and get the SpeechSynthesisResult data returned from each stream.
The complete synthesis result is also available through the getAudioData method of the SpeechSynthesizer class after all the streaming data from Flowable has been returned.
Bidirectional streaming call
The following example shows how to use a Flowable object as an input parameter to input a text stream. It also shows how to use a Flowable object as a return value and use the blockingForEach interface to block and get the SpeechSynthesisResult data returned from each stream.
The complete synthesis result is also available through the getAudioData method of the SpeechSynthesizer class after all the streaming data from Flowable has been returned.
High-concurrency calls
The DashScope Java SDK uses OkHttp3's connection pool technology to reduce the overhead of repeatedly establishing connections. For more information, see High-concurrency scenarios.
Request parameters
Use the chained methods of SpeechSynthesisParam to configure parameters such as the model and voice. Pass the configured parameter object to the constructor of the SpeechSynthesizer class.
|
Parameter |
Type |
Required |
Description |
|
model |
String |
Yes |
Speech synthesis model. Each model version requires compatible voices:
|
|
voice |
String |
Yes |
The voice used for speech synthesis. Supported voice types:
|
|
format |
enum |
No |
The audio encoding format and sample rate. The default is MP3 format at 22.05 kHz sample rate. Note
The default sample rate represents the optimal rate for the selected voice. Output uses this rate by default, but downsampling and upsampling are supported. The following audio encoding formats and sample rates are supported:
|
|
volume |
int |
No |
The volume. Default: 50. Valid range: [0, 100]. Values scale linearly—0 is silent, 50 is default, 100 is maximum. |
|
speechRate |
float |
No |
The speech rate. Default value: 1.0. Valid values: [0.5, 2.0]. A value of 1.0 is the standard speech rate. A value less than 1.0 slows down the speech, and a value greater than 1.0 speeds it up. |
|
pitchRate |
float |
No |
Pitch multiplier. The relationship to perceived pitch is neither linear nor logarithmic—test to find suitable values. Default value: 1.0. Valid values: [0.5, 2.0]. A value of 1.0 is the natural pitch of the voice. A value greater than 1.0 raises the pitch, and a value less than 1.0 lowers it. |
|
bit_rate |
int |
No |
The audio bitrate in kbps. If the audio format is Opus, adjust the bitrate by using the Default value: 32. Valid values: [6, 510]. Note
Set the Set using the parameter method
Set using the parameters method
|
|
enableWordTimestamp |
boolean |
No |
Specifies whether to enable word-level timestamps. Default value: false.
This feature is available only for cloned voices of the cosyvoice-v3-flash, cosyvoice-v3-plus, and cosyvoice-v2 models, and for system voices that are marked as supported in the voice list. Timestamp results are only available through the callback interface. |
|
seed |
int |
No |
The random seed used during generation. Different seeds produce different synthesis results. If the model, text, voice, and other parameters are identical, using the same seed reproduces the same output. Default value: 0. Valid values: [0, 65535]. |
|
languageHints |
List |
No |
Specifies the target language for speech synthesis to improve the synthesis effect. Use when pronunciation or synthesis quality is poor for numbers, abbreviations, symbols, or less common languages:
Valid values:
Note: This parameter is an array, but the current version only processes the first element. Therefore, we recommend passing only one value. Important
This parameter specifies the target language for speech synthesis. This setting is independent of the language of the sample audio used for voice cloning. To set the source language for a cloning task, see CosyVoice Voice Cloning/Design API. |
|
instruction |
String |
No |
Sets an instruction to control synthesis effects such as dialect, emotion, or speaking style. This feature is available only for cloned voices of the cosyvoice-v3.5-flash, cosyvoice-v3.5-plus, and cosyvoice-v3-flash models, and for system voices marked as supporting Instruct in the voice list. Length limit: 100 characters. A Chinese character (including simplified and traditional Chinese, Japanese Kanji, and Korean Hanja) is counted as two characters. All other characters, such as punctuation marks, letters, numbers, and Japanese/Korean Kana/Hangul, are counted as one character. Usage requirements (vary by model):
|
|
enable_aigc_tag |
boolean |
No |
Specifies whether to add an invisible AIGC identifier to the generated audio. When set to true, an invisible identifier is embedded into the audio in supported formats (WAV, MP3, and Opus). Default value: false. Only cosyvoice-v3-flash, cosyvoice-v3-plus, and cosyvoice-v2 support this feature. Note
Set the Set using the parameter method
Set using the parameters method
|
|
aigc_propagator |
String |
No |
Sets the Default value: Alibaba Cloud UID. Only cosyvoice-v3-flash, cosyvoice-v3-plus, and cosyvoice-v2 support this feature. Note
Set the Set using the parameter method
Set using the parameters method
|
|
aigc_propagate_id |
String |
No |
Sets the Default value: The request ID of the current speech synthesis request. Only cosyvoice-v3-flash, cosyvoice-v3-plus, and cosyvoice-v2 support this feature. Note
Set the Set using the parameter method
Configuration via parameters
|
|
hotFix |
ParamHotFix |
No |
Configuration for text hotpatching. Allows you to customize the pronunciation of specific words or replace text before synthesis. This feature is available only for cloned voices of cosyvoice-v3-flash. Parameter description:
Example:
|
|
enable_markdown_filter |
boolean |
false |
Specifies whether to enable Markdown filtering. When enabled, the system automatically removes Markdown symbols from the input text before synthesizing speech, preventing them from being read aloud. This feature is available only for cloned voices of cosyvoice-v3-flash. Default value: false. Valid values:
Note
Set the Set using the parameter method
Set using the parameters method
|
Key interfaces
SpeechSynthesizer class
Import the SpeechSynthesizer class using import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesizer;. It provides the key interfaces for speech synthesis.
|
Interface/Method |
Parameter |
Return value |
Description |
|
|
|
Constructor.
|
|
|
|
Converts text (plain or with SSML) to speech. When you create a
Important
Before each call to the |
|
|
None |
Sends text as a stream. SSML not supported. Call this interface multiple times to send the text to the server in multiple parts. The synthesis result is returned through the For a detailed call flow and reference example, see Bidirectional streaming call. |
|
None |
None |
Ends the streaming speech synthesis. This method blocks until one of the following conditions occurs:
For a detailed call flow and reference example, see Bidirectional streaming call. Important
When making a bidirectional streaming call, make sure to call this method to avoid missing parts of the synthesized speech. |
|
|
The synthesis result, encapsulated in |
Converts non-streaming text input (text containing SSML is not supported) into a streaming speech output in real time. The synthesis result is returned in a stream within the Flowable object. For a detailed call flow and reference example, see Call using Flowable. |
|
code: WebSocket Close Code reason: Reason for closing For information about how to configure these parameters, see The WebSocket Protocol document. |
true |
After a task is complete, you must close the WebSocket connection regardless of whether an exception occurred. This prevents connection leaks. For information about how to reuse connections to improve efficiency, see High-concurrency scenarios. |
|
|
The synthesis result, encapsulated in |
Converts streaming text input (text containing SSML is not supported) into a streaming speech output in real time. The synthesis result is returned in a stream within the Flowable object. For a detailed call flow and reference example, see Call using Flowable. |
|
None |
The request ID of the previous task. |
Gets the request ID of the previous task. Use this after starting a new task by calling |
|
None |
The first-packet latency of the current task. |
Returns first-packet latency in milliseconds (time from sending text to receiving first audio). Call after task completes. Factors affecting first-packet latency:
Typical latency:
If latency consistently exceeds 2,000 ms:
|
ResultCallback interface
For streaming calls (unidirectional or bidirectional), get results via ResultCallback. Import: import com.alibaba.dashscope.common.ResultCallback;.
|
Interface/Method |
Parameter |
Return value |
Description |
|
|
None |
Called when server pushes audio data. Use Call the |
|
None |
None |
Called asynchronously after all synthesis data has been returned and speech synthesis is complete. |
|
|
None |
Called asynchronously when an exception occurs. We recommend implementing complete exception logging and resource cleanup logic in the |
Response
The server returns binary audio data:
-
Non-streaming call: Process the binary audio data returned by the
callmethod of the SpeechSynthesizer class. -
Unidirectional streaming call or bidirectional streaming call: Process the parameter (of type
SpeechSynthesisResult) of theonEventmethod of the ResultCallback interface.The key interfaces of
SpeechSynthesisResultare as follows:Interface/Method
Parameter
Return value
Description
public ByteBuffer getAudioFrame()None
Binary audio data
Returns binary audio for current segment (may be empty if no new data).
Combine segments into a complete file or stream to a compatible player.
Important-
In streaming speech synthesis, for compressed formats such as MP3 and Opus, the segmented audio data must be played using a streaming player. Do not play it frame by frame, as this causes decoding to fail.
Streaming players include FFmpeg, PyAudio (Python), AudioFormat (Java), and MediaSource (JavaScript).
-
When combining audio data into a complete audio file, write to the same file in append mode.
-
For WAV and MP3 audio from streaming speech synthesis, only the first frame contains header information. Subsequent frames contain only audio data.
public String getRequestId()None
The request ID of the task.
Gets the request ID of the task. When you get binary audio data by calling
getAudioFrame, the return value of thegetRequestIdmethod isnull.public SpeechSynthesisUsage getUsage()None
SpeechSynthesisUsage: The number of billable characters in the current request so far.Returns
SpeechSynthesisUsageornull.The
getCharactersmethod ofSpeechSynthesisUsagereturns the number of billable characters in the current request so far. Use the last receivedSpeechSynthesisUsageas the final value.public Sentence getTimestamp()None
Sentence: The sentence being billed in the current request so far.You need to enable the
enableWordTimestampword-level timestamp feature.Returns
Sentenceornull.Methods of
Sentence:-
getIndex: Gets the sentence number, starting from 0. -
getWords: Gets the character arrayList<Word>that makes up the sentence. Use the last receivedSentenceas the final value.
Methods of
Word:-
getText: Gets the text of the character. -
getBeginIndex: Gets the starting position index of the character in the sentence, starting from 0. -
getEndIndex: Gets the ending position index of the character in the sentence, starting from 1. -
getBeginTime: Gets the start timestamp of the audio corresponding to the character, in milliseconds. -
getEndTime: Gets the end timestamp of the audio corresponding to the character, in milliseconds.
-
Error codes
If an error occurs, see Error messages for troubleshooting.
More examples
For more examples, see GitHub.
FAQ
Features, billing, and rate limiting
Q: What can I do if the pronunciation is inaccurate?
Use SSML to fix pronunciation.
Q: Speech synthesis is billed by character count. How do I check the text length for each synthesis request?
-
Non-streaming call: You need to calculate it yourself according to the character counting rules.
-
Other call methods: Use the
getUsagemethod of the response SpeechSynthesisResult. Use the last response result you receive as the final value.
Troubleshooting
If a code error occurs, see Error codes for troubleshooting.
Q: How do I get the request ID?
Get it in one of the following ways:
-
In the
onEventmethod of the ResultCallback interface, call thegetRequestIdmethod of SpeechSynthesisResult.The return value of the
getRequestIdmethod may be null. For more information, see the description of thegetRequestIdmethod in SpeechSynthesisResult. -
Call the
getLastRequestIdmethod of SpeechSynthesizer.
Q: Why does the SSML feature fail?
Troubleshooting:
-
Verify limits and constraints.
-
Make sure you are using the correct interface: only the
callmethod of the SpeechSynthesizer class supports SSML. -
Make sure the text to be synthesized is in plain text format and meets the format requirements. For more information, see Introduction to the SSML markup language.
Q: Why does the audio duration of TTS speech synthesis differ from the WAV file's displayed duration? For example, a WAV file shows 7 seconds but the actual audio is less than 5 seconds?
TTS uses a streaming synthesis mechanism, which means it synthesizes and returns data progressively. As a result, the WAV file header contains an estimated value, which may have some margin of error. If you require precise duration, you can set the format to PCM and manually add the WAV header information after obtaining the complete synthesis result. This will give you a more accurate duration.
Q: Why can't the audio be played?
Check the following scenarios one by one:
-
The audio is saved as a complete file (such as xx.mp3).
-
Format consistency: Verify request format matches file extension (e.g., WAV with .wav, not .mp3).
-
Player compatibility: Verify that your player supports the format and sample rate of the audio file. Some players may not support high sample rates or specific audio encodings.
-
-
The audio is played in a stream.
-
Save the audio stream as a complete file and try to play it with a player. If the file cannot be played, see the troubleshooting method for scenario 1.
-
If the file plays normally, the problem may be with your streaming playback implementation. Verify that your player supports streaming playback.
Common tools and libraries that support streaming playback include FFmpeg, PyAudio (Python), AudioFormat (Java), and MediaSource (JavaScript).
-
Q: Why does the audio playback stutter?
Check the following scenarios one by one:
-
Check the text sending speed: Make sure the interval between text segments is reasonable. Avoid situations where the next segment is not sent promptly after the previous audio segment finishes playing.
-
Check the callback function performance:
-
Avoid heavy business logic in the callback function—it can cause blocking.
-
Callbacks run in the WebSocket thread. Blocking prevents timely packet reception and causes audio playback to stutter.
-
We recommend writing audio data to a separate buffer and processing it in another thread to avoid blocking the WebSocket thread.
-
-
Check network stability: Ensure your network connection is stable to avoid audio transmission interruptions or delays caused by network fluctuations.
Q: Why does speech synthesis take a long time?
Follow these steps to troubleshoot:
-
Check input interval
Check the input interval. If you are using streaming speech synthesis, verify whether the interval between sending text segments is too long (for example, a delay of several seconds). A long interval increases the total synthesis time.
-
Analyze performance metrics.
-
First-packet latency: Normally around 500 ms.
-
RTF (RTF = Total synthesis time / Audio duration): Normally less than 1.0.
-
Q: How do I handle incorrect pronunciation in the synthesized speech?
Use the <phoneme> tag of SSML to specify the correct pronunciation.
Q: Why is some text at the end not converted to speech, or why is no speech returned?
Check whether you have called the streamingComplete method of the SpeechSynthesizer class. During speech synthesis, the server begins synthesizing only after caching enough text. If you do not call streamingComplete, text remaining in the buffer may not be synthesized.
Permissions and authentication
Q: How can I restrict my API key to the CosyVoice speech synthesis service only (permission isolation)?
Create a workspace and grant authorization only to specific models to limit the API key scope. For more information, see Manage workspaces.
More questions
See the QA on GitHub.