This topic describes the parameters and interface details of the CosyVoice Java SDK for speech synthesis.
To use a model in the China (Beijing) region, go to the API key page for the China (Beijing) region
User guide: For more information about the models and guidance on model selection, see Real-time speech synthesis - CosyVoice.
Prerequisites
You have activated the Model Studio and created an API key. To prevent security risks, export the API key as an environment variable instead of hard-coding it in your code.
NoteTo grant temporary access permissions to third-party applications or users, or if you want to strictly control high-risk operations such as accessing or deleting sensitive data, we recommend that you use a temporary authentication token.
Compared with long-term API keys, temporary authentication tokens are more secure because they are short-lived (60 seconds). They are suitable for temporary call scenarios and can effectively reduce the risk of API key leakage.
To use a temporary token, replace the API key used for authentication in your code with the temporary authentication token.
Models and pricing
Model | Unit price |
cosyvoice-v3-plus | $0.286706 per 10,000 characters |
cosyvoice-v3-flash | $0.14335 per 10,000 characters |
cosyvoice-v2 | $0.286706 per 10,000 characters |
Text and format limitations
Text length limits
Non-streaming call (synchronous call, asynchronous call, or Flowable non-streaming call): The text for a single request cannot exceed 2,000 characters.
Streaming call (streaming call or Flowable streaming call): The text in a single request cannot exceed 2,000 characters, and the total length of the text cannot exceed 200,000 characters.
Character counting rules
A Chinese character, including simplified or traditional Chinese, Japanese kanji, and Korean hanja, is counted as 2 characters. All other characters, such as punctuation marks, letters, numbers, and Japanese or Korean kana or hangul, are counted as 1 character.
SSML tags are not included in the text length calculation.
Examples:
"你好"→ 2(你) + 2(好) = 4 characters"中A文123"→ 2(中) + 1(A) + 2(文) + 1(1) + 1(2) + 1(3) = 8 characters"中文。"→ 2(中) + 2(文) + 1(。) = 5 characters"中 文。"→ 2(中) + 1(space) + 2(文) + 1(。) = 6 characters"<speak>你好</speak>"→ 2(你) + 2(好) = 4 characters
Encoding format
Use UTF-8 encoding.
Support for mathematical expressions
The mathematical expression parsing feature is currently available only for the cosyvoice-v2, cosyvoice-v3-flash, and cosyvoice-v3-plus models. This feature supports common mathematical expressions from primary and secondary school, such as basic arithmetic, algebra, and geometry.
For more information, see Convert LaTeX formulas to speech.
SSML support
The Speech Synthesis Markup Language (SSML) feature is currently available only for cloned voices of the cosyvoice-v3-flash, cosyvoice-v3-plus, and cosyvoice-v2 models, and for system voices that are indicated as supported in the voice list. The following conditions must be met:
Use DashScope SDK 2.20.3 or later.
Only synchronous calls and asynchronous calls (which use the SpeechSynthesizer class's
callmethod) are supported. Streaming calls (which use the SpeechSynthesizer class'sstreamingCallmethod) and Flowable calls are not supported.The usage is the same as for standard speech synthesis. Pass the text containing SSML to the
callmethod of the SpeechSynthesizer class.
Getting started
The SpeechSynthesizer class provides interfaces for speech synthesis and supports the following call methods:
Synchronous call: A blocking call that sends the complete text at once and returns the complete audio directly. This method is suitable for short text synthesis scenarios.
Asynchronous call: A non-blocking call that sends the complete text at once and uses a callback function to receive audio data, which may be delivered in chunks. This method is suitable for short text synthesis scenarios that require low latency.
Streaming call: A non-blocking call that sends text in fragments and uses a callback function to receive the synthesized audio stream incrementally in real time. This method is suitable for long text synthesis scenarios that require low latency.
Synchronous call
Submit a speech synthesis task synchronously to obtain the complete result directly.
Instantiate the SpeechSynthesizer class, set the request parameters, and call the call method to synthesize and obtain the binary audio data.
The text length cannot exceed 2,000 characters. For more information, see the call method of the SpeechSynthesizer class.
Before each call to the call method, you must create a new SpeechSynthesizer instance.
Asynchronous call
Submit a speech synthesis task asynchronously and receive real-time speech segments frame by frame by registering a ResultCallback callback.
Instantiate the SpeechSynthesizer class, set the request parameters and the ResultCallback interface, and then call the call method to synthesize the audio. The onEvent method of the ResultCallback interface provides the synthesis result in real time.
The text length cannot exceed 2,000 characters. For more information, see the call method of the SpeechSynthesizer class.
Before each call to the call method, you must create a new SpeechSynthesizer instance.
Streaming call
Submit text in fragments and receive real-time speech segments frame by frame by registering a ResultCallback callback.
For streaming input, call
streamingCallmultiple times to submit text fragments sequentially. After the server receives the text fragments, it automatically segments the text into sentences:Complete sentences are synthesized immediately.
Incomplete sentences are cached until they are complete and then synthesized.
When you call
streamingComplete, the server synthesizes all received but unprocessed text fragments, including incomplete sentences.The interval between sending text fragments cannot exceed 23 seconds. Otherwise, a "request timeout after 23 seconds" exception occurs.
If you have no more text to send, call
streamingCompleteto end the task promptly.The server enforces a 23-second timeout, which cannot be modified by the client.
Instantiate the SpeechSynthesizer class.
Instantiate the SpeechSynthesizer class and set the request parameters and the ResultCallback interface.
Stream data
Call the
streamingCallmethod of the SpeechSynthesizer class multiple times to submit the text to be synthesized to the server in segments.As you send the text, the server returns the synthesis result in real time through the
onEventmethod of the ResultCallback interface.For each call to the
streamingCallmethod, the length of the text segment (that is,text) cannot exceed 2,000 characters. The total length of all text sent cannot exceed 200,000 characters.End processing
Call the
streamingCompletemethod of the SpeechSynthesizer class to end the speech synthesis task.This method blocks the current thread until the
onCompleteoronErrorcallback of the ResultCallback interface is triggered, after which the thread is unblocked.You must call this method. Otherwise, the final text fragments may not be successfully converted to speech.
Call through Flowable
Flowable is an open-source framework for workflow and business process management that is released under the Apache 2.0 license. For more information about how to use Flowable, see Flowable API details.
Before you use Flowable, ensure that you have integrated the RxJava library and understand the basic concepts of reactive programming.
Non-streaming call
The following example shows how to use the blockingForEach interface of a Flowable object to block the current thread and retrieve the SpeechSynthesisResult data that is returned in each stream.
You can also obtain the complete synthesis result using the getAudioData method of the SpeechSynthesizer class after the Flowable stream is complete.
Streaming call
The following example shows how to use a Flowable object as an input parameter for a text stream. The example also shows how to use a Flowable object as a return value and use the blockingForEach interface to block the current thread and retrieve the SpeechSynthesisResult data that is returned in each stream.
You can also obtain the complete synthesis result using the getAudioData method of the SpeechSynthesizer class after the Flowable stream is complete.
High-concurrency calls
The DashScope Java SDK uses the connection pool technology of OkHttp3 to reduce the overhead from repeatedly establishing connections. For more information, see High-concurrency scenarios.
Request parameters
Use the chained methods of SpeechSynthesisParam to configure parameters, such as the model and voice, and pass the configured parameter object to the constructor of the SpeechSynthesizer class.
Parameter | Type | Required | Description |
model | String | Yes | The speech synthesis model. Difference models require corresponding voices:
|
voice | String | Yes | The voice to use for speech synthesis. System voices and cloned voices are supported:
|
format | enum | No | The audio coding format and sample rate. If you do not specify the Note The default sample rate is the optimal rate for the current voice. By default, the output uses this sample rate. Downsampling and upsampling are also supported. The following audio coding formats and sample rates are available:
|
volume | int | No | The volume. Default value: 50. Value range: [0, 100]. A value of 50 is the standard volume. The volume has a linear relationship with this value. 0 is mute and 100 is the maximum volume. |
speechRate | float | No | The speech rate. Default value: 1.0. Value range: [0.5, 2.0]. A value of 1.0 is the standard rate. Values less than 1.0 slow down the speech, and values greater than 1.0 speed it up. |
pitchRate | float | No | The pitch. This value is a multiplier for pitch adjustment. The relationship between this value and the perceived pitch is not strictly linear or logarithmic. Test different values to find the best one. Default value: 1.0. Value range: [0.5, 2.0]. A value of 1.0 is the natural pitch of the voice. Values greater than 1.0 raise the pitch, and values less than 1.0 lower it. |
bit_rate | int | No | The audio bitrate in kbps. If the audio format is Opus, you can adjust the bitrate using the Default value: 32. Value range: [6, 510]. Note Set the Set using parameterSet using parameters |
enableWordTimestamp | boolean | No | Specifies whether to enable character-level timestamps. Default value: false.
This feature is available only for cloned voices of the cosyvoice-v3-flash, cosyvoice-v3-plus, and cosyvoice-v2 models, and for system voices marked as supported in the voice list. Timestamp results can be obtained only through the callback interface. |
seed | int | No | The random number seed used during generation, which varies the synthesis effect. If the model version, text, voice, and other parameters are the same, using the same seed reproduces the same synthesis result. Default value: 0. Value range: [0, 65535]. |
languageHints | List | No | Provides language hints. Only cosyvoice-v3-flash and cosyvoice-v3-plus support this feature. No default value. This parameter has no effect if it is not set. This parameter has the following effects in speech synthesis:
If the specified language hint clearly does not match the text content, for example, setting Note: This parameter is an array, but the current version processes only the first element. Therefore, pass only one value. |
instruction | String | No | Sets an instruction. This feature is available only for cloned voices of the cosyvoice-v3-flash and cosyvoice-v3-plus models, and for system voices marked as supported in the voice list. No default value. This parameter has no effect if it is not set. The instruction has the following effects in speech synthesis:
|
enable_aigc_tag | boolean | No | Specifies whether to add an invisible AIGC identifier to the generated audio. When set to true, an invisible identifier is embedded into the audio in supported formats (WAV, MP3, and Opus). Default value: false. This feature is available only for the cosyvoice-v3-flash, cosyvoice-v3-plus, and cosyvoice-v2 models. Note Set the Set using the parameter methodSet using the parameters method |
aigc_propagator | String | No | Sets the Default value: Alibaba Cloud UID. This feature is available only for the cosyvoice-v3-flash, cosyvoice-v3-plus, and cosyvoice-v2 models. Note Set Set using the parameter methodSet using the parameters method |
aigc_propagate_id | String | No | Sets the Default value: The request ID of the current speech synthesis request. This feature is available only for the cosyvoice-v3-flash, cosyvoice-v3-plus, and cosyvoice-v2 models. Note Set Set using the parameter methodSet using the parameters method |
Key interfaces
SpeechSynthesizer class
The SpeechSynthesizer class provides the primary interfaces for speech synthesis and is imported using import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesizer;.
Interface/Method | Parameter | Return value | Description |
|
|
| Constructor.
|
|
|
| Converts a segment of text into speech. The text can be plain text or text that contains SSML. When you create a
Important Before each call to the |
|
| None | Sends the text for synthesis in a stream. Text that contains SSML is not supported. Call this interface multiple times to send the text for synthesis to the server in parts. The For a detailed call flow and reference examples, see Streaming call. |
| None | None | Ends the streaming speech synthesis. This method blocks the calling thread until one of the following conditions occurs:
For a detailed call process and reference examples, see Streaming call. Important When making a streaming call, call this method to avoid missing parts of the synthesized speech. |
|
| The synthesis result, encapsulated in | Converts non-streaming text input into a streaming speech output in real time. Text containing SSML is not supported. The synthesis result is returned in a stream within the Flowable object. For a detailed call process and reference examples, see Call through Flowable. |
| code: WebSocket Close Code reason: Shutdown reason For information about how to configure these parameters, see The WebSocket Protocol. | true | After a task is complete, close the WebSocket connection, regardless of whether an exception occurred, to avoid connection leaks. For information about how to reuse connections to improve efficiency, see High-concurrency scenarios. |
|
| The synthesis result, encapsulated in | Converts streaming text input into a streaming speech output in real time. Text that contains SSML is not supported. The synthesis result is returned as a stream in a Flowable object. For a detailed call process and reference examples, see Call through Flowable. |
| None | The request ID of the previous task. | Gets the request ID of the previous task. You can use this method after starting a new task by calling |
| None | First packet delay for the current task. | Gets the first packet delay of the current task, which is typically around 500 ms. Use this method after the task is complete. The first packet delay is the time between when the text starts being sent and when the first audio packet is received, measured in milliseconds. A WebSocket connection must be established when the text is sent for the first time. Therefore, the first-packet latency includes the time required to establish the connection. If the connection is reused in a high-concurrency scenario, the connection time is not included. |
Callback interface (ResultCallback)
When you make an asynchronous call or a streaming call, you can retrieve the synthesis result from the ResultCallback interface. This interface is imported using import com.alibaba.dashscope.common.ResultCallback;.
Interface/Method | Parameter | Return value | Description |
|
| None | This is called back asynchronously when the server pushes speech synthesis data. Call the Call the |
| None | None | The callback is invoked asynchronously after all synthetic data is returned (speech synthesis is complete). |
|
| None | This interface is called back asynchronously when an exception occurs. Implement complete exception logging and resource cleanup logic in the |
Response
The server returns binary audio data:
Synchronous call: Process the binary audio data returned by the
callmethod of the SpeechSynthesizer class.Asynchronous call or streaming call: Process the
SpeechSynthesisResultparameter of theonEventmethod of the ResultCallback interface.The key interfaces of
SpeechSynthesisResultare as follows:Interface/Method
Parameter
Return value
Description
public ByteBuffer getAudioFrame()None
Binary audio data
Returns the binary audio data of the current streaming synthesis segment. This may be empty if no new data arrives.
Combine the binary audio data into a complete audio file for playback, or play it in real time with a player that supports streaming playback.
ImportantIn streaming speech synthesis, for compressed formats such as MP3 and Opus, use a streaming player to play the audio segments. Do not play them frame by frame to avoid decoding failures.
Players that support streaming playback include ffmpeg, pyaudio (Python), AudioFormat (Java), and MediaSource (JavaScript).
When combining audio data into a complete audio file, append the data to the same file.
For WAV and MP3 audio formats in streaming speech synthesis, only the first frame contains header information. Subsequent frames contain only audio data.
public String getRequestId()None
The request ID of the task.
Gets the request ID of the task. When you call
getAudioFrameto get binary audio data, the return value of thegetRequestIdmethod isnull.public SpeechSynthesisUsage getUsage()None
SpeechSynthesisUsage: The number of billable characters in the current request so far.Returns
SpeechSynthesisUsageornull.The
getCharactersmethod ofSpeechSynthesisUsagereturns the number of billable characters used so far in the current request. Use the last receivedSpeechSynthesisUsageas the final count.public Sentence getTimestamp()None
Sentence: The billable sentence in the current request so far.You must enable the
enableWordTimestampcharacter-level timestamp feature.Returns
Sentenceornull.Methods of
Sentence:getIndex: Gets the sentence number, starting from 0.getWords: Gets the character arrayList<Word>that makes up the sentence. Use the last receivedSentenceas the final result.
Methods of
Word:getText: Gets the text of the character.getBeginIndex: Gets the start position index of the character in the sentence, starting from 0.getEndIndex: Gets the end position index of the character in the sentence, starting from 1.getBeginTime: Gets the start timestamp of the audio corresponding to the character, in milliseconds.getEndTime: Gets the end timestamp of the audio corresponding to the character, in milliseconds.
Error codes
For troubleshooting information, see Error messages.
More examples
For more examples, see GitHub.
FAQ
Features, billing, and rate limiting
Q: What can I do to fix inaccurate pronunciation?
You can use SSML to customize the speech synthesis output.
Q: Speech synthesis is billed based on the number of text characters. How can I view or get the text length for each synthesis?
Synchronous call: You must calculate it manually based on the character counting rules.
Other call methods: You can retrieve it using the
getUsagemethod of the SpeechSynthesisResult response. The value in the final response is the final total.
Troubleshooting
If a code error occurs, see Error codes for troubleshooting information.
Q: How do I get the Request ID?
You can retrieve it in one of the following two ways:
In the ResultCallback
onEventmethod, you can call the SpeechSynthesisResultgetRequestIdmethod.The return value of the
getRequestIdmethod may be null. For more information, see the description of thegetRequestIdmethod in SpeechSynthesisResult.Call the
getLastRequestIdmethod of SpeechSynthesizer.
Q: Why does the SSML feature fail?
Perform the following steps to troubleshoot this issue:
Ensure that your use case meets the conditions described in the scope of application.
Ensure that you are using the correct interface. SSML is only supported by the
callmethod of the SpeechSynthesizer class.Ensure that the text for synthesis is in plain text and meets the format requirements. For more information, see Speech Synthesis Markup Language.
Q: Why can't the audio be played?
Troubleshoot this issue based on the following scenarios:
The audio is saved as a complete file, such as an .mp3 file.
Audio format consistency: Ensure that the audio format specified in the request parameters matches the file extension. For example, playback might fail if the audio format is set to WAV in the request parameters but the file has an .mp3 extension.
Player compatibility: Confirm that your player supports the format and sample rate of the audio file. For example, some players might not support high sample rates or specific audio encodings.
The audio is played in streaming mode.
Save the audio stream as a complete file and try to play it. If the file fails to play, see the troubleshooting steps for the first scenario.
If the file plays correctly, the issue might be with the streaming playback implementation. Confirm that your player supports streaming playback.
Common tools and libraries that support streaming playback include FFmpeg, pyaudio (Python), AudioFormat (Java), and MediaSource (JavaScript).
Q: Why does the audio playback stutter?
Troubleshoot this issue based on the following scenarios:
Check the text sending speed: Ensure that the text sending interval is reasonable. Avoid delays in sending the next text segment after the audio for the previous segment has finished playing.
Check the callback function performance:
Check whether the callback function contains excessive business logic that could cause it to block.
The callback function runs in the WebSocket thread. If this thread is blocked, it can interfere with the WebSocket's ability to receive network packets, resulting in audio stuttering.
To avoid blocking the WebSocket thread, write the audio data to a separate audio buffer and then use another thread to read and process it.
Check network stability: Ensure that your network connection is stable to prevent audio transmission interruptions or delays caused by network fluctuations.
Q: Why is speech synthesis slow (long synthesis time)?
Perform the following troubleshooting steps:
Check the input interval
If you are using streaming speech synthesis, check whether the text sending interval is too long. For example, a delay of several seconds before sending the next segment will increase the total synthesis time.
Analyze performance metrics
First packet delay: This is typically around 500 ms.
Real-Time Factor (RTF): This is calculated as Total Synthesis Time / Audio Duration. The RTF is normally less than 1.0.
Q: How do I handle incorrect pronunciation in the synthesized speech?
Use the <phoneme> tag of SSML to specify the correct pronunciation.
Q: Why is no speech returned? Why is the text at the end not successfully converted to speech? (Missing synthesized speech)
Check whether you have called the streamingComplete method of the SpeechSynthesizer class. During speech synthesis, the server caches text and begins synthesis only after a sufficient amount of text is cached. If you do not call the streamingComplete method, the text remaining in the cache may not be synthesized.
Permissions and authentication
Q: I want my API key to be used only for the CosyVoice speech synthesis service, not for other Model Studio models (permission isolation). What should I do?
You can create a workspace and authorize only specific models to limit the scope of the API key. For more information, see Manage workspaces.
More questions
For more information, see the Q&A on GitHub.