This topic describes the parameters and interfaces of the Paraformer real-time speech recognition Java SDK.
This document applies only to the China (Beijing) region. To use the model, you must use an API key from the China (Beijing) region.
User guide: For model descriptions and selection guidance, see Real-time speech recognition - Fun-ASR/Paraformer.
Prerequisites
You have activated the Model Studio and created an API key. To prevent security risks, export the API key as an environment variable instead of hard-coding it in your code.
NoteTo grant temporary access permissions to third-party applications or users, or if you want to strictly control high-risk operations such as accessing or deleting sensitive data, we recommend that you use a temporary authentication token.
Compared with long-term API keys, temporary authentication tokens are more secure because they are short-lived (60 seconds). They are suitable for temporary call scenarios and can effectively reduce the risk of API key leakage.
To use a temporary token, replace the API key used for authentication in your code with the temporary authentication token.
Model list
paraformer-realtime-v2 | paraformer-realtime-8k-v2 | |
Scenarios | Scenarios such as live streaming and meetings | Recognition scenarios for 8 kHz audio, such as telephone customer service and voicemail |
Sample rate | Any | 8 kHz |
Languages | Chinese (including Mandarin and various dialects), English, Japanese, Korean, German, French, and Russian Supported Chinese dialects: Shanghainese, Wu, Minnan, Northeastern, Gansu, Guizhou, Henan, Hubei, Hunan, Jiangxi, Ningxia, Shanxi, Shaanxi, Shandong, Sichuan, Tianjin, Yunnan, and Cantonese | Chinese |
Punctuation prediction | ✅ Supported by default. No configuration is required. | ✅ Supported by default. No configuration is required. |
Inverse Text Normalization (ITN) | ✅ Supported by default. No configuration is required. | ✅ Supported by default. No configuration is required. |
Custom hotwords | ✅ For more information, see Customize hotwords | ✅ For more information, see Customize hotwords |
Specify recognition language | ✅ Specified by the | ❌ |
Emotion recognition | ❌ |
Getting started
The Recognition class provides interfaces for synchronous and streaming calls. Select a method based on your requirements:
Synchronous call: Recognizes a local file and returns the complete result in a single response. Suitable for processing pre-recorded audio.
Streaming call: Recognizes an audio stream directly and outputs the results in real time. The audio stream can originate from an external device, such as a microphone, or be read from a local file. Suitable for scenarios that require immediate feedback.
Synchronous call
Submit a real-time speech-to-text task for a local file and receive the transcription result in a single, synchronous response. This is a blocking operation.
Instantiate the Recognition class and call the call method with the request parameters and the file to be recognized. This action performs the recognition and returns the result.
Streaming call: Based on callbacks
You can submit a real-time speech-to-text task and receive streaming recognition results by implementing a callback interface.
Start streaming speech recognition
Instantiate the Recognition class and call the
callmethod with the request parameters and the callback interface (ResultCallback) to start streaming speech recognition.Stream audio
Repeatedly call the
sendAudioFramemethod of the Recognition class to send segments of a binary audio stream to the server. The audio stream can be read from a local file or a device, such as a microphone.While you are sending audio data, the server returns recognition results to the client in real time through the
onEventmethod of the callback interface (ResultCallback).We recommend that you send each audio segment with a duration of about 100 ms and a data size between 1 KB and 16 KB.
End processing
Call the
stopmethod of the Recognition class to stop speech recognition.This method blocks the current thread until the
onCompleteoronErrorcallback of the callback interface (ResultCallback) is triggered.
Streaming call: Based on Flowable
You can submit a real-time speech-to-text task and receive streaming recognition results by implementing a Flowable workflow.
Flowable is an open-source framework for workflow and business process management that is released under the Apache 2.0 license. For more information, see Flowable API reference.
High-concurrency calls
The DashScope Java SDK uses OkHttp3 connection pooling to reduce the overhead of repeatedly establishing connections. For more information, see Real-time speech recognition in high-concurrency scenarios.
Request parameters
Use the chained methods of RecognitionParam to configure parameters such as the model, sample rate, and audio format. Pass the configured parameter object to the call or streamCall method of the Recognition class.
Parameter | Type | Default value | Required | Description |
model | String | - | Yes | The model for real-time speech recognition. For more information, see Model list. |
sampleRate | Integer | - | Yes | The audio sampling rate in Hz. This parameter varies by model:
|
format | String | - | Yes | The format of the audio to be recognized. Supported audio formats: pcm, wav, mp3, opus, speex, aac, and amr. Important opus/speex: Must be encapsulated in Ogg. wav: Must be PCM encoded. amr: Only the AMR-NB type is supported. |
vocabularyId | String | - | No | The ID of the hotword vocabulary. This parameter takes effect only when it is set. Use this field to set the hotword ID for v2 and later models. The hotword information for this hotword ID is applied to the speech recognition request. For more information, see Custom vocabulary. |
disfluencyRemovalEnabled | boolean | false | No | Specifies whether to filter out disfluent words:
|
language_hints | String[] | ["zh", "en"] | No | The language code of the language to be recognized. If you cannot determine the language in advance, you can leave this parameter unset. The model automatically detects the language. Currently supported language codes:
This parameter applies only to models that support multiple languages. For more information, see Model list. Note Set Set using the parameter methodSet using the parameters method |
semantic_punctuation_enabled | boolean | false | No | Specifies whether to enable semantic sentence segmentation. This feature is disabled by default.
Semantic sentence segmentation provides higher accuracy and is suitable for meeting transcription scenarios. VAD sentence segmentation has lower latency and is suitable for interactive scenarios. By adjusting the This parameter is effective only for v2 and later models. Note Set Set using the parameter methodSet using the parameters method |
max_sentence_silence | Integer | 800 | No | The silence duration threshold for VAD sentence segmentation, in ms. If the silence duration after a speech segment exceeds this threshold, the system determines that the sentence has ended. The parameter ranges from 200 ms to 6000 ms. The default value is 800 ms. This parameter is effective only when the Note Set Set using the parameter methodSet using the parameters method |
multi_threshold_mode_enabled | boolean | false | No | If this parameter is set to `true`, it prevents VAD from segmenting sentences that are too long. This feature is disabled by default. This parameter is effective only when the Note Set Set using the parameter methodSet using the parameters method |
punctuation_prediction_enabled | boolean | true | No | Specifies whether to automatically add punctuation to the recognition results:
This parameter is effective only for v2 and later models. Note Set Set using the parameter methodSet using the parameters method |
heartbeat | boolean | false | No | Controls whether to maintain a persistent connection with the server:
This parameter is effective only for v2 and later models. Note To use this field, your SDK version must be 2.19.1 or later. Set Set using the parameter methodSet using the parameters method |
inverse_text_normalization_enabled | boolean | true | No | Specifies whether to enable Inverse Text Normalization (ITN). This feature is enabled by default (`true`). When enabled, Chinese numerals are converted to Arabic numerals. This parameter is effective only for v2 and later models. Note Set Set using the parameter methodSet using the parameters method |
apiKey | String | - | No | Your API key. |
Key interfaces
Recognition class
Import the Recognition class using the statement import com.alibaba.dashscope.audio.asr.recognition.Recognition;. Its key interfaces are:
Interface/Method | Parameters | Return value | Description |
|
| None | Performs streaming real-time recognition based on callbacks. This method does not block the current thread. |
|
| Recognition result | Performs a synchronous call based on a local file. This method blocks the current thread until the entire audio file is read. The file must have read permissions. |
|
|
| Performs streaming real-time recognition based on Flowable. |
|
| None | Pushes an audio stream. The size of each pushed audio stream should not be too large or too small. We recommend that each audio packet has a duration of about 100 ms and a size between 1 KB and 16 KB. The recognition result is obtained through the onEvent method of the callback interface (ResultCallback). |
| None | None | Stops real-time recognition. This method blocks the current thread until the |
| code: WebSocket Close Code reason: Reason for closing For information about how to configure these two parameters, see The WebSocket Protocol document. | true | After the task ends, you must close the WebSocket connection to prevent connection leaks, regardless of whether an exception occurs. For more information about how to reuse connections to improve efficiency, see Real-time speech recognition in high-concurrency scenarios. |
| None | requestId | Gets the request ID of the current task. Use this method after you start a new task by calling Note This method is available in SDK versions 2.18.0 and later. |
| None | First-packet latency | Gets the first-packet latency, which is the delay from sending the first audio packet to receiving the first recognition result. Use this method after the task is complete. Note This method is available in SDK versions 2.18.0 and later. |
| None | Last-packet latency | Gets the last-packet latency, which is the time taken from sending the Note This method is available in SDK versions 2.18.0 and later. |
ResultCallback
When you make a streaming call, the server returns key process information and data to the client through callbacks. Implement callback methods to handle the information and data returned by the server.
To implement the callback methods, inherit the ResultCallback abstract class. When you inherit this class, you can specify the generic type as RecognitionResult. RecognitionResult encapsulates the data structure returned by the server.
Because Java supports connection reuse, there are no onClose or onOpen methods.
Interface/Method | Parameters | Return value | Description |
| None | This method is called back when the service sends a response. | |
| None | None | This interface is called back when the task is complete. |
|
| None | This interface is called back when an exception occurs. |
Response
Real-time recognition result (RecognitionResult)
RecognitionResult represents the result of a single real-time recognition.
Interface/Method | Parameters | Return value | Description |
| None | requestId | Gets the request ID. |
| None | Whether the sentence is complete, which means a sentence break has occurred | Checks whether the given sentence has ended. |
| None | Gets single-sentence information, including the timestamp and text. |
Sentence information (Sentence)
Interface/Method | Parameters | Return value | Description |
| None | Sentence start time, in ms | Returns the start time of the sentence. |
| None | Sentence end time, in ms | Returns the end time of the sentence. |
| None | Recognized text | Returns the recognized text. |
| None | A list of Word timestamp information (Word) objects | Returns word timestamp information. |
| None | Emotion of the current sentence | Returns the emotion of the current sentence:
Emotion recognition has the following constraints:
|
| None | Confidence level of the recognized emotion for the current sentence | Returns the confidence level of the recognized emotion for the current sentence. The value ranges from 0.0 to 1.0. A larger value indicates a higher confidence level. Emotion recognition has the following constraints:
|
Word timestamp information (Word)
Interface/Method | Parameters | Return value | Description |
| None | Word start time, in ms | Returns the start time of the word. |
| None | Word end time, in ms | Returns the end time of the word. |
| None | Word | Returns the recognized word. |
| None | Punctuation | Returns the punctuation. |
Error codes
If an error occurs, see Error messages for troubleshooting.
If the problem persists, join the developer group to report the issue. Provide the Request ID to help us investigate the issue.
More examples
For more examples, see GitHub.
FAQ
Features
Q: How to maintain a persistent connection with the server during long periods of silence?
You can set the heartbeat request parameter to true and continuously send silent audio to the server.
Silent audio refers to content in an audio file or data stream that has no sound signal. You can generate silent audio using various methods, such as using audio editing software such as Audacity or Adobe Audition, or command-line tools such as FFmpeg.
Q: How to convert an audio format to the required format?
You can use the FFmpeg tool. For more information, see the official FFmpeg website.
# Basic conversion command (universal template)
# -i: Specifies the input file path. Example: audio.wav
# -c:a: Specifies the audio encoder. Examples: aac, libmp3lame, pcm_s16le
# -b:a: Specifies the bit rate (controls audio quality). Examples: 192k, 320k
# -ar: Specifies the sample rate. Examples: 44100 (CD), 48000, 16000
# -ac: Specifies the number of sound channels. Examples: 1 (mono), 2 (stereo)
# -y: Overwrites an existing file (no value needed).
ffmpeg -i input_audio.ext -c:a encoder_name -b:a bit_rate -ar sample_rate -ac num_channels output.ext
# Example: WAV to MP3 (maintain original quality)
ffmpeg -i input.wav -c:a libmp3lame -q:a 0 output.mp3
# Example: MP3 to WAV (16-bit PCM standard format)
ffmpeg -i input.mp3 -c:a pcm_s16le -ar 44100 -ac 2 output.wav
# Example: M4A to AAC (extract/convert Apple audio)
ffmpeg -i input.m4a -c:a copy output.aac # Directly extract without re-encoding
ffmpeg -i input.m4a -c:a aac -b:a 256k output.aac # Re-encode to improve quality
# Example: FLAC lossless to Opus (high compression)
ffmpeg -i input.flac -c:a libopus -b:a 128k -vbr on output.opusQ: Can I view the time range for each sentence?
Yes, you can. The speech recognition results include the start and end timestamps for each sentence. You can use these timestamps to determine the time range of each sentence.
Q: How to recognize a local file (recorded audio file)?
There are two ways to recognize a local file:
Pass the local file path directly: This method returns the complete recognition result after the entire file is processed. It is not suitable for scenarios that require immediate feedback.
For more information, see Synchronous call. You can pass the file path to the
callmethod of the Recognition class to directly recognize the audio file.Convert the local file into a binary stream for recognition: This method provides real-time results as the file is streamed and recognized. It is suitable for scenarios that require immediate feedback.
Use the
sendAudioFramemethod of the Recognition class to send a binary stream to the server for recognition. For more information, see Streaming call: Based on callbacks.Use the
streamCallmethod of the Recognition class to send a binary stream to the server for recognition. For more information, see Streaming call: Based on Flowable.
Troubleshooting
Q: Why there is no recognition result?
Check whether the audio
formatandsampleRate/sample_ratein the request parameters are set correctly and meet the parameter constraints. The following are common error examples:The audio file has a .wav extension but is in MP3 format, and the
formatparameter is incorrectly set to `mp3`.The audio sample rate is 3600 Hz, but the
sampleRate/sample_rateparameter is incorrectly set to 48000.
You can use the ffprobe tool to obtain audio information, such as the container, encoding, sample rate, and sound channels:
ffprobe -v error -show_entries format=format_name -show_entries stream=codec_name,sample_rate,channels -of default=noprint_wrappers=1 input.xxxWhen you use the
paraformer-realtime-v2model, check whether the language set inlanguage_hintsmatches the actual language of the audio.For example, the audio is in Chinese, but
language_hintsis set toen(English).If all the preceding checks pass, you can use custom hotwords to improve the recognition of specific words.