This topic describes the parameters and interfaces of the Fun-ASR real-time speech recognition Java SDK.
User guide: For an introduction to the models and selection recommendations, see Real-time speech recognition - Fun-ASR/Gummy/Paraformer.
Prerequisites
You have activated the service and created an API key. To prevent security risks from code leakage, export the API key as an environment variable instead of hard-coding it in your code.
Model availability
International (Singapore)
Model | Version | Supported languages | Supported sample rates | Scenarios | Supported audio formats | Price | Free quota (Note) |
fun-asr-realtime This model is currently equivalent to fun-asr-realtime-2025-11-07. | Stable | Chinese (Mandarin, Cantonese, Wu, Minnan, Hakka, Gan, Xiang, and Jin), English, and Japanese. This model also supports Mandarin accents from various Chinese regions, including Zhongyuan, Southwest, Ji-Lu, Jianghuai, Lan-Yin, Jiao-Liao, Northeast, Beijing, Hong Kong/Taiwan, Henan, Shaanxi, Hubei, Sichuan, Chongqing, Yunnan, Guizhou, Guangdong, Guangxi, Hebei, Tianjin, Shandong, Anhui, Nanjing, Jiangsu, Hangzhou, Gansu, and Ningxia. | 16 kHz | ApsaraVideo Live, conferences, call centers, and more. | PCM, WAV, MP3, Opus, Speex, AAC, and AMR | $0.00009/second | 36,000 seconds (10 hours) Valid for 90 days |
fun-asr-realtime-2025-11-07 | Snapshot |
China (Beijing)
Model | Version | Supported languages | Supported sample rates | Scenarios | Supported audio formats | Price |
fun-asr-realtime Equivalent to fun-asr-realtime-2025-11-07 | Stable | Chinese (Mandarin, Cantonese, Wu, Minnan, Hakka, Gan, Xiang, and Jin), English, and Japanese. This model also supports Mandarin accents from various Chinese regions and provinces, including Zhongyuan, Southwest, Ji-Lu, Jianghuai, Lan-Yin, Jiao-Liao, Northeast, Beijing, Hong Kong/Taiwan, Henan, Shaanxi, Hubei, Sichuan, Chongqing, Yunnan, Guizhou, Guangdong, Guangxi, Hebei, Tianjin, Shandong, Anhui, Nanjing, Jiangsu, Hangzhou, Gansu, and Ningxia. | 16 kHz | ApsaraVideo Live, conferences, call centers, and more | pcm, wav, mp3, opus, speex, aac, amr | $0.000047/second |
fun-asr-realtime-2025-11-07 This version is optimized for far-field Voice Activity Detection (VAD) and provides higher recognition accuracy than fun-asr-realtime-2025-09-15. | Snapshot | |||||
fun-asr-realtime-2025-09-15 | Chinese (Mandarin), English |
Getting started
The Recognition class provides interfaces for non-streaming and bidirectional streaming calls. Choose the appropriate calling method based on your requirements:
Non-streaming call: Recognizes a local file and returns the complete result at once. This is suitable for processing pre-recorded audio.
Bidirectional streaming call: Recognizes an audio stream and outputs the results in real time. The audio stream can originate from an external device, such as a microphone, or be read from a local file. This is suitable for scenarios that require immediate feedback.
Non-streaming call
Submit a single real-time speech-to-text task and synchronously receive the transcription result by passing a local file. This call blocks the current thread.
To perform recognition and obtain the result, instantiate the Recognition class, call the call method, and attach the request parameters and the file to be recognized.
Bidirectional streaming call: Callback-based
Submit a single real-time speech-to-text task and receive real-time recognition results in a stream by implementing a callback interface.
Start streaming speech recognition
Instantiate the Recognition class and call the
callmethod to bind the request parameters and the callback interface (ResultCallback) to start streaming speech recognition.Stream audio
Repeatedly call the
sendAudioFramemethod of the Recognition class to send segments of a binary audio stream to the server. The stream can be read from a local file or a device, such as a microphone.While you send audio data, the server returns real-time recognition results to the client through the
onEventmethod of the ResultCallback callback interface.Each audio segment that you send should be about 100 milliseconds long, with a data size between 1 KB and 16 KB.
End processing
Call the
stopmethod of the Recognition class to stop speech recognition.This method blocks the current thread until the
onCompleteoronErrorcallback of the ResultCallback interface is triggered.
Bidirectional streaming call: Flowable-based
Submit a single real-time speech-to-text task and receive real-time recognition results in a stream by implementing a Flowable workflow.
Flowable is an open-source framework for workflow and business process management that is released under the Apache 2.0 license. For more information about how to use Flowable, see Flowable API details.
High-concurrency calls
The DashScope Java SDK uses the connection pool technology of OkHttp3 to reduce the overhead of repeatedly establishing connections. For more information, see Real-time speech recognition in high-concurrency scenarios.
Request parameters
Use the chain methods of RecognitionParam to configure parameters, such as the model, sample rate, and audio format. Then, pass the configured parameter object to the call or streamCall methods of the Recognition class.
Parameter | Type | Default | Required | Description |
model | String | - | Yes | The model for real-time speech recognition. For more information, see Model list. |
sampleRate | Integer | - | Yes | The sample rate of the audio to be recognized, in Hz. fun-asr-realtime supports a sample rate of 16000 Hz. |
format | String | - | Yes | The format of the audio to be recognized. Supported audio formats: pcm, wav, mp3, opus, speex, aac, amr. Important opus/speex: Must use Ogg encapsulation. wav: Must be PCM encoded. amr: Only AMR-NB is supported. |
vocabularyId | String | - | No | The ID of the custom vocabulary. For more information, see Custom vocabulary. This parameter is not set by default. |
semantic_punctuation_enabled | boolean | false | No | Specifies whether to enable semantic punctuation. The default value is false.
Semantic punctuation provides higher accuracy and is suitable for meeting transcription scenarios. VAD punctuation provides lower latency and is suitable for interactive scenarios. By adjusting the Note Set the Set using the parameter methodSet using the parameters method |
max_sentence_silence | Integer | 1300 | No | The silence duration threshold for VAD punctuation, in ms. When the duration of silence after a segment of speech exceeds this threshold, the system determines that the sentence has ended. The parameter ranges from 200 ms to 6000 ms. The default value is 1300 ms. This parameter takes effect only when the Note Set the Set using the parameter methodSet using the parameters method |
multi_threshold_mode_enabled | boolean | false | No | When this switch is enabled (true), it prevents VAD from segmenting sentences that are too long. It is disabled by default. This parameter takes effect only when the Note Set the Set using the parameter methodSet using the parameters method |
punctuation_prediction_enabled | boolean | true | No | Specifies whether to automatically add punctuation to the recognition result:
Note Set the Set using the parameter methodSet using the parameters method |
heartbeat | boolean | false | No | Use this switch to control whether to maintain a persistent connection with the server:
Note To use this field, the SDK version must be 2.19.1 or later. Set the Set using the parameter methodSet using the parameters method |
apiKey | String | - | No | Your API key. |
Key interfaces
Recognition class
Import the Recognition class using "import com.alibaba.dashscope.audio.asr.recognition.Recognition;". The key interfaces of this class are described in the following table:
Interface/Method | Parameters | Return value | Description |
|
| None | Performs callback-based streaming real-time recognition. This method does not block the current thread. |
|
| Recognition result | Performs a non-streaming call based on a local file. This method blocks the current thread until the entire audio file is read. The file to be recognized must have read permissions. |
|
|
| Performs Flowable-based streaming real-time recognition. |
|
| None | Pushes an audio stream. Each audio packet should have a duration of about 100 ms and a size between 1 KB and 16 KB. The detection results are obtained through the onEvent method of the ResultCallback callback. |
| None | None | Stops real-time recognition. This method blocks the current thread until the |
| code: WebSocket Close Code reason: The reason for closing Refer to the The WebSocket Protocol document to configure these two parameters. | true | You must close the WebSocket connection after a task is complete to prevent connection leaks. This applies even if an exception occurs. To learn how to reuse connections to improve efficiency, see Real-time speech recognition in high-concurrency scenarios. |
| None | requestId | Gets the request ID of the current task. Use this method after a new task is started by calling Note This method is available only in SDK versions 2.18.0 and later. |
| None | First-packet latency | Gets the first-packet latency, which is the delay from when the first audio packet is sent to when the first recognition result is received. Use this method after the task is complete. Note This method is available only in SDK versions 2.18.0 and later. |
| None | Last-packet latency | Gets the last-packet latency, which is the time taken from when the Note This method is available only in SDK versions 2.18.0 and later. |
Callback interface (ResultCallback)
When you make a bidirectional streaming call, the server uses a callback to return key process information and data to the client. You must implement a callback method to handle the information or data that is returned by the server.
Implement the callback methods by inheriting the ResultCallback abstract class. When you inherit this abstract class, specify the generic type as RecognitionResult. The RecognitionResult object encapsulates the data structure that is returned by the server.
Because Java supports connection reuse, there are no onClose or onOpen methods.
Interface/Method | Parameters | Return value | Description |
| None | This method is called back when the service responds. | |
| None | None | This interface is called back after the task is complete. |
|
| None | This interface is called back when an exception occurs. |
Response results
Real-time recognition result (RecognitionResult)
RecognitionResult represents the result of a single real-time recognition.
Interface/Method | Parameters | Return value | Description |
| None | requestId | Gets the request ID. |
| None | Whether the sentence is complete, meaning punctuation has occurred | Determines whether the given sentence has ended. |
| None | Gets sentence information, including timestamps and text. |
Sentence information (Sentence)
Interface/Method | Parameters | Return value | Description |
| None | Sentence start time, in ms | Returns the sentence start time. |
| None | Sentence end time, in ms | Returns the sentence end time. |
| None | Recognized text | Returns the recognized text. |
| None | A List of Word timestamp information (Word). | Returns word timestamp information. |
Word timestamp information (Word)
Interface/Method | Parameters | Return value | Description |
| None | Word start time, in ms | Returns the word start time. |
| None | Word end time, in ms | Returns the word end time. |
| None | Word | Returns the recognized word. |
| None | Punctuation | Returns the punctuation. |
Error codes
If an error occurs, see Error messages for troubleshooting.
If the problem persists, join the developer group to report the issue. Provide the Request ID to help us investigate the issue.
FAQ
Features
Q: How do I maintain a persistent connection with the server during long periods of silence?
A: Set the heartbeat request parameter to true and continuously send silent audio to the server.
Silent audio refers to content in an audio file or data stream that has no sound signal. Generate silent audio using audio editing software, such as Audacity or Adobe Audition, or using command-line tools, such as FFmpeg.
Q: How do I convert an audio format to a supported format?
Use the FFmpeg tool. For more information, see the official FFmpeg website.
# Basic conversion command (universal template)
# -i, function: input file path, example: audio.wav
# -c:a, function: audio encoder, examples: aac, libmp3lame, pcm_s16le
# -b:a, function: bit rate (audio quality control), examples: 192k, 320k
# -ar, function: sample rate, examples: 44100 (CD), 48000, 16000
# -ac, function: number of sound channels, examples: 1 (mono), 2 (stereo)
# -y, function: overwrite existing file (no value needed)
ffmpeg -i input_audio.ext -c:a encoder_name -b:a bit_rate -ar sample_rate -ac number_of_channels output.ext
# Example: WAV → MP3 (preserve original quality)
ffmpeg -i input.wav -c:a libmp3lame -q:a 0 output.mp3
# Example: MP3 → WAV (16-bit PCM standard format)
ffmpeg -i input.mp3 -c:a pcm_s16le -ar 44100 -ac 2 output.wav
# Example: M4A → AAC (extract/convert Apple audio)
ffmpeg -i input.m4a -c:a copy output.aac # Directly extract without re-encoding
ffmpeg -i input.m4a -c:a aac -b:a 256k output.aac # Re-encode to improve quality
# Example: FLAC lossless → Opus (high compression)
ffmpeg -i input.flac -c:a libopus -b:a 128k -vbr on output.opusQ: How do I recognize a local file (recorded audio file)?
A: There are two ways to recognize a local file:
Directly pass the local file path: This method returns the complete recognition result after the recognition is complete. This method is not suitable for scenarios that require immediate feedback.
Pass the file path to the
callmethod of the Recognition class to directly recognize the audio file. For more information, see Non-streaming call.Convert the local file into a binary stream for recognition: This method recognizes the file and streams the recognition results. This method is suitable for scenarios that require immediate feedback.
Use the
sendAudioFramemethod of the Recognition class to send a binary stream to the server for recognition. For more information, see Bidirectional streaming calls: Callback-based.Use the
streamCallmethod of the Recognition class to send a binary stream to the server for recognition. For more information, see Bidirectional streaming call: Flowable-based.
Troubleshooting
Q: Why is speech not recognized (no recognition result)?
Check whether the audio format (
format) and sample rate (sampleRateorsample_rate) in the request parameters are correctly set and comply with the parameter constraints. The following are examples of common errors:The audio file has a .wav file name extension but is in MP3 format, and the
formatrequest parameter is set to mp3. This is an incorrect parameter setting.The audio sample rate is 3600 Hz, but the
sampleRateorsample_raterequest parameter is set to 48000. This is an incorrect parameter setting.
Use the ffprobe tool to obtain audio information, such as the container, encoding, sample rate, and sound channel:
ffprobe -v error -show_entries format=format_name -show_entries stream=codec_name,sample_rate,channels -of default=noprint_wrappers=1 input.xxxCheck if the language specified in
language_hintsis consistent with the actual language of the audio.For example, the audio is in Chinese, but
language_hintsis set toen(English).If all the preceding checks pass, customize vocabulary to improve the recognition of specific words.