The interfaces and parameters of the Fun-ASR Java SDK for real-time speech recognition.
User guide: For an introduction to the models and selection recommendations, see Real-time speech recognition - Fun-ASR/Paraformer.
Prerequisites
You have activated the service and created an API key. Export the API key as an environment variable instead of hard-coding it to prevent security risks.
Model availability
International
In the international deployment mode, endpoints and data storage are in the Singapore region. Model inference compute resources are dynamically scheduled globally, excluding Chinese Mainland.
Model | Version | Unit price | Free quota (Note) |
fun-asr-realtime Currently, fun-asr-realtime-2025-11-07 | Stable | $0.00009/second | 36,000 seconds (10 hours) Valid for 90 days |
fun-asr-realtime-2025-11-07 | Snapshot |
Languages supported: Mandarin, Cantonese, Wu, Minnan, Hakka, Gan, Xiang, and Jin. Also supports Mandarin accents from Zhongyuan, Southwest, Jilu, Jianghuai, Lanyin, Jiaoliao, Northeast, Beijing, and Hong Kong–Taiwan regions—including Henan, Shaanxi, Hubei, Sichuan, Chongqing, Yunnan, Guizhou, Guangdong, Guangxi, Hebei, Tianjin, Shandong, Anhui, Nanjing, Jiangsu, Hangzhou, Gansu, and Ningxia. Also supports English and Japanese.
Sample rates supported: 16 kHz
Audio formats supported: pcm, wav, mp3, opus, speex, aac, amr
Chinese Mainland
In the Chinese Mainland deployment mode, endpoints and data storage are in the Beijing region. Model inference compute resources are limited to Chinese Mainland.
Model | Version | Unit price | Free quota (Note) |
fun-asr-realtime Currently, fun-asr-realtime-2025-11-07 | Stable | $0.000047/second | No free quota |
fun-asr-realtime-2026-02-28 | Snapshot | ||
fun-asr-realtime-2025-11-07 | Snapshot | ||
fun-asr-realtime-2025-09-15 | |||
fun-asr-flash-8k-realtime Currently, fun-asr-flash-8k-realtime-2026-01-28 | Stable | $0.000032/second | |
fun-asr-flash-8k-realtime-2026-01-28 | Snapshot |
Languages supported:
fun-asr-realtime, fun-asr-realtime-2026-02-28, fun-asr-realtime-2025-11-07: Chinese (Mandarin, Cantonese, Wu, Minnan, Hakka, Gan, Xiang, and Jin. Also supports Mandarin accents from Zhongyuan, Southwest, Jilu, Jianghuai, Lanyin, Jiaoliao, Northeast, Beijing, and Hong Kong–Taiwan regions—including Henan, Shaanxi, Hubei, Sichuan, Chongqing, Yunnan, Guizhou, Guangdong, Guangxi, Hebei, Tianjin, Shandong, Anhui, Nanjing, Jiangsu, Hangzhou, Gansu, and Ningxia), English, and Japanese.
fun-asr-realtime-2025-09-15: Chinese (Mandarin), English
Sample rates supported: 16 kHz
Sample rates supported:
fun-asr-flash-8k-realtime and fun-asr-flash-8k-realtime-2026-01-28: 8 kHz
All other models: 16 kHz
Audio formats supported: pcm, wav, mp3, opus, speex, aac, amr
Getting started
The Recognition class provides non-streaming and bidirectional streaming interfaces. Choose based on your requirements:
Non-streaming call: Recognizes a local file and returns the complete result at once. Suitable for pre-recorded audio.
Bidirectional streaming call: Recognizes audio streams with real-time output. Audio can originate from a microphone or local file. Suitable for scenarios requiring immediate feedback.
Non-streaming call
Submit a speech-to-text task with a local file and synchronously receive the transcription result. This call blocks the current thread.
To perform recognition, instantiate the Recognition class, and then call the call method with the request parameters and the file to be recognized.
Bidirectional streaming call: Callback-based
Submit a speech-to-text task and receive streaming recognition results by implementing a callback interface.
Start streaming speech recognition
Instantiate the Recognition class and call the
callmethod to bind the request parameters and the callback interface (ResultCallback) to start streaming speech recognition.Stream audio
Repeatedly call
sendAudioFrameto send binary audio stream segments to the server. The stream can be from a local file or microphone.While you send audio data, the server returns real-time recognition results to the client through the
onEventmethod of the ResultCallback callback interface.Each audio segment should be ~100 ms long, between 1 KB and 16 KB.
End processing
Call the
stopmethod of the Recognition class to stop speech recognition.Blocks until the
onCompleteoronErrorcallback of the ResultCallback triggers.
Bidirectional streaming call: Flowable-based
Submit a speech-to-text task and receive streaming results through a Flowable workflow.
Flowable is a reactive stream type from the RxJava library supporting backpressure. See Flowable API reference for details.
High-concurrency calls
The DashScope Java SDK uses OkHttp3 connection pooling to reduce connection overhead. See Optimize Paraformer real-time speech recognition for high concurrency for details.
Request parameters
Use RecognitionParam builder methods to configure parameters (model, sample rate, audio format), then pass to call or streamCall methods.
Parameter | Type | Default | Required | Description |
model | String | - | Yes | The model for real-time speech recognition. For more information, see Model list. |
sampleRate | Integer | - | Yes | Audio sample rate in Hz. fun-asr-realtime supports 16000 Hz. |
format | String | - | Yes | Audio format. Supported: pcm, wav, mp3, opus, speex, aac, amr. Important opus/speex: Must use Ogg encapsulation. wav: Must be PCM encoded. amr: Only AMR-NB is supported. |
semantic_punctuation_enabled | boolean | false | No | Enable semantic punctuation. Default: false.
Semantic punctuation: higher accuracy, suitable for meeting transcription. VAD punctuation: lower latency, suitable for interactive scenarios. Use Note Set the Set using the parameter methodSet using the parameters method |
max_sentence_silence | Integer | 1300 | No | Silence duration threshold for VAD punctuation in ms. When silence after speech exceeds this threshold, the sentence ends. Range: 200-6000 ms. Default: 1300 ms. Only applies when Note Set the Set using the parameter methodSet using the parameters method |
multi_threshold_mode_enabled | boolean | false | No | Prevents VAD from prematurely segmenting long sentences. Default: false. Only applies when Note Set the Set using the parameter methodSet using the parameters method |
punctuation_prediction_enabled | boolean | true | No | Auto-add punctuation to recognition result:
Note Set the Set using the parameter methodSet using the parameters method |
heartbeat | boolean | false | No | Maintain persistent connection with server:
Note To use this field, the SDK version must be 2.19.1 or later. Set the Set using the parameter methodSet using the parameters method |
language_hints | String[] | - | No | Sets the language codes for recognition. If the language is unknown in advance, leave this parameter unset and the model will identify it automatically. The system reads only the first value in the array and ignores all other values. Supported language codes by model:
Note Set the Set using the parameter methodSet using the parameters method |
speech_noise_threshold | float | - | No | Adjusts the speech-noise detection threshold to control VAD sensitivity. Range: [-1.0, 1.0]. Guidelines:
Important This is an advanced parameter. Adjustments can significantly affect recognition quality.
Note Set the Set using the parameter methodSet using the parameters method |
apiKey | String | - | No | Your API key. |
Key interfaces
Recognition class
Import: import com.alibaba.dashscope.audio.asr.recognition.Recognition;. Key interfaces:
Interface/Method | Parameters | Return value | Description |
|
| None | Performs callback-based streaming recognition. Does not block the current thread. |
|
| Recognition result | Performs non-streaming recognition with a local file. Blocks until the entire audio file is processed. File must be readable. |
|
|
| Performs Flowable-based streaming real-time recognition. |
|
| None | Sends audio stream segment. Each packet: ~100 ms duration, 1-16 KB size. Results are retrieved via the |
| None | None | Stops recognition. Blocks until |
| code: WebSocket Close Code reason: The reason for closing Refer to the The WebSocket Protocol document to configure these two parameters. | true | Close the WebSocket connection after task completion to prevent leaks, even if exceptions occur. See Optimize Paraformer real-time speech recognition for high concurrency for connection reuse. |
| None | requestId | Gets request ID. Use after calling Note This method is available only in SDK versions 2.18.0 and later. |
| None | First-packet latency | Gets first-packet latency (delay from sending first audio packet to receiving first result). Use after task completion. Note This method is available only in SDK versions 2.18.0 and later. |
| None | Last-packet latency | Gets last-packet latency (time from Note This method is available only in SDK versions 2.18.0 and later. |
Callback interface (ResultCallback)
In bidirectional streaming calls, the server returns results via callbacks. Implement callback methods to handle returned data.
Implement callbacks by inheriting the ResultCallback abstract class. Specify the generic type as RecognitionResult. The RecognitionResult object encapsulates server-returned data.
Because Java supports connection reuse, there are no onClose or onOpen methods.
Interface/Method | Parameters | Return value | Description |
| None | Called when the server returns a recognition result. | |
| None | None | Called when the recognition task completes successfully. |
|
| None | Called when an error occurs during recognition. |
Response results
Real-time recognition result (RecognitionResult)
RecognitionResult represents a single recognition result.
Interface/Method | Parameters | Return value | Description |
| None | requestId | Gets the request ID. |
| None | Whether sentence segmentation has completed | Returns whether the current sentence has ended (final result). |
| None | Gets sentence information, including timestamps and text. |
Sentence information (Sentence)
Interface/Method | Parameters | Return value | Description |
| None | Sentence start time, in ms | Returns the sentence start time. |
| None | Sentence end time, in ms | Returns the sentence end time. |
| None | Recognized text | Returns the recognized text. |
| None | A List of Word timestamp information (Word). | Returns word timestamp information. |
Word timestamp information (Word)
Interface/Method | Parameters | Return value | Description |
| None | Word start time, in ms | Returns the word start time. |
| None | Word end time, in ms | Returns the word end time. |
| None | Word | Returns the recognized word. |
| None | Punctuation | Returns the punctuation. |
Error codes
If an error occurs, see Error messages for troubleshooting.
If the problem persists, join the developer group to report the issue. Provide the Request ID to help us investigate the issue.
FAQ
Features
Q: How do I maintain a persistent connection with the server during long periods of silence?
Set heartbeat to true and continuously send silent audio.
Silent audio has no sound signal. Generate it using audio editing software (Audacity, Adobe Audition) or command-line tools (FFmpeg).
Q: How do I convert an audio format to a supported format?
Use FFmpeg to convert audio files. See the official FFmpeg website for details.
# Basic conversion command (universal template)
# -i, function: input file path, example: audio.wav
# -c:a, function: audio encoder, examples: aac, libmp3lame, pcm_s16le
# -b:a, function: bit rate (audio quality control), examples: 192k, 320k
# -ar, function: sample rate, examples: 44100 (CD), 48000, 16000
# -ac, function: number of sound channels, examples: 1 (mono), 2 (stereo)
# -y, function: overwrite existing file (no value needed)
ffmpeg -i input_audio.ext -c:a encoder_name -b:a bit_rate -ar sample_rate -ac number_of_channels output.ext
# Example: WAV → MP3 (preserve original quality)
ffmpeg -i input.wav -c:a libmp3lame -q:a 0 output.mp3
# Example: MP3 → WAV (16-bit PCM standard format)
ffmpeg -i input.mp3 -c:a pcm_s16le -ar 44100 -ac 2 output.wav
# Example: M4A → AAC (extract/convert Apple audio)
ffmpeg -i input.m4a -c:a copy output.aac # Directly extract without re-encoding
ffmpeg -i input.m4a -c:a aac -b:a 256k output.aac # Re-encode to improve quality
# Example: FLAC lossless → Opus (high compression)
ffmpeg -i input.flac -c:a libopus -b:a 128k -vbr on output.opusQ: How do I recognize a local file (recorded audio file)?
A: There are two ways to recognize a local file:
Pass local file path directly: Returns complete result after processing. Not suitable for real-time feedback scenarios.
Pass file path to
callmethod to recognize the audio file. See Non-streaming call for details.Convert local file to binary stream: Streams results in real time, suitable for immediate feedback scenarios.
Use the
sendAudioFramemethod of the Recognition class to send a binary stream to the server for recognition. For more information, see Bidirectional streaming calls: Callback-based.Use the
streamCallmethod of the Recognition class to send a binary stream to the server for recognition. For more information, see Bidirectional streaming call: Flowable-based.
Troubleshooting
Q: Why is speech not recognized (no recognition result)?
Check whether
formatandsampleRate/sample_ratein request parameters are correct and comply with constraints. Common errors:The audio file has a .wav file name extension but is actually in MP3 format, while the
formatrequest parameter is set to mp3.The actual audio sample rate is 3600 Hz, but the
sampleRateorsample_raterequest parameter is set to 48000.
Use ffprobe to get audio info (container, encoding, sample rate, channels):
ffprobe -v error -show_entries format=format_name -show_entries stream=codec_name,sample_rate,channels -of default=noprint_wrappers=1 input.xxxCheck if
language_hintsmatches the actual audio language.For example, the audio is in Chinese, but
language_hintsis set toen(English).