This topic describes the parameters and interfaces of the Fun-ASR real-time speech recognition Python SDK.
User Guide: For an introduction to the models and recommendations for model selection, see Real-time speech recognition - Fun-ASR/Paraformer.
Prerequisites
You have activated the service and created an API key. To prevent security risks from code leakage, export the API key as an environment variable instead of hard-coding it in your code.
Model availability
International (Singapore)
Model | Version | Supported languages | Supported sample rates | Scenarios | Supported audio formats | Price | Free quota (Note) |
fun-asr-realtime This model is currently equivalent to fun-asr-realtime-2025-11-07. | Stable | Chinese (Mandarin, Cantonese, Wu, Minnan, Hakka, Gan, Xiang, and Jin), English, and Japanese. This model also supports Mandarin accents from various Chinese regions, including Zhongyuan, Southwest, Ji-Lu, Jianghuai, Lan-Yin, Jiao-Liao, Northeast, Beijing, Hong Kong/Taiwan, Henan, Shaanxi, Hubei, Sichuan, Chongqing, Yunnan, Guizhou, Guangdong, Guangxi, Hebei, Tianjin, Shandong, Anhui, Nanjing, Jiangsu, Hangzhou, Gansu, and Ningxia. | 16 kHz | ApsaraVideo Live, conferences, call centers, and more. | PCM, WAV, MP3, Opus, Speex, AAC, and AMR | $0.00009/second | 36,000 seconds (10 hours) Valid for 90 days |
fun-asr-realtime-2025-11-07 | Snapshot |
China (Beijing)
Model | Version | Supported languages | Supported sample rates | Scenarios | Supported audio formats | Price |
fun-asr-realtime Equivalent to fun-asr-realtime-2025-11-07 | Stable | Chinese (Mandarin, Cantonese, Wu, Minnan, Hakka, Gan, Xiang, and Jin), English, and Japanese. This model also supports Mandarin accents from various Chinese regions and provinces, including Zhongyuan, Southwest, Ji-Lu, Jianghuai, Lan-Yin, Jiao-Liao, Northeast, Beijing, Hong Kong/Taiwan, Henan, Shaanxi, Hubei, Sichuan, Chongqing, Yunnan, Guizhou, Guangdong, Guangxi, Hebei, Tianjin, Shandong, Anhui, Nanjing, Jiangsu, Hangzhou, Gansu, and Ningxia. | 16 kHz | ApsaraVideo Live, conferences, call centers, and more | pcm, wav, mp3, opus, speex, aac, amr | $0.000047/second |
fun-asr-realtime-2025-11-07 This version is optimized for far-field Voice Activity Detection (VAD) and provides higher recognition accuracy than fun-asr-realtime-2025-09-15. | Snapshot | |||||
fun-asr-realtime-2025-09-15 | Chinese (Mandarin), English |
Getting started
The Recognition class provides interfaces for non-streaming and bidirectional streaming calls. Select a call method based on your requirements:
Non-streaming call: Recognize a local file and return the complete result at once. This method is suitable for processing pre-recorded audio.
Bidirectional streaming call: Recognize an audio stream and output the results in real time. The audio stream can be from an external device, such as a microphone, or read from a local file. This method is suitable for scenarios that require immediate feedback.
Non-streaming call
Submit a real-time speech-to-text task for a single audio file. The call is synchronous and blocks until the transcription result is returned.
Instantiate the Recognition class, set the request parameters, and call the call method to perform recognition and retrieve the recognition result (RecognitionResult).
Bidirectional streaming call
Submit a real-time speech-to-text task and receive stream results by implementing a callback.
Start stream speech recognition
Instantiate the Recognition class, attach the request parameters and the callback (RecognitionCallback), and call the
startmethod to start stream speech recognition.Stream
Repeatedly call the
send_audio_framemethod of the Recognition class to send the binary audio stream from a local file or a device, such as a microphone, to the server in segments.During audio data transmission, the server returns the recognition results to the client in real time through the
on_eventmethod of the callback (RecognitionCallback).Each audio segment should have a duration of about 100 ms and a data size between 1 KB and 16 KB.
End the process
Call the
stopmethod of the Recognition class to stop speech recognition.This method blocks the current thread until the
on_completeoron_errormethod of the RecognitionCallback is triggered.
Request parameters
Set the request parameters using the constructor method (_init_) of the Recognition class.
Parameter | Type | Default | Required | Description |
model | str | - | Yes | The model used for real-time speech recognition. |
sample_rate | int | - | Yes | The sample rate of the audio to be recognized, in Hz. fun-asr-realtime supports a 16000 Hz sample rate. |
format | str | - | Yes | The format of the audio to be recognized. Supported audio formats: pcm, wav, mp3, opus, speex, aac, amr. Important opus/speex: Must be Ogg encapsulated. wav: Must be PCM encoded. amr: Only AMR-NB is supported. |
vocabulary_id | str | - | No | The vocabulary ID. For more information, see Custom vocabulary. This parameter is not set by default. |
semantic_punctuation_enabled | bool | False | No | Specifies whether to enable semantic punctuation. Default value: False.
Semantic punctuation provides higher accuracy and is suitable for conference transcription. VAD punctuation has lower latency and is suitable for interactive scenarios. You can adjust the |
max_sentence_silence | int | 1300 | No | The silence duration threshold for VAD punctuation, in ms. If the silence duration after a speech segment exceeds this threshold, the system determines that the sentence has ended. The value ranges from 200 ms to 6000 ms. The default value is 1300 ms. This parameter takes effect only when the |
multi_threshold_mode_enabled | bool | False | No | If this parameter is set to true, it prevents VAD punctuation from creating excessively long segments. Default value: false. This parameter takes effect only when the |
punctuation_prediction_enabled | bool | True | No | Specifies whether to automatically add punctuation to the recognition results:
|
heartbeat | bool | False | No | Controls whether to maintain a persistent connection with the server:
To use this field, your SDK version must be 1.23.1 or later. |
callback | RecognitionCallback | - | No |
Key interfaces
Recognition class
Import the Recognition class using from dashscope.audio.asr import *.
Member method | Method signature | Description |
call | | A non-streaming call based on a local file. This method blocks the current thread until the entire audio file is read. The file must have read permissions. The recognition result is returned as a |
start | | Starts speech recognition. This is a streaming, real-time recognition method based on callbacks. It does not block the current thread. Use it with |
send_audio_frame | | Pushes an audio frame. Each audio packet should have a duration of about 100 ms and a size between 1 KB and 16 KB. You can retrieve the recognition results through the on_event method of the RecognitionCallback interface. |
stop | | Stops speech recognition. This method blocks until the service has recognized all received audio and the task is complete. |
get_last_request_id | | Gets the request_id. Use this method after the constructor is called (the object is created). |
get_first_package_delay | | Gets the first-packet latency. This is the delay from sending the first audio packet to receiving the first recognition result packet. Use this method after the task is complete. |
get_last_package_delay | | Gets the last-packet latency. This is the time elapsed from sending the |
get_response | | Gets the last message. Use this to get `task-failed` error messages. |
Callback interface (RecognitionCallback)
In a bidirectional streaming call, the server returns key information and data to the client through callbacks. You must implement a callback to process the information or data that is returned by the server.
Method | Parameter | Return value | Description |
| None | None | This method is called immediately after a connection is established with the server. |
| None | This method is called when the service sends a response. | |
| None | None | This method is called after all recognition results have been returned. |
|
| None | This method is called when an error occurs. |
| None | None | This method is called after the service has closed the connection. |
Response
Recognition result (RecognitionResult)
The RecognitionResult class represents the result of a stream call or a synchronous call.
Member method | Method signature | Description |
get_sentence | | Gets the currently recognized sentence and its timestamp information. In a callback, a single sentence is returned, so the return type of this method is `Dict[str, Any]`. For more information, see Sentence information (Sentence). |
get_request_id | | Gets the request_id of the request. |
is_sentence_end | | Checks whether the given sentence has ended. |
Sentence information (Sentence)
The members of the Sentence class are as follows:
Parameter | Type | Description |
begin_time | int | The start time of the sentence, in ms. |
end_time | int | The end time of the sentence, in ms. |
text | str | The recognized text. |
words | A list of Word timestamp information (Word) objects | Word timestamp information. |
Word timestamp information (Word)
The members of the Word class are as follows:
Parameter | Type | Description |
begin_time | int | The start time of the word, in ms. |
end_time | int | The end time of the word, in ms. |
text | str | The word. |
punctuation | str | The punctuation mark. |
Error codes
If an error occurs, see Error messages for troubleshooting.
If the problem persists, join the developer group to report the issue. Provide the Request ID to help us investigate the issue.
FAQ
Features
Q: How can I maintain a persistent connection with the server during long periods of silence?
Set the heartbeat request parameter to true and continuously send silent audio to the server.
Silent audio refers to content in an audio file or data stream that has no sound signal. You can generate silent audio using various methods, such as using audio editing software such as Audacity or Adobe Audition, or using command-line tools such as FFmpeg.
Q: How do I convert an audio file to a supported format?
Use the FFmpeg tool. For more information, see the official FFmpeg website.
# Basic conversion command (universal template)
# -i: Specifies the input file path. Example: audio.wav
# -c:a: Specifies the audio encoder. Examples: aac, libmp3lame, pcm_s16le
# -b:a: Specifies the bit rate (controls audio quality). Examples: 192k, 320k
# -ar: Specifies the sample rate. Examples: 44100 (CD), 48000, 16000
# -ac: Specifies the number of sound channels. Examples: 1 (mono), 2 (stereo)
# -y: Overwrites the output file if it already exists (no value needed).
ffmpeg -i input_audio.ext -c:a encoder_name -b:a bit_rate -ar sample_rate -ac number_of_channels output.ext
# Example: WAV to MP3 (maintaining original quality)
ffmpeg -i input.wav -c:a libmp3lame -q:a 0 output.mp3
# Example: MP3 to WAV (standard 16-bit PCM format)
ffmpeg -i input.mp3 -c:a pcm_s16le -ar 44100 -ac 2 output.wav
# Example: M4A to AAC (extract/convert Apple audio)
ffmpeg -i input.m4a -c:a copy output.aac # Directly extract without re-encoding
ffmpeg -i input.m4a -c:a aac -b:a 256k output.aac # Re-encode to improve quality
# Example: FLAC lossless to Opus (high compression)
ffmpeg -i input.flac -c:a libopus -b:a 128k -vbr on output.opusQ: How do I recognize a local file (recorded audio file)?
There are two ways to recognize a local file:
Pass the local file path directly. This method returns the complete recognition result after the entire file is processed and is not suitable for scenarios that require immediate feedback.
You can pass the file path to the
callmethod of the Recognition class to directly recognize the audio file. For more information, see Npn-streaming call.Convert the local file into a binary stream for recognition. This method processes the file and returns recognition results as a stream, which makes it suitable for scenarios that require immediate feedback.
Use the
send_audio_framemethod of the Recognition class to send a binary stream to the server for recognition. For more information, see Bidirectional streaming call.
Troubleshooting
Q: Why is the speech not being recognized (no recognition result)?
Check whether the audio format (
format) and sample rate (sampleRate/sample_rate) in the request parameters are set correctly and match the audio file's properties. The following are common examples of errors:The audio file is in the WAV format, but the
formatrequest parameter is incorrectly set to `mp3`.The audio sampling rate is 3600 Hz, but the
sampleRate/sample_raterequest parameter is incorrectly set to 48000.
Use the ffprobe tool to obtain information about the audio container, encoding, sample rate, and channels:
ffprobe -v error -show_entries format=format_name -show_entries stream=codec_name,sample_rate,channels -of default=noprint_wrappers=1 input.xxxIf the previous checks pass, you can try customizing vocabulary to improve the recognition of specific words.