The parameters and interfaces for the Fun-ASR real-time speech recognition Python SDK.
User guide: For model introductions and selection recommendations, see Real-time speech recognition - Fun-ASR/Paraformer.
Prerequisites
You have activated the service and obtained an API key. Configure your API key as an environment variable instead of hard coding it in your code to prevent security risks from code leaks.
Model availability
International
In the international deployment mode, endpoints and data storage are in the Singapore region. Model inference compute resources are dynamically scheduled globally, excluding Chinese Mainland.
Model | Version | Unit price | Free quota (Note) |
fun-asr-realtime Currently, fun-asr-realtime-2025-11-07 | Stable | $0.00009/second | 36,000 seconds (10 hours) Valid for 90 days |
fun-asr-realtime-2025-11-07 | Snapshot |
Languages supported: Mandarin, Cantonese, Wu, Minnan, Hakka, Gan, Xiang, and Jin. Also supports Mandarin accents from Zhongyuan, Southwest, Jilu, Jianghuai, Lanyin, Jiaoliao, Northeast, Beijing, and Hong Kong–Taiwan regions—including Henan, Shaanxi, Hubei, Sichuan, Chongqing, Yunnan, Guizhou, Guangdong, Guangxi, Hebei, Tianjin, Shandong, Anhui, Nanjing, Jiangsu, Hangzhou, Gansu, and Ningxia. Also supports English and Japanese.
Sample rates supported: 16 kHz
Audio formats supported: pcm, wav, mp3, opus, speex, aac, amr
Chinese Mainland
In the Chinese Mainland deployment mode, endpoints and data storage are in the Beijing region. Model inference compute resources are limited to Chinese Mainland.
Model | Version | Unit price | Free quota (Note) |
fun-asr-realtime Currently, fun-asr-realtime-2025-11-07 | Stable | $0.000047/second | No free quota |
fun-asr-realtime-2026-02-28 | Snapshot | ||
fun-asr-realtime-2025-11-07 | Snapshot | ||
fun-asr-realtime-2025-09-15 | |||
fun-asr-flash-8k-realtime Currently, fun-asr-flash-8k-realtime-2026-01-28 | Stable | $0.000032/second | |
fun-asr-flash-8k-realtime-2026-01-28 | Snapshot |
Languages supported:
fun-asr-realtime, fun-asr-realtime-2026-02-28, fun-asr-realtime-2025-11-07: Chinese (Mandarin, Cantonese, Wu, Minnan, Hakka, Gan, Xiang, and Jin. Also supports Mandarin accents from Zhongyuan, Southwest, Jilu, Jianghuai, Lanyin, Jiaoliao, Northeast, Beijing, and Hong Kong–Taiwan regions—including Henan, Shaanxi, Hubei, Sichuan, Chongqing, Yunnan, Guizhou, Guangdong, Guangxi, Hebei, Tianjin, Shandong, Anhui, Nanjing, Jiangsu, Hangzhou, Gansu, and Ningxia), English, and Japanese.
fun-asr-realtime-2025-09-15: Chinese (Mandarin), English
Sample rates supported: 16 kHz
Sample rates supported:
fun-asr-flash-8k-realtime and fun-asr-flash-8k-realtime-2026-01-28: 8 kHz
All other models: 16 kHz
Audio formats supported: pcm, wav, mp3, opus, speex, aac, amr
Getting started
The Recognition class provides interfaces for non-streaming and bidirectional streaming calls. Select a method based on your requirements:
Non-streaming call: Recognizes a local file and returns the complete result at once. Suitable for processing pre-recorded audio.
Bidirectional streaming call: Recognizes an audio stream and returns results in real time. The audio stream can come from an external device such as a microphone, or from a local file. Suitable for scenarios that require immediate feedback.
Non-streaming call
Submit a real-time speech-to-text task for a single audio file. The call is synchronous and blocks until the transcription result is returned.
Instantiate the Recognition class, set the request parameters, and call the call method to perform recognition and retrieve the recognition result (RecognitionResult).
Bidirectional streaming call
Submit a real-time speech-to-text task and receive streaming results through a callback.
Start streaming speech recognition
Instantiate the Recognition class, configure the request parameters and the callback (RecognitionCallback), and call the
startmethod to begin streaming speech recognition.Send audio
Call the
send_audio_framemethod of the Recognition class repeatedly to send binary audio data from a local file or a device such as a microphone to the server in segments.During audio transmission, the server returns recognition results to the client in real time through the
on_eventmethod of the callback (RecognitionCallback).Each audio segment should have a duration of about 100 ms and a data size between 1 KB and 16 KB.
Stop recognition
Call the
stopmethod of the Recognition class to stop speech recognition.This method blocks the current thread until the
on_completeoron_errormethod of the RecognitionCallback is triggered.
Request parameters
Set the request parameters in the constructor (__init__) of the Recognition class.
Parameter | Type | Default | Required | Description |
model | str | - | Yes | The model for real-time speech recognition. |
sample_rate | int | - | Yes | The sample rate of the audio, in Hz. fun-asr-realtime supports a sample rate of 16000 Hz. |
format | str | - | Yes | The audio format. Supported audio formats: pcm, wav, mp3, opus, speex, aac, amr. Important opus/speex: Must be Ogg encapsulated. wav: Must be PCM encoded. amr: Only AMR-NB is supported. |
semantic_punctuation_enabled | bool | False | No | Specifies whether to enable semantic punctuation.
Semantic punctuation provides higher accuracy and is suitable for conference transcription. VAD punctuation has lower latency and is suitable for interactive scenarios. |
max_sentence_silence | int | 1300 | No | The silence duration threshold for VAD-based sentence segmentation, in ms. If the silence after a speech segment exceeds this threshold, the system determines that the sentence has ended. Valid range: 200 to 6000 ms. Default: 1300 ms. This parameter takes effect only when the |
multi_threshold_mode_enabled | bool | False | No | When set to true, prevents VAD from creating excessively long segments. Default value: false. This parameter takes effect only when the |
punctuation_prediction_enabled | bool | True | No | Specifies whether to add punctuation to the recognition results automatically:
|
heartbeat | bool | False | No | Controls whether to maintain a persistent connection with the server:
To use this field, your SDK version must be 1.23.1 or later. |
language_hints | list[str] | - | No | Sets the language codes for recognition. If the language is unknown in advance, leave this parameter unset and the model will identify it automatically. The system reads only the first value in the array and ignores all other values. Supported language codes by model:
|
speech_noise_threshold | float | - | No | Adjusts the speech-noise detection threshold to control VAD sensitivity. Range: [-1.0, 1.0]. Guidelines:
Important This is an advanced parameter. Adjustments can significantly affect recognition quality.
|
callback | RecognitionCallback | - | No |
Key interfaces
Recognition class
Import the Recognition class using from dashscope.audio.asr import *.
Member method | Method signature | Description |
call | | Performs non-streaming recognition on a local file. Blocks the current thread until the entire file is processed. The file must be readable. The recognition result is returned as a |
start | | Starts speech recognition. A callback-based streaming method for real-time recognition. Does not block the current thread. Use with |
send_audio_frame | | Sends an audio frame. Each packet should be about 100 ms in duration and between 1 KB and 16 KB in size. You can retrieve recognition results through the on_event method of the RecognitionCallback interface. |
stop | | Stops speech recognition. Blocks until the service finishes processing all received audio. |
get_last_request_id | | Returns the request ID. Available after the Recognition object is created. |
get_first_package_delay | | Returns the first-packet latency: the time from sending the first audio packet to receiving the first recognition result. Available after the task completes. |
get_last_package_delay | | Returns the last-packet latency: the time from sending the |
get_response | | Returns the last message. Use this to retrieve |
Callback interface (RecognitionCallback)
In a bidirectional streaming call, the server returns information and data to the client through callbacks. Implement a callback to process the server responses.
Method | Parameter | Return value | Description |
| None | None | Called when a connection to the server is established. |
| None | Called when the service returns a recognition result. | |
| None | None | Called after all recognition results are returned. |
|
| None | Called when an error occurs. |
| None | None | Called when the connection is closed. |
Response
Recognition result (RecognitionResult)
The RecognitionResult class represents the result of a streaming call or a synchronous call.
Member method | Method signature | Description |
get_sentence | | Returns the current recognized sentence with timestamp information. In a callback, returns a single sentence as For more information, see Sentence information (Sentence). |
get_request_id | | Returns the request ID. |
is_sentence_end | | Checks whether the given sentence has ended. |
Sentence information (Sentence)
Members of the Sentence class:
Parameter | Type | Description |
begin_time | int | The start time of the sentence, in ms. |
end_time | int | The end time of the sentence, in ms. |
text | str | The recognized text. |
words | A list of Word timestamp information (Word) objects | Word timestamp information. |
Word timestamp information (Word)
Members of the Word class:
Parameter | Type | Description |
begin_time | int | The start time of the word, in ms. |
end_time | int | The end time of the word, in ms. |
text | str | The word. |
punctuation | str | The punctuation mark. |
Error codes
If an error occurs, see Error messages for troubleshooting.
If the problem persists, join the developer group to report the issue. Provide the Request ID to help us investigate the issue.
FAQ
Features
Q: How can I maintain a persistent connection with the server during long periods of silence?
Set the heartbeat request parameter to true and continuously send silent audio to the server.
Silent audio is audio data with no sound signal. You can generate it with audio editing software (such as Audacity or Adobe Audition) or command-line tools (such as FFmpeg).
Q: How do I convert an audio file to a supported format?
Use the FFmpeg tool. For more information, see the official FFmpeg website.
# Basic conversion command (universal template)
# -i: Specifies the input file path. Example: audio.wav
# -c:a: Specifies the audio encoder. Examples: aac, libmp3lame, pcm_s16le
# -b:a: Specifies the bit rate (controls audio quality). Examples: 192k, 320k
# -ar: Specifies the sample rate. Examples: 44100 (CD), 48000, 16000
# -ac: Specifies the number of sound channels. Examples: 1 (mono), 2 (stereo)
# -y: Overwrites the output file if it already exists (no value needed).
ffmpeg -i input_audio.ext -c:a encoder_name -b:a bit_rate -ar sample_rate -ac number_of_channels output.ext
# Example: WAV to MP3 (maintaining original quality)
ffmpeg -i input.wav -c:a libmp3lame -q:a 0 output.mp3
# Example: MP3 to WAV (standard 16-bit PCM format)
ffmpeg -i input.mp3 -c:a pcm_s16le -ar 44100 -ac 2 output.wav
# Example: M4A to AAC (extract/convert Apple audio)
ffmpeg -i input.m4a -c:a copy output.aac # Directly extract without re-encoding
ffmpeg -i input.m4a -c:a aac -b:a 256k output.aac # Re-encode to improve quality
# Example: FLAC lossless to Opus (high compression)
ffmpeg -i input.flac -c:a libopus -b:a 128k -vbr on output.opusQ: How do I recognize a local file (recorded audio file)?
There are two ways to recognize a local file:
Pass the local file path directly. This returns the complete recognition result after the entire file is processed and is not suitable for scenarios that require immediate feedback.
You can pass the file path to the
callmethod of the Recognition class to recognize the audio file directly. For more information, see Non-streaming call.Convert the local file into a binary stream for recognition. This processes the file and returns recognition results as a stream, suitable for scenarios that require immediate feedback.
Use the
send_audio_framemethod of the Recognition class to send a binary stream to the server for recognition. For more information, see Bidirectional streaming call.
Troubleshooting
Q: Why is the speech not being recognized (no recognition result)?
Verify that the audio format (
format) and sample rate (sampleRate/sample_rate) match the actual audio file properties. Common errors include:The audio file is in the WAV format, but the
formatrequest parameter is incorrectly set to `mp3`.The audio sampling rate is 3600 Hz, but the
sampleRate/sample_raterequest parameter is incorrectly set to 48000.
Use the ffprobe tool to obtain information about the audio container, encoding, sample rate, and channels:
ffprobe -v error -show_entries format=format_name -show_entries stream=codec_name,sample_rate,channels -of default=noprint_wrappers=1 input.xxx