The parameters and interfaces of the Paraformer real-time speech recognition Python SDK.
This document applies only to the China (Beijing) region. To use the models, you must use an API key from the China (Beijing) region.
User guide: For model descriptions and selection guidance, see Real-time speech recognition - Fun-ASR/Gummy/Paraformer.
Prerequisites
You have activated Model Studio and created an API key. To prevent security risks from code leaks, do not hard-code the API key in your code. Instead, export it as an environment variable.
NoteUse a temporary authentication token to grant temporary access to third-party applications or users, or to strictly control high-risk operations like accessing or deleting sensitive data.
A temporary authentication token is more secure than a long-lived API key because it expires in 60 seconds. This short lifespan makes it ideal for temporary call scenarios and significantly reduces the risk of a compromised API key.
To use this method, replace the API key in your code with the temporary authentication token.
Model list
paraformer-realtime-v2 | paraformer-realtime-8k-v2 | |
Scenarios | Live streaming and meetings | For 8 kHz Audio recognition in scenarios such as telephone customer service and voicemail. |
Sample rate | Any | 8 kHz |
Languages | Chinese (including Mandarin and various dialects), English, Japanese, Korean, German, French, and Russian Supported Chinese dialects: Shanghainese, Wu, Minnan, Northeastern, Gansu, Guizhou, Henan, Hubei, Hunan, Jiangxi, Ningxia, Shanxi, Shaanxi, Shandong, Sichuan, Tianjin, Yunnan, and Cantonese | Chinese |
Punctuation prediction | ✅ Supported by default. No configuration is required. | ✅ Supported by default. No configuration is required. |
Inverse Text Normalization (ITN) | ✅ Supported by default. No configuration is required. | ✅ Supported by default. No configuration is required. |
Specify recognition language | ✅ Specify the language using the | ❌ |
Emotion recognition | ❌ |
Getting started
The Recognition class provides methods for non-streaming and bidirectional streaming calls. You can select the appropriate method based on your requirements:
Non-streaming call: Recognizes a local file and returns the complete result at once. This is suitable for processing pre-recorded audio.
Bidirectional streaming call: Recognizes an audio stream and outputs the results in real time. The audio stream can come from an external device, such as a microphone, or be read from a local file. This is suitable for scenarios that require immediate feedback.
Non-streaming call
This method submits a real-time speech-to-text task for a local file. The process is blocked until the complete transcription result is returned.
Instantiate the Recognition class, set the request parameters, and call the call method to perform recognition and obtain the RecognitionResult.
Bidirectional streaming call
This method submits a real-time speech-to-text task and returns real-time recognition results through a callback interface.
Start streaming speech recognition
Instantiate the Recognition class, set the request parameters and the callback interface (RecognitionCallback), and then call the
startmethod.Stream audio
Repeatedly call the `Recognition` class's
send_audio_framemethod to send the binary audio stream from a local file or a device (such as a microphone) to the server in segments.As audio data is sent, the server uses the RecognitionCallback callback interface's
on_eventmethod to return the recognition results to the client in real time.We recommend that the duration of each audio segment sent is about 100 milliseconds, and the data size is between 1 KB and 16 KB.
End processing
Call the
stopmethod of the Recognition class to stop speech recognition.This method blocks the current thread until the
on_completeoron_errorcallback of the callback interface (RecognitionCallback) is triggered.
Concurrent calls
In Python, because of the Global Interpreter Lock, only one thread can execute Python code at a time, although some performance-oriented libraries may remove this limitation. To better utilize the computing resources of a multi-core computer, we recommend that you use multiprocessing or concurrent.futures.ProcessPoolExecutor. Multi-threading can significantly increase SDK call latency under high concurrency.
Request parameters
Request parameters are set in the constructor (__init__) of the Recognition class.
Parameter | Type | Default | Required | Description |
model | str | - | Yes | The model used for real-time speech recognition. For more information, see Model list. |
sample_rate | int | - | Yes | The sample rate of the audio, in Hz. This parameter varies by model:
|
format | str | - | Yes | The format of the audio. Supported audio formats: Important
|
disfluency_removal_enabled | bool | False | No | Specifies whether to filter out disfluent words:
|
language_hints | list[str] | ["zh", "en"] | No | Specifies the language codes for recognition. If this parameter is unset, the model automatically detects the language. Supported language codes:
This parameter applies only to multilingual models. For more information, see Model list. |
semantic_punctuation_enabled | bool | False | No | Specifies whether to enable semantic sentence segmentation.
Semantic sentence segmentation offers higher accuracy and is ideal for meeting transcription scenarios. VAD-based segmentation has lower latency and is better suited for interactive scenarios. By adjusting the This parameter applies only to v2 and later models. |
max_sentence_silence | int | 800 | No | Specifies the silence duration threshold for VAD-based segmentation, in milliseconds (ms). If a pause in speech exceeds this threshold, the system considers the sentence complete. Valid range: 200 ms to 6000 ms. Default: 800 ms. This parameter applies only to v2 and later models when |
multi_threshold_mode_enabled | bool | False | No | If This parameter applies only to v2 and later models when |
punctuation_prediction_enabled | bool | True | No | Specifies whether to automatically add punctuation to the recognition results:
This parameter applies only to v2 and later models. |
heartbeat | bool | False | No | Specifies whether to maintain a persistent connection with the server:
This parameter applies only to v2 and later models. When using this field, the SDK version must be 1.23.1 or later. |
inverse_text_normalization_enabled | bool | True | No | Specifies whether to enable Inverse Text Normalization (ITN). When enabled, Chinese numerals are converted to Arabic numerals. This parameter applies only to v2 and later models. |
callback | RecognitionCallback | - | No |
Key interfaces
Recognition class
The Recognition class is imported using `from dashscope.audio.asr import *`.
Member method | Method signature | Description |
call | | A non-streaming call that uses a local file. This method blocks the current thread until the entire audio file is read. The file must have read permissions. The recognition result is returned as a |
start | | Starts speech recognition. This is a callback-based streaming real-time recognition method that does not block the current thread. It must be used with |
send_audio_frame | | Pushes an audio stream. The audio stream pushed each time should not be too large or too small. We recommend that each audio packet has a duration of about 100 ms and a size between 1 KB and 16 KB. You can obtain the recognition results through the on_event method of the callback interface (RecognitionCallback). |
stop | | Stops speech recognition. This method blocks until the service has recognized all received audio and the task is complete. |
get_last_request_id | | Gets the request_id. This can be used after the constructor is called (the object is created). |
get_first_package_delay | | Gets the first packet delay, which is the latency from sending the first audio packet to receiving the first recognition result packet. Use this after the task is completed. |
get_last_package_delay | | Gets the last packet delay, which is the time taken from sending the |
Callback interface (RecognitionCallback)
During a bidirectional streaming call, the server uses callbacks to return key process information and data to the client. You must implement a callback method to process the returned information and data.
Method | Parameter | Return value | Description |
| None | None | This method is called immediately after a connection is established with the server. |
|
| None | This method is called when the service sends a response. |
| None | None | This method is called after all recognition results have been returned. |
|
| None | This method is called when an exception occurs. |
| None | None | This method is called after the service has closed the connection. |
Response results
Recognition result (RecognitionResult)
RecognitionResult represents the recognition result of either a single real-time recognition in a bidirectional streaming call or a non-streaming call.
Member method | Method signature | Description |
get_sentence | | Gets the currently recognized sentence and timestamp information. In a callback, a single sentence is returned, so this method returns a Dict[str, Any] type. For more information, see Sentence. |
get_request_id | | Gets the request_id of the request. |
is_sentence_end | | Determines whether the given sentence has ended. |
Sentence (Sentence)
The members of the Sentence class are as follows:
Parameter | Type | Description |
begin_time | int | The start time of the sentence, in ms. |
end_time | int | The end time of the sentence, in ms. |
text | str | The recognized text. |
words | A list of Word timestamp information (Word) | Word timestamp information. |
emo_tag | str | The emotion of the current sentence:
Emotion recognition has the following constraints:
|
emo_confidence | float | The confidence level of the recognized emotion for the current sentence. The value ranges from 0.0 to 1.0. A larger value indicates a higher confidence level. Emotion recognition has the following constraints:
|
Word timestamp information (Word)
The members of the Word class are as follows:
Parameter | Type | Description |
begin_time | int | The start time of the word, in ms. |
end_time | int | The end time of the word, in ms. |
text | str | The word. |
punctuation | str | The punctuation. |
Error codes
If an error occurs, see Error Messages for troubleshooting.
If the issue persists, join the Developer Group and provide the Request ID for further investigation.
More examples
For more examples, see GitHub.
FAQ
Features
Q: How do I maintain a persistent connection with the server during long silences?
Set the request parameter heartbeat to true, and continuously send silent audio to the server.
Silent audio is an audio file or data stream with no sound. You can generate silent audio using audio editing software like Audacity or Adobe Audition, or with command-line tools like FFmpeg.
Q: How do I convert an audio file to a supported format?
You can use the FFmpeg tool. For more information, see the official FFmpeg website.
# Basic conversion command (universal template)
# -i: Input file path. Example: audio.wav
# -c:a: Audio encoder. Examples: aac, libmp3lame, pcm_s16le
# -b:a: Bit rate (controls audio quality). Examples: 192k, 320k
# -ar: Sample rate. Examples: 44100 (CD), 48000, 16000
# -ac: Audio channels. Examples: 1 (mono), 2 (stereo)
# -y: Overwrite existing file (no value required).
ffmpeg -i input_audio.ext -c:a encoder_name -b:a bit_rate -ar sample_rate -ac num_channels output.ext
# Example: Convert WAV to MP3 (maintaining original quality)
ffmpeg -i input.wav -c:a libmp3lame -q:a 0 output.mp3
# Example: Convert MP3 to WAV (16-bit PCM standard format)
ffmpeg -i input.mp3 -c:a pcm_s16le -ar 44100 -ac 2 output.wav
# Example: Convert M4A to AAC (extract/convert Apple audio)
ffmpeg -i input.m4a -c:a copy output.aac # Directly extract without re-encoding
ffmpeg -i input.m4a -c:a aac -b:a 256k output.aac # Re-encode to a 256k bit rate
# Example: Convert FLAC lossless to Opus (high compression)
ffmpeg -i input.flac -c:a libopus -b:a 128k -vbr on output.opusQ: Can I view the time range for each sentence?
Yes. The speech recognition result includes the start and end timestamp for each sentence, which defines its time range.
Q: How do I recognize a local file (recorded audio file)?
There are two ways to recognize a local file:
Directly pass the local file path: This method returns the complete recognition result after the file is fully processed. It is not suitable for scenarios that require immediate feedback.
Pass the file path to the
callmethod of the Recognition class to directly recognize the audio file. For more information, see non-streaming call.Convert the local file into a binary stream for recognition: This method returns recognition results in a stream while the file is being processed. It is suitable for scenarios that require immediate feedback.
You can use the
send_audio_framemethod of the Recognition class to send a binary stream to the server for recognition. For more information, see bidirectional streaming call.
Troubleshooting
Q: Why am I not getting any recognition results?
Check if the settings for the audio format (
format) and sample rate (sampleRate/sample_rate) are correct and meet the parameter constraints. The following are common error examples:The audio file has a .wav extension but is actually in MP3 format, and the request parameter
formatis set to mp3 (an incorrect parameter setting).The audio sample rate is 3600 Hz, but the request parameter
sampleRate/sample_rateis set to 48000 (incorrect parameter setting).
You can use the ffprobe tool to get information about an audio file's properties, such as its container, encoding, sample rate, and channels:
ffprobe -v error -show_entries format=format_name -show_entries stream=codec_name,sample_rate,channels -of default=noprint_wrappers=1 input.xxxWhen you use the
paraformer-realtime-v2model, check that the language specified inlanguage_hintsmatches the actual language of the audio.For example: the audio is actually in Chinese, but
language_hintsis set toen(English).