Fun-ASR-RealTime Python SDK - Alibaba Cloud Model Studio

This topic describes the parameters and interfaces of the Fun-ASR real-time speech recognition Python SDK.

User Guide: For an introduction to the models and recommendations for model selection, see Real-time speech recognition - Fun-ASR/Paraformer.

Prerequisites

You have activated the service and created an API key. To prevent security risks from code leakage, export the API key as an environment variable instead of hard-coding it in your code.
Install the latest version of the DashScope SDK.

Model availability

International (Singapore)

Model

Version

Supported languages

Supported sample rates

Scenarios

Supported audio formats

Price

Free quota (Note)

fun-asr-realtime

This model is currently equivalent to fun-asr-realtime-2025-11-07.

Stable

Chinese (Mandarin, Cantonese, Wu, Minnan, Hakka, Gan, Xiang, and Jin), English, and Japanese. This model also supports Mandarin accents from various Chinese regions, including Zhongyuan, Southwest, Ji-Lu, Jianghuai, Lan-Yin, Jiao-Liao, Northeast, Beijing, Hong Kong/Taiwan, Henan, Shaanxi, Hubei, Sichuan, Chongqing, Yunnan, Guizhou, Guangdong, Guangxi, Hebei, Tianjin, Shandong, Anhui, Nanjing, Jiangsu, Hangzhou, Gansu, and Ningxia.

16 kHz

ApsaraVideo Live, conferences, call centers, and more.

PCM, WAV, MP3, Opus, Speex, AAC, and AMR

$0.00009/second

36,000 seconds (10 hours)

Valid for 90 days

fun-asr-realtime-2025-11-07

Snapshot

China (Beijing)

Model	Version	Supported languages	Supported sample rates	Scenarios	Supported audio formats	Price
fun-asr-realtime Equivalent to fun-asr-realtime-2025-11-07	Stable	Chinese (Mandarin, Cantonese, Wu, Minnan, Hakka, Gan, Xiang, and Jin), English, and Japanese. This model also supports Mandarin accents from various Chinese regions and provinces, including Zhongyuan, Southwest, Ji-Lu, Jianghuai, Lan-Yin, Jiao-Liao, Northeast, Beijing, Hong Kong/Taiwan, Henan, Shaanxi, Hubei, Sichuan, Chongqing, Yunnan, Guizhou, Guangdong, Guangxi, Hebei, Tianjin, Shandong, Anhui, Nanjing, Jiangsu, Hangzhou, Gansu, and Ningxia.	16 kHz	ApsaraVideo Live, conferences, call centers, and more	pcm, wav, mp3, opus, speex, aac, amr	$0.000047/second
fun-asr-realtime-2025-11-07 This version is optimized for far-field Voice Activity Detection (VAD) and provides higher recognition accuracy than fun-asr-realtime-2025-09-15.	Snapshot
fun-asr-realtime-2025-09-15		Chinese (Mandarin), English

Getting started

The Recognition class provides interfaces for non-streaming and bidirectional streaming calls. Select a call method based on your requirements:

Non-streaming call: Recognize a local file and return the complete result at once. This method is suitable for processing pre-recorded audio.
Bidirectional streaming call: Recognize an audio stream and output the results in real time. The audio stream can be from an external device, such as a microphone, or read from a local file. This method is suitable for scenarios that require immediate feedback.

Non-streaming call

Submit a real-time speech-to-text task for a single audio file. The call is synchronous and blocks until the transcription result is returned.

Instantiate the Recognition class, set the request parameters, and call the call method to perform recognition and retrieve the recognition result (RecognitionResult).

Click to view the complete example

The audio file used in the example is: asr_example.wav.

from http import HTTPStatus
import dashscope
from dashscope.audio.asr import Recognition
import os

# The API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
# If you have not configured an environment variable, replace the following line with your Model Studio API key: dashscope.api_key = "sk-xxx"
dashscope.api_key = os.environ.get('DASHSCOPE_API_KEY')

# The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: wss://dashscope.aliyuncs.com/api-ws/v1/inference
dashscope.base_websocket_api_url='wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference'

recognition = Recognition(model='fun-asr-realtime',
                          format='wav',
                          sample_rate=16000,
                          callback=None)
result = recognition.call('asr_example.wav')
if result.status_code == HTTPStatus.OK:
    print('Recognition result:')
    print(result.get_sentence())
else:
    print('Error: ', result.message)
    
print(
    '[Metric] requestId: {}, first package delay ms: {}, last package delay ms: {}'
    .format(
        recognition.get_last_request_id(),
        recognition.get_first_package_delay(),
        recognition.get_last_package_delay(),
    ))

Bidirectional streaming call

Submit a real-time speech-to-text task and receive stream results by implementing a callback.

Start stream speech recognition
Instantiate the Recognition class, attach the request parameters and the callback (RecognitionCallback), and call the start method to start stream speech recognition.
Stream
Repeatedly call the send_audio_frame method of the Recognition class to send the binary audio stream from a local file or a device, such as a microphone, to the server in segments.
During audio data transmission, the server returns the recognition results to the client in real time through the on_event method of the callback (RecognitionCallback).
Each audio segment should have a duration of about 100 ms and a data size between 1 KB and 16 KB.
End the process
Call the stop method of the Recognition class to stop speech recognition.
This method blocks the current thread until the on_complete or on_error method of the RecognitionCallback is triggered.

Click to view the complete example

Recognize speech from a microphone

import os
import signal  # for keyboard events handling (press "Ctrl+C" to terminate recording)
import sys

import dashscope
import pyaudio
from dashscope.audio.asr import *

mic = None
stream = None

# Set recording parameters
sample_rate = 16000  # sampling rate (Hz)
channels = 1  # mono channel
dtype = 'int16'  # data type
format_pcm = 'pcm'  # the format of the audio data
block_size = 3200  # number of frames per buffer


# Real-time speech recognition callback
class Callback(RecognitionCallback):
    def on_open(self) -> None:
        global mic
        global stream
        print('RecognitionCallback open.')
        mic = pyaudio.PyAudio()
        stream = mic.open(format=pyaudio.paInt16,
                          channels=1,
                          rate=16000,
                          input=True)

    def on_close(self) -> None:
        global mic
        global stream
        print('RecognitionCallback close.')
        stream.stop_stream()
        stream.close()
        mic.terminate()
        stream = None
        mic = None

    def on_complete(self) -> None:
        print('RecognitionCallback completed.')  # recognition completed

    def on_error(self, message) -> None:
        print('RecognitionCallback task_id: ', message.request_id)
        print('RecognitionCallback error: ', message.message)
        # Stop and close the audio stream if it is running
        if 'stream' in globals() and stream.active:
            stream.stop()
            stream.close()
        # Forcefully exit the program
        sys.exit(1)

    def on_event(self, result: RecognitionResult) -> None:
        sentence = result.get_sentence()
        if 'text' in sentence:
            print('RecognitionCallback text: ', sentence['text'])
            if RecognitionResult.is_sentence_end(sentence):
                print(
                    'RecognitionCallback sentence end, request_id:%s, usage:%s'
                    % (result.get_request_id(), result.get_usage(sentence)))


def signal_handler(sig, frame):
    print('Ctrl+C pressed, stop recognition ...')
    # Stop recognition
    recognition.stop()
    print('Recognition stopped.')
    print(
        '[Metric] requestId: {}, first package delay ms: {}, last package delay ms: {}'
        .format(
            recognition.get_last_request_id(),
            recognition.get_first_package_delay(),
            recognition.get_last_package_delay(),
        ))
    # Forcefully exit the program
    sys.exit(0)


# main function
if __name__ == '__main__':
    # The API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
    # If you have not configured an environment variable, replace the following line with your Model Studio API key: dashscope.api_key = "sk-xxx"
    dashscope.api_key = os.environ.get('DASHSCOPE_API_KEY')

    # The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: wss://dashscope.aliyuncs.com/api-ws/v1/inference
    dashscope.base_websocket_api_url='wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference'

    # Create the recognition callback
    callback = Callback()

    # Call recognition service by async mode, you can customize the recognition parameters, like model, format,
    # sample_rate
    recognition = Recognition(
        model='fun-asr-realtime',
        format=format_pcm,
        # 'pcm', 'wav', 'opus', 'speex', 'aac', 'amr'. You can check the supported formats in the document.
        sample_rate=sample_rate,
        # Supports 8000, 16000.
        semantic_punctuation_enabled=False,
        callback=callback)

    # Start recognition
    recognition.start()

    signal.signal(signal.SIGINT, signal_handler)
    print("Press 'Ctrl+C' to stop recording and recognition...")
    # Create a keyboard listener until "Ctrl+C" is pressed

    while True:
        if stream:
            data = stream.read(3200, exception_on_overflow=False)
            recognition.send_audio_frame(data)
        else:
            break

    recognition.stop()

Recognize a local audio file

The audio file used in the example is: asr_example.wav.

import os
import time
import dashscope
from dashscope.audio.asr import *

# The API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
# If you have not configured an environment variable, replace the following line with your Model Studio API key: dashscope.api_key = "sk-xxx"
dashscope.api_key = os.environ.get('DASHSCOPE_API_KEY')

# The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: wss://dashscope.aliyuncs.com/api-ws/v1/inference
dashscope.base_websocket_api_url='wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference'

from datetime import datetime

def get_timestamp():
    now = datetime.now()
    formatted_timestamp = now.strftime("[%Y-%m-%d %H:%M:%S.%f]")
    return formatted_timestamp

class Callback(RecognitionCallback):
    def on_complete(self) -> None:
        print(get_timestamp() + ' Recognition completed')  # recognition complete

    def on_error(self, result: RecognitionResult) -> None:
        print('Recognition task_id: ', result.request_id)
        print('Recognition error: ', result.message)
        exit(0)

    def on_event(self, result: RecognitionResult) -> None:
        sentence = result.get_sentence()
        if 'text' in sentence:
            print(get_timestamp() + ' RecognitionCallback text: ', sentence['text'])
            if RecognitionResult.is_sentence_end(sentence):
                print(get_timestamp() + 
                    'RecognitionCallback sentence end, request_id:%s, usage:%s'
                    % (result.get_request_id(), result.get_usage(sentence)))


callback = Callback()

recognition = Recognition(model='fun-asr-realtime',
                          format='wav',
                          sample_rate=16000,
                          callback=callback)

recognition.start()

try:
    audio_data: bytes = None
    f = open("asr_example.wav", 'rb')
    if os.path.getsize("asr_example.wav"):
        while True:
            audio_data = f.read(3200)
            if not audio_data:
                break
            else:
                recognition.send_audio_frame(audio_data)
            time.sleep(0.1)
    else:
        raise Exception(
            'The supplied file was empty (zero bytes long)')
    f.close()
except Exception as e:
    raise e

recognition.stop()

print(
    '[Metric] requestId: {}, first package delay ms: {}, last package delay ms: {}'
    .format(
        recognition.get_last_request_id(),
        recognition.get_first_package_delay(),
        recognition.get_last_package_delay(),
    ))

Request parameters

Set the request parameters using the constructor method (_init_) of the Recognition class.

Parameter	Type	Default	Required	Description
model	str	-	Yes	The model used for real-time speech recognition.
sample_rate	int	-	Yes	The sample rate of the audio to be recognized, in Hz. fun-asr-realtime supports a 16000 Hz sample rate.
format	str	-	Yes	The format of the audio to be recognized. Supported audio formats: pcm, wav, mp3, opus, speex, aac, amr. Important opus/speex: Must be Ogg encapsulated. wav: Must be PCM encoded. amr: Only AMR-NB is supported.
vocabulary_id	str	-	No	The vocabulary ID. For more information, see Custom vocabulary. This parameter is not set by default.
semantic_punctuation_enabled	bool	False	No	Specifies whether to enable semantic punctuation. Default value: False. true: Enables semantic punctuation and disables Voice Activity Detection (VAD) based punctuation. false (default): Enables VAD punctuation and disables semantic punctuation. Semantic punctuation provides higher accuracy and is suitable for conference transcription. VAD punctuation has lower latency and is suitable for interactive scenarios. You can adjust the `semantic_punctuation_enabled` parameter to switch the punctuation method to suit different scenarios.
max_sentence_silence	int	1300	No	The silence duration threshold for VAD punctuation, in ms. If the silence duration after a speech segment exceeds this threshold, the system determines that the sentence has ended. The value ranges from 200 ms to 6000 ms. The default value is 1300 ms. This parameter takes effect only when the `semantic_punctuation_enabled` parameter is set to false (VAD punctuation).
multi_threshold_mode_enabled	bool	False	No	If this parameter is set to true, it prevents VAD punctuation from creating excessively long segments. Default value: false. This parameter takes effect only when the `semantic_punctuation_enabled` parameter is set to false (VAD punctuation).
punctuation_prediction_enabled	bool	True	No	Specifies whether to automatically add punctuation to the recognition results: true (default): This setting cannot be modified.
heartbeat	bool	False	No	Controls whether to maintain a persistent connection with the server: true: The connection remains active if you continuously send silent audio. false (default): The connection will time out and close after 60 seconds, even if you continuously send silent audio. Silent audio refers to content in an audio file or data stream that has no sound signal. You can generate silent audio using various methods, such as using audio editing software such as Audacity or Adobe Audition, or command-line tools such as FFmpeg. To use this field, your SDK version must be 1.23.1 or later.
callback	RecognitionCallback	-	No	The RecognitionCallback interface.

Key interfaces

`Recognition` class

Import the Recognition class using from dashscope.audio.asr import *.

Member method	Method signature	Description
call	`def call(self, file: str, phrase_id: str = None, **kwargs) -> RecognitionResult`	A non-streaming call based on a local file. This method blocks the current thread until the entire audio file is read. The file must have read permissions. The recognition result is returned as a `RecognitionResult` object.
start	`def start(self, phrase_id: str = None, **kwargs)`	Starts speech recognition. This is a streaming, real-time recognition method based on callbacks. It does not block the current thread. Use it with `send_audio_frame` and `stop`.
send_audio_frame	`def send_audio_frame(self, buffer: bytes)`	Pushes an audio frame. Each audio packet should have a duration of about 100 ms and a size between 1 KB and 16 KB. You can retrieve the recognition results through the on_event method of the RecognitionCallback interface.
stop	`def stop(self)`	Stops speech recognition. This method blocks until the service has recognized all received audio and the task is complete.
get_last_request_id	`def get_last_request_id(self)`	Gets the request_id. Use this method after the constructor is called (the object is created).
get_first_package_delay	`def get_first_package_delay(self)`	Gets the first-packet latency. This is the delay from sending the first audio packet to receiving the first recognition result packet. Use this method after the task is complete.
get_last_package_delay	`def get_last_package_delay(self)`	Gets the last-packet latency. This is the time elapsed from sending the `stop` command to receiving the last recognition result packet. Use this method after the task is complete.
get_response	`def get_response(self)`	Gets the last message. Use this to get `task-failed` error messages.

Callback interface (`RecognitionCallback`)

In a bidirectional streaming call, the server returns key information and data to the client through callbacks. You must implement a callback to process the information or data that is returned by the server.

Click to view example

class Callback(RecognitionCallback):
    def on_open(self) -> None:
        print('Connection successful')

    def on_event(self, result: RecognitionResult) -> None:
        # Implement the logic to receive recognition results

    def on_complete(self) -> None:
        print('Task complete')

    def on_error(self, result: RecognitionResult) -> None:
        print('An exception occurred: ', result)

    def on_close(self) -> None:
        print('Connection closed')


callback = Callback()

Method	Parameter	Return value	Description
`def on_open(self) -> None`	None	None	This method is called immediately after a connection is established with the server.
`def on_event(self, result: RecognitionResult) -> None`	`result`: Recognition result (RecognitionResult)	None	This method is called when the service sends a response.
`def on_complete(self) -> None`	None	None	This method is called after all recognition results have been returned.
`def on_error(self, result: RecognitionResult) -> None`	`result`: RecognitionResult	None	This method is called when an error occurs.
`def on_close(self) -> None`	None	None	This method is called after the service has closed the connection.

Response

Recognition result (`RecognitionResult`)

The RecognitionResult class represents the result of a stream call or a synchronous call.

Member method	Method signature	Description
get_sentence	`def get_sentence(self) -> Union[Dict[str, Any], List[Any]]`	Gets the currently recognized sentence and its timestamp information. In a callback, a single sentence is returned, so the return type of this method is `Dict[str, Any]`. For more information, see Sentence information (Sentence).
get_request_id	`def get_request_id(self) -> str`	Gets the request_id of the request.
is_sentence_end	`@staticmethod def is_sentence_end(sentence: Dict[str, Any]) -> bool`	Checks whether the given sentence has ended.

Sentence information (`Sentence`)

The members of the Sentence class are as follows:

Parameter	Type	Description
begin_time	int	The start time of the sentence, in ms.
end_time	int	The end time of the sentence, in ms.
text	str	The recognized text.
words	A list of Word timestamp information (Word) objects	Word timestamp information.

Word timestamp information (`Word`)

The members of the Word class are as follows:

Parameter	Type	Description
begin_time	int	The start time of the word, in ms.
end_time	int	The end time of the word, in ms.
text	str	The word.
punctuation	str	The punctuation mark.

Error codes

If an error occurs, see Error messages for troubleshooting.

If the problem persists, join the developer group to report the issue. Provide the Request ID to help us investigate the issue.

FAQ

Features

Q: How can I maintain a persistent connection with the server during long periods of silence?

Set the heartbeat request parameter to true and continuously send silent audio to the server.

Silent audio refers to content in an audio file or data stream that has no sound signal. You can generate silent audio using various methods, such as using audio editing software such as Audacity or Adobe Audition, or using command-line tools such as FFmpeg.

Q: How do I convert an audio file to a supported format?

Use the FFmpeg tool. For more information, see the official FFmpeg website.

# Basic conversion command (universal template)
# -i: Specifies the input file path. Example: audio.wav
# -c:a: Specifies the audio encoder. Examples: aac, libmp3lame, pcm_s16le
# -b:a: Specifies the bit rate (controls audio quality). Examples: 192k, 320k
# -ar: Specifies the sample rate. Examples: 44100 (CD), 48000, 16000
# -ac: Specifies the number of sound channels. Examples: 1 (mono), 2 (stereo)
# -y: Overwrites the output file if it already exists (no value needed).
ffmpeg -i input_audio.ext -c:a encoder_name -b:a bit_rate -ar sample_rate -ac number_of_channels output.ext

# Example: WAV to MP3 (maintaining original quality)
ffmpeg -i input.wav -c:a libmp3lame -q:a 0 output.mp3
# Example: MP3 to WAV (standard 16-bit PCM format)
ffmpeg -i input.mp3 -c:a pcm_s16le -ar 44100 -ac 2 output.wav
# Example: M4A to AAC (extract/convert Apple audio)
ffmpeg -i input.m4a -c:a copy output.aac  # Directly extract without re-encoding
ffmpeg -i input.m4a -c:a aac -b:a 256k output.aac  # Re-encode to improve quality
# Example: FLAC lossless to Opus (high compression)
ffmpeg -i input.flac -c:a libopus -b:a 128k -vbr on output.opus

Q: How do I recognize a local file (recorded audio file)?

There are two ways to recognize a local file:

Pass the local file path directly. This method returns the complete recognition result after the entire file is processed and is not suitable for scenarios that require immediate feedback.
You can pass the file path to the call method of the Recognition class to directly recognize the audio file. For more information, see Npn-streaming call.
Convert the local file into a binary stream for recognition. This method processes the file and returns recognition results as a stream, which makes it suitable for scenarios that require immediate feedback.
Use the send_audio_frame method of the Recognition class to send a binary stream to the server for recognition. For more information, see Bidirectional streaming call.

Troubleshooting

Q: Why is the speech not being recognized (no recognition result)?

Check whether the audio format (format) and sample rate (sampleRate/sample_rate) in the request parameters are set correctly and match the audio file's properties. The following are common examples of errors:
- The audio file is in the WAV format, but the format request parameter is incorrectly set to `mp3`.
- The audio sampling rate is 3600 Hz, but the sampleRate/sample_rate request parameter is incorrectly set to 48000.
Use the ffprobe tool to obtain information about the audio container, encoding, sample rate, and channels:
```
ffprobe -v error -show_entries format=format_name -show_entries stream=codec_name,sample_rate,channels -of default=noprint_wrappers=1 input.xxx
```
If the previous checks pass, you can try customizing vocabulary to improve the recognition of specific words.