All Products
Search
Document Center

Alibaba Cloud Model Studio:Python SDK

Last Updated:Dec 24, 2025

This topic describes the parameters and interfaces of the Fun-ASR real-time speech recognition Python SDK.

User Guide: For an introduction to the models and recommendations for model selection, see Real-time speech recognition - Fun-ASR/Paraformer.

Prerequisites

Model availability

International (Singapore)

Model

Version

Supported languages

Supported sample rates

Scenarios

Supported audio formats

Price

Free quota (Note)

fun-asr-realtime

This model is currently equivalent to fun-asr-realtime-2025-11-07.

Stable

Chinese (Mandarin, Cantonese, Wu, Minnan, Hakka, Gan, Xiang, and Jin), English, and Japanese. This model also supports Mandarin accents from various Chinese regions, including Zhongyuan, Southwest, Ji-Lu, Jianghuai, Lan-Yin, Jiao-Liao, Northeast, Beijing, Hong Kong/Taiwan, Henan, Shaanxi, Hubei, Sichuan, Chongqing, Yunnan, Guizhou, Guangdong, Guangxi, Hebei, Tianjin, Shandong, Anhui, Nanjing, Jiangsu, Hangzhou, Gansu, and Ningxia.

16 kHz

ApsaraVideo Live, conferences, call centers, and more.

PCM, WAV, MP3, Opus, Speex, AAC, and AMR

$0.00009/second

36,000 seconds (10 hours)

Valid for 90 days

fun-asr-realtime-2025-11-07

Snapshot

China (Beijing)

Model

Version

Supported languages

Supported sample rates

Scenarios

Supported audio formats

Price

fun-asr-realtime

Equivalent to fun-asr-realtime-2025-11-07

Stable

Chinese (Mandarin, Cantonese, Wu, Minnan, Hakka, Gan, Xiang, and Jin), English, and Japanese. This model also supports Mandarin accents from various Chinese regions and provinces, including Zhongyuan, Southwest, Ji-Lu, Jianghuai, Lan-Yin, Jiao-Liao, Northeast, Beijing, Hong Kong/Taiwan, Henan, Shaanxi, Hubei, Sichuan, Chongqing, Yunnan, Guizhou, Guangdong, Guangxi, Hebei, Tianjin, Shandong, Anhui, Nanjing, Jiangsu, Hangzhou, Gansu, and Ningxia.

16 kHz

ApsaraVideo Live, conferences, call centers, and more

pcm, wav, mp3, opus, speex, aac, amr

$0.000047/second

fun-asr-realtime-2025-11-07

This version is optimized for far-field Voice Activity Detection (VAD) and provides higher recognition accuracy than fun-asr-realtime-2025-09-15.

Snapshot

fun-asr-realtime-2025-09-15

Chinese (Mandarin), English

Getting started

The Recognition class provides interfaces for non-streaming and bidirectional streaming calls. Select a call method based on your requirements:

  • Non-streaming call: Recognize a local file and return the complete result at once. This method is suitable for processing pre-recorded audio.

  • Bidirectional streaming call: Recognize an audio stream and output the results in real time. The audio stream can be from an external device, such as a microphone, or read from a local file. This method is suitable for scenarios that require immediate feedback.

Non-streaming call

Submit a real-time speech-to-text task for a single audio file. The call is synchronous and blocks until the transcription result is returned.

Instantiate the Recognition class, set the request parameters, and call the call method to perform recognition and retrieve the recognition result (RecognitionResult).

Click to view the complete example

The audio file used in the example is: asr_example.wav.

from http import HTTPStatus
import dashscope
from dashscope.audio.asr import Recognition
import os

# The API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
# If you have not configured an environment variable, replace the following line with your Model Studio API key: dashscope.api_key = "sk-xxx"
dashscope.api_key = os.environ.get('DASHSCOPE_API_KEY')

# The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: wss://dashscope.aliyuncs.com/api-ws/v1/inference
dashscope.base_websocket_api_url='wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference'

recognition = Recognition(model='fun-asr-realtime',
                          format='wav',
                          sample_rate=16000,
                          callback=None)
result = recognition.call('asr_example.wav')
if result.status_code == HTTPStatus.OK:
    print('Recognition result:')
    print(result.get_sentence())
else:
    print('Error: ', result.message)
    
print(
    '[Metric] requestId: {}, first package delay ms: {}, last package delay ms: {}'
    .format(
        recognition.get_last_request_id(),
        recognition.get_first_package_delay(),
        recognition.get_last_package_delay(),
    ))

Bidirectional streaming call

Submit a real-time speech-to-text task and receive stream results by implementing a callback.

  1. Start stream speech recognition

    Instantiate the Recognition class, attach the request parameters and the callback (RecognitionCallback), and call the start method to start stream speech recognition.

  2. Stream

    Repeatedly call the send_audio_frame method of the Recognition class to send the binary audio stream from a local file or a device, such as a microphone, to the server in segments.

    During audio data transmission, the server returns the recognition results to the client in real time through the on_event method of the callback (RecognitionCallback).

    Each audio segment should have a duration of about 100 ms and a data size between 1 KB and 16 KB.

  3. End the process

    Call the stop method of the Recognition class to stop speech recognition.

    This method blocks the current thread until the on_complete or on_error method of the RecognitionCallback is triggered.

Click to view the complete example

Recognize speech from a microphone

import os
import signal  # for keyboard events handling (press "Ctrl+C" to terminate recording)
import sys

import dashscope
import pyaudio
from dashscope.audio.asr import *

mic = None
stream = None

# Set recording parameters
sample_rate = 16000  # sampling rate (Hz)
channels = 1  # mono channel
dtype = 'int16'  # data type
format_pcm = 'pcm'  # the format of the audio data
block_size = 3200  # number of frames per buffer


# Real-time speech recognition callback
class Callback(RecognitionCallback):
    def on_open(self) -> None:
        global mic
        global stream
        print('RecognitionCallback open.')
        mic = pyaudio.PyAudio()
        stream = mic.open(format=pyaudio.paInt16,
                          channels=1,
                          rate=16000,
                          input=True)

    def on_close(self) -> None:
        global mic
        global stream
        print('RecognitionCallback close.')
        stream.stop_stream()
        stream.close()
        mic.terminate()
        stream = None
        mic = None

    def on_complete(self) -> None:
        print('RecognitionCallback completed.')  # recognition completed

    def on_error(self, message) -> None:
        print('RecognitionCallback task_id: ', message.request_id)
        print('RecognitionCallback error: ', message.message)
        # Stop and close the audio stream if it is running
        if 'stream' in globals() and stream.active:
            stream.stop()
            stream.close()
        # Forcefully exit the program
        sys.exit(1)

    def on_event(self, result: RecognitionResult) -> None:
        sentence = result.get_sentence()
        if 'text' in sentence:
            print('RecognitionCallback text: ', sentence['text'])
            if RecognitionResult.is_sentence_end(sentence):
                print(
                    'RecognitionCallback sentence end, request_id:%s, usage:%s'
                    % (result.get_request_id(), result.get_usage(sentence)))


def signal_handler(sig, frame):
    print('Ctrl+C pressed, stop recognition ...')
    # Stop recognition
    recognition.stop()
    print('Recognition stopped.')
    print(
        '[Metric] requestId: {}, first package delay ms: {}, last package delay ms: {}'
        .format(
            recognition.get_last_request_id(),
            recognition.get_first_package_delay(),
            recognition.get_last_package_delay(),
        ))
    # Forcefully exit the program
    sys.exit(0)


# main function
if __name__ == '__main__':
    # The API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
    # If you have not configured an environment variable, replace the following line with your Model Studio API key: dashscope.api_key = "sk-xxx"
    dashscope.api_key = os.environ.get('DASHSCOPE_API_KEY')

    # The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: wss://dashscope.aliyuncs.com/api-ws/v1/inference
    dashscope.base_websocket_api_url='wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference'

    # Create the recognition callback
    callback = Callback()

    # Call recognition service by async mode, you can customize the recognition parameters, like model, format,
    # sample_rate
    recognition = Recognition(
        model='fun-asr-realtime',
        format=format_pcm,
        # 'pcm', 'wav', 'opus', 'speex', 'aac', 'amr'. You can check the supported formats in the document.
        sample_rate=sample_rate,
        # Supports 8000, 16000.
        semantic_punctuation_enabled=False,
        callback=callback)

    # Start recognition
    recognition.start()

    signal.signal(signal.SIGINT, signal_handler)
    print("Press 'Ctrl+C' to stop recording and recognition...")
    # Create a keyboard listener until "Ctrl+C" is pressed

    while True:
        if stream:
            data = stream.read(3200, exception_on_overflow=False)
            recognition.send_audio_frame(data)
        else:
            break

    recognition.stop()

Recognize a local audio file

The audio file used in the example is: asr_example.wav.

import os
import time
import dashscope
from dashscope.audio.asr import *

# The API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
# If you have not configured an environment variable, replace the following line with your Model Studio API key: dashscope.api_key = "sk-xxx"
dashscope.api_key = os.environ.get('DASHSCOPE_API_KEY')

# The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: wss://dashscope.aliyuncs.com/api-ws/v1/inference
dashscope.base_websocket_api_url='wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference'

from datetime import datetime

def get_timestamp():
    now = datetime.now()
    formatted_timestamp = now.strftime("[%Y-%m-%d %H:%M:%S.%f]")
    return formatted_timestamp

class Callback(RecognitionCallback):
    def on_complete(self) -> None:
        print(get_timestamp() + ' Recognition completed')  # recognition complete

    def on_error(self, result: RecognitionResult) -> None:
        print('Recognition task_id: ', result.request_id)
        print('Recognition error: ', result.message)
        exit(0)

    def on_event(self, result: RecognitionResult) -> None:
        sentence = result.get_sentence()
        if 'text' in sentence:
            print(get_timestamp() + ' RecognitionCallback text: ', sentence['text'])
            if RecognitionResult.is_sentence_end(sentence):
                print(get_timestamp() + 
                    'RecognitionCallback sentence end, request_id:%s, usage:%s'
                    % (result.get_request_id(), result.get_usage(sentence)))


callback = Callback()

recognition = Recognition(model='fun-asr-realtime',
                          format='wav',
                          sample_rate=16000,
                          callback=callback)

recognition.start()

try:
    audio_data: bytes = None
    f = open("asr_example.wav", 'rb')
    if os.path.getsize("asr_example.wav"):
        while True:
            audio_data = f.read(3200)
            if not audio_data:
                break
            else:
                recognition.send_audio_frame(audio_data)
            time.sleep(0.1)
    else:
        raise Exception(
            'The supplied file was empty (zero bytes long)')
    f.close()
except Exception as e:
    raise e

recognition.stop()

print(
    '[Metric] requestId: {}, first package delay ms: {}, last package delay ms: {}'
    .format(
        recognition.get_last_request_id(),
        recognition.get_first_package_delay(),
        recognition.get_last_package_delay(),
    ))

Request parameters

Set the request parameters using the constructor method (_init_) of the Recognition class.

Parameter

Type

Default

Required

Description

model

str

-

Yes

The model used for real-time speech recognition.

sample_rate

int

-

Yes

The sample rate of the audio to be recognized, in Hz.

fun-asr-realtime supports a 16000 Hz sample rate.

format

str

-

Yes

The format of the audio to be recognized.

Supported audio formats: pcm, wav, mp3, opus, speex, aac, amr.

Important

opus/speex: Must be Ogg encapsulated.

wav: Must be PCM encoded.

amr: Only AMR-NB is supported.

vocabulary_id

str

-

No

The vocabulary ID. For more information, see Custom vocabulary.

This parameter is not set by default.

semantic_punctuation_enabled

bool

False

No

Specifies whether to enable semantic punctuation. Default value: False.

  • true: Enables semantic punctuation and disables Voice Activity Detection (VAD) based punctuation.

  • false (default): Enables VAD punctuation and disables semantic punctuation.

Semantic punctuation provides higher accuracy and is suitable for conference transcription. VAD punctuation has lower latency and is suitable for interactive scenarios.

You can adjust the semantic_punctuation_enabled parameter to switch the punctuation method to suit different scenarios.

max_sentence_silence

int

1300

No

The silence duration threshold for VAD punctuation, in ms.

If the silence duration after a speech segment exceeds this threshold, the system determines that the sentence has ended.

The value ranges from 200 ms to 6000 ms. The default value is 1300 ms.

This parameter takes effect only when the semantic_punctuation_enabled parameter is set to false (VAD punctuation).

multi_threshold_mode_enabled

bool

False

No

If this parameter is set to true, it prevents VAD punctuation from creating excessively long segments. Default value: false.

This parameter takes effect only when the semantic_punctuation_enabled parameter is set to false (VAD punctuation).

punctuation_prediction_enabled

bool

True

No

Specifies whether to automatically add punctuation to the recognition results:

  • true (default): This setting cannot be modified.

heartbeat

bool

False

No

Controls whether to maintain a persistent connection with the server:

  • true: The connection remains active if you continuously send silent audio.

  • false (default): The connection will time out and close after 60 seconds, even if you continuously send silent audio.

    Silent audio refers to content in an audio file or data stream that has no sound signal. You can generate silent audio using various methods, such as using audio editing software such as Audacity or Adobe Audition, or command-line tools such as FFmpeg.

To use this field, your SDK version must be 1.23.1 or later.

callback

RecognitionCallback

-

No

The RecognitionCallback interface.

Key interfaces

Recognition class

Import the Recognition class using from dashscope.audio.asr import *.

Member method

Method signature

Description

call

def call(self, file: str, phrase_id: str = None, **kwargs) -> RecognitionResult

A non-streaming call based on a local file. This method blocks the current thread until the entire audio file is read. The file must have read permissions.

The recognition result is returned as a RecognitionResult object.

start

def start(self, phrase_id: str = None, **kwargs)

Starts speech recognition.

This is a streaming, real-time recognition method based on callbacks. It does not block the current thread. Use it with send_audio_frame and stop.

send_audio_frame

def send_audio_frame(self, buffer: bytes)

Pushes an audio frame. Each audio packet should have a duration of about 100 ms and a size between 1 KB and 16 KB.

You can retrieve the recognition results through the on_event method of the RecognitionCallback interface.

stop

def stop(self)

Stops speech recognition. This method blocks until the service has recognized all received audio and the task is complete.

get_last_request_id

def get_last_request_id(self)

Gets the request_id. Use this method after the constructor is called (the object is created).

get_first_package_delay

def get_first_package_delay(self)

Gets the first-packet latency. This is the delay from sending the first audio packet to receiving the first recognition result packet. Use this method after the task is complete.

get_last_package_delay

def get_last_package_delay(self)

Gets the last-packet latency. This is the time elapsed from sending the stop command to receiving the last recognition result packet. Use this method after the task is complete.

get_response

def get_response(self)

Gets the last message. Use this to get `task-failed` error messages.

Callback interface (RecognitionCallback)

In a bidirectional streaming call, the server returns key information and data to the client through callbacks. You must implement a callback to process the information or data that is returned by the server.

Click to view example

class Callback(RecognitionCallback):
    def on_open(self) -> None:
        print('Connection successful')

    def on_event(self, result: RecognitionResult) -> None:
        # Implement the logic to receive recognition results

    def on_complete(self) -> None:
        print('Task complete')

    def on_error(self, result: RecognitionResult) -> None:
        print('An exception occurred: ', result)

    def on_close(self) -> None:
        print('Connection closed')


callback = Callback()

Method

Parameter

Return value

Description

def on_open(self) -> None

None

None

This method is called immediately after a connection is established with the server.

def on_event(self, result: RecognitionResult) -> None

result: Recognition result (RecognitionResult)

None

This method is called when the service sends a response.

def on_complete(self) -> None

None

None

This method is called after all recognition results have been returned.

def on_error(self, result: RecognitionResult) -> None

result: RecognitionResult

None

This method is called when an error occurs.

def on_close(self) -> None

None

None

This method is called after the service has closed the connection.

Response

Recognition result (RecognitionResult)

The RecognitionResult class represents the result of a stream call or a synchronous call.

Member method

Method signature

Description

get_sentence

def get_sentence(self) -> Union[Dict[str, Any], List[Any]]

Gets the currently recognized sentence and its timestamp information. In a callback, a single sentence is returned, so the return type of this method is `Dict[str, Any]`.

For more information, see Sentence information (Sentence).

get_request_id

def get_request_id(self) -> str

Gets the request_id of the request.

is_sentence_end

@staticmethod
def is_sentence_end(sentence: Dict[str, Any]) -> bool

Checks whether the given sentence has ended.

Sentence information (Sentence)

The members of the Sentence class are as follows:

Parameter

Type

Description

begin_time

int

The start time of the sentence, in ms.

end_time

int

The end time of the sentence, in ms.

text

str

The recognized text.

words

A list of Word timestamp information (Word) objects

Word timestamp information.

Word timestamp information (Word)

The members of the Word class are as follows:

Parameter

Type

Description

begin_time

int

The start time of the word, in ms.

end_time

int

The end time of the word, in ms.

text

str

The word.

punctuation

str

The punctuation mark.

Error codes

If an error occurs, see Error messages for troubleshooting.

If the problem persists, join the developer group to report the issue. Provide the Request ID to help us investigate the issue.

FAQ

Features

Q: How can I maintain a persistent connection with the server during long periods of silence?

Set the heartbeat request parameter to true and continuously send silent audio to the server.

Silent audio refers to content in an audio file or data stream that has no sound signal. You can generate silent audio using various methods, such as using audio editing software such as Audacity or Adobe Audition, or using command-line tools such as FFmpeg.

Q: How do I convert an audio file to a supported format?

Use the FFmpeg tool. For more information, see the official FFmpeg website.

# Basic conversion command (universal template)
# -i: Specifies the input file path. Example: audio.wav
# -c:a: Specifies the audio encoder. Examples: aac, libmp3lame, pcm_s16le
# -b:a: Specifies the bit rate (controls audio quality). Examples: 192k, 320k
# -ar: Specifies the sample rate. Examples: 44100 (CD), 48000, 16000
# -ac: Specifies the number of sound channels. Examples: 1 (mono), 2 (stereo)
# -y: Overwrites the output file if it already exists (no value needed).
ffmpeg -i input_audio.ext -c:a encoder_name -b:a bit_rate -ar sample_rate -ac number_of_channels output.ext

# Example: WAV to MP3 (maintaining original quality)
ffmpeg -i input.wav -c:a libmp3lame -q:a 0 output.mp3
# Example: MP3 to WAV (standard 16-bit PCM format)
ffmpeg -i input.mp3 -c:a pcm_s16le -ar 44100 -ac 2 output.wav
# Example: M4A to AAC (extract/convert Apple audio)
ffmpeg -i input.m4a -c:a copy output.aac  # Directly extract without re-encoding
ffmpeg -i input.m4a -c:a aac -b:a 256k output.aac  # Re-encode to improve quality
# Example: FLAC lossless to Opus (high compression)
ffmpeg -i input.flac -c:a libopus -b:a 128k -vbr on output.opus

Q: How do I recognize a local file (recorded audio file)?

There are two ways to recognize a local file:

  • Pass the local file path directly. This method returns the complete recognition result after the entire file is processed and is not suitable for scenarios that require immediate feedback.

    You can pass the file path to the call method of the Recognition class to directly recognize the audio file. For more information, see Npn-streaming call.

  • Convert the local file into a binary stream for recognition. This method processes the file and returns recognition results as a stream, which makes it suitable for scenarios that require immediate feedback.

    Use the send_audio_frame method of the Recognition class to send a binary stream to the server for recognition. For more information, see Bidirectional streaming call.

Troubleshooting

Q: Why is the speech not being recognized (no recognition result)?

  1. Check whether the audio format (format) and sample rate (sampleRate/sample_rate) in the request parameters are set correctly and match the audio file's properties. The following are common examples of errors:

    • The audio file is in the WAV format, but the format request parameter is incorrectly set to `mp3`.

    • The audio sampling rate is 3600 Hz, but the sampleRate/sample_rate request parameter is incorrectly set to 48000.

    Use the ffprobe tool to obtain information about the audio container, encoding, sample rate, and channels:

    ffprobe -v error -show_entries format=format_name -show_entries stream=codec_name,sample_rate,channels -of default=noprint_wrappers=1 input.xxx
  2. If the previous checks pass, you can try customizing vocabulary to improve the recognition of specific words.