All Products
Search
Document Center

Alibaba Cloud Model Studio:Python SDK

Last Updated:Mar 24, 2026

The parameters and interfaces for the Fun-ASR real-time speech recognition Python SDK.

User guide: For model introductions and selection recommendations, see Real-time speech recognition - Fun-ASR/Paraformer.

Prerequisites

Model availability

International

In the international deployment mode, endpoints and data storage are in the Singapore region. Model inference compute resources are dynamically scheduled globally, excluding Chinese Mainland.

Model

Version

Unit price

Free quota (Note)

fun-asr-realtime

Currently, fun-asr-realtime-2025-11-07

Stable

$0.00009/second

36,000 seconds (10 hours)

Valid for 90 days

fun-asr-realtime-2025-11-07

Snapshot

  • Languages supported: Mandarin, Cantonese, Wu, Minnan, Hakka, Gan, Xiang, and Jin. Also supports Mandarin accents from Zhongyuan, Southwest, Jilu, Jianghuai, Lanyin, Jiaoliao, Northeast, Beijing, and Hong Kong–Taiwan regions—including Henan, Shaanxi, Hubei, Sichuan, Chongqing, Yunnan, Guizhou, Guangdong, Guangxi, Hebei, Tianjin, Shandong, Anhui, Nanjing, Jiangsu, Hangzhou, Gansu, and Ningxia. Also supports English and Japanese.

  • Sample rates supported: 16 kHz

  • Audio formats supported: pcm, wav, mp3, opus, speex, aac, amr

Chinese Mainland

In the Chinese Mainland deployment mode, endpoints and data storage are in the Beijing region. Model inference compute resources are limited to Chinese Mainland.

Model

Version

Unit price

Free quota (Note)

fun-asr-realtime

Currently, fun-asr-realtime-2025-11-07

Stable

$0.000047/second

No free quota

fun-asr-realtime-2026-02-28

Snapshot

fun-asr-realtime-2025-11-07

Snapshot

fun-asr-realtime-2025-09-15

fun-asr-flash-8k-realtime

Currently, fun-asr-flash-8k-realtime-2026-01-28

Stable

$0.000032/second

fun-asr-flash-8k-realtime-2026-01-28

Snapshot

  • Languages supported:

    • fun-asr-realtime, fun-asr-realtime-2026-02-28, fun-asr-realtime-2025-11-07: Chinese (Mandarin, Cantonese, Wu, Minnan, Hakka, Gan, Xiang, and Jin. Also supports Mandarin accents from Zhongyuan, Southwest, Jilu, Jianghuai, Lanyin, Jiaoliao, Northeast, Beijing, and Hong Kong–Taiwan regions—including Henan, Shaanxi, Hubei, Sichuan, Chongqing, Yunnan, Guizhou, Guangdong, Guangxi, Hebei, Tianjin, Shandong, Anhui, Nanjing, Jiangsu, Hangzhou, Gansu, and Ningxia), English, and Japanese.

    • fun-asr-realtime-2025-09-15: Chinese (Mandarin), English

  • Sample rates supported: 16 kHz

  • Sample rates supported:

    • fun-asr-flash-8k-realtime and fun-asr-flash-8k-realtime-2026-01-28: 8 kHz

    • All other models: 16 kHz

  • Audio formats supported: pcm, wav, mp3, opus, speex, aac, amr

Getting started

The Recognition class provides interfaces for non-streaming and bidirectional streaming calls. Select a method based on your requirements:

  • Non-streaming call: Recognizes a local file and returns the complete result at once. Suitable for processing pre-recorded audio.

  • Bidirectional streaming call: Recognizes an audio stream and returns results in real time. The audio stream can come from an external device such as a microphone, or from a local file. Suitable for scenarios that require immediate feedback.

Non-streaming call

Submit a real-time speech-to-text task for a single audio file. The call is synchronous and blocks until the transcription result is returned.

Instantiate the Recognition class, set the request parameters, and call the call method to perform recognition and retrieve the recognition result (RecognitionResult).

Click to view the complete example

The audio file used in the example is: asr_example.wav.

from http import HTTPStatus
import dashscope
from dashscope.audio.asr import Recognition
import os

# The API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
# If you have not configured an environment variable, replace the following line with your Model Studio API key: dashscope.api_key = "sk-xxx"
dashscope.api_key = os.environ.get('DASHSCOPE_API_KEY')

# The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: wss://dashscope.aliyuncs.com/api-ws/v1/inference
dashscope.base_websocket_api_url='wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference'

recognition = Recognition(model='fun-asr-realtime',
                          format='wav',
                          sample_rate=16000,
                          callback=None)
result = recognition.call('asr_example.wav')
if result.status_code == HTTPStatus.OK:
    print('Recognition result:')
    print(result.get_sentence())
else:
    print('Error: ', result.message)
    
print(
    '[Metric] requestId: {}, first package delay ms: {}, last package delay ms: {}'
    .format(
        recognition.get_last_request_id(),
        recognition.get_first_package_delay(),
        recognition.get_last_package_delay(),
    ))

Bidirectional streaming call

Submit a real-time speech-to-text task and receive streaming results through a callback.

  1. Start streaming speech recognition

    Instantiate the Recognition class, configure the request parameters and the callback (RecognitionCallback), and call the start method to begin streaming speech recognition.

  2. Send audio

    Call the send_audio_frame method of the Recognition class repeatedly to send binary audio data from a local file or a device such as a microphone to the server in segments.

    During audio transmission, the server returns recognition results to the client in real time through the on_event method of the callback (RecognitionCallback).

    Each audio segment should have a duration of about 100 ms and a data size between 1 KB and 16 KB.

  3. Stop recognition

    Call the stop method of the Recognition class to stop speech recognition.

    This method blocks the current thread until the on_complete or on_error method of the RecognitionCallback is triggered.

Click to view the complete example

Recognize speech from a microphone

import os
import signal  # for keyboard events handling (press "Ctrl+C" to terminate recording)
import sys

import dashscope
import pyaudio
from dashscope.audio.asr import *

mic = None
stream = None

# Set recording parameters
sample_rate = 16000  # sampling rate (Hz)
channels = 1  # mono channel
dtype = 'int16'  # data type
format_pcm = 'pcm'  # the format of the audio data
block_size = 3200  # number of frames per buffer


# Real-time speech recognition callback
class Callback(RecognitionCallback):
    def on_open(self) -> None:
        global mic
        global stream
        print('RecognitionCallback open.')
        mic = pyaudio.PyAudio()
        stream = mic.open(format=pyaudio.paInt16,
                          channels=1,
                          rate=16000,
                          input=True)

    def on_close(self) -> None:
        global mic
        global stream
        print('RecognitionCallback close.')
        stream.stop_stream()
        stream.close()
        mic.terminate()
        stream = None
        mic = None

    def on_complete(self) -> None:
        print('RecognitionCallback completed.')  # recognition completed

    def on_error(self, message) -> None:
        print('RecognitionCallback task_id: ', message.request_id)
        print('RecognitionCallback error: ', message.message)
        # Stop and close the audio stream if it is running
        if 'stream' in globals() and stream.active:
            stream.stop()
            stream.close()
        # Forcefully exit the program
        sys.exit(1)

    def on_event(self, result: RecognitionResult) -> None:
        sentence = result.get_sentence()
        if 'text' in sentence:
            print('RecognitionCallback text: ', sentence['text'])
            if RecognitionResult.is_sentence_end(sentence):
                print(
                    'RecognitionCallback sentence end, request_id:%s, usage:%s'
                    % (result.get_request_id(), result.get_usage(sentence)))


def signal_handler(sig, frame):
    print('Ctrl+C pressed, stop recognition ...')
    # Stop recognition
    recognition.stop()
    print('Recognition stopped.')
    print(
        '[Metric] requestId: {}, first package delay ms: {}, last package delay ms: {}'
        .format(
            recognition.get_last_request_id(),
            recognition.get_first_package_delay(),
            recognition.get_last_package_delay(),
        ))
    # Forcefully exit the program
    sys.exit(0)


# main function
if __name__ == '__main__':
    # The API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
    # If you have not configured an environment variable, replace the following line with your Model Studio API key: dashscope.api_key = "sk-xxx"
    dashscope.api_key = os.environ.get('DASHSCOPE_API_KEY')

    # The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: wss://dashscope.aliyuncs.com/api-ws/v1/inference
    dashscope.base_websocket_api_url='wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference'

    # Create the recognition callback
    callback = Callback()

    # Call recognition service by async mode, you can customize the recognition parameters, like model, format,
    # sample_rate
    recognition = Recognition(
        model='fun-asr-realtime',
        format=format_pcm,
        # 'pcm', 'wav', 'opus', 'speex', 'aac', 'amr'. You can check the supported formats in the document.
        sample_rate=sample_rate,
        # Supports 8000, 16000.
        semantic_punctuation_enabled=False,
        callback=callback)

    # Start recognition
    recognition.start()

    signal.signal(signal.SIGINT, signal_handler)
    print("Press 'Ctrl+C' to stop recording and recognition...")
    # Create a keyboard listener until "Ctrl+C" is pressed

    while True:
        if stream:
            data = stream.read(3200, exception_on_overflow=False)
            recognition.send_audio_frame(data)
        else:
            break

    recognition.stop()

Recognize a local audio file

The audio file used in the example is asr_example.wav.

import os
import time
import dashscope
from dashscope.audio.asr import *

# API keys differ between the Singapore and Beijing regions. Get an API key: https://www.alibabacloud.com/help/zh/model-studio/get-api-key
# If you have not set an environment variable, replace the next line with your Model Studio API key: dashscope.api_key = "sk-xxx"
dashscope.api_key = os.environ.get('DASHSCOPE_API_KEY')

# The following URL is for the Singapore region. To use the Beijing region model, replace it with: wss://dashscope.aliyuncs.com/api-ws/v1/inference
dashscope.base_websocket_api_url='wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference'

from datetime import datetime


def get_timestamp():
    now = datetime.now()
    formatted_timestamp = now.strftime("[%Y-%m-%d %H:%M:%S.%f]")
    return formatted_timestamp


class Callback(RecognitionCallback):
    def on_complete(self) -> None:
        print(get_timestamp() + ' Recognition completed')  # recognition complete

    def on_error(self, result: RecognitionResult) -> None:
        print('Recognition task_id: ', result.request_id)
        print('Recognition error: ', result.message)
        exit(0)

    def on_event(self, result: RecognitionResult) -> None:
        sentence = result.get_sentence()
        if 'text' in sentence:
            print(get_timestamp() + ' RecognitionCallback text: ', sentence['text'])
        if RecognitionResult.is_sentence_end(sentence):
            print(get_timestamp() +
                  'RecognitionCallback sentence end, request_id:%s, usage:%s'
                  % (result.get_request_id(), result.get_usage(sentence)))


callback = Callback()

recognition = Recognition(model='fun-asr-realtime',
                          format='wav',
                          sample_rate=16000,
                          callback=callback)

try:
    audio_data: bytes = None
    f = open("asr_example.wav", 'rb')
    if os.path.getsize("asr_example.wav"):
        # Read the entire file into a buffer
        file_buffer = f.read()
        f.close()
        print("Start Recognition")
        recognition.start()

        # Send data in chunks of 3200 bytes
        buffer_size = len(file_buffer)
        offset = 0
        chunk_size = 3200

        while offset < buffer_size:
            # Calculate the size of the current chunk
            remaining_bytes = buffer_size - offset
            current_chunk_size = min(chunk_size, remaining_bytes)

            # Extract the current chunk from the buffer
            audio_data = file_buffer[offset:offset + current_chunk_size]

            # Send the audio frame
            recognition.send_audio_frame(audio_data)
            # Update the offset
            offset += current_chunk_size

            # Add a delay to simulate real-time transmission
            time.sleep(0.1)

        recognition.stop()
    else:
        raise Exception(
            'The supplied file was empty (zero bytes long)')
except Exception as e:
    raise e

print(
    '[Metric] requestId: {}, first package delay ms: {}, last package delay ms: {}'
    .format(
        recognition.get_last_request_id(),
        recognition.get_first_package_delay(),
        recognition.get_last_package_delay(),
    ))

Request parameters

Set the request parameters in the constructor (__init__) of the Recognition class.

Parameter

Type

Default

Required

Description

model

str

-

Yes

The model for real-time speech recognition.

sample_rate

int

-

Yes

The sample rate of the audio, in Hz.

fun-asr-realtime supports a sample rate of 16000 Hz.

format

str

-

Yes

The audio format.

Supported audio formats: pcm, wav, mp3, opus, speex, aac, amr.

Important

opus/speex: Must be Ogg encapsulated.

wav: Must be PCM encoded.

amr: Only AMR-NB is supported.

semantic_punctuation_enabled

bool

False

No

Specifies whether to enable semantic punctuation.

  • true: Enables semantic punctuation and disables VAD-based punctuation.

  • false (default): Enables VAD punctuation and disables semantic punctuation.

Semantic punctuation provides higher accuracy and is suitable for conference transcription. VAD punctuation has lower latency and is suitable for interactive scenarios.

max_sentence_silence

int

1300

No

The silence duration threshold for VAD-based sentence segmentation, in ms.

If the silence after a speech segment exceeds this threshold, the system determines that the sentence has ended.

Valid range: 200 to 6000 ms. Default: 1300 ms.

This parameter takes effect only when the semantic_punctuation_enabled parameter is set to false (VAD punctuation).

multi_threshold_mode_enabled

bool

False

No

When set to true, prevents VAD from creating excessively long segments. Default value: false.

This parameter takes effect only when the semantic_punctuation_enabled parameter is set to false (VAD punctuation).

punctuation_prediction_enabled

bool

True

No

Specifies whether to add punctuation to the recognition results automatically:

  • true (default): This setting cannot be modified.

heartbeat

bool

False

No

Controls whether to maintain a persistent connection with the server:

  • true: The connection remains active if you continuously send silent audio.

  • false (default): The connection will time out and close after 60 seconds, even if you continuously send silent audio.

    Silent audio: audio with no sound signal. Generate it with editing software (Audacity, Adobe Audition) or FFmpeg.

To use this field, your SDK version must be 1.23.1 or later.

language_hints

list[str]

-

No

Sets the language codes for recognition. If the language is unknown in advance, leave this parameter unset and the model will identify it automatically.

The system reads only the first value in the array and ignores all other values.

Supported language codes by model:

  • fun-asr-realtime, fun-asr-realtime-2025-11-07:

    • zh: Chinese

    • en: English

    • ja: Japanese

  • fun-asr-realtime-2025-09-15:

    • zh: Chinese

    • en: English

speech_noise_threshold

float

-

No

Adjusts the speech-noise detection threshold to control VAD sensitivity.

Range: [-1.0, 1.0].

Guidelines:

  • Near -1: Lowers the noise threshold — more noise may be transcribed as speech.

  • Near +1: Raises the noise threshold — some speech may be filtered out as noise.

Important

This is an advanced parameter. Adjustments can significantly affect recognition quality.

  • Test thoroughly before adjusting.

  • Make small adjustments (step size 0.1) based on your audio environment.

callback

RecognitionCallback

-

No

The RecognitionCallback interface.

Key interfaces

Recognition class

Import the Recognition class using from dashscope.audio.asr import *.

Member method

Method signature

Description

call

def call(self, file: str, phrase_id: str = None, **kwargs) -> RecognitionResult

Performs non-streaming recognition on a local file. Blocks the current thread until the entire file is processed. The file must be readable.

The recognition result is returned as a RecognitionResult.

start

def start(self, phrase_id: str = None, **kwargs)

Starts speech recognition.

A callback-based streaming method for real-time recognition. Does not block the current thread. Use with send_audio_frame and stop.

send_audio_frame

def send_audio_frame(self, buffer: bytes)

Sends an audio frame. Each packet should be about 100 ms in duration and between 1 KB and 16 KB in size.

You can retrieve recognition results through the on_event method of the RecognitionCallback interface.

stop

def stop(self)

Stops speech recognition. Blocks until the service finishes processing all received audio.

get_last_request_id

def get_last_request_id(self)

Returns the request ID. Available after the Recognition object is created.

get_first_package_delay

def get_first_package_delay(self)

Returns the first-packet latency: the time from sending the first audio packet to receiving the first recognition result. Available after the task completes.

get_last_package_delay

def get_last_package_delay(self)

Returns the last-packet latency: the time from sending the stop command to receiving the final recognition result. Available after the task completes.

get_response

def get_response(self)

Returns the last message. Use this to retrieve task-failed error messages.

Callback interface (RecognitionCallback)

In a bidirectional streaming call, the server returns information and data to the client through callbacks. Implement a callback to process the server responses.

Click to view example

class Callback(RecognitionCallback):
    def on_open(self) -> None:
        print('Connection successful')

    def on_event(self, result: RecognitionResult) -> None:
        # Implement the logic to receive recognition results

    def on_complete(self) -> None:
        print('Task complete')

    def on_error(self, result: RecognitionResult) -> None:
        print('An exception occurred: ', result)

    def on_close(self) -> None:
        print('Connection closed')


callback = Callback()

Method

Parameter

Return value

Description

def on_open(self) -> None

None

None

Called when a connection to the server is established.

def on_event(self, result: RecognitionResult) -> None

result: Recognition result (RecognitionResult)

None

Called when the service returns a recognition result.

def on_complete(self) -> None

None

None

Called after all recognition results are returned.

def on_error(self, result: RecognitionResult) -> None

result: RecognitionResult

None

Called when an error occurs.

def on_close(self) -> None

None

None

Called when the connection is closed.

Response

Recognition result (RecognitionResult)

The RecognitionResult class represents the result of a streaming call or a synchronous call.

Member method

Method signature

Description

get_sentence

def get_sentence(self) -> Union[Dict[str, Any], List[Any]]

Returns the current recognized sentence with timestamp information. In a callback, returns a single sentence as Dict[str, Any].

For more information, see Sentence information (Sentence).

get_request_id

def get_request_id(self) -> str

Returns the request ID.

is_sentence_end

@staticmethod
def is_sentence_end(sentence: Dict[str, Any]) -> bool

Checks whether the given sentence has ended.

Sentence information (Sentence)

Members of the Sentence class:

Parameter

Type

Description

begin_time

int

The start time of the sentence, in ms.

end_time

int

The end time of the sentence, in ms.

text

str

The recognized text.

words

A list of Word timestamp information (Word) objects

Word timestamp information.

Word timestamp information (Word)

Members of the Word class:

Parameter

Type

Description

begin_time

int

The start time of the word, in ms.

end_time

int

The end time of the word, in ms.

text

str

The word.

punctuation

str

The punctuation mark.

Error codes

If an error occurs, see Error messages for troubleshooting.

If the problem persists, join the developer group to report the issue. Provide the Request ID to help us investigate the issue.

FAQ

Features

Q: How can I maintain a persistent connection with the server during long periods of silence?

Set the heartbeat request parameter to true and continuously send silent audio to the server.

Silent audio is audio data with no sound signal. You can generate it with audio editing software (such as Audacity or Adobe Audition) or command-line tools (such as FFmpeg).

Q: How do I convert an audio file to a supported format?

Use the FFmpeg tool. For more information, see the official FFmpeg website.

# Basic conversion command (universal template)
# -i: Specifies the input file path. Example: audio.wav
# -c:a: Specifies the audio encoder. Examples: aac, libmp3lame, pcm_s16le
# -b:a: Specifies the bit rate (controls audio quality). Examples: 192k, 320k
# -ar: Specifies the sample rate. Examples: 44100 (CD), 48000, 16000
# -ac: Specifies the number of sound channels. Examples: 1 (mono), 2 (stereo)
# -y: Overwrites the output file if it already exists (no value needed).
ffmpeg -i input_audio.ext -c:a encoder_name -b:a bit_rate -ar sample_rate -ac number_of_channels output.ext

# Example: WAV to MP3 (maintaining original quality)
ffmpeg -i input.wav -c:a libmp3lame -q:a 0 output.mp3
# Example: MP3 to WAV (standard 16-bit PCM format)
ffmpeg -i input.mp3 -c:a pcm_s16le -ar 44100 -ac 2 output.wav
# Example: M4A to AAC (extract/convert Apple audio)
ffmpeg -i input.m4a -c:a copy output.aac  # Directly extract without re-encoding
ffmpeg -i input.m4a -c:a aac -b:a 256k output.aac  # Re-encode to improve quality
# Example: FLAC lossless to Opus (high compression)
ffmpeg -i input.flac -c:a libopus -b:a 128k -vbr on output.opus

Q: How do I recognize a local file (recorded audio file)?

There are two ways to recognize a local file:

  • Pass the local file path directly. This returns the complete recognition result after the entire file is processed and is not suitable for scenarios that require immediate feedback.

    You can pass the file path to the call method of the Recognition class to recognize the audio file directly. For more information, see Non-streaming call.

  • Convert the local file into a binary stream for recognition. This processes the file and returns recognition results as a stream, suitable for scenarios that require immediate feedback.

    Use the send_audio_frame method of the Recognition class to send a binary stream to the server for recognition. For more information, see Bidirectional streaming call.

Troubleshooting

Q: Why is the speech not being recognized (no recognition result)?

  1. Verify that the audio format (format) and sample rate (sampleRate/sample_rate) match the actual audio file properties. Common errors include:

    • The audio file is in the WAV format, but the format request parameter is incorrectly set to `mp3`.

    • The audio sampling rate is 3600 Hz, but the sampleRate/sample_rate request parameter is incorrectly set to 48000.

    Use the ffprobe tool to obtain information about the audio container, encoding, sample rate, and channels:

    ffprobe -v error -show_entries format=format_name -show_entries stream=codec_name,sample_rate,channels -of default=noprint_wrappers=1 input.xxx