All Products
Search
Document Center

Alibaba Cloud Model Studio:Paraformer real-time speech recognition Python SDK

Last Updated:Oct 15, 2025

This topic describes the parameters and interfaces of the Paraformer real-time speech recognition Python SDK.

Important

This document applies only to the China (Beijing) region. To use the models, you must use an API key from the China (Beijing) region.

User Guide: For model descriptions and selection recommendations, see Real-time Speech Recognition.

Prerequisites

  • You have activated the service and obtained an API key. To prevent security risks from code leakage, configure the API key as an environment variable instead of hard-coding it in your code.

    Note

    When you need to provide temporary access permissions to third-party applications or users, or want to strictly control high-risk operations such as accessing or deleting sensitive data, we recommend that you use a temporary authentication token.

    Compared to a long-term API key, a temporary authentication token is short-lived (60 seconds) and more secure. It is suitable for temporary call scenarios and effectively reduces the risk of API key leakage.

    Usage: In your code, replace the API key used for authentication with the temporary authentication token.

  • Install the latest version of the DashScope SDK.

Model list

paraformer-realtime-v2

paraformer-realtime-8k-v2

Scenarios

Live streaming, meetings, and other similar scenarios

Recognition scenarios for 8 kHz audio, such as telephone customer service and voicemail

Sample rate

Any

8 kHz

Languages

Chinese (including Mandarin and various dialects), English, Japanese, Korean, German, French, and Russian

Supported Chinese dialects: Shanghainese, Wu, Minnan, Northeastern, Gansu, Guizhou, Henan, Hubei, Hunan, Jiangxi, Ningxia, Shanxi, Shaanxi, Shandong, Sichuan, Tianjin, Yunnan, and Cantonese

Chinese

Punctuation prediction

✅ Supported by default. No configuration is required.

✅ Supported by default. No configuration is required.

Inverse Text Normalization (ITN)

✅ Supported by default. No configuration is required.

✅ Supported by default. No configuration is required.

Custom vocabulary

✅ See Custom hotwords

✅ See Custom hotwords

Specify recognition language

✅ Specify the language using the language_hints parameter.

Emotion recognition

✅ (Click to view usage)

Emotion recognition has the following constraints:

  • It is available only for the paraformer-realtime-8k-v2 model.

  • You must disable semantic punctuation (controlled by the request parameter semantic_punctuation_enabled). By default, semantic punctuation is disabled.

  • The emotion recognition result is displayed only when the RecognitionResult object's is_sentence_end method returns True.

To obtain the emotion detection results, retrieve the emotion and the emotion confidence level of the current sentence from the emo_tag and emo_confidence fields of the single-sentence information (Sentence), respectively.

Getting started

The Recognition class provides methods for synchronous and streaming calls. You can select the appropriate method based on your requirements:

  • Synchronous call: Recognizes a local file and returns the complete result at once. This is suitable for processing pre-recorded audio.

  • Streaming call: Recognizes an audio stream and outputs the results in real time. The audio stream can come from an external device, such as a microphone, or be read from a local file. This is suitable for scenarios that require immediate feedback.

Synchronous call

This method submits a real-time speech-to-text task for a local file. The process is blocked until the complete transcription result is returned.

image

Instantiate the Recognition class, set the request parameters, and call the call method to perform recognition and obtain the RecognitionResult.

Click to view the complete example

The audio file used in the example is: asr_example.wav.

from http import HTTPStatus
from dashscope.audio.asr import Recognition

# If you have not configured the API key in the environment variable, uncomment the following line of code and replace apiKey with your API key.
# import dashscope
# dashscope.api_key = "apiKey"

recognition = Recognition(model='paraformer-realtime-v2',
                          format='wav',
                          sample_rate=16000,
                          # The "language_hints" parameter is supported only by the paraformer-realtime-v2 model.
                          language_hints=['zh', 'en'],
                          callback=None)
result = recognition.call('asr_example.wav')
if result.status_code == HTTPStatus.OK:
    print('Recognition result:')
    print(result.get_sentence())
else:
    print('Error: ', result.message)
    
print(
    '[Metric] requestId: {}, first package delay ms: {}, last package delay ms: {}'
    .format(
        recognition.get_last_request_id(),
        recognition.get_first_package_delay(),
        recognition.get_last_package_delay(),
    ))

Streaming call

This method submits a real-time speech-to-text task and returns real-time recognition results through a callback interface.

image
  1. Start streaming speech recognition

    Instantiate the Recognition class, set the request parameters and the callback interface (RecognitionCallback), and then call the start method.

  2. Stream audio

    Repeatedly call the `Recognition` class's send_audio_frame method to send the binary audio stream from a local file or a device (such as a microphone) to the server in segments.

    As audio data is sent, the server uses the RecognitionCallback callback interface's on_event method to return the recognition results to the client in real time.

    We recommend that the duration of each audio segment sent is about 100 milliseconds, and the data size is between 1 KB and 16 KB.

  3. End processing

    Call the stop method of the Recognition class to stop speech recognition.

    This method blocks the current thread until the on_complete or on_error callback of the callback interface (RecognitionCallback) is triggered.

Click to view the complete example

Recognize speech from a microphone

import os
import signal  # for keyboard events handling (press "Ctrl+C" to terminate recording)
import sys

import dashscope
import pyaudio
from dashscope.audio.asr import *

mic = None
stream = None

# Set recording parameters
sample_rate = 16000  # sampling rate (Hz)
channels = 1  # mono channel
dtype = 'int16'  # data type
format_pcm = 'pcm'  # the format of the audio data
block_size = 3200  # number of frames per buffer


def init_dashscope_api_key():
    """
        Set your DashScope API-key. More information:
        https://github.com/aliyun/alibabacloud-bailian-speech-demo/blob/master/PREREQUISITES.md
    """

    if 'DASHSCOPE_API_KEY' in os.environ:
        dashscope.api_key = os.environ[
            'DASHSCOPE_API_KEY']  # load API-key from environment variable DASHSCOPE_API_KEY
    else:
        dashscope.api_key = '<your-dashscope-api-key>'  # set API-key manually


# Real-time speech recognition callback
class Callback(RecognitionCallback):
    def on_open(self) -> None:
        global mic
        global stream
        print('RecognitionCallback open.')
        mic = pyaudio.PyAudio()
        stream = mic.open(format=pyaudio.paInt16,
                          channels=1,
                          rate=16000,
                          input=True)

    def on_close(self) -> None:
        global mic
        global stream
        print('RecognitionCallback close.')
        stream.stop_stream()
        stream.close()
        mic.terminate()
        stream = None
        mic = None

    def on_complete(self) -> None:
        print('RecognitionCallback completed.')  # recognition completed

    def on_error(self, message) -> None:
        print('RecognitionCallback task_id: ', message.request_id)
        print('RecognitionCallback error: ', message.message)
        # Stop and close the audio stream if it is running
        if 'stream' in globals() and stream.active:
            stream.stop()
            stream.close()
        # Forcefully exit the program
        sys.exit(1)

    def on_event(self, result: RecognitionResult) -> None:
        sentence = result.get_sentence()
        if 'text' in sentence:
            print('RecognitionCallback text: ', sentence['text'])
            if RecognitionResult.is_sentence_end(sentence):
                print(
                    'RecognitionCallback sentence end, request_id:%s, usage:%s'
                    % (result.get_request_id(), result.get_usage(sentence)))


def signal_handler(sig, frame):
    print('Ctrl+C pressed, stop recognition ...')
    # Stop recognition
    recognition.stop()
    print('Recognition stopped.')
    print(
        '[Metric] requestId: {}, first package delay ms: {}, last package delay ms: {}'
        .format(
            recognition.get_last_request_id(),
            recognition.get_first_package_delay(),
            recognition.get_last_package_delay(),
        ))
    # Forcefully exit the program
    sys.exit(0)


# main function
if __name__ == '__main__':
    init_dashscope_api_key()
    print('Initializing ...')

    # Create the recognition callback
    callback = Callback()

    # Call recognition service by async mode, you can customize the recognition parameters, like model, format,
    # sample_rate
    recognition = Recognition(
        model='paraformer-realtime-v2',
        format=format_pcm,
        # 'pcm', 'wav', 'opus', 'speex', 'aac', or 'amr'. You can check the supported formats in the document.
        sample_rate=sample_rate,
        # 8000 or 16000 is supported.
        semantic_punctuation_enabled=False,
        callback=callback)

    # Start recognition
    recognition.start()

    signal.signal(signal.SIGINT, signal_handler)
    print("Press 'Ctrl+C' to stop recording and recognition...")
    # Create a keyboard listener until "Ctrl+C" is pressed

    while True:
        if stream:
            data = stream.read(3200, exception_on_overflow=False)
            recognition.send_audio_frame(data)
        else:
            break

    recognition.stop()

Recognize a local audio file

The audio file used in the example is: asr_example.wav.

import os
import time
from dashscope.audio.asr import *

# If you have not configured the API key in the environment variable, uncomment the following line of code and replace apiKey with your API key.
# import dashscope
# dashscope.api_key = "apiKey"

from datetime import datetime

def get_timestamp():
    now = datetime.now()
    formatted_timestamp = now.strftime("[%Y-%m-%d %H:%M:%S.%f]")
    return formatted_timestamp

class Callback(RecognitionCallback):
    def on_complete(self) -> None:
        print(get_timestamp() + ' Recognition completed')  # recognition complete

    def on_error(self, result: RecognitionResult) -> None:
        print('Recognition task_id: ', result.request_id)
        print('Recognition error: ', result.message)
        exit(0)

    def on_event(self, result: RecognitionResult) -> None:
        sentence = result.get_sentence()
        if 'text' in sentence:
            print(get_timestamp() + ' RecognitionCallback text: ', sentence['text'])
            if RecognitionResult.is_sentence_end(sentence):
                print(get_timestamp() + 
                    'RecognitionCallback sentence end, request_id:%s, usage:%s'
                    % (result.get_request_id(), result.get_usage(sentence)))


callback = Callback()

recognition = Recognition(model='paraformer-realtime-v2',
                          format='wav',
                          sample_rate=16000,
                          # The "language_hints" parameter is supported only by the paraformer-realtime-v2 model.
                          language_hints=['zh', 'en'],
                          callback=callback)

recognition.start()

try:
    audio_data: bytes = None
    f = open("asr_example.wav", 'rb')
    if os.path.getsize("asr_example.wav"):
        while True:
            audio_data = f.read(3200)
            if not audio_data:
                break
            else:
                recognition.send_audio_frame(audio_data)
            time.sleep(0.1)
    else:
        raise Exception(
            'The supplied file was empty (zero bytes long)')
    f.close()
except Exception as e:
    raise e

recognition.stop()

print(
    '[Metric] requestId: {}, first package delay ms: {}, last package delay ms: {}'
    .format(
        recognition.get_last_request_id(),
        recognition.get_first_package_delay(),
        recognition.get_last_package_delay(),
    ))

Concurrent calls

In Python, because of the Global Interpreter Lock, only one thread can execute Python code at a time, although some performance-oriented libraries may remove this limitation. To better utilize the computing resources of a multi-core computer, we recommend that you use multiprocessing or concurrent.futures.ProcessPoolExecutor. Multi-threading can significantly increase SDK call latency under high concurrency.

Request parameters

Request parameters are set in the constructor (__init__) of the Recognition class.

Parameter

Type

Default

Required

Description

model

str

-

Yes

The model used for real-time speech recognition. For more information, see Model list.

sample_rate

int

-

Yes

The sample rate of the audio to be recognized, in Hz.

Varies by model:

  • paraformer-realtime-v2 supports any sample rate.

  • paraformer-realtime-8k-v2 supports only an 8000 Hz sample rate.

format

str

-

Yes

The format of the audio to be recognized.

Supported audio formats: pcm, wav, mp3, opus, speex, aac, and amr.

Important

opus/speex: Must use Ogg encapsulation.

wav: Must be PCM encoded.

amr: Only the AMR-NB type is supported.

vocabulary_id

str

-

No

The vocabulary ID. This parameter is used to set the vocabulary ID for v2 and later models. If not set, it does not take effect.

In this speech recognition request, the service applies the hotword information that corresponds to the specified hotword ID. For more information, see Custom hotwords.

disfluency_removal_enabled

bool

False

No

Specifies whether to filter disfluent words:

  • true: Filter disfluent words.

  • false (default): Do not filter disfluent words.

language_hints

list[str]

["zh", "en"]

No

The language code for recognition. If you cannot determine the language in advance, you can leave this parameter unset. The model automatically detects the language.

Currently supported language codes:

  • zh: Chinese

  • en: English

  • ja: Japanese

  • yue: Cantonese

  • ko: Korean

  • de: German

  • fr: French

  • ru: Russian

This parameter applies only to multilingual models. For more information, see Model list.

semantic_punctuation_enabled

bool

False

No

Specifies whether to enable semantic punctuation. It is disabled by default.

  • true: Enable semantic punctuation and disable Voice Activity Detection (VAD) punctuation.

  • false (default): Enable VAD punctuation and disable semantic punctuation.

Semantic punctuation provides higher accuracy and is suitable for meeting transcription scenarios. VAD punctuation has lower latency and is suitable for interactive scenarios.

By adjusting the semantic_punctuation_enabled parameter, you can flexibly switch the punctuation method for speech recognition to suit different scenario requirements.

This parameter is effective only for v2 and later models.

max_sentence_silence

int

800

No

The silence duration threshold for VAD punctuation, in ms.

When the silence duration after a speech segment exceeds this threshold, the system determines that the sentence has ended.

The parameter range is from 200 ms to 6000 ms. The default value is 800 ms.

This parameter is effective only when the semantic_punctuation_enabled parameter is set to false (VAD punctuation) and for v2 and later models.

multi_threshold_mode_enabled

bool

False

No

When this switch is enabled (true), it prevents VAD from creating excessively long segments. It is disabled by default.

This parameter is effective only when the semantic_punctuation_enabled parameter is set to false (VAD punctuation) and for v2 and later models.

punctuation_prediction_enabled

bool

True

No

Specifies whether to automatically add punctuation to the recognition results:

  • true (default): Yes

  • false: No

This parameter is effective only for v2 and later models.

heartbeat

bool

False

No

Use this switch to maintain a persistent connection with the server:

  • true: If you continuously send silent audio, the connection to the server is not interrupted.

  • false (default): Even if you continuously send silent audio, the connection will be disconnected after 60 seconds due to a timeout.

    Silent audio refers to content in an audio file or data stream that has no sound signal. You can generate silent audio using various methods, such as audio editing software (Audacity and Adobe Audition) or command-line tools (FFmpeg).

This parameter is effective only for v2 and later models.

When using this field, the SDK version must be 1.23.1 or later.

inverse_text_normalization_enabled

bool

True

No

Specifies whether to enable Inverse Text Normalization (ITN).

It is enabled by default (true). When enabled, Chinese numerals are converted to Arabic numerals.

This parameter is effective only for v2 and later models.

callback

RecognitionCallback

-

No

RecognitionCallback interface.

Key interfaces

Recognition class

The Recognition class is imported using `from dashscope.audio.asr import *`.

Member method

Method signature

Description

call

def call(self, file: str, phrase_id: str = None, **kwargs) -> RecognitionResult

A synchronous call that uses a local file. This method blocks the current thread until the entire audio file is read. The file must have read permissions.

The recognition result is returned as a RecognitionResult type.

start

def start(self, phrase_id: str = None, **kwargs)

Starts speech recognition.

This is a callback-based streaming real-time recognition method that does not block the current thread. It must be used with send_audio_frame and stop.

send_audio_frame

def send_audio_frame(self, buffer: bytes)

Pushes an audio stream. The audio stream pushed each time should not be too large or too small. We recommend that each audio packet has a duration of about 100 ms and a size between 1 KB and 16 KB.

You can obtain the recognition results through the on_event method of the callback interface (RecognitionCallback).

stop

def stop(self)

Stops speech recognition. This method blocks until the service has recognized all received audio and the task is complete.

get_last_request_id

def get_last_request_id(self)

Gets the request_id. This can be used after the constructor is called (the object is created).

get_first_package_delay

def get_first_package_delay(self)

Gets the first packet delay, which is the latency from sending the first audio packet to receiving the first recognition result packet. Use this after the task is completed.

get_last_package_delay

def get_last_package_delay(self)

Gets the last packet delay, which is the time taken from sending the stop instruction to receiving the last recognition result packet. Use this after the task is completed.

get_response

def get_response(self)

Gets the last message. This can be used to get a task-failed error.

Callback interface (RecognitionCallback)

During a streaming call, the server uses callbacks to return key process information and data to the client. You must implement a callback method to process the returned information and data.

Click to view the example

class Callback(RecognitionCallback):
    def on_open(self) -> None:
        print('Connection successful')

    def on_event(self, result: RecognitionResult) -> None:
        # Implement the logic for receiving recognition results

    def on_complete(self) -> None:
        print('Task completed')

    def on_error(self, result: RecognitionResult) -> None:
        print('An exception occurred: ', result)

    def on_close(self) -> None:
        print('Connection closed')


callback = Callback()

Method

Parameter

Return value

Description

def on_open(self) -> None

None

None

This method is called immediately after a connection is established with the server.

def on_event(self, result: RecognitionResult) -> None

result: RecognitionResult

None

This method is called when the service sends a response.

def on_complete(self) -> None

None

None

This method is called after all recognition results have been returned.

def on_error(self, result: RecognitionResult) -> None

result: Recognition result

None

This method is called when an exception occurs.

def on_close(self) -> None

None

None

This method is called after the service has closed the connection.

Response results

Recognition result (RecognitionResult)

RecognitionResult represents the recognition result of either a single real-time recognition in a streaming call or a synchronous call.

Member method

Method signature

Description

get_sentence

def get_sentence(self) -> Union[Dict[str, Any], List[Any]]

Gets the currently recognized sentence and timestamp information. In a callback, a single sentence is returned, so this method returns a Dict[str, Any] type.

For more information, see Sentence.

get_request_id

def get_request_id(self) -> str

Gets the request_id of the request.

is_sentence_end

@staticmethod
def is_sentence_end(sentence: Dict[str, Any]) -> bool

Determines whether the given sentence has ended.

Sentence (Sentence)

The members of the Sentence class are as follows:

Parameter

Type

Description

begin_time

int

The start time of the sentence, in ms.

end_time

int

The end time of the sentence, in ms.

text

str

The recognized text.

words

A list of Word timestamp information (Word)

Word timestamp information.

emo_tag

str

The emotion of the current sentence:

  • positive: Positive emotion, such as happy or satisfied

  • negative: Negative emotion, such as angry or sad

  • neutral: No obvious emotion

Emotion recognition has the following constraints:

  • It is available only for the paraformer-realtime-8k-v2 model.

  • You must disable semantic punctuation (controlled by the request parameter semantic_punctuation_enabled). By default, semantic punctuation is disabled.

  • The emotion recognition result is displayed only when the RecognitionResult object's is_sentence_end method returns True.

emo_confidence

float

The confidence level of the recognized emotion for the current sentence. The value ranges from 0.0 to 1.0. A larger value indicates a higher confidence level.

Emotion recognition has the following constraints:

  • It is available only for the paraformer-realtime-8k-v2 model.

  • You must disable semantic punctuation (controlled by the request parameter semantic_punctuation_enabled). By default, semantic punctuation is disabled.

  • The emotion recognition result is displayed only when the RecognitionResult object's is_sentence_end method returns True.

Word timestamp information (Word)

The members of the Word class are as follows:

Parameter

Type

Description

begin_time

int

The start time of the word, in ms.

end_time

int

The end time of the word, in ms.

text

str

The word.

punctuation

str

The punctuation.

Error codes

If you encounter an error, see Error messages for troubleshooting.

If the issue persists, join the developer group to provide feedback. Include the Request ID to help us investigate the problem.

More examples

For more examples, see GitHub.

FAQ

Features

Q: How can I maintain a persistent connection with the server during long periods of silence?

Set the heartbeat request parameter to true and continuously send silent audio to the server.

Silent audio refers to content in an audio file or data stream that has no sound signal. You can generate silent audio using various methods, such as audio editing software (Audacity and Adobe Audition) or command-line tools (FFmpeg).

Q: How do I convert an audio file to a supported format?

You can use the FFmpeg tool. For more information, see the official FFmpeg website.

# Basic conversion command (universal template)
# -i, function: input file path, example value: audio.wav
# -c:a, function: audio encoder, example values: aac, libmp3lame, pcm_s16le
# -b:a, function: bit rate (controls audio quality), example values: 192k, 320k
# -ar, function: sample rate, example values: 44100 (CD), 48000, 16000
# -ac, function: number of sound channels, example values: 1 (mono), 2 (stereo)
# -y, function: overwrite existing file (no value needed)
ffmpeg -i input_audio.ext -c:a encoder_name -b:a bit_rate -ar sample_rate -ac num_channels output.ext

# Example: WAV → MP3 (maintain original quality)
ffmpeg -i input.wav -c:a libmp3lame -q:a 0 output.mp3
# Example: MP3 → WAV (16-bit PCM standard format)
ffmpeg -i input.mp3 -c:a pcm_s16le -ar 44100 -ac 2 output.wav
# Example: M4A → AAC (extract/convert Apple audio)
ffmpeg -i input.m4a -c:a copy output.aac  # Directly extract without re-encoding
ffmpeg -i input.m4a -c:a aac -b:a 256k output.aac  # Re-encode to improve quality
# Example: FLAC lossless → Opus (high compression)
ffmpeg -i input.flac -c:a libopus -b:a 128k -vbr on output.opus

Q: How do I recognize a local file (recorded audio file)?

There are two ways to recognize a local file:

  • Directly pass the local file path: This method returns the complete recognition result after the file is fully processed. It is not suitable for scenarios that require immediate feedback.

    For more information, see synchronous call. Pass the file path to the call method of the Recognition class to directly recognize the audio file.

  • Convert the local file into a binary stream for recognition: This method returns recognition results in a stream while the file is being processed. It is suitable for scenarios that require immediate feedback.

    For more information, see Streaming call. You can use the send_audio_frame method of the Recognition class to send a binary stream to the server for recognition.

Troubleshooting

Q: Why is the speech not recognized (no recognition result)?

  1. Check whether the audio format (format) and sample rate (sampleRate or sample_rate) in the request parameters are set correctly and meet the parameter constraints. The following are common examples of errors:

    • The audio file has a .wav file name extension but is actually in MP3 format, and the format request parameter is set to mp3 (incorrect parameter setting).

    • The audio sample rate is 3600 Hz, but the sampleRate or sample_rate request parameter is set to 48000 (incorrect parameter setting).

    You can use the ffprobe tool to retrieve information about the audio container, encoding, sample rate, and channels:

    ffprobe -v error -show_entries format=format_name -show_entries stream=codec_name,sample_rate,channels -of default=noprint_wrappers=1 input.xxx
  2. When you use the paraformer-realtime-v2 model, check whether the language set in language_hints matches the actual language of the audio.

    For example, the audio is in Chinese, but language_hints is set to en (English).

  3. If all the preceding checks pass, you can use custom vocabulary to improve the recognition of specific words.