All Products
Search
Document Center

Alibaba Cloud Model Studio:Real-time speech recognition - Fun-ASR/Paraformer

Last Updated:Dec 04, 2025

The real-time speech recognition service converts audio streams into text with punctuation in real time, creating a "speak-and-see" text effect. You can easily transcribe audio from microphones, meeting recordings, or local audio files. This service is widely used in scenarios such as real-time meeting transcription, live streaming captions, voice chat, and intelligent customer service.

Core features

  • Supports real-time speech recognition for multiple languages, including Chinese, English, and various dialects.

  • Supports custom hotwords to improve the recognition accuracy of specific terms.

  • Supports timestamp output to generate structured recognition results.

  • Flexible sample rates and multiple audio formats adapt to different recording environments.

  • Optional Voice Activity Detection (VAD) automatically filters silent segments to improve processing efficiency for long audio.

  • SDK and WebSocket connection types provide a low-latency, stable service.

Availability

  • Supported regions:

    • International (Singapore) region: Requires an API key from the International (Singapore) region.

    • China (Beijing) region: Requires an API key from the China (Beijing) region.

  • Supported models:

    International (Singapore)

    Fun-ASR: fun-asr-realtime (stable version, equivalent to fun-asr-2025-11-07) and fun-asr-realtime-2025-11-07 (snapshot version)

    China (Beijing)

    • Fun-ASR: fun-asr-realtime (stable version, equivalent to fun-asr-2025-11-07), fun-asr-realtime-2025-11-07 (snapshot version), and fun-asr-realtime-2025-09-15 (snapshot version)

    • Paraformer: paraformer-realtime-v2, paraformer-realtime-v1, paraformer-realtime-8k-v2, and paraformer-realtime-8k-v1

Model selection

Scenario

Recommended model

Reason

Mandarin Chinese recognition (meetings/live streaming)

fun-asr-realtime, fun-asr-realtime-2025-11-07, paraformer-realtime-v2

Compatible with multiple formats, supports high sample rates, stable latency.

Multi-language recognition (international conferences)

paraformer-realtime-v2

Covers multiple languages.

Chinese dialect recognition (customer service/government affairs)

fun-asr-realtime-2025-11-07, paraformer-realtime-v2

Covers multiple local dialects.

Mixed Chinese, English, and Japanese recognition (classrooms/speeches)

fun-asr-realtime, fun-asr-realtime-2025-11-07

Optimized for Chinese, English, and Japanese recognition.

Low-bandwidth telephone recording transcription

paraformer-realtime-8k-v2

Supports 8 kHz, emotion recognition enabled by default.

Custom hotword scenarios (brand names/proprietary terms)

Paraformer, latest versions of Fun-ASR models

Hotwords can be enabled or disabled, easy to configure and iterate.

For more information, see Model feature comparison.

Getting started

The following are code samples for calling the API. For more code samples for common scenarios, see GitHub.

You must obtain an API key and set the API key as an environment variable. If you use an SDK to make calls, you must also install the DashScope SDK.

Fun-ASR

Recognize speech from a microphone

You can use real-time speech recognition to recognize speech from a microphone and output the results, which creates a "speak-and-see" text effect.

Java

import com.alibaba.dashscope.audio.asr.recognition.Recognition;
import com.alibaba.dashscope.audio.asr.recognition.RecognitionParam;
import com.alibaba.dashscope.audio.asr.recognition.RecognitionResult;
import com.alibaba.dashscope.common.ResultCallback;
import com.alibaba.dashscope.utils.Constants;

import javax.sound.sampled.AudioFormat;
import javax.sound.sampled.AudioSystem;
import javax.sound.sampled.TargetDataLine;

import java.nio.ByteBuffer;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;

public class Main {
    public static void main(String[] args) throws InterruptedException {
        // The following URL is for the Singapore region. If you use a model in the Beijing region, replace the URL with wss://dashscope.aliyuncs.com/api-ws/v1/inference.
        Constants.baseWebsocketApiUrl = "wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference";
        ExecutorService executorService = Executors.newSingleThreadExecutor();
        executorService.submit(new RealtimeRecognitionTask());
        executorService.shutdown();
        executorService.awaitTermination(1, TimeUnit.MINUTES);
        System.exit(0);
    }
}

class RealtimeRecognitionTask implements Runnable {
    @Override
    public void run() {
        RecognitionParam param = RecognitionParam.builder()
                .model("fun-asr-realtime")
                // The API keys for the Singapore and Beijing regions are different. To obtain an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key.
                // If you have not configured an environment variable, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .format("wav")
                .sampleRate(16000)
                .build();
        Recognition recognizer = new Recognition();

        ResultCallback<RecognitionResult> callback = new ResultCallback<RecognitionResult>() {
            @Override
            public void onEvent(RecognitionResult result) {
                if (result.isSentenceEnd()) {
                    System.out.println("Final Result: " + result.getSentence().getText());
                } else {
                    System.out.println("Intermediate Result: " + result.getSentence().getText());
                }
            }

            @Override
            public void onComplete() {
                System.out.println("Recognition complete");
            }

            @Override
            public void onError(Exception e) {
                System.out.println("RecognitionCallback error: " + e.getMessage());
            }
        };
        try {
            recognizer.call(param, callback);
            // Create an audio format.
            AudioFormat audioFormat = new AudioFormat(16000, 16, 1, true, false);
            // Match the default recording device based on the format.
            TargetDataLine targetDataLine =
                    AudioSystem.getTargetDataLine(audioFormat);
            targetDataLine.open(audioFormat);
            // Start recording.
            targetDataLine.start();
            ByteBuffer buffer = ByteBuffer.allocate(1024);
            long start = System.currentTimeMillis();
            // Record for 50 seconds and perform real-time transcription.
            while (System.currentTimeMillis() - start < 50000) {
                int read = targetDataLine.read(buffer.array(), 0, buffer.capacity());
                if (read > 0) {
                    buffer.limit(read);
                    // Send the recorded audio data to the streaming recognition service.
                    recognizer.sendAudioFrame(buffer);
                    buffer = ByteBuffer.allocate(1024);
                    // The recording rate is limited. Sleep for a short period to prevent high CPU usage.
                    Thread.sleep(20);
                }
            }
            recognizer.stop();
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            // Close the WebSocket connection after the task is complete.
            recognizer.getDuplexApi().close(1000, "bye");
        }

        System.out.println(
                "[Metric] requestId: "
                        + recognizer.getLastRequestId()
                        + ", first package delay ms: "
                        + recognizer.getFirstPackageDelay()
                        + ", last package delay ms: "
                        + recognizer.getLastPackageDelay());
    }
}

Python

Before you run the Python sample, run the pip install pyaudio command to install the third-party audio playback and capture library.

import os
import signal  # for keyboard events handling (press "Ctrl+C" to terminate recording)
import sys

import dashscope
import pyaudio
from dashscope.audio.asr import *

mic = None
stream = None

# Set recording parameters
sample_rate = 16000  # sampling rate (Hz)
channels = 1  # mono channel
dtype = 'int16'  # data type
format_pcm = 'pcm'  # the format of the audio data
block_size = 3200  # number of frames per buffer


# Real-time speech recognition callback
class Callback(RecognitionCallback):
    def on_open(self) -> None:
        global mic
        global stream
        print('RecognitionCallback open.')
        mic = pyaudio.PyAudio()
        stream = mic.open(format=pyaudio.paInt16,
                          channels=1,
                          rate=16000,
                          input=True)

    def on_close(self) -> None:
        global mic
        global stream
        print('RecognitionCallback close.')
        stream.stop_stream()
        stream.close()
        mic.terminate()
        stream = None
        mic = None

    def on_complete(self) -> None:
        print('RecognitionCallback completed.')  # recognition completed

    def on_error(self, message) -> None:
        print('RecognitionCallback task_id: ', message.request_id)
        print('RecognitionCallback error: ', message.message)
        # Stop and close the audio stream if it is running
        if 'stream' in globals() and stream.active:
            stream.stop()
            stream.close()
        # Forcefully exit the program
        sys.exit(1)

    def on_event(self, result: RecognitionResult) -> None:
        sentence = result.get_sentence()
        if 'text' in sentence:
            print('RecognitionCallback text: ', sentence['text'])
            if RecognitionResult.is_sentence_end(sentence):
                print(
                    'RecognitionCallback sentence end, request_id:%s, usage:%s'
                    % (result.get_request_id(), result.get_usage(sentence)))


def signal_handler(sig, frame):
    print('Ctrl+C pressed, stop recognition ...')
    # Stop recognition
    recognition.stop()
    print('Recognition stopped.')
    print(
        '[Metric] requestId: {}, first package delay ms: {}, last package delay ms: {}'
        .format(
            recognition.get_last_request_id(),
            recognition.get_first_package_delay(),
            recognition.get_last_package_delay(),
        ))
    # Forcefully exit the program
    sys.exit(0)


# main function
if __name__ == '__main__':
    # The API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
    # If you have not configured an environment variable, replace the following line with your Model Studio API key: dashscope.api_key = "sk-xxx"
    dashscope.api_key = os.environ.get('DASHSCOPE_API_KEY')

    # The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: wss://dashscope.aliyuncs.com/api-ws/v1/inference
    dashscope.base_websocket_api_url='wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference'

    # Create the recognition callback
    callback = Callback()

    # Call recognition service by async mode, you can customize the recognition parameters, like model, format,
    # sample_rate
    recognition = Recognition(
        model='fun-asr-realtime',
        format=format_pcm,
        # 'pcm', 'wav', 'opus', 'speex', 'aac', 'amr'. You can check the supported formats in the document.
        sample_rate=sample_rate,
        # Supports 8000, 16000.
        semantic_punctuation_enabled=False,
        callback=callback)

    # Start recognition
    recognition.start()

    signal.signal(signal.SIGINT, signal_handler)
    print("Press 'Ctrl+C' to stop recording and recognition...")
    # Create a keyboard listener until "Ctrl+C" is pressed

    while True:
        if stream:
            data = stream.read(3200, exception_on_overflow=False)
            recognition.send_audio_frame(data)
        else:
            break

    recognition.stop()

Recognize a local audio file

You can use real-time speech recognition to recognize a local audio file and output the recognition result. This feature is suitable for near real-time speech recognition scenarios that involve shorter audio, such as dialogues, voice commands, voice input methods, and voice search.

Java

The audio file used in the example is asr_example.wav.

import com.alibaba.dashscope.audio.asr.recognition.Recognition;
import com.alibaba.dashscope.audio.asr.recognition.RecognitionParam;
import com.alibaba.dashscope.audio.asr.recognition.RecognitionResult;
import com.alibaba.dashscope.common.ResultCallback;
import com.alibaba.dashscope.utils.Constants;

import java.io.FileInputStream;
import java.nio.ByteBuffer;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.time.LocalDateTime;
import java.time.format.DateTimeFormatter;
import java.util.concurrent.CountDownLatch;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;

class TimeUtils {
    private static final DateTimeFormatter formatter =
            DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss.SSS");

    public static String getTimestamp() {
        return LocalDateTime.now().format(formatter);
    }
}

public class Main {
    public static void main(String[] args) throws InterruptedException {
        // The following URL is for the Singapore region. If you use a model in the Beijing region, replace the URL with wss://dashscope.aliyuncs.com/api-ws/v1/inference.
        Constants.baseWebsocketApiUrl = "wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference";
        ExecutorService executorService = Executors.newSingleThreadExecutor();
        executorService.submit(new RealtimeRecognitionTask(Paths.get(System.getProperty("user.dir"), "asr_example.wav")));
        executorService.shutdown();

        // wait for all tasks to complete
        executorService.awaitTermination(1, TimeUnit.MINUTES);
        System.exit(0);
    }
}

class RealtimeRecognitionTask implements Runnable {
    private Path filepath;

    public RealtimeRecognitionTask(Path filepath) {
        this.filepath = filepath;
    }

    @Override
    public void run() {
        RecognitionParam param = RecognitionParam.builder()
                .model("fun-asr-realtime")
                // The API keys for the Singapore and Beijing regions are different. To obtain an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key.
                // If you have not configured an environment variable, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .format("wav")
                .sampleRate(16000)
                .build();
        Recognition recognizer = new Recognition();

        String threadName = Thread.currentThread().getName();

        ResultCallback<RecognitionResult> callback = new ResultCallback<RecognitionResult>() {
            @Override
            public void onEvent(RecognitionResult message) {
                if (message.isSentenceEnd()) {

                    System.out.println(TimeUtils.getTimestamp()+" "+
                            "[process " + threadName + "] Final Result:" + message.getSentence().getText());
                } else {
                    System.out.println(TimeUtils.getTimestamp()+" "+
                            "[process " + threadName + "] Intermediate Result: " + message.getSentence().getText());
                }
            }

            @Override
            public void onComplete() {
                System.out.println(TimeUtils.getTimestamp()+" "+"[" + threadName + "] Recognition complete");
            }

            @Override
            public void onError(Exception e) {
                System.out.println(TimeUtils.getTimestamp()+" "+
                        "[" + threadName + "] RecognitionCallback error: " + e.getMessage());
            }
        };

        try {
            recognizer.call(param, callback);
            // Please replace the path with your audio file path
            System.out.println(TimeUtils.getTimestamp()+" "+"[" + threadName + "] Input file_path is: " + this.filepath);
            // Read file and send audio by chunks
            FileInputStream fis = new FileInputStream(this.filepath.toFile());
            // chunk size set to 1 second for 16 kHz sample rate
            byte[] buffer = new byte[3200];
            int bytesRead;
            // Loop to read chunks of the file
            while ((bytesRead = fis.read(buffer)) != -1) {
                ByteBuffer byteBuffer;
                // Handle the last chunk which might be smaller than the buffer size
                System.out.println(TimeUtils.getTimestamp()+" "+"[" + threadName + "] bytesRead: " + bytesRead);
                if (bytesRead < buffer.length) {
                    byteBuffer = ByteBuffer.wrap(buffer, 0, bytesRead);
                } else {
                    byteBuffer = ByteBuffer.wrap(buffer);
                }

                recognizer.sendAudioFrame(byteBuffer);
                buffer = new byte[3200];
                Thread.sleep(100);
            }
            System.out.println(TimeUtils.getTimestamp()+" "+LocalDateTime.now());
            recognizer.stop();
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            // Close the WebSocket connection after the task is complete.
            recognizer.getDuplexApi().close(1000, "bye");
        }

        System.out.println(
                "["
                        + threadName
                        + "][Metric] requestId: "
                        + recognizer.getLastRequestId()
                        + ", first package delay ms: "
                        + recognizer.getFirstPackageDelay()
                        + ", last package delay ms: "
                        + recognizer.getLastPackageDelay());
    }
}

Python

The audio file used in the example is: asr_example.wav.

import os
import time
import dashscope
from dashscope.audio.asr import *

# The API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
# If you have not configured an environment variable, replace the following line with your Model Studio API key: dashscope.api_key = "sk-xxx"
dashscope.api_key = os.environ.get('DASHSCOPE_API_KEY')

# The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: wss://dashscope.aliyuncs.com/api-ws/v1/inference
dashscope.base_websocket_api_url='wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference'

from datetime import datetime

def get_timestamp():
    now = datetime.now()
    formatted_timestamp = now.strftime("[%Y-%m-%d %H:%M:%S.%f]")
    return formatted_timestamp

class Callback(RecognitionCallback):
    def on_complete(self) -> None:
        print(get_timestamp() + ' Recognition completed')  # recognition complete

    def on_error(self, result: RecognitionResult) -> None:
        print('Recognition task_id: ', result.request_id)
        print('Recognition error: ', result.message)
        exit(0)

    def on_event(self, result: RecognitionResult) -> None:
        sentence = result.get_sentence()
        if 'text' in sentence:
            print(get_timestamp() + ' RecognitionCallback text: ', sentence['text'])
            if RecognitionResult.is_sentence_end(sentence):
                print(get_timestamp() + 
                    'RecognitionCallback sentence end, request_id:%s, usage:%s'
                    % (result.get_request_id(), result.get_usage(sentence)))


callback = Callback()

recognition = Recognition(model='fun-asr-realtime',
                          format='wav',
                          sample_rate=16000,
                          callback=callback)

recognition.start()

try:
    audio_data: bytes = None
    f = open("asr_example.wav", 'rb')
    if os.path.getsize("asr_example.wav"):
        while True:
            audio_data = f.read(3200)
            if not audio_data:
                break
            else:
                recognition.send_audio_frame(audio_data)
            time.sleep(0.1)
    else:
        raise Exception(
            'The supplied file was empty (zero bytes long)')
    f.close()
except Exception as e:
    raise e

recognition.stop()

print(
    '[Metric] requestId: {}, first package delay ms: {}, last package delay ms: {}'
    .format(
        recognition.get_last_request_id(),
        recognition.get_first_package_delay(),
        recognition.get_last_package_delay(),
    ))

Paraformer

Recognize speech from a microphone

You can use real-time speech recognition to recognize speech from a microphone and output the results, which creates a "speak-and-see" text effect.

Java

import com.alibaba.dashscope.audio.asr.recognition.Recognition;
import com.alibaba.dashscope.audio.asr.recognition.RecognitionParam;
import com.alibaba.dashscope.exception.NoApiKeyException;
import io.reactivex.BackpressureStrategy;
import io.reactivex.Flowable;

import java.nio.ByteBuffer;
import javax.sound.sampled.AudioFormat;
import javax.sound.sampled.AudioSystem;
import javax.sound.sampled.TargetDataLine;

public class Main {

    public static void main(String[] args) throws NoApiKeyException {
        // Create a Flowable<ByteBuffer>.
        Flowable<ByteBuffer> audioSource = Flowable.create(emitter -> {
            new Thread(() -> {
                try {
                    // Create an audio format.
                    AudioFormat audioFormat = new AudioFormat(16000, 16, 1, true, false);
                    // Match the default recording device based on the format.
                    TargetDataLine targetDataLine =
                            AudioSystem.getTargetDataLine(audioFormat);
                    targetDataLine.open(audioFormat);
                    // Start recording.
                    targetDataLine.start();
                    ByteBuffer buffer = ByteBuffer.allocate(1024);
                    long start = System.currentTimeMillis();
                    // Record for 300 seconds and perform real-time transcription.
                    while (System.currentTimeMillis() - start < 300000) {
                        int read = targetDataLine.read(buffer.array(), 0, buffer.capacity());
                        if (read > 0) {
                            buffer.limit(read);
                            // Send the recorded audio data to the streaming recognition service.
                            emitter.onNext(buffer);
                            buffer = ByteBuffer.allocate(1024);
                            // The recording rate is limited. Sleep for a short time to prevent high CPU usage.
                            Thread.sleep(20);
                        }
                    }
                    // Notify the end of transcription.
                    emitter.onComplete();
                } catch (Exception e) {
                    emitter.onError(e);
                }
            }).start();
        },
        BackpressureStrategy.BUFFER);

        // Create a Recognizer.
        Recognition recognizer = new Recognition();
        // Create a RecognitionParam and pass the created Flowable<ByteBuffer> to the audioFrames parameter.
        RecognitionParam param = RecognitionParam.builder()
            .model("paraformer-realtime-v2")
            .format("pcm")
            .sampleRate(16000)
            // If you have not configured the API Key as an environment variable, uncomment the following line and replace apiKey with your own API key.
            // .apiKey("apikey")
            .build();

        // Call the interface in streaming mode.
        recognizer.streamCall(param, audioSource)
            // Call the subscribe method of Flowable to subscribe to the results.
            .blockingForEach(
                result -> {
                    // Print the final result.
                    if (result.isSentenceEnd()) {
                        System.out.println("Fix:" + result.getSentence().getText());
                    } else {
                        System.out.println("Result:" + result.getSentence().getText());
                    }
                });
        System.exit(0);
    }
}

Python

Before you run the Python sample, run the pip install pyaudio command to install the third-party audio playback and capture library.

import pyaudio
from dashscope.audio.asr import (Recognition, RecognitionCallback,
                                 RecognitionResult)

# If you have not configured the API Key as an environment variable, uncomment the following line and replace apiKey with your own API key.
# import dashscope
# dashscope.api_key = "apiKey"

mic = None
stream = None


class Callback(RecognitionCallback):
    def on_open(self) -> None:
        global mic
        global stream
        print('RecognitionCallback open.')
        mic = pyaudio.PyAudio()
        stream = mic.open(format=pyaudio.paInt16,
                          channels=1,
                          rate=16000,
                          input=True)

    def on_close(self) -> None:
        global mic
        global stream
        print('RecognitionCallback close.')
        stream.stop_stream()
        stream.close()
        mic.terminate()
        stream = None
        mic = None

    def on_event(self, result: RecognitionResult) -> None:
        print('RecognitionCallback sentence: ', result.get_sentence())


callback = Callback()
recognition = Recognition(model='paraformer-realtime-v2',
                          format='pcm',
                          sample_rate=16000,
                          callback=callback)
recognition.start()

while True:
    if stream:
        data = stream.read(3200, exception_on_overflow=False)
        recognition.send_audio_frame(data)
    else:
        break

recognition.stop()

Recognize a local audio file

You can use real-time speech recognition to recognize a local audio file and output the recognition result. This feature is suitable for near real-time speech recognition scenarios that involve shorter audio, such as dialogues, voice commands, voice input methods, and voice search.

Java

import com.alibaba.dashscope.audio.asr.recognition.Recognition;
import com.alibaba.dashscope.audio.asr.recognition.RecognitionParam;
import java.io.File;
import java.io.IOException;
import java.io.InputStream;
import java.net.URL;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.nio.file.StandardCopyOption;

public class Main {
    public static void main(String[] args) {
        // You can ignore the file download part and directly use a local file to call the API for recognition.
        String exampleWavUrl =
                "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav";
        try {
            InputStream in = new URL(exampleWavUrl).openStream();
            Files.copy(in, Paths.get("asr_example.wav"), StandardCopyOption.REPLACE_EXISTING);
        } catch (IOException e) {
            System.out.println("error: " + e);
            System.exit(1);
        }

        // Create a Recognition instance.
        Recognition recognizer = new Recognition();
        // Create a RecognitionParam.
        RecognitionParam param =
                RecognitionParam.builder()
                        // If you have not configured the API Key as an environment variable, uncomment the following line and replace apiKey with your own API key.
                        // .apiKey("apikey")
                        .model("paraformer-realtime-v2")
                        .format("wav")
                        .sampleRate(16000)
                        // "language_hints" is only supported by the paraformer-v2 and paraformer-realtime-v2 models.
                        .parameter("language_hints", new String[]{"zh", "en"})
                        .build();

        try {
            System.out.println("Recognition result: " + recognizer.call(param, new File("asr_example.wav")));
        } catch (Exception e) {
            e.printStackTrace();
        }
        System.exit(0);
    }
}

Python

import requests
from http import HTTPStatus
from dashscope.audio.asr import Recognition

# If you have not configured the API Key as an environment variable, uncomment the following line and replace apiKey with your own API key.
# import dashscope
# dashscope.api_key = "apiKey"

# You can ignore the code for downloading the file from the URL and use a local file for recognition directly.
r = requests.get(
    'https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav'
)
with open('asr_example.wav', 'wb') as f:
    f.write(r.content)

recognition = Recognition(model='paraformer-realtime-v2',
                          format='wav',
                          sample_rate=16000,
                          # "language_hints" is only supported by the paraformer-v2 and paraformer-realtime-v2 models.
                          language_hints=['zh', 'en'],
                          callback=None)
result = recognition.call('asr_example.wav')
if result.status_code == HTTPStatus.OK:
    print('Recognition result:')
    print(result.get_sentence())
else:
    print('Error: ', result.message)

Going live

Improve recognition performance

  • Select a model with the correct sample rate: For 8 kHz telephone audio, use an 8 kHz model directly instead of upsampling the audio to 16 kHz for recognition. This method avoids information distortion and provides better results.

  • Use the hotword feature: For business-specific terms, such as proprietary nouns, names, and brand names, you can configure hotwords to significantly improve recognition accuracy. For more information, see Customize hotwords - Paraformer/Fun-ASR.

  • Optimize input audio quality: Use high-quality microphones whenever possible and ensure a recording environment with a high signal-to-noise ratio and no echo. At the application layer, you can integrate algorithms such as denoising (for example, RNNoise) and acoustic echo cancellation (AEC) to pre-process the audio and obtain a cleaner signal.

  • Specify the recognition language: For models that support multiple languages, such as Paraformer-v2, if you can determine the audio language in advance when making an API call (for example, using the Language_hints parameter to specify the languages as ['zh','en']), this can help the model converge and avoid confusion between similarly pronounced languages, thereby improving accuracy.

  • Filter disfluent words: For the Paraformer model, you can enable the disfluency filtering feature by setting the disfluency_removal_enabled parameter. This produces more formal and readable text results.

Set a fault tolerance policy

  • Client reconnection: Your client should implement an automatic reconnection mechanism to handle network jitter. For the Python SDK, consider the following suggestions:

    1. Catch exceptions: Implement the on_error method in the Callback class. The dashscope SDK calls this method when it encounters a network error or other issues.

    2. Status notification: When on_error is triggered, set a reconnection signal. In Python, you can use threading.Event, which is a thread-safe flag.

    3. Reconnection loop: Wrap the main logic in a for loop to allow for retries, for example, three times. When the reconnection signal is detected, the current recognition process is interrupted, resources are cleaned up, and after a short delay, the loop restarts to create a new connection.

  • Set a heartbeat to prevent disconnection: To maintain a persistent connection with the server, set the heartbeat parameter to true. This ensures the connection to the server is not interrupted even during long periods of silence in the audio.

  • Rate limiting: When you call a model API, be aware of the model's rate limits.

API reference

Model feature comparison

Feature

fun-asr-realtime, fun-asr-realtime-2025-11-07

fun-asr-realtime-2025-09-15

paraformer-realtime-v2

paraformer-realtime-v1

paraformer-realtime-8k-v2

paraformer-realtime-8k-v1

Region

Singapore, Beijing

Beijing

Core scenario

Live video streaming, meetings, trilingual education, etc.

Live video streaming, meetings, bilingual education, etc.

Long-form audio stream recognition (meetings, live streaming)

Telephone customer service, etc.

Supported languages

Chinese (Mandarin, Cantonese, Wu, Min Nan, Hakka, Gan, Xiang, Jin; and Mandarin accents from regions such as Central Plains, Southwest, Ji-Lu, Jianghuai, Lan-Yin, Jiao-Liao, Northeast, Beijing, and Hong Kong/Taiwan, including Henan, Shaanxi, Hubei, Sichuan, Chongqing, Yunnan, Guizhou, Guangdong, Guangxi, Hebei, Tianjin, Shandong, Anhui, Nanjing, Jiangsu, Hangzhou, Gansu, and Ningxia), English, Japanese

Chinese (Mandarin), English

Chinese (Mandarin, Cantonese, Wu, Min Nan, Northeast dialect, Gansu dialect, Guizhou dialect, Henan dialect, Hubei dialect, Hunan dialect, Ningxia dialect, Shanxi dialect, Shaanxi dialect, Shandong dialect, Sichuan dialect, Tianjin dialect, Jiangxi dialect, Yunnan dialect, Shanghainese), English, Japanese, Korean, German, French, Russian

Chinese (Mandarin)

Supported audio formats

pcm, wav, mp3, opus, speex, aac, amr

Sample rate

16 kHz

Any sample rate

16 kHz

8 kHz

Channels

Mono

Input format

Binary audio stream

Audio size/duration

Unlimited

Emotion recognition

Not supported

Supported Enabled by default, can be disabled

Not supported

Sensitive word filtering

Not supported

Speaker diarization

Not supported

Disfluency filtering

Supported Disabled by default, can be enabled

Timestamps

Supported Always enabled

Punctuation prediction

Supported Always enabled

Supported Enabled by default, can be disabled

Supported Always enabled

Supported Enabled by default, can be disabled

Supported Always enabled

Hotwords

Supported Configurable

ITN

Supported Always enabled

VAD

Supported Always enabled

Rate limit (RPS)

20

Connection method

Java/Python SDK, WebSocket API

Price

International (Singapore): $0.00009/second

China (Beijing): $0.000047/second

China (Beijing): $0.000047/second

China (Beijing): $0.000012/second