All Products
Search
Document Center

Alibaba Cloud Model Studio:Real-time speech recognition - Fun-ASR/Paraformer

Last Updated:Mar 24, 2026

Converts audio streams (microphone, meeting recordings, or local files) into text with punctuation in real time. Use it for meeting transcription, live captions, voice chat, or customer service.

Core features

  • Supports real-time recognition for multiple languages (Chinese, English, dialects).

  • Provides timestamp output for structured results.

  • Flexible sample rates and multiple audio formats adapt to different recording environments.

  • Voice Activity Detection (VAD) is optional and filters silent segments in long audio.

  • SDK and WebSocket connection types provide a low-latency, stable service.

Availability

Supported models:

International

In the international deployment mode, the endpoint and data storage are located in the Singapore region. Inference computing resources are dynamically scheduled worldwide, excluding the Chinese mainland.

Use an API key from the Singapore region:

  • Fun-ASR: fun-asr-realtime (stable, equivalent to fun-asr-realtime-2025-11-07), fun-asr-realtime-2025-11-07 (snapshot)

Chinese mainland

In the Chinese mainland deployment mode, the endpoint and data storage are located in the Beijing region. Inference computing resources are limited to the Chinese mainland.

Use an API key from the Beijing region:

  • Fun-ASR: fun-asr-realtime (stable, equivalent to fun-asr-realtime-2025-11-07), fun-asr-realtime-2026-02-28 (latest snapshot), fun-asr-realtime-2025-11-07 (snapshot), fun-asr-realtime-2025-09-15 (snapshot)

    • fun-asr-flash-8k-realtime (stable, equivalent to fun-asr-flash-8k-realtime-2026-01-28), fun-asr-flash-8k-realtime-2026-01-28

  • Paraformer: paraformer-realtime-v2, paraformer-realtime-v1, paraformer-realtime-8k-v2, paraformer-realtime-8k-v1

For more information, see the model list.

Model selection

Scenario

Recommended

Reason

Mandarin Chinese recognition (meetings/live streaming)

fun-asr-realtime, fun-asr-realtime-2026-02-28, paraformer-realtime-v2

Supports multiple formats, high sample rates, and stable latency.

Multi-language recognition (international conferences)

paraformer-realtime-v2

Covers multiple languages.

Chinese dialect recognition (customer service/government affairs)

fun-asr-realtime-2026-02-28, paraformer-realtime-v2

Covers multiple local dialects.

Mixed Chinese, English, and Japanese recognition (classrooms/speeches)

fun-asr-realtime, fun-asr-realtime-2025-11-07

Optimized for Chinese, English, and Japanese recognition.

Low-bandwidth phone recording transcription

fun-asr-flash-8k-realtime

Specifically designed for Chinese-language telephone customer service.

For more information, see Compare models.

Getting started

The following are code samples for calling the API. For more code samples for common scenarios, see GitHub.

Get an API key and export the API key as an environment variable. If you use an SDK to make calls, install the DashScope SDK.

Fun-ASR

Recognize speech from a microphone

This feature recognizes microphone input and displays results in real time ("speak-and-see").

Java

import com.alibaba.dashscope.audio.asr.recognition.Recognition;
import com.alibaba.dashscope.audio.asr.recognition.RecognitionParam;
import com.alibaba.dashscope.audio.asr.recognition.RecognitionResult;
import com.alibaba.dashscope.common.ResultCallback;
import com.alibaba.dashscope.utils.Constants;

import javax.sound.sampled.AudioFormat;
import javax.sound.sampled.AudioSystem;
import javax.sound.sampled.TargetDataLine;

import java.nio.ByteBuffer;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;

public class Main {
    public static void main(String[] args) throws InterruptedException {
        // The following URL is for the Singapore region. If you use models in the Beijing region, replace the URL with: wss://dashscope.aliyuncs.com/api-ws/v1/inference
        Constants.baseWebsocketApiUrl = "wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference";
        ExecutorService executorService = Executors.newSingleThreadExecutor();
        executorService.submit(new RealtimeRecognitionTask());
        executorService.shutdown();
        executorService.awaitTermination(1, TimeUnit.MINUTES);
        System.exit(0);
    }
}

class RealtimeRecognitionTask implements Runnable {
    @Override
    public void run() {
        RecognitionParam param = RecognitionParam.builder()
                .model("fun-asr-realtime")
                // API keys differ between Singapore and Beijing regions. Get an API key: https://www.alibabacloud.com/help/zh/model-studio/get-api-key
                // If you have not configured an environment variable, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .format("pcm")
                .sampleRate(16000)
                .build();
        Recognition recognizer = new Recognition();

        ResultCallback<RecognitionResult> callback = new ResultCallback<RecognitionResult>() {
            @Override
            public void onEvent(RecognitionResult result) {
                if (result.isSentenceEnd()) {
                    System.out.println("Final Result: " + result.getSentence().getText());
                } else {
                    System.out.println("Intermediate Result: " + result.getSentence().getText());
                }
            }

            @Override
            public void onComplete() {
                System.out.println("Recognition complete");
            }

            @Override
            public void onError(Exception e) {
                System.out.println("RecognitionCallback error: " + e.getMessage());
            }
        };
        try {
            recognizer.call(param, callback);
            // Create audio format
            AudioFormat audioFormat = new AudioFormat(16000, 16, 1, true, false);
            // Match the default recording device based on the format
            TargetDataLine targetDataLine =
                    AudioSystem.getTargetDataLine(audioFormat);
            targetDataLine.open(audioFormat);
            // Start recording
            targetDataLine.start();
            ByteBuffer buffer = ByteBuffer.allocate(1024);
            long start = System.currentTimeMillis();
            // Record for 50s and perform real-time transcription
            while (System.currentTimeMillis() - start < 50000) {
                int read = targetDataLine.read(buffer.array(), 0, buffer.capacity());
                if (read > 0) {
                    buffer.limit(read);
                    // Send recorded audio data to the streaming recognition service
                    recognizer.sendAudioFrame(buffer);
                    buffer = ByteBuffer.allocate(1024);
                    // Sleep briefly to prevent high CPU usage due to limited recording speed
                    Thread.sleep(20);
                }
            }
            recognizer.stop();
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            // Close the Websocket connection after the task ends
            recognizer.getDuplexApi().close(1000, "bye");
        }

        System.out.println(
                "[Metric] requestId: "
                        + recognizer.getLastRequestId()
                        + ", first package delay ms: "
                        + recognizer.getFirstPackageDelay()
                        + ", last package delay ms: "
                        + recognizer.getLastPackageDelay());
    }
}

Python

Before running the Python example, run pip install pyaudio

import os
import signal  # for keyboard events handling (press "Ctrl+C" to terminate recording)
import sys

import dashscope
import pyaudio
from dashscope.audio.asr import *

mic = None
stream = None

# Set recording parameters
sample_rate = 16000  # sampling rate (Hz)
channels = 1  # mono channel
dtype = 'int16'  # data type
format_pcm = 'pcm'  # the format of the audio data
block_size = 3200  # number of frames per buffer


# Real-time speech recognition callback
class Callback(RecognitionCallback):
    def on_open(self) -> None:
        global mic
        global stream
        print('RecognitionCallback open.')
        mic = pyaudio.PyAudio()
        stream = mic.open(format=pyaudio.paInt16,
                          channels=1,
                          rate=16000,
                          input=True)

    def on_close(self) -> None:
        global mic
        global stream
        print('RecognitionCallback close.')
        stream.stop_stream()
        stream.close()
        mic.terminate()
        stream = None
        mic = None

    def on_complete(self) -> None:
        print('RecognitionCallback completed.')  # recognition completed

    def on_error(self, message) -> None:
        print('RecognitionCallback task_id: ', message.request_id)
        print('RecognitionCallback error: ', message.message)
        # Stop and close the audio stream if it is running
        if 'stream' in globals() and stream.active:
            stream.stop()
            stream.close()
        # Forcefully exit the program
        sys.exit(1)

    def on_event(self, result: RecognitionResult) -> None:
        sentence = result.get_sentence()
        if 'text' in sentence:
            print('RecognitionCallback text: ', sentence['text'])
            if RecognitionResult.is_sentence_end(sentence):
                print(
                    'RecognitionCallback sentence end, request_id:%s, usage:%s'
                    % (result.get_request_id(), result.get_usage(sentence)))


def signal_handler(sig, frame):
    print('Ctrl+C pressed, stop recognition ...')
    # Stop recognition
    recognition.stop()
    print('Recognition stopped.')
    print(
        '[Metric] requestId: {}, first package delay ms: {}, last package delay ms: {}'
        .format(
            recognition.get_last_request_id(),
            recognition.get_first_package_delay(),
            recognition.get_last_package_delay(),
        ))
    # Forcefully exit the program
    sys.exit(0)


# main function
if __name__ == '__main__':
    # API keys differ between the Singapore and Beijing regions. Get an API key: https://www.alibabacloud.com/help/zh/model-studio/get-api-key
    # If you have not set an environment variable, replace the next line with your Model Studio API key: dashscope.api_key = "sk-xxx"
    dashscope.api_key = os.environ.get('DASHSCOPE_API_KEY')

    # The following URL is for the Singapore region. To use models in the Beijing region, replace it with: wss://dashscope.aliyuncs.com/api-ws/v1/inference
    dashscope.base_websocket_api_url='wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference'

    # Create the recognition callback
    callback = Callback()

    # Call recognition service by async mode, you can customize the recognition parameters, like model, format,
    # sample_rate
    recognition = Recognition(
        model='fun-asr-realtime',
        format=format_pcm,
        # 'pcm'、'wav'、'opus'、'speex'、'aac'、'amr', you can check the supported formats in the document
        sample_rate=sample_rate,
        # support 8000, 16000
        semantic_punctuation_enabled=False,
        callback=callback)

    # Start recognition
    recognition.start()

    signal.signal(signal.SIGINT, signal_handler)
    print("Press 'Ctrl+C' to stop recording and recognition...")
    # Create a keyboard listener until "Ctrl+C" is pressed

    while True:
        if stream:
            data = stream.read(3200, exception_on_overflow=False)
            recognition.send_audio_frame(data)
        else:
            break

    recognition.stop()

Recognize a local audio file

This feature recognizes and transcribes local audio files. It is ideal for short audio scenarios like voice chats, commands, voice input, or voice search.

Java

Audio used in the example: asr_example.wav.

import com.alibaba.dashscope.api.GeneralApi;
import com.alibaba.dashscope.audio.asr.recognition.Recognition;
import com.alibaba.dashscope.audio.asr.recognition.RecognitionParam;
import com.alibaba.dashscope.audio.asr.recognition.RecognitionResult;
import com.alibaba.dashscope.base.HalfDuplexParamBase;
import com.alibaba.dashscope.common.GeneralListParam;
import com.alibaba.dashscope.common.ResultCallback;
import com.alibaba.dashscope.protocol.GeneralServiceOption;
import com.alibaba.dashscope.protocol.HttpMethod;
import com.alibaba.dashscope.protocol.Protocol;
import com.alibaba.dashscope.protocol.StreamingMode;
import com.alibaba.dashscope.utils.Constants;

import java.io.FileInputStream;
import java.nio.ByteBuffer;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.time.LocalDateTime;
import java.time.format.DateTimeFormatter;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;

class TimeUtils {
    private static final DateTimeFormatter formatter =
            DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss.SSS");

    public static String getTimestamp() {
        return LocalDateTime.now().format(formatter);
    }
}

public class Main {
    public static void main(String[] args) throws InterruptedException {
        // The following URL is for the Singapore region. If you use models in the Beijing region, replace the URL with: wss://dashscope.aliyuncs.com/api-ws/v1/inference
        Constants.baseWebsocketApiUrl = "wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference";
        // In actual applications, run this method only once at program startup.
        warmUp();

        ExecutorService executorService = Executors.newSingleThreadExecutor();
        executorService.submit(new RealtimeRecognitionTask(Paths.get(System.getProperty("user.dir"), "asr_example.wav")));
        executorService.shutdown();

        // wait for all tasks to complete
        executorService.awaitTermination(1, TimeUnit.MINUTES);
        System.exit(0);
    }

    public static void warmUp() {
        try {
            // Lightweight GET request to establish connection
            GeneralServiceOption warmupOption = GeneralServiceOption.builder()
                    .protocol(Protocol.HTTP)
                    .httpMethod(HttpMethod.GET)
                    .streamingMode(StreamingMode.OUT)
                    .path("assistants")
                    .build();

            warmupOption.setBaseHttpUrl(Constants.baseHttpApiUrl);
            GeneralApi<HalfDuplexParamBase> api = new GeneralApi<>();
            api.get(GeneralListParam.builder().limit(1L).build(), warmupOption);
        } catch (Exception e) {
            // Reset flag to allow retry if pre-warming failed
        }
    }
}

class RealtimeRecognitionTask implements Runnable {
    private Path filepath;

    public RealtimeRecognitionTask(Path filepath) {
        this.filepath = filepath;
    }

    @Override
    public void run() {
        RecognitionParam param = RecognitionParam.builder()
                .model("fun-asr-realtime")
                // API keys differ between Singapore and Beijing regions. Get an API key: https://www.alibabacloud.com/help/zh/model-studio/get-api-key
                // If you have not configured an environment variable, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .format("wav")
                .sampleRate(16000)
                .build();
        Recognition recognizer = new Recognition();

        String threadName = Thread.currentThread().getName();

        ResultCallback<RecognitionResult> callback = new ResultCallback<RecognitionResult>() {
            @Override
            public void onEvent(RecognitionResult message) {
                if (message.isSentenceEnd()) {

                    System.out.println(TimeUtils.getTimestamp()+" "+
                            "[process " + threadName + "] Final Result:" + message.getSentence().getText());
                } else {
                    System.out.println(TimeUtils.getTimestamp()+" "+
                            "[process " + threadName + "] Intermediate Result: " + message.getSentence().getText());
                }
            }

            @Override
            public void onComplete() {
                System.out.println(TimeUtils.getTimestamp()+" "+"[" + threadName + "] Recognition complete");
            }

            @Override
            public void onError(Exception e) {
                System.out.println(TimeUtils.getTimestamp()+" "+
                        "[" + threadName + "] RecognitionCallback error: " + e.getMessage());
            }
        };

        try {
            recognizer.call(param, callback);
            // Please replace the path with your audio file path
            System.out.println(TimeUtils.getTimestamp()+" "+"[" + threadName + "] Input file_path is: " + this.filepath);
            // Read file and send audio by chunks
            FileInputStream fis = new FileInputStream(this.filepath.toFile());
            byte[] allData = new byte[fis.available()];
            int ret = fis.read(allData);
            fis.close();

            int sendFrameLength = 3200;
            for (int i = 0; i * sendFrameLength < allData.length; i ++) {
                int start = i * sendFrameLength;
                int end = Math.min(start + sendFrameLength, allData.length);
                ByteBuffer byteBuffer = ByteBuffer.wrap(allData, start, end - start);
                recognizer.sendAudioFrame(byteBuffer);
                Thread.sleep(100);
            }

            System.out.println(TimeUtils.getTimestamp()+" "+LocalDateTime.now());
            recognizer.stop();
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            // Close the Websocket connection after the task ends
            recognizer.getDuplexApi().close(1000, "bye");
        }

        System.out.println(
                "["
                        + threadName
                        + "][Metric] requestId: "
                        + recognizer.getLastRequestId()
                        + ", first package delay ms: "
                        + recognizer.getFirstPackageDelay()
                        + ", last package delay ms: "
                        + recognizer.getLastPackageDelay());
    }
}

Python

The audio file used in this example is asr_example.wav.

import os
import time
import dashscope
from dashscope.audio.asr import *

# API keys differ between the Singapore and Beijing regions. Get an API key: https://www.alibabacloud.com/help/zh/model-studio/get-api-key
# If you have not set an environment variable, replace the next line with your Model Studio API key: dashscope.api_key = "sk-xxx"
dashscope.api_key = os.environ.get('DASHSCOPE_API_KEY')

# The following URL is for the Singapore region. To use models in the Beijing region, replace it with: wss://dashscope.aliyuncs.com/api-ws/v1/inference
dashscope.base_websocket_api_url='wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference'

from datetime import datetime


def get_timestamp():
    now = datetime.now()
    formatted_timestamp = now.strftime("[%Y-%m-%d %H:%M:%S.%f]")
    return formatted_timestamp


class Callback(RecognitionCallback):
    def on_complete(self) -> None:
        print(get_timestamp() + ' Recognition completed')  # recognition complete

    def on_error(self, result: RecognitionResult) -> None:
        print('Recognition task_id: ', result.request_id)
        print('Recognition error: ', result.message)
        exit(0)

    def on_event(self, result: RecognitionResult) -> None:
        sentence = result.get_sentence()
        if 'text' in sentence:
            print(get_timestamp() + ' RecognitionCallback text: ', sentence['text'])
        if RecognitionResult.is_sentence_end(sentence):
            print(get_timestamp() +
                  'RecognitionCallback sentence end, request_id:%s, usage:%s'
                  % (result.get_request_id(), result.get_usage(sentence)))


callback = Callback()

recognition = Recognition(model='fun-asr-realtime',
                          format='wav',
                          sample_rate=16000,
                          callback=callback)

try:
    audio_data: bytes = None
    f = open("asr_example.wav", 'rb')
    if os.path.getsize("asr_example.wav"):
        # Read all file data into buffer at once
        file_buffer = f.read()
        f.close()
        print("Start Recognition")
        recognition.start()

        # Send data from buffer in chunks of 3200 bytes
        buffer_size = len(file_buffer)
        offset = 0
        chunk_size = 3200

        while offset < buffer_size:
            # Calculate size of current chunk
            remaining_bytes = buffer_size - offset
            current_chunk_size = min(chunk_size, remaining_bytes)

            # Extract current chunk from buffer
            audio_data = file_buffer[offset:offset + current_chunk_size]

            # Send audio frame
            recognition.send_audio_frame(audio_data)
            # Update offset
            offset += current_chunk_size

            # Add delay to simulate real-time transmission
            time.sleep(0.1)

        recognition.stop()
    else:
        raise Exception(
            'The supplied file was empty (zero bytes long)')
except Exception as e:
    raise e

print(
    '[Metric] requestId: {}, first package delay ms: {}, last package delay ms: {}'
    .format(
        recognition.get_last_request_id(),
        recognition.get_first_package_delay(),
        recognition.get_last_package_delay(),
    ))

Paraformer

Recognize speech from a microphone

This feature recognizes microphone input and displays results in real time ("speak-and-see").

Java

import com.alibaba.dashscope.audio.asr.recognition.Recognition;
import com.alibaba.dashscope.audio.asr.recognition.RecognitionParam;
import com.alibaba.dashscope.exception.NoApiKeyException;
import io.reactivex.BackpressureStrategy;
import io.reactivex.Flowable;

import java.nio.ByteBuffer;
import javax.sound.sampled.AudioFormat;
import javax.sound.sampled.AudioSystem;
import javax.sound.sampled.TargetDataLine;

public class Main {

    public static void main(String[] args) throws NoApiKeyException {
        // Create a Flowable<ByteBuffer>.
        Flowable<ByteBuffer> audioSource = Flowable.create(emitter -> {
            new Thread(() -> {
                try {
                    // Create an audio format.
                    AudioFormat audioFormat = new AudioFormat(16000, 16, 1, true, false);
                    // Match the default recording device based on the format.
                    TargetDataLine targetDataLine =
                            AudioSystem.getTargetDataLine(audioFormat);
                    targetDataLine.open(audioFormat);
                    // Start recording.
                    targetDataLine.start();
                    ByteBuffer buffer = ByteBuffer.allocate(1024);
                    long start = System.currentTimeMillis();
                    // Record for 300s and perform real-time transcription.
                    while (System.currentTimeMillis() - start < 300000) {
                        int read = targetDataLine.read(buffer.array(), 0, buffer.capacity());
                        if (read > 0) {
                            buffer.limit(read);
                            // Send the recorded audio data to the streaming recognition service.
                            emitter.onNext(buffer);
                            buffer = ByteBuffer.allocate(1024);
                            // The recording rate is limited. Sleep for a short time to prevent high CPU usage.
                            Thread.sleep(20);
                        }
                    }
                    // Notify the end of transcription.
                    emitter.onComplete();
                } catch (Exception e) {
                    emitter.onError(e);
                }
            }).start();
        },
        BackpressureStrategy.BUFFER);

        // Create a Recognizer.
        Recognition recognizer = new Recognition();
        // Create a RecognitionParam and pass the created Flowable<ByteBuffer> in the audioFrames parameter.
        RecognitionParam param = RecognitionParam.builder()
            .model("paraformer-realtime-v2")
            .format("pcm")
            .sampleRate(16000)
            // If you have not configured the API key as an environment variable, uncomment the following line of code and replace apiKey with your own API key.
            // .apiKey("apikey")
            .build();

        // Call the interface in a streaming fashion.
        recognizer.streamCall(param, audioSource)
            // Call the subscribe method of Flowable to subscribe to the results.
            .blockingForEach(
                result -> {
                    // Print the final result.
                    if (result.isSentenceEnd()) {
                        System.out.println("Fix:" + result.getSentence().getText());
                    } else {
                        System.out.println("Result:" + result.getSentence().getText());
                    }
                });
        System.exit(0);
    }
}

Python

Before running the Python example, run pip install pyaudio

import pyaudio
from dashscope.audio.asr import (Recognition, RecognitionCallback,
                                 RecognitionResult)

# If you have not configured the API key as an environment variable, uncomment the following line of code and replace apiKey with your own API key.
# import dashscope
# dashscope.api_key = "apiKey"

mic = None
stream = None


class Callback(RecognitionCallback):
    def on_open(self) -> None:
        global mic
        global stream
        print('RecognitionCallback open.')
        mic = pyaudio.PyAudio()
        stream = mic.open(format=pyaudio.paInt16,
                          channels=1,
                          rate=16000,
                          input=True)

    def on_close(self) -> None:
        global mic
        global stream
        print('RecognitionCallback close.')
        stream.stop_stream()
        stream.close()
        mic.terminate()
        stream = None
        mic = None

    def on_event(self, result: RecognitionResult) -> None:
        print('RecognitionCallback sentence: ', result.get_sentence())


callback = Callback()
recognition = Recognition(model='paraformer-realtime-v2',
                          format='pcm',
                          sample_rate=16000,
                          callback=callback)
recognition.start()

while True:
    if stream:
        data = stream.read(3200, exception_on_overflow=False)
        recognition.send_audio_frame(data)
    else:
        break

recognition.stop()

Recognize a local audio file

This feature recognizes and transcribes local audio files. It is ideal for short audio scenarios like voice chats, commands, voice input, or voice search.

Java

import com.alibaba.dashscope.audio.asr.recognition.Recognition;
import com.alibaba.dashscope.audio.asr.recognition.RecognitionParam;
import java.io.File;
import java.io.IOException;
import java.io.InputStream;
import java.net.URL;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.nio.file.StandardCopyOption;

public class Main {
    public static void main(String[] args) {
        // You can ignore the file download from the URL. Use a local file to call the API for recognition.
        String exampleWavUrl =
                "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav";
        try {
            InputStream in = new URL(exampleWavUrl).openStream();
            Files.copy(in, Paths.get("asr_example.wav"), StandardCopyOption.REPLACE_EXISTING);
        } catch (IOException e) {
            System.out.println("error: " + e);
            System.exit(1);
        }

        // Create a Recognition instance.
        Recognition recognizer = new Recognition();
        // Create a RecognitionParam.
        RecognitionParam param =
                RecognitionParam.builder()
                        // If you have not configured the API key as an environment variable, uncomment the following line of code and replace apiKey with your own API key.
                        // .apiKey("apikey")
                        .model("paraformer-realtime-v2")
                        .format("wav")
                        .sampleRate(16000)
                        // "language_hints" is only supported by the paraformer-v2 and paraformer-realtime-v2 models.
                        .parameter("language_hints", new String[]{"zh", "en"})
                        .build();

        try {
            System.out.println("Recognition result: " + recognizer.call(param, new File("asr_example.wav")));
        } catch (Exception e) {
            e.printStackTrace();
        }
        System.exit(0);
    }
}

Python

import requests
from http import HTTPStatus
from dashscope.audio.asr import Recognition

# If you have not configured the API key as an environment variable, uncomment the following line of code and replace apiKey with your own API key.
# import dashscope
# dashscope.api_key = "apiKey"

# You can ignore the file download from the URL. Use a local file for recognition.
r = requests.get(
    'https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav'
)
with open('asr_example.wav', 'wb') as f:
    f.write(r.content)

recognition = Recognition(model='paraformer-realtime-v2',
                          format='wav',
                          sample_rate=16000,
                          # "language_hints" is only supported by the paraformer-v2 and paraformer-realtime-v2 models.
                          language_hints=['zh', 'en'],
                          callback=None)
result = recognition.call('asr_example.wav')
if result.status_code == HTTPStatus.OK:
    print('Recognition result:')
    print(result.get_sentence())
else:
    print('Error: ', result.message)

Going live

Improve recognition accuracy

  • Select a model with the correct sample rate: For 8 kHz telephone audio, use an 8 kHz model directly because upsampling to 16 kHz causes distortion.

  • Optimize input audio quality: Use high-quality microphones with high SNR and no echo. At the application level, integrate noise reduction (like RNNoise) or acoustic echo cancellation (AEC) to preprocess audio.

  • Specify the recognition language: For multilingual models like Paraformer-v2, use the Language_hints parameter (e.g., ['zh','en']) to help the model converge and avoid confusion between similar-sounding languages.

  • Filter disfluent words: For Paraformer, set the disfluency_removal_enabled parameter to produce more formal, readable text.

Set a fault tolerance policy

  • Client-side reconnection: Implement automatic reconnection to handle network jitter. For the Python SDK:

    1. Catch exceptions: Implement on_error in the Callback class. The dashscope SDK calls this method when network errors occur.

    2. Notify status: When on_error triggers, set a reconnection signal using the thread-safe threading.Event flag.

    3. Reconnection loop: Wrap the main logic in a for loop to retry up to 3 times. When the signal triggers, interrupt recognition, clean up resources, wait a few seconds, and reconnect.

  • Set a heartbeat to prevent connection loss: Set the heartbeat parameter to true to maintain the connection during long silence periods.

  • Rate limiting: Follow the model's rate limiting rules.

API reference

Compare models

Features

Fun-ASR

Paraformer

Supported languages

Varies by model:

  • fun-asr-realtime, fun-asr-realtime-2026-02-28, fun-asr-realtime-2025-11-07: Chinese (Mandarin, Cantonese, Wu, Minnan, Hakka, Gan, Xiang, Jin; and supports accents from regions such as Central Plains, Southwest, Jilu, Jianghuai, Lanyin, Jiaoliao, Northeast, Beijing, Hong Kong, and Taiwan, including Henan, Shaanxi, Hubei, Sichuan, Chongqing, Yunnan, Guizhou, Guangdong, Guangxi, Hebei, Tianjin, Shandong, Anhui, Nanjing, Jiangsu, Hangzhou, Gansu, and Ningxia), English, Japanese

  • fun-asr-realtime-2025-09-15: Chinese (Mandarin), English

  • fun-asr-flash-8k-realtime, fun-asr-flash-8k-realtime-2026-01-28: Chinese

Varies by model:

  • paraformer-realtime-v2: Chinese (Mandarin, Cantonese, Wu, Minnan, Northeast dialect, Gansu dialect, Guizhou dialect, Henan dialect, Hubei dialect, Hunan dialect, Ningxia dialect, Shanxi dialect, Shaanxi dialect, Shandong dialect, Sichuan dialect, Tianjin dialect, Jiangxi dialect, Yunnan dialect, Shanghainese), English, Japanese, Korean, German, French, Russian

  • paraformer-realtime-v1, paraformer-realtime-8k-v2, paraformer-realtime-8k-v1: Chinese (Mandarin)

Supported audio formats

pcm, wav, mp3, opus, speex, aac, amr

Sample rate

Varies by model:

  • fun-asr-realtime, fun-asr-realtime-2026-02-28, fun-asr-realtime-2025-11-07, fun-asr-realtime-2025-09-15: 16 kHz

  • fun-asr-flash-8k-realtime, fun-asr-flash-8k-realtime-2026-01-28: 8 kHz

Varies by model:

  • paraformer-realtime-v2: Any sample rate

  • paraformer-realtime-v1: 16 kHz

  • paraformer-realtime-8k-v2, paraformer-realtime-8k-v1: 8 kHz

Sound channel

Mono

Input format

Binary audio stream

Audio size/duration

Unlimited

Unlimited

Emotion recognition

Not supported

Varies by model:

  • paraformer-realtime-v2, paraformer-realtime-v1, paraformer-realtime-8k-v1: Not supported

  • paraformer-realtime-8k-v2: Supported Enabled by default, can be disabled

Sensitive words filter

Not supported

Speaker diarization

Not supported

Modal particle filtering

Not supported

Supported Disabled by default, can be enabled

Timestamp

Supported Always enabled

Punctuation prediction

Supported Always enabled

Varies by model:

  • paraformer-realtime-v2, paraformer-realtime-8k-v2: Supported Enabled by default, can be disabled

  • paraformer-realtime-v1, paraformer-realtime-8k-v1: Supported Always enabled

Hotwords

Not supported

ITN

Supported Always enabled

VAD

Supported Always enabled

Rate limit (RPS)

20

20

Connection type

Java/Python/Android/iOS SDK, WebSocket API

Price

Varies by model:

  • fun-asr-realtime, fun-asr-realtime-2026-02-28, fun-asr-realtime-2025-11-07:

    • International: $0.00009/second

    • Chinese mainland: $0.000047/second

  • fun-asr-realtime-2025-09-15:

    • Chinese mainland: $0.000047/second

  • fun-asr-flash-8k-realtime, fun-asr-flash-8k-realtime-2026-01-28:

    • Chinese mainland: $0.000032/second

Chinese mainland: $0.000012/second