Fun-ASR Java SDK provides interfaces for synchronous and streaming speech recognition - Alibaba Cloud Model Studio

This topic describes the parameters and interfaces of the Fun-ASR real-time speech recognition Java SDK.

User guide: For an introduction to the models and selection recommendations, see Real-time speech recognition - Fun-ASR/Gummy/Paraformer.

Prerequisites

You have activated the service and created an API key. To prevent security risks from code leakage, export the API key as an environment variable instead of hard-coding it in your code.
Install the latest version of the DashScope SDK.

Model availability

International (Singapore)

Model

Version

Supported languages

Supported sample rates

Scenarios

Supported audio formats

Price

Free quota (Note)

fun-asr-realtime

This model is currently equivalent to fun-asr-realtime-2025-11-07.

Stable

Chinese (Mandarin, Cantonese, Wu, Minnan, Hakka, Gan, Xiang, and Jin), English, and Japanese. This model also supports Mandarin accents from various Chinese regions, including Zhongyuan, Southwest, Ji-Lu, Jianghuai, Lan-Yin, Jiao-Liao, Northeast, Beijing, Hong Kong/Taiwan, Henan, Shaanxi, Hubei, Sichuan, Chongqing, Yunnan, Guizhou, Guangdong, Guangxi, Hebei, Tianjin, Shandong, Anhui, Nanjing, Jiangsu, Hangzhou, Gansu, and Ningxia.

16 kHz

ApsaraVideo Live, conferences, call centers, and more.

PCM, WAV, MP3, Opus, Speex, AAC, and AMR

$0.00009/second

36,000 seconds (10 hours)

Valid for 90 days

fun-asr-realtime-2025-11-07

Snapshot

China (Beijing)

Model	Version	Supported languages	Supported sample rates	Scenarios	Supported audio formats	Price
fun-asr-realtime Equivalent to fun-asr-realtime-2025-11-07	Stable	Chinese (Mandarin, Cantonese, Wu, Minnan, Hakka, Gan, Xiang, and Jin), English, and Japanese. This model also supports Mandarin accents from various Chinese regions and provinces, including Zhongyuan, Southwest, Ji-Lu, Jianghuai, Lan-Yin, Jiao-Liao, Northeast, Beijing, Hong Kong/Taiwan, Henan, Shaanxi, Hubei, Sichuan, Chongqing, Yunnan, Guizhou, Guangdong, Guangxi, Hebei, Tianjin, Shandong, Anhui, Nanjing, Jiangsu, Hangzhou, Gansu, and Ningxia.	16 kHz	ApsaraVideo Live, conferences, call centers, and more	pcm, wav, mp3, opus, speex, aac, amr	$0.000047/second
fun-asr-realtime-2025-11-07 This version is optimized for far-field Voice Activity Detection (VAD) and provides higher recognition accuracy than fun-asr-realtime-2025-09-15.	Snapshot
fun-asr-realtime-2025-09-15		Chinese (Mandarin), English

Getting started

The Recognition class provides interfaces for non-streaming and bidirectional streaming calls. Choose the appropriate calling method based on your requirements:

Non-streaming call: Recognizes a local file and returns the complete result at once. This is suitable for processing pre-recorded audio.
Bidirectional streaming call: Recognizes an audio stream and outputs the results in real time. The audio stream can originate from an external device, such as a microphone, or be read from a local file. This is suitable for scenarios that require immediate feedback.

Non-streaming call

Submit a single real-time speech-to-text task and synchronously receive the transcription result by passing a local file. This call blocks the current thread.

To perform recognition and obtain the result, instantiate the Recognition class, call the call method, and attach the request parameters and the file to be recognized.

Click to view the complete example

The audio file used in the example is asr_example.wav.

import com.alibaba.dashscope.audio.asr.recognition.Recognition;
import com.alibaba.dashscope.audio.asr.recognition.RecognitionParam;
import com.alibaba.dashscope.utils.Constants;

import java.io.File;

public class Main {
    public static void main(String[] args) {
        // The following URL is for the Singapore region. If you use a model in the Beijing region, replace the URL with wss://dashscope.aliyuncs.com/api-ws/v1/inference.
        Constants.baseWebsocketApiUrl = "wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference";
        // Create a Recognition instance.
        Recognition recognizer = new Recognition();
        // Create a RecognitionParam.
        RecognitionParam param =
                RecognitionParam.builder()
                        .model("fun-asr-realtime")
                        // The API keys for the Singapore and Beijing regions are different. To obtain an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key.
                        // If you have not configured an environment variable, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
                        .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                        .format("wav")
                        .sampleRate(16000)
                        .parameter("language_hints", new String[]{"zh", "en"})
                        .build();

        try {
            System.out.println("Recognition result: " + recognizer.call(param, new File("asr_example.wav")));
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            // Close the WebSocket connection after the task is complete.
            recognizer.getDuplexApi().close(1000, "bye");
        }
        System.out.println(
                "[Metric] requestId: "
                        + recognizer.getLastRequestId()
                        + ", first package delay ms: "
                        + recognizer.getFirstPackageDelay()
                        + ", last package delay ms: "
                        + recognizer.getLastPackageDelay());
        System.exit(0);
    }
}

Bidirectional streaming call: Callback-based

Submit a single real-time speech-to-text task and receive real-time recognition results in a stream by implementing a callback interface.

Start streaming speech recognition
Instantiate the Recognition class and call the call method to bind the request parameters and the callback interface (ResultCallback) to start streaming speech recognition.
Stream audio
Repeatedly call the sendAudioFrame method of the Recognition class to send segments of a binary audio stream to the server. The stream can be read from a local file or a device, such as a microphone.
While you send audio data, the server returns real-time recognition results to the client through the onEvent method of the ResultCallback callback interface.
Each audio segment that you send should be about 100 milliseconds long, with a data size between 1 KB and 16 KB.
End processing
Call the stop method of the Recognition class to stop speech recognition.
This method blocks the current thread until the onComplete or onError callback of the ResultCallback interface is triggered.

Click to view the complete example

Recognize speech from a microphone

import com.alibaba.dashscope.audio.asr.recognition.Recognition;
import com.alibaba.dashscope.audio.asr.recognition.RecognitionParam;
import com.alibaba.dashscope.audio.asr.recognition.RecognitionResult;
import com.alibaba.dashscope.common.ResultCallback;
import com.alibaba.dashscope.utils.Constants;

import javax.sound.sampled.AudioFormat;
import javax.sound.sampled.AudioSystem;
import javax.sound.sampled.TargetDataLine;

import java.nio.ByteBuffer;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;

public class Main {
    public static void main(String[] args) throws InterruptedException {
        // The following URL is for the Singapore region. If you use a model in the Beijing region, replace the URL with wss://dashscope.aliyuncs.com/api-ws/v1/inference.
        Constants.baseWebsocketApiUrl = "wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference";
        ExecutorService executorService = Executors.newSingleThreadExecutor();
        executorService.submit(new RealtimeRecognitionTask());
        executorService.shutdown();
        executorService.awaitTermination(1, TimeUnit.MINUTES);
        System.exit(0);
    }
}

class RealtimeRecognitionTask implements Runnable {
    @Override
    public void run() {
        RecognitionParam param = RecognitionParam.builder()
                .model("fun-asr-realtime")
                // The API keys for the Singapore and Beijing regions are different. To obtain an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key.
                // If you have not configured an environment variable, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .format("wav")
                .sampleRate(16000)
                .build();
        Recognition recognizer = new Recognition();

        ResultCallback<RecognitionResult> callback = new ResultCallback<RecognitionResult>() {
            @Override
            public void onEvent(RecognitionResult result) {
                if (result.isSentenceEnd()) {
                    System.out.println("Final Result: " + result.getSentence().getText());
                } else {
                    System.out.println("Intermediate Result: " + result.getSentence().getText());
                }
            }

            @Override
            public void onComplete() {
                System.out.println("Recognition complete");
            }

            @Override
            public void onError(Exception e) {
                System.out.println("RecognitionCallback error: " + e.getMessage());
            }
        };
        try {
            recognizer.call(param, callback);
            // Create an audio format.
            AudioFormat audioFormat = new AudioFormat(16000, 16, 1, true, false);
            // Match the default recording device based on the format.
            TargetDataLine targetDataLine =
                    AudioSystem.getTargetDataLine(audioFormat);
            targetDataLine.open(audioFormat);
            // Start recording.
            targetDataLine.start();
            ByteBuffer buffer = ByteBuffer.allocate(1024);
            long start = System.currentTimeMillis();
            // Record for 50 seconds and perform real-time transcription.
            while (System.currentTimeMillis() - start < 50000) {
                int read = targetDataLine.read(buffer.array(), 0, buffer.capacity());
                if (read > 0) {
                    buffer.limit(read);
                    // Send the recorded audio data to the streaming recognition service.
                    recognizer.sendAudioFrame(buffer);
                    buffer = ByteBuffer.allocate(1024);
                    // The recording rate is limited. Sleep for a short period to prevent high CPU usage.
                    Thread.sleep(20);
                }
            }
            recognizer.stop();
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            // Close the WebSocket connection after the task is complete.
            recognizer.getDuplexApi().close(1000, "bye");
        }

        System.out.println(
                "[Metric] requestId: "
                        + recognizer.getLastRequestId()
                        + ", first package delay ms: "
                        + recognizer.getFirstPackageDelay()
                        + ", last package delay ms: "
                        + recognizer.getLastPackageDelay());
    }
}

Recognize a local audio file

The audio file used in the example is asr_example.wav.

import com.alibaba.dashscope.audio.asr.recognition.Recognition;
import com.alibaba.dashscope.audio.asr.recognition.RecognitionParam;
import com.alibaba.dashscope.audio.asr.recognition.RecognitionResult;
import com.alibaba.dashscope.common.ResultCallback;
import com.alibaba.dashscope.utils.Constants;

import java.io.FileInputStream;
import java.nio.ByteBuffer;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.time.LocalDateTime;
import java.time.format.DateTimeFormatter;
import java.util.concurrent.CountDownLatch;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;

class TimeUtils {
    private static final DateTimeFormatter formatter =
            DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss.SSS");

    public static String getTimestamp() {
        return LocalDateTime.now().format(formatter);
    }
}

public class Main {
    public static void main(String[] args) throws InterruptedException {
        // The following URL is for the Singapore region. If you use a model in the Beijing region, replace the URL with wss://dashscope.aliyuncs.com/api-ws/v1/inference.
        Constants.baseWebsocketApiUrl = "wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference";
        ExecutorService executorService = Executors.newSingleThreadExecutor();
        executorService.submit(new RealtimeRecognitionTask(Paths.get(System.getProperty("user.dir"), "asr_example.wav")));
        executorService.shutdown();

        // wait for all tasks to complete
        executorService.awaitTermination(1, TimeUnit.MINUTES);
        System.exit(0);
    }
}

class RealtimeRecognitionTask implements Runnable {
    private Path filepath;

    public RealtimeRecognitionTask(Path filepath) {
        this.filepath = filepath;
    }

    @Override
    public void run() {
        RecognitionParam param = RecognitionParam.builder()
                .model("fun-asr-realtime")
                // The API keys for the Singapore and Beijing regions are different. To obtain an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key.
                // If you have not configured an environment variable, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .format("wav")
                .sampleRate(16000)
                .build();
        Recognition recognizer = new Recognition();

        String threadName = Thread.currentThread().getName();

        ResultCallback<RecognitionResult> callback = new ResultCallback<RecognitionResult>() {
            @Override
            public void onEvent(RecognitionResult message) {
                if (message.isSentenceEnd()) {

                    System.out.println(TimeUtils.getTimestamp()+" "+
                            "[process " + threadName + "] Final Result:" + message.getSentence().getText());
                } else {
                    System.out.println(TimeUtils.getTimestamp()+" "+
                            "[process " + threadName + "] Intermediate Result: " + message.getSentence().getText());
                }
            }

            @Override
            public void onComplete() {
                System.out.println(TimeUtils.getTimestamp()+" "+"[" + threadName + "] Recognition complete");
            }

            @Override
            public void onError(Exception e) {
                System.out.println(TimeUtils.getTimestamp()+" "+
                        "[" + threadName + "] RecognitionCallback error: " + e.getMessage());
            }
        };

        try {
            recognizer.call(param, callback);
            // Please replace the path with your audio file path
            System.out.println(TimeUtils.getTimestamp()+" "+"[" + threadName + "] Input file_path is: " + this.filepath);
            // Read file and send audio by chunks
            FileInputStream fis = new FileInputStream(this.filepath.toFile());
            // chunk size set to 1 second for 16 kHz sample rate
            byte[] buffer = new byte[3200];
            int bytesRead;
            // Loop to read chunks of the file
            while ((bytesRead = fis.read(buffer)) != -1) {
                ByteBuffer byteBuffer;
                // Handle the last chunk which might be smaller than the buffer size
                System.out.println(TimeUtils.getTimestamp()+" "+"[" + threadName + "] bytesRead: " + bytesRead);
                if (bytesRead < buffer.length) {
                    byteBuffer = ByteBuffer.wrap(buffer, 0, bytesRead);
                } else {
                    byteBuffer = ByteBuffer.wrap(buffer);
                }

                recognizer.sendAudioFrame(byteBuffer);
                buffer = new byte[3200];
                Thread.sleep(100);
            }
            System.out.println(TimeUtils.getTimestamp()+" "+LocalDateTime.now());
            recognizer.stop();
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            // Close the WebSocket connection after the task is complete.
            recognizer.getDuplexApi().close(1000, "bye");
        }

        System.out.println(
                "["
                        + threadName
                        + "][Metric] requestId: "
                        + recognizer.getLastRequestId()
                        + ", first package delay ms: "
                        + recognizer.getFirstPackageDelay()
                        + ", last package delay ms: "
                        + recognizer.getLastPackageDelay());
    }
}

Bidirectional streaming call: Flowable-based

Submit a single real-time speech-to-text task and receive real-time recognition results in a stream by implementing a Flowable workflow.

Flowable is an open-source framework for workflow and business process management that is released under the Apache 2.0 license. For more information about how to use Flowable, see Flowable API details.

Click to view the complete example

Directly call the streamCall method of the Recognition class to start recognition.

The streamCall method returns a Flowable<RecognitionResult> instance. Call methods of the Flowable instance, such as blockingForEach and subscribe, to process the recognition results. The recognition results are encapsulated in the RecognitionResult object.

The streamCall method requires two parameters:

A RecognitionParam instance (request parameters): Use this instance to set parameters for speech recognition, such as the model, sample rate, and audio format.
A Flowable<ByteBuffer> instance: You need to create an instance of the Flowable<ByteBuffer> type and implement a method to parse the audio stream within this instance.

import com.alibaba.dashscope.audio.asr.recognition.Recognition;
import com.alibaba.dashscope.audio.asr.recognition.RecognitionParam;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.utils.Constants;
import io.reactivex.BackpressureStrategy;
import io.reactivex.Flowable;

import javax.sound.sampled.AudioFormat;
import javax.sound.sampled.AudioSystem;
import javax.sound.sampled.TargetDataLine;
import java.nio.ByteBuffer;

public class Main {
    public static void main(String[] args) throws NoApiKeyException {
        // The following URL is for the Singapore region. If you use a model in the Beijing region, replace the URL with wss://dashscope.aliyuncs.com/api-ws/v1/inference.
        Constants.baseWebsocketApiUrl = "wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference";
        // Create a Flowable<ByteBuffer>.
        Flowable<ByteBuffer> audioSource =
                Flowable.create(
                        emitter -> {
                            new Thread(
                                    () -> {
                                        try {
                                            // Create an audio format.
                                            AudioFormat audioFormat = new AudioFormat(16000, 16, 1, true, false);
                                            // Match the default recording device based on the format.
                                            TargetDataLine targetDataLine =
                                                    AudioSystem.getTargetDataLine(audioFormat);
                                            targetDataLine.open(audioFormat);
                                            // Start recording.
                                            targetDataLine.start();
                                            ByteBuffer buffer = ByteBuffer.allocate(1024);
                                            long start = System.currentTimeMillis();
                                            // Record for 50 seconds and perform real-time transcription.
                                            while (System.currentTimeMillis() - start < 50000) {
                                                int read = targetDataLine.read(buffer.array(), 0, buffer.capacity());
                                                if (read > 0) {
                                                    buffer.limit(read);
                                                    // Send the recorded audio data to the streaming recognition service.
                                                    emitter.onNext(buffer);
                                                    buffer = ByteBuffer.allocate(1024);
                                                    // The recording rate is limited. Sleep for a short period to prevent high CPU usage.
                                                    Thread.sleep(20);
                                                }
                                            }
                                            // Notify that the transcription is complete.
                                            emitter.onComplete();
                                        } catch (Exception e) {
                                            emitter.onError(e);
                                        }
                                    })
                                    .start();
                        },
                        BackpressureStrategy.BUFFER);

        // Create a Recognizer.
        Recognition recognizer = new Recognition();
        // Create a RecognitionParam and pass the created Flowable<ByteBuffer> in the audioFrames parameter.
        RecognitionParam param = RecognitionParam.builder()
                .model("fun-asr-realtime")
                // The API keys for the Singapore and Beijing regions are different. To obtain an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key.
                // If you have not configured an environment variable, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .format("pcm")
                .sampleRate(16000)
                .build();

        // Call the streaming interface.
        recognizer
                .streamCall(param, audioSource)
                .blockingForEach(
                        result -> {
                            // Subscribe to the output result
                            if (result.isSentenceEnd()) {
                                System.out.println("Final Result: " + result.getSentence().getText());
                            } else {
                                System.out.println("Intermediate Result: " + result.getSentence().getText());
                            }
                        });
        // Close the WebSocket connection after the task is complete.
        recognizer.getDuplexApi().close(1000, "bye");
        System.out.println(
                "[Metric] requestId: "
                        + recognizer.getLastRequestId()
                        + ", first package delay ms: "
                        + recognizer.getFirstPackageDelay()
                        + ", last package delay ms: "
                        + recognizer.getLastPackageDelay());
        System.exit(0);
    }
}

High-concurrency calls

The DashScope Java SDK uses the connection pool technology of OkHttp3 to reduce the overhead of repeatedly establishing connections. For more information, see Real-time speech recognition in high-concurrency scenarios.

Request parameters

Use the chain methods of RecognitionParam to configure parameters, such as the model, sample rate, and audio format. Then, pass the configured parameter object to the call or streamCall methods of the Recognition class.

Click to view an example

RecognitionParam param = RecognitionParam.builder()
  .model("fun-asr-realtime")
  .format("pcm")
  .sampleRate(16000)
  .parameter("language_hints", new String[]{"zh", "en"})
  .build();

Parameter	Type	Default	Required	Description
model	String	-	Yes	The model for real-time speech recognition. For more information, see Model list.
sampleRate	Integer	-	Yes	The sample rate of the audio to be recognized, in Hz. fun-asr-realtime supports a sample rate of 16000 Hz.
format	String	-	Yes	The format of the audio to be recognized. Supported audio formats: pcm, wav, mp3, opus, speex, aac, amr. Important opus/speex: Must use Ogg encapsulation. wav: Must be PCM encoded. amr: Only AMR-NB is supported.
vocabularyId	String	-	No	The ID of the custom vocabulary. For more information, see Custom vocabulary. This parameter is not set by default.
semantic_punctuation_enabled	boolean	false	No	Specifies whether to enable semantic punctuation. The default value is false. true: Enables semantic punctuation and disables Voice Activity Detection (VAD) punctuation. false (default): Enables VAD punctuation and disables semantic punctuation. Semantic punctuation provides higher accuracy and is suitable for meeting transcription scenarios. VAD punctuation provides lower latency and is suitable for interactive scenarios. By adjusting the `semantic_punctuation_enabled` parameter, you can flexibly switch the punctuation method for speech recognition to adapt to different scenarios. Note Set the `semantic_punctuation_enabled` parameter using the `parameter` or `parameters` method of the `RecognitionParam` instance: Set using the parameter method `RecognitionParam param = RecognitionParam.builder() .model("fun-asr-realtime") .format("pcm") .sampleRate(16000) .parameter("semantic_punctuation_enabled", true) .build();` Set using the parameters method `RecognitionParam param = RecognitionParam.builder() .model("fun-asr-realtime") .format("pcm") .sampleRate(16000) .parameters(Collections.singletonMap("semantic_punctuation_enabled", true)) .build();`
max_sentence_silence	Integer	1300	No	The silence duration threshold for VAD punctuation, in ms. When the duration of silence after a segment of speech exceeds this threshold, the system determines that the sentence has ended. The parameter ranges from 200 ms to 6000 ms. The default value is 1300 ms. This parameter takes effect only when the `semantic_punctuation_enabled` parameter is set to false (VAD punctuation). Note Set the `max_sentence_silence` parameter using the `parameter` or `parameters` method of the `RecognitionParam` instance: Set using the parameter method `RecognitionParam param = RecognitionParam.builder() .model("fun-asr-realtime") .format("pcm") .sampleRate(16000) .parameter("max_sentence_silence", 800) .build();` Set using the parameters method `RecognitionParam param = RecognitionParam.builder() .model("fun-asr-realtime") .format("pcm") .sampleRate(16000) .parameters(Collections.singletonMap("max_sentence_silence", 800)) .build();`
multi_threshold_mode_enabled	boolean	false	No	When this switch is enabled (true), it prevents VAD from segmenting sentences that are too long. It is disabled by default. This parameter takes effect only when the `semantic_punctuation_enabled` parameter is set to false (VAD punctuation). Note Set the `multi_threshold_mode_enabled` parameter using the `parameter` or `parameters` method of the `RecognitionParam` instance: Set using the parameter method `RecognitionParam param = RecognitionParam.builder() .model("fun-asr-realtime") .format("pcm") .sampleRate(16000) .parameter("multi_threshold_mode_enabled", true) .build();` Set using the parameters method `RecognitionParam param = RecognitionParam.builder() .model("fun-asr-realtime") .format("pcm") .sampleRate(16000) .parameters(Collections.singletonMap("multi_threshold_mode_enabled", true)) .build();`
punctuation_prediction_enabled	boolean	true	No	Specifies whether to automatically add punctuation to the recognition result: true (default): This setting cannot be modified. Note Set the `punctuation_prediction_enabled` parameter using the `parameter` or `parameters` method of the `RecognitionParam` instance: Set using the parameter method `RecognitionParam param = RecognitionParam.builder() .model("fun-asr-realtime") .format("pcm") .sampleRate(16000) .parameter("punctuation_prediction_enabled", false) .build();` Set using the parameters method `RecognitionParam param = RecognitionParam.builder() .model("fun-asr-realtime") .format("pcm") .sampleRate(16000) .parameters(Collections.singletonMap("punctuation_prediction_enabled", false)) .build();`
heartbeat	boolean	false	No	Use this switch to control whether to maintain a persistent connection with the server: true: The connection with the server is maintained even if silent audio is continuously sent. false (default): The connection is disconnected due to a timeout after 60 seconds, even if silent audio is continuously sent. Silent audio refers to content in an audio file or data stream that has no sound signal. You can generate silent audio using various methods, such as using audio editing software such as Audacity or Adobe Audition, or command-line tools such as FFmpeg. Note To use this field, the SDK version must be 2.19.1 or later. Set the `heartbeat` parameter using the `parameter` or `parameters` method of the `RecognitionParam` instance: Set using the parameter method `RecognitionParam param = RecognitionParam.builder() .model("fun-asr-realtime") .format("pcm") .sampleRate(16000) .parameter("heartbeat", true) .build();` Set using the parameters method `RecognitionParam param = RecognitionParam.builder() .model("fun-asr-realtime") .format("pcm") .sampleRate(16000) .parameters(Collections.singletonMap("heartbeat", true)) .build();`
apiKey	String	-	No	Your API key.

Key interfaces

`Recognition` class

Import the Recognition class using "import com.alibaba.dashscope.audio.asr.recognition.Recognition;". The key interfaces of this class are described in the following table:

Interface/Method	Parameters	Return value	Description
`public void call(RecognitionParam param, final ResultCallback<RecognitionResult> callback)`	`param`: Request parameter `callback`: The callback interface (ResultCallback)	None	Performs callback-based streaming real-time recognition. This method does not block the current thread.
`public String call(RecognitionParam param, File file)`	`param`: Request parameter `file`: The audio file to be recognized	Recognition result	Performs a non-streaming call based on a local file. This method blocks the current thread until the entire audio file is read. The file to be recognized must have read permissions.
`public Flowable<RecognitionResult> streamCall(RecognitionParam param, Flowable<ByteBuffer> audioFrame)`	`param`: Request parameters `audioFrame`: A `Flowable<ByteBuffer>` instance	`Flowable<RecognitionResult>`	Performs Flowable-based streaming real-time recognition.
`public void sendAudioFrame(ByteBuffer audioFrame)`	`audioFrame`: The binary audio stream, of the `ByteBuffer` type	None	Pushes an audio stream. Each audio packet should have a duration of about 100 ms and a size between 1 KB and 16 KB. The detection results are obtained through the onEvent method of the ResultCallback callback.
`public void stop()`	None	None	Stops real-time recognition. This method blocks the current thread until the `onComplete` or `onError` method of the `ResultCallback` instance is called.
`boolean getDuplexApi().close(int code, String reason)`	code: WebSocket Close Code reason: The reason for closing Refer to the The WebSocket Protocol document to configure these two parameters.	true	You must close the WebSocket connection after a task is complete to prevent connection leaks. This applies even if an exception occurs. To learn how to reuse connections to improve efficiency, see Real-time speech recognition in high-concurrency scenarios.
`public String getLastRequestId()`	None	requestId	Gets the request ID of the current task. Use this method after a new task is started by calling `call` or `streamingCall`. Note This method is available only in SDK versions 2.18.0 and later.
`public long getFirstPackageDelay()`	None	First-packet latency	Gets the first-packet latency, which is the delay from when the first audio packet is sent to when the first recognition result is received. Use this method after the task is complete. Note This method is available only in SDK versions 2.18.0 and later.
`public long getLastPackageDelay()`	None	Last-packet latency	Gets the last-packet latency, which is the time taken from when the `stop` instruction is sent to when the last recognition result is delivered. Use this method after the task is complete. Note This method is available only in SDK versions 2.18.0 and later.

Callback interface (`ResultCallback`)

When you make a bidirectional streaming call, the server uses a callback to return key process information and data to the client. You must implement a callback method to handle the information or data that is returned by the server.

Implement the callback methods by inheriting the ResultCallback abstract class. When you inherit this abstract class, specify the generic type as RecognitionResult. The RecognitionResult object encapsulates the data structure that is returned by the server.

Because Java supports connection reuse, there are no onClose or onOpen methods.

Example

ResultCallback<RecognitionResult> callback = new ResultCallback<RecognitionResult>() {
    @Override
    public void onEvent(RecognitionResult result) {
        System.out.println("RequestId is: " + result.getRequestId());
        // Implement the logic to process the speech recognition result here.
    }

    @Override
    public void onComplete() {
        System.out.println("Task complete");
    }

    @Override
    public void onError(Exception e) {
        System.out.println("Task failed: " + e.getMessage());
    }
};

Interface/Method	Parameters	Return value	Description
`public void onEvent(RecognitionResult result)`	`result`: Real-time recognition result (RecognitionResult)	None	This method is called back when the service responds.
`public void onComplete()`	None	None	This interface is called back after the task is complete.
`public void onError(Exception e)`	`e`: Exception information	None	This interface is called back when an exception occurs.

Response results

Real-time recognition result (`RecognitionResult`)

RecognitionResult represents the result of a single real-time recognition.

Interface/Method	Parameters	Return value	Description
`public String getRequestId()`	None	requestId	Gets the request ID.
`public boolean isSentenceEnd()`	None	Whether the sentence is complete, meaning punctuation has occurred	Determines whether the given sentence has ended.
`public Sentence getSentence()`	None	Sentence	Gets sentence information, including timestamps and text.

Sentence information (`Sentence`)

Interface/Method	Parameters	Return value	Description
`public Long getBeginTime()`	None	Sentence start time, in ms	Returns the sentence start time.
`public Long getEndTime()`	None	Sentence end time, in ms	Returns the sentence end time.
`public String getText()`	None	Recognized text	Returns the recognized text.
`public List<Word> getWords()`	None	A List of Word timestamp information (Word).	Returns word timestamp information.

Word timestamp information (`Word`)

Interface/Method	Parameters	Return value	Description
`public long getBeginTime()`	None	Word start time, in ms	Returns the word start time.
`public long getEndTime()`	None	Word end time, in ms	Returns the word end time.
`public String getText()`	None	Word	Returns the recognized word.
`public String getPunctuation()`	None	Punctuation	Returns the punctuation.

Error codes

If an error occurs, see Error messages for troubleshooting.

If the problem persists, join the developer group to report the issue. Provide the Request ID to help us investigate the issue.

FAQ

Features

Q: How do I maintain a persistent connection with the server during long periods of silence?

A: Set the heartbeat request parameter to true and continuously send silent audio to the server.

Silent audio refers to content in an audio file or data stream that has no sound signal. Generate silent audio using audio editing software, such as Audacity or Adobe Audition, or using command-line tools, such as FFmpeg.

Q: How do I convert an audio format to a supported format?

Use the FFmpeg tool. For more information, see the official FFmpeg website.

# Basic conversion command (universal template)
# -i, function: input file path, example: audio.wav
# -c:a, function: audio encoder, examples: aac, libmp3lame, pcm_s16le
# -b:a, function: bit rate (audio quality control), examples: 192k, 320k
# -ar, function: sample rate, examples: 44100 (CD), 48000, 16000
# -ac, function: number of sound channels, examples: 1 (mono), 2 (stereo)
# -y, function: overwrite existing file (no value needed)
ffmpeg -i input_audio.ext -c:a encoder_name -b:a bit_rate -ar sample_rate -ac number_of_channels output.ext

# Example: WAV → MP3 (preserve original quality)
ffmpeg -i input.wav -c:a libmp3lame -q:a 0 output.mp3
# Example: MP3 → WAV (16-bit PCM standard format)
ffmpeg -i input.mp3 -c:a pcm_s16le -ar 44100 -ac 2 output.wav
# Example: M4A → AAC (extract/convert Apple audio)
ffmpeg -i input.m4a -c:a copy output.aac  # Directly extract without re-encoding
ffmpeg -i input.m4a -c:a aac -b:a 256k output.aac  # Re-encode to improve quality
# Example: FLAC lossless → Opus (high compression)
ffmpeg -i input.flac -c:a libopus -b:a 128k -vbr on output.opus

Q: How do I recognize a local file (recorded audio file)?

A: There are two ways to recognize a local file:

Directly pass the local file path: This method returns the complete recognition result after the recognition is complete. This method is not suitable for scenarios that require immediate feedback.
Pass the file path to the call method of the Recognition class to directly recognize the audio file. For more information, see Non-streaming call.
Convert the local file into a binary stream for recognition: This method recognizes the file and streams the recognition results. This method is suitable for scenarios that require immediate feedback.
- Use the sendAudioFrame method of the Recognition class to send a binary stream to the server for recognition. For more information, see Bidirectional streaming calls: Callback-based.
- Use the streamCall method of the Recognition class to send a binary stream to the server for recognition. For more information, see Bidirectional streaming call: Flowable-based.

Troubleshooting

Q: Why is speech not recognized (no recognition result)?

Check whether the audio format (format) and sample rate (sampleRate or sample_rate) in the request parameters are correctly set and comply with the parameter constraints. The following are examples of common errors:
- The audio file has a .wav file name extension but is in MP3 format, and the format request parameter is set to mp3. This is an incorrect parameter setting.
- The audio sample rate is 3600 Hz, but the sampleRate or sample_rate request parameter is set to 48000. This is an incorrect parameter setting.
Use the ffprobe tool to obtain audio information, such as the container, encoding, sample rate, and sound channel:
```
ffprobe -v error -show_entries format=format_name -show_entries stream=codec_name,sample_rate,channels -of default=noprint_wrappers=1 input.xxx
```
Check if the language specified in language_hints is consistent with the actual language of the audio.
For example, the audio is in Chinese, but language_hints is set to en (English).
If all the preceding checks pass, customize vocabulary to improve the recognition of specific words.

Prerequisites

Model availability

International (Singapore)

China (Beijing)

Getting started

Non-streaming call

Bidirectional streaming call: Callback-based

Recognize speech from a microphone

Recognize a local audio file

Bidirectional streaming call: Flowable-based

High-concurrency calls

Request parameters

Set using the parameter method

Set using the parameters method

Set using the parameter method

Set using the parameters method

Set using the parameter method

Set using the parameters method

Set using the parameter method

Set using the parameters method

Set using the parameter method

Set using the parameters method

Key interfaces

Recognition class

Callback interface (ResultCallback)

Response results

Real-time recognition result (RecognitionResult)

Sentence information (Sentence)

Word timestamp information (Word)

Error codes

FAQ

Features

Q: How do I maintain a persistent connection with the server during long periods of silence?

Q: How do I convert an audio format to a supported format?

Q: How do I recognize a local file (recorded audio file)?

Troubleshooting

Q: Why is speech not recognized (no recognition result)?

`Recognition` class

Callback interface (`ResultCallback`)

Real-time recognition result (`RecognitionResult`)

Sentence information (`Sentence`)

Word timestamp information (`Word`)