All Products
Search
Document Center

Alibaba Cloud Model Studio:Java SDK

Last Updated:Dec 25, 2025

This topic describes the parameters and interfaces of the Fun-ASR real-time speech recognition Java SDK.

User guide: For an introduction to the models and selection recommendations, see Real-time speech recognition - Fun-ASR/Gummy/Paraformer.

Prerequisites

Model availability

International (Singapore)

Model

Version

Supported languages

Supported sample rates

Scenarios

Supported audio formats

Price

Free quota (Note)

fun-asr-realtime

This model is currently equivalent to fun-asr-realtime-2025-11-07.

Stable

Chinese (Mandarin, Cantonese, Wu, Minnan, Hakka, Gan, Xiang, and Jin), English, and Japanese. This model also supports Mandarin accents from various Chinese regions, including Zhongyuan, Southwest, Ji-Lu, Jianghuai, Lan-Yin, Jiao-Liao, Northeast, Beijing, Hong Kong/Taiwan, Henan, Shaanxi, Hubei, Sichuan, Chongqing, Yunnan, Guizhou, Guangdong, Guangxi, Hebei, Tianjin, Shandong, Anhui, Nanjing, Jiangsu, Hangzhou, Gansu, and Ningxia.

16 kHz

ApsaraVideo Live, conferences, call centers, and more.

PCM, WAV, MP3, Opus, Speex, AAC, and AMR

$0.00009/second

36,000 seconds (10 hours)

Valid for 90 days

fun-asr-realtime-2025-11-07

Snapshot

China (Beijing)

Model

Version

Supported languages

Supported sample rates

Scenarios

Supported audio formats

Price

fun-asr-realtime

Equivalent to fun-asr-realtime-2025-11-07

Stable

Chinese (Mandarin, Cantonese, Wu, Minnan, Hakka, Gan, Xiang, and Jin), English, and Japanese. This model also supports Mandarin accents from various Chinese regions and provinces, including Zhongyuan, Southwest, Ji-Lu, Jianghuai, Lan-Yin, Jiao-Liao, Northeast, Beijing, Hong Kong/Taiwan, Henan, Shaanxi, Hubei, Sichuan, Chongqing, Yunnan, Guizhou, Guangdong, Guangxi, Hebei, Tianjin, Shandong, Anhui, Nanjing, Jiangsu, Hangzhou, Gansu, and Ningxia.

16 kHz

ApsaraVideo Live, conferences, call centers, and more

pcm, wav, mp3, opus, speex, aac, amr

$0.000047/second

fun-asr-realtime-2025-11-07

This version is optimized for far-field Voice Activity Detection (VAD) and provides higher recognition accuracy than fun-asr-realtime-2025-09-15.

Snapshot

fun-asr-realtime-2025-09-15

Chinese (Mandarin), English

Getting started

The Recognition class provides interfaces for non-streaming and bidirectional streaming calls. Choose the appropriate calling method based on your requirements:

  • Non-streaming call: Recognizes a local file and returns the complete result at once. This is suitable for processing pre-recorded audio.

  • Bidirectional streaming call: Recognizes an audio stream and outputs the results in real time. The audio stream can originate from an external device, such as a microphone, or be read from a local file. This is suitable for scenarios that require immediate feedback.

Non-streaming call

Submit a single real-time speech-to-text task and synchronously receive the transcription result by passing a local file. This call blocks the current thread.

To perform recognition and obtain the result, instantiate the Recognition class, call the call method, and attach the request parameters and the file to be recognized.

Click to view the complete example

The audio file used in the example is asr_example.wav.

import com.alibaba.dashscope.audio.asr.recognition.Recognition;
import com.alibaba.dashscope.audio.asr.recognition.RecognitionParam;
import com.alibaba.dashscope.utils.Constants;

import java.io.File;

public class Main {
    public static void main(String[] args) {
        // The following URL is for the Singapore region. If you use a model in the Beijing region, replace the URL with wss://dashscope.aliyuncs.com/api-ws/v1/inference.
        Constants.baseWebsocketApiUrl = "wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference";
        // Create a Recognition instance.
        Recognition recognizer = new Recognition();
        // Create a RecognitionParam.
        RecognitionParam param =
                RecognitionParam.builder()
                        .model("fun-asr-realtime")
                        // The API keys for the Singapore and Beijing regions are different. To obtain an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key.
                        // If you have not configured an environment variable, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
                        .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                        .format("wav")
                        .sampleRate(16000)
                        .parameter("language_hints", new String[]{"zh", "en"})
                        .build();

        try {
            System.out.println("Recognition result: " + recognizer.call(param, new File("asr_example.wav")));
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            // Close the WebSocket connection after the task is complete.
            recognizer.getDuplexApi().close(1000, "bye");
        }
        System.out.println(
                "[Metric] requestId: "
                        + recognizer.getLastRequestId()
                        + ", first package delay ms: "
                        + recognizer.getFirstPackageDelay()
                        + ", last package delay ms: "
                        + recognizer.getLastPackageDelay());
        System.exit(0);
    }
}

Bidirectional streaming call: Callback-based

Submit a single real-time speech-to-text task and receive real-time recognition results in a stream by implementing a callback interface.

  1. Start streaming speech recognition

    Instantiate the Recognition class and call the call method to bind the request parameters and the callback interface (ResultCallback) to start streaming speech recognition.

  2. Stream audio

    Repeatedly call the sendAudioFrame method of the Recognition class to send segments of a binary audio stream to the server. The stream can be read from a local file or a device, such as a microphone.

    While you send audio data, the server returns real-time recognition results to the client through the onEvent method of the ResultCallback callback interface.

    Each audio segment that you send should be about 100 milliseconds long, with a data size between 1 KB and 16 KB.

  3. End processing

    Call the stop method of the Recognition class to stop speech recognition.

    This method blocks the current thread until the onComplete or onError callback of the ResultCallback interface is triggered.

Click to view the complete example

Recognize speech from a microphone

import com.alibaba.dashscope.audio.asr.recognition.Recognition;
import com.alibaba.dashscope.audio.asr.recognition.RecognitionParam;
import com.alibaba.dashscope.audio.asr.recognition.RecognitionResult;
import com.alibaba.dashscope.common.ResultCallback;
import com.alibaba.dashscope.utils.Constants;

import javax.sound.sampled.AudioFormat;
import javax.sound.sampled.AudioSystem;
import javax.sound.sampled.TargetDataLine;

import java.nio.ByteBuffer;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;

public class Main {
    public static void main(String[] args) throws InterruptedException {
        // The following URL is for the Singapore region. If you use a model in the Beijing region, replace the URL with wss://dashscope.aliyuncs.com/api-ws/v1/inference.
        Constants.baseWebsocketApiUrl = "wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference";
        ExecutorService executorService = Executors.newSingleThreadExecutor();
        executorService.submit(new RealtimeRecognitionTask());
        executorService.shutdown();
        executorService.awaitTermination(1, TimeUnit.MINUTES);
        System.exit(0);
    }
}

class RealtimeRecognitionTask implements Runnable {
    @Override
    public void run() {
        RecognitionParam param = RecognitionParam.builder()
                .model("fun-asr-realtime")
                // The API keys for the Singapore and Beijing regions are different. To obtain an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key.
                // If you have not configured an environment variable, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .format("wav")
                .sampleRate(16000)
                .build();
        Recognition recognizer = new Recognition();

        ResultCallback<RecognitionResult> callback = new ResultCallback<RecognitionResult>() {
            @Override
            public void onEvent(RecognitionResult result) {
                if (result.isSentenceEnd()) {
                    System.out.println("Final Result: " + result.getSentence().getText());
                } else {
                    System.out.println("Intermediate Result: " + result.getSentence().getText());
                }
            }

            @Override
            public void onComplete() {
                System.out.println("Recognition complete");
            }

            @Override
            public void onError(Exception e) {
                System.out.println("RecognitionCallback error: " + e.getMessage());
            }
        };
        try {
            recognizer.call(param, callback);
            // Create an audio format.
            AudioFormat audioFormat = new AudioFormat(16000, 16, 1, true, false);
            // Match the default recording device based on the format.
            TargetDataLine targetDataLine =
                    AudioSystem.getTargetDataLine(audioFormat);
            targetDataLine.open(audioFormat);
            // Start recording.
            targetDataLine.start();
            ByteBuffer buffer = ByteBuffer.allocate(1024);
            long start = System.currentTimeMillis();
            // Record for 50 seconds and perform real-time transcription.
            while (System.currentTimeMillis() - start < 50000) {
                int read = targetDataLine.read(buffer.array(), 0, buffer.capacity());
                if (read > 0) {
                    buffer.limit(read);
                    // Send the recorded audio data to the streaming recognition service.
                    recognizer.sendAudioFrame(buffer);
                    buffer = ByteBuffer.allocate(1024);
                    // The recording rate is limited. Sleep for a short period to prevent high CPU usage.
                    Thread.sleep(20);
                }
            }
            recognizer.stop();
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            // Close the WebSocket connection after the task is complete.
            recognizer.getDuplexApi().close(1000, "bye");
        }

        System.out.println(
                "[Metric] requestId: "
                        + recognizer.getLastRequestId()
                        + ", first package delay ms: "
                        + recognizer.getFirstPackageDelay()
                        + ", last package delay ms: "
                        + recognizer.getLastPackageDelay());
    }
}

Recognize a local audio file

The audio file used in the example is asr_example.wav.

import com.alibaba.dashscope.audio.asr.recognition.Recognition;
import com.alibaba.dashscope.audio.asr.recognition.RecognitionParam;
import com.alibaba.dashscope.audio.asr.recognition.RecognitionResult;
import com.alibaba.dashscope.common.ResultCallback;
import com.alibaba.dashscope.utils.Constants;

import java.io.FileInputStream;
import java.nio.ByteBuffer;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.time.LocalDateTime;
import java.time.format.DateTimeFormatter;
import java.util.concurrent.CountDownLatch;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;

class TimeUtils {
    private static final DateTimeFormatter formatter =
            DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss.SSS");

    public static String getTimestamp() {
        return LocalDateTime.now().format(formatter);
    }
}

public class Main {
    public static void main(String[] args) throws InterruptedException {
        // The following URL is for the Singapore region. If you use a model in the Beijing region, replace the URL with wss://dashscope.aliyuncs.com/api-ws/v1/inference.
        Constants.baseWebsocketApiUrl = "wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference";
        ExecutorService executorService = Executors.newSingleThreadExecutor();
        executorService.submit(new RealtimeRecognitionTask(Paths.get(System.getProperty("user.dir"), "asr_example.wav")));
        executorService.shutdown();

        // wait for all tasks to complete
        executorService.awaitTermination(1, TimeUnit.MINUTES);
        System.exit(0);
    }
}

class RealtimeRecognitionTask implements Runnable {
    private Path filepath;

    public RealtimeRecognitionTask(Path filepath) {
        this.filepath = filepath;
    }

    @Override
    public void run() {
        RecognitionParam param = RecognitionParam.builder()
                .model("fun-asr-realtime")
                // The API keys for the Singapore and Beijing regions are different. To obtain an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key.
                // If you have not configured an environment variable, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .format("wav")
                .sampleRate(16000)
                .build();
        Recognition recognizer = new Recognition();

        String threadName = Thread.currentThread().getName();

        ResultCallback<RecognitionResult> callback = new ResultCallback<RecognitionResult>() {
            @Override
            public void onEvent(RecognitionResult message) {
                if (message.isSentenceEnd()) {

                    System.out.println(TimeUtils.getTimestamp()+" "+
                            "[process " + threadName + "] Final Result:" + message.getSentence().getText());
                } else {
                    System.out.println(TimeUtils.getTimestamp()+" "+
                            "[process " + threadName + "] Intermediate Result: " + message.getSentence().getText());
                }
            }

            @Override
            public void onComplete() {
                System.out.println(TimeUtils.getTimestamp()+" "+"[" + threadName + "] Recognition complete");
            }

            @Override
            public void onError(Exception e) {
                System.out.println(TimeUtils.getTimestamp()+" "+
                        "[" + threadName + "] RecognitionCallback error: " + e.getMessage());
            }
        };

        try {
            recognizer.call(param, callback);
            // Please replace the path with your audio file path
            System.out.println(TimeUtils.getTimestamp()+" "+"[" + threadName + "] Input file_path is: " + this.filepath);
            // Read file and send audio by chunks
            FileInputStream fis = new FileInputStream(this.filepath.toFile());
            // chunk size set to 1 second for 16 kHz sample rate
            byte[] buffer = new byte[3200];
            int bytesRead;
            // Loop to read chunks of the file
            while ((bytesRead = fis.read(buffer)) != -1) {
                ByteBuffer byteBuffer;
                // Handle the last chunk which might be smaller than the buffer size
                System.out.println(TimeUtils.getTimestamp()+" "+"[" + threadName + "] bytesRead: " + bytesRead);
                if (bytesRead < buffer.length) {
                    byteBuffer = ByteBuffer.wrap(buffer, 0, bytesRead);
                } else {
                    byteBuffer = ByteBuffer.wrap(buffer);
                }

                recognizer.sendAudioFrame(byteBuffer);
                buffer = new byte[3200];
                Thread.sleep(100);
            }
            System.out.println(TimeUtils.getTimestamp()+" "+LocalDateTime.now());
            recognizer.stop();
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            // Close the WebSocket connection after the task is complete.
            recognizer.getDuplexApi().close(1000, "bye");
        }

        System.out.println(
                "["
                        + threadName
                        + "][Metric] requestId: "
                        + recognizer.getLastRequestId()
                        + ", first package delay ms: "
                        + recognizer.getFirstPackageDelay()
                        + ", last package delay ms: "
                        + recognizer.getLastPackageDelay());
    }
}

Bidirectional streaming call: Flowable-based

Submit a single real-time speech-to-text task and receive real-time recognition results in a stream by implementing a Flowable workflow.

Flowable is an open-source framework for workflow and business process management that is released under the Apache 2.0 license. For more information about how to use Flowable, see Flowable API details.

Click to view the complete example

Directly call the streamCall method of the Recognition class to start recognition.

The streamCall method returns a Flowable<RecognitionResult> instance. Call methods of the Flowable instance, such as blockingForEach and subscribe, to process the recognition results. The recognition results are encapsulated in the RecognitionResult object.

The streamCall method requires two parameters:

  • A RecognitionParam instance (request parameters): Use this instance to set parameters for speech recognition, such as the model, sample rate, and audio format.

  • A Flowable<ByteBuffer> instance: You need to create an instance of the Flowable<ByteBuffer> type and implement a method to parse the audio stream within this instance.

import com.alibaba.dashscope.audio.asr.recognition.Recognition;
import com.alibaba.dashscope.audio.asr.recognition.RecognitionParam;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.utils.Constants;
import io.reactivex.BackpressureStrategy;
import io.reactivex.Flowable;

import javax.sound.sampled.AudioFormat;
import javax.sound.sampled.AudioSystem;
import javax.sound.sampled.TargetDataLine;
import java.nio.ByteBuffer;

public class Main {
    public static void main(String[] args) throws NoApiKeyException {
        // The following URL is for the Singapore region. If you use a model in the Beijing region, replace the URL with wss://dashscope.aliyuncs.com/api-ws/v1/inference.
        Constants.baseWebsocketApiUrl = "wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference";
        // Create a Flowable<ByteBuffer>.
        Flowable<ByteBuffer> audioSource =
                Flowable.create(
                        emitter -> {
                            new Thread(
                                    () -> {
                                        try {
                                            // Create an audio format.
                                            AudioFormat audioFormat = new AudioFormat(16000, 16, 1, true, false);
                                            // Match the default recording device based on the format.
                                            TargetDataLine targetDataLine =
                                                    AudioSystem.getTargetDataLine(audioFormat);
                                            targetDataLine.open(audioFormat);
                                            // Start recording.
                                            targetDataLine.start();
                                            ByteBuffer buffer = ByteBuffer.allocate(1024);
                                            long start = System.currentTimeMillis();
                                            // Record for 50 seconds and perform real-time transcription.
                                            while (System.currentTimeMillis() - start < 50000) {
                                                int read = targetDataLine.read(buffer.array(), 0, buffer.capacity());
                                                if (read > 0) {
                                                    buffer.limit(read);
                                                    // Send the recorded audio data to the streaming recognition service.
                                                    emitter.onNext(buffer);
                                                    buffer = ByteBuffer.allocate(1024);
                                                    // The recording rate is limited. Sleep for a short period to prevent high CPU usage.
                                                    Thread.sleep(20);
                                                }
                                            }
                                            // Notify that the transcription is complete.
                                            emitter.onComplete();
                                        } catch (Exception e) {
                                            emitter.onError(e);
                                        }
                                    })
                                    .start();
                        },
                        BackpressureStrategy.BUFFER);

        // Create a Recognizer.
        Recognition recognizer = new Recognition();
        // Create a RecognitionParam and pass the created Flowable<ByteBuffer> in the audioFrames parameter.
        RecognitionParam param = RecognitionParam.builder()
                .model("fun-asr-realtime")
                // The API keys for the Singapore and Beijing regions are different. To obtain an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key.
                // If you have not configured an environment variable, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .format("pcm")
                .sampleRate(16000)
                .build();

        // Call the streaming interface.
        recognizer
                .streamCall(param, audioSource)
                .blockingForEach(
                        result -> {
                            // Subscribe to the output result
                            if (result.isSentenceEnd()) {
                                System.out.println("Final Result: " + result.getSentence().getText());
                            } else {
                                System.out.println("Intermediate Result: " + result.getSentence().getText());
                            }
                        });
        // Close the WebSocket connection after the task is complete.
        recognizer.getDuplexApi().close(1000, "bye");
        System.out.println(
                "[Metric] requestId: "
                        + recognizer.getLastRequestId()
                        + ", first package delay ms: "
                        + recognizer.getFirstPackageDelay()
                        + ", last package delay ms: "
                        + recognizer.getLastPackageDelay());
        System.exit(0);
    }
}

High-concurrency calls

The DashScope Java SDK uses the connection pool technology of OkHttp3 to reduce the overhead of repeatedly establishing connections. For more information, see Real-time speech recognition in high-concurrency scenarios.

Request parameters

Use the chain methods of RecognitionParam to configure parameters, such as the model, sample rate, and audio format. Then, pass the configured parameter object to the call or streamCall methods of the Recognition class.

Click to view an example

RecognitionParam param = RecognitionParam.builder()
  .model("fun-asr-realtime")
  .format("pcm")
  .sampleRate(16000)
  .parameter("language_hints", new String[]{"zh", "en"})
  .build();

Parameter

Type

Default

Required

Description

model

String

-

Yes

The model for real-time speech recognition. For more information, see Model list.

sampleRate

Integer

-

Yes

The sample rate of the audio to be recognized, in Hz.

fun-asr-realtime supports a sample rate of 16000 Hz.

format

String

-

Yes

The format of the audio to be recognized.

Supported audio formats: pcm, wav, mp3, opus, speex, aac, amr.

Important

opus/speex: Must use Ogg encapsulation.

wav: Must be PCM encoded.

amr: Only AMR-NB is supported.

vocabularyId

String

-

No

The ID of the custom vocabulary. For more information, see Custom vocabulary.

This parameter is not set by default.

semantic_punctuation_enabled

boolean

false

No

Specifies whether to enable semantic punctuation. The default value is false.

  • true: Enables semantic punctuation and disables Voice Activity Detection (VAD) punctuation.

  • false (default): Enables VAD punctuation and disables semantic punctuation.

Semantic punctuation provides higher accuracy and is suitable for meeting transcription scenarios. VAD punctuation provides lower latency and is suitable for interactive scenarios.

By adjusting the semantic_punctuation_enabled parameter, you can flexibly switch the punctuation method for speech recognition to adapt to different scenarios.

Note

Set the semantic_punctuation_enabled parameter using the parameter or parameters method of the RecognitionParam instance:

Set using the parameter method

RecognitionParam param = RecognitionParam.builder()
 .model("fun-asr-realtime")
 .format("pcm")
 .sampleRate(16000)
 .parameter("semantic_punctuation_enabled", true)
 .build();

Set using the parameters method

RecognitionParam param = RecognitionParam.builder()
 .model("fun-asr-realtime")
 .format("pcm")
 .sampleRate(16000)
 .parameters(Collections.singletonMap("semantic_punctuation_enabled", true))
 .build();

max_sentence_silence

Integer

1300

No

The silence duration threshold for VAD punctuation, in ms.

When the duration of silence after a segment of speech exceeds this threshold, the system determines that the sentence has ended.

The parameter ranges from 200 ms to 6000 ms. The default value is 1300 ms.

This parameter takes effect only when the semantic_punctuation_enabled parameter is set to false (VAD punctuation).

Note

Set the max_sentence_silence parameter using the parameter or parameters method of the RecognitionParam instance:

Set using the parameter method

RecognitionParam param = RecognitionParam.builder()
 .model("fun-asr-realtime")
 .format("pcm")
 .sampleRate(16000)
 .parameter("max_sentence_silence", 800)
 .build();

Set using the parameters method

RecognitionParam param = RecognitionParam.builder()
 .model("fun-asr-realtime")
 .format("pcm")
 .sampleRate(16000)
 .parameters(Collections.singletonMap("max_sentence_silence", 800))
 .build();

multi_threshold_mode_enabled

boolean

false

No

When this switch is enabled (true), it prevents VAD from segmenting sentences that are too long. It is disabled by default.

This parameter takes effect only when the semantic_punctuation_enabled parameter is set to false (VAD punctuation).

Note

Set the multi_threshold_mode_enabled parameter using the parameter or parameters method of the RecognitionParam instance:

Set using the parameter method

RecognitionParam param = RecognitionParam.builder()
 .model("fun-asr-realtime")
 .format("pcm")
 .sampleRate(16000)
 .parameter("multi_threshold_mode_enabled", true)
 .build();

Set using the parameters method

RecognitionParam param = RecognitionParam.builder()
 .model("fun-asr-realtime")
 .format("pcm")
 .sampleRate(16000)
 .parameters(Collections.singletonMap("multi_threshold_mode_enabled", true))
 .build();

punctuation_prediction_enabled

boolean

true

No

Specifies whether to automatically add punctuation to the recognition result:

  • true (default): This setting cannot be modified.

Note

Set the punctuation_prediction_enabled parameter using the parameter or parameters method of the RecognitionParam instance:

Set using the parameter method

RecognitionParam param = RecognitionParam.builder()
 .model("fun-asr-realtime")
 .format("pcm")
 .sampleRate(16000)
 .parameter("punctuation_prediction_enabled", false)
 .build();

Set using the parameters method

RecognitionParam param = RecognitionParam.builder()
 .model("fun-asr-realtime")
 .format("pcm")
 .sampleRate(16000)
 .parameters(Collections.singletonMap("punctuation_prediction_enabled", false))
 .build();

heartbeat

boolean

false

No

Use this switch to control whether to maintain a persistent connection with the server:

  • true: The connection with the server is maintained even if silent audio is continuously sent.

  • false (default): The connection is disconnected due to a timeout after 60 seconds, even if silent audio is continuously sent.

    Silent audio refers to content in an audio file or data stream that has no sound signal. You can generate silent audio using various methods, such as using audio editing software such as Audacity or Adobe Audition, or command-line tools such as FFmpeg.

Note

To use this field, the SDK version must be 2.19.1 or later.

Set the heartbeat parameter using the parameter or parameters method of the RecognitionParam instance:

Set using the parameter method

RecognitionParam param = RecognitionParam.builder()
 .model("fun-asr-realtime")
 .format("pcm")
 .sampleRate(16000)
 .parameter("heartbeat", true)
 .build();

Set using the parameters method

RecognitionParam param = RecognitionParam.builder()
 .model("fun-asr-realtime")
 .format("pcm")
 .sampleRate(16000)
 .parameters(Collections.singletonMap("heartbeat", true))
 .build();

apiKey

String

-

No

Your API key.

Key interfaces

Recognition class

Import the Recognition class using "import com.alibaba.dashscope.audio.asr.recognition.Recognition;". The key interfaces of this class are described in the following table:

Interface/Method

Parameters

Return value

Description

public void call(RecognitionParam param, final ResultCallback<RecognitionResult> callback)

None

Performs callback-based streaming real-time recognition. This method does not block the current thread.

public String call(RecognitionParam param, File file)

Recognition result

Performs a non-streaming call based on a local file. This method blocks the current thread until the entire audio file is read. The file to be recognized must have read permissions.

public Flowable<RecognitionResult> streamCall(RecognitionParam param, Flowable<ByteBuffer> audioFrame)

Flowable<RecognitionResult>

Performs Flowable-based streaming real-time recognition.

public void sendAudioFrame(ByteBuffer audioFrame)
  • audioFrame: The binary audio stream, of the ByteBuffer type

None

Pushes an audio stream. Each audio packet should have a duration of about 100 ms and a size between 1 KB and 16 KB.

The detection results are obtained through the onEvent method of the ResultCallback callback.

public void stop()

None

None

Stops real-time recognition.

This method blocks the current thread until the onComplete or onError method of the ResultCallback instance is called.

boolean getDuplexApi().close(int code, String reason)

code: WebSocket Close Code

reason: The reason for closing

Refer to the The WebSocket Protocol document to configure these two parameters.

true

You must close the WebSocket connection after a task is complete to prevent connection leaks. This applies even if an exception occurs. To learn how to reuse connections to improve efficiency, see Real-time speech recognition in high-concurrency scenarios.

public String getLastRequestId()

None

requestId

Gets the request ID of the current task. Use this method after a new task is started by calling call or streamingCall.

Note

This method is available only in SDK versions 2.18.0 and later.

public long getFirstPackageDelay()

None

First-packet latency

Gets the first-packet latency, which is the delay from when the first audio packet is sent to when the first recognition result is received. Use this method after the task is complete.

Note

This method is available only in SDK versions 2.18.0 and later.

public long getLastPackageDelay()

None

Last-packet latency

Gets the last-packet latency, which is the time taken from when the stop instruction is sent to when the last recognition result is delivered. Use this method after the task is complete.

Note

This method is available only in SDK versions 2.18.0 and later.

Callback interface (ResultCallback)

When you make a bidirectional streaming call, the server uses a callback to return key process information and data to the client. You must implement a callback method to handle the information or data that is returned by the server.

Implement the callback methods by inheriting the ResultCallback abstract class. When you inherit this abstract class, specify the generic type as RecognitionResult. The RecognitionResult object encapsulates the data structure that is returned by the server.

Because Java supports connection reuse, there are no onClose or onOpen methods.

Example

ResultCallback<RecognitionResult> callback = new ResultCallback<RecognitionResult>() {
    @Override
    public void onEvent(RecognitionResult result) {
        System.out.println("RequestId is: " + result.getRequestId());
        // Implement the logic to process the speech recognition result here.
    }

    @Override
    public void onComplete() {
        System.out.println("Task complete");
    }

    @Override
    public void onError(Exception e) {
        System.out.println("Task failed: " + e.getMessage());
    }
};

Interface/Method

Parameters

Return value

Description

public void onEvent(RecognitionResult result)

result: Real-time recognition result (RecognitionResult)

None

This method is called back when the service responds.

public void onComplete()

None

None

This interface is called back after the task is complete.

public void onError(Exception e)

e: Exception information

None

This interface is called back when an exception occurs.

Response results

Real-time recognition result (RecognitionResult)

RecognitionResult represents the result of a single real-time recognition.

Interface/Method

Parameters

Return value

Description

public String getRequestId()

None

requestId

Gets the request ID.

public boolean isSentenceEnd()

None

Whether the sentence is complete, meaning punctuation has occurred

Determines whether the given sentence has ended.

public Sentence getSentence()

None

Sentence

Gets sentence information, including timestamps and text.

Sentence information (Sentence)

Interface/Method

Parameters

Return value

Description

public Long getBeginTime()

None

Sentence start time, in ms

Returns the sentence start time.

public Long getEndTime()

None

Sentence end time, in ms

Returns the sentence end time.

public String getText()

None

Recognized text

Returns the recognized text.

public List<Word> getWords()

None

A List of Word timestamp information (Word).

Returns word timestamp information.

Word timestamp information (Word)

Interface/Method

Parameters

Return value

Description

public long getBeginTime()

None

Word start time, in ms

Returns the word start time.

public long getEndTime()

None

Word end time, in ms

Returns the word end time.

public String getText()

None

Word

Returns the recognized word.

public String getPunctuation()

None

Punctuation

Returns the punctuation.

Error codes

If an error occurs, see Error messages for troubleshooting.

If the problem persists, join the developer group to report the issue. Provide the Request ID to help us investigate the issue.

FAQ

Features

Q: How do I maintain a persistent connection with the server during long periods of silence?

A: Set the heartbeat request parameter to true and continuously send silent audio to the server.

Silent audio refers to content in an audio file or data stream that has no sound signal. Generate silent audio using audio editing software, such as Audacity or Adobe Audition, or using command-line tools, such as FFmpeg.

Q: How do I convert an audio format to a supported format?

Use the FFmpeg tool. For more information, see the official FFmpeg website.

# Basic conversion command (universal template)
# -i, function: input file path, example: audio.wav
# -c:a, function: audio encoder, examples: aac, libmp3lame, pcm_s16le
# -b:a, function: bit rate (audio quality control), examples: 192k, 320k
# -ar, function: sample rate, examples: 44100 (CD), 48000, 16000
# -ac, function: number of sound channels, examples: 1 (mono), 2 (stereo)
# -y, function: overwrite existing file (no value needed)
ffmpeg -i input_audio.ext -c:a encoder_name -b:a bit_rate -ar sample_rate -ac number_of_channels output.ext

# Example: WAV → MP3 (preserve original quality)
ffmpeg -i input.wav -c:a libmp3lame -q:a 0 output.mp3
# Example: MP3 → WAV (16-bit PCM standard format)
ffmpeg -i input.mp3 -c:a pcm_s16le -ar 44100 -ac 2 output.wav
# Example: M4A → AAC (extract/convert Apple audio)
ffmpeg -i input.m4a -c:a copy output.aac  # Directly extract without re-encoding
ffmpeg -i input.m4a -c:a aac -b:a 256k output.aac  # Re-encode to improve quality
# Example: FLAC lossless → Opus (high compression)
ffmpeg -i input.flac -c:a libopus -b:a 128k -vbr on output.opus

Q: How do I recognize a local file (recorded audio file)?

A: There are two ways to recognize a local file:

  • Directly pass the local file path: This method returns the complete recognition result after the recognition is complete. This method is not suitable for scenarios that require immediate feedback.

    Pass the file path to the call method of the Recognition class to directly recognize the audio file. For more information, see Non-streaming call.

  • Convert the local file into a binary stream for recognition: This method recognizes the file and streams the recognition results. This method is suitable for scenarios that require immediate feedback.

Troubleshooting

Q: Why is speech not recognized (no recognition result)?

  1. Check whether the audio format (format) and sample rate (sampleRate or sample_rate) in the request parameters are correctly set and comply with the parameter constraints. The following are examples of common errors:

    • The audio file has a .wav file name extension but is in MP3 format, and the format request parameter is set to mp3. This is an incorrect parameter setting.

    • The audio sample rate is 3600 Hz, but the sampleRate or sample_rate request parameter is set to 48000. This is an incorrect parameter setting.

    Use the ffprobe tool to obtain audio information, such as the container, encoding, sample rate, and sound channel:

    ffprobe -v error -show_entries format=format_name -show_entries stream=codec_name,sample_rate,channels -of default=noprint_wrappers=1 input.xxx
  2. Check if the language specified in language_hints is consistent with the actual language of the audio.

    For example, the audio is in Chinese, but language_hints is set to en (English).

  3. If all the preceding checks pass, customize vocabulary to improve the recognition of specific words.