All Products
Search
Document Center

Alibaba Cloud Model Studio:Java SDK

Last Updated:Mar 24, 2026

The interfaces and parameters of the Fun-ASR Java SDK for real-time speech recognition.

User guide: For an introduction to the models and selection recommendations, see Real-time speech recognition - Fun-ASR/Paraformer.

Prerequisites

Model availability

International

In the international deployment mode, endpoints and data storage are in the Singapore region. Model inference compute resources are dynamically scheduled globally, excluding Chinese Mainland.

Model

Version

Unit price

Free quota (Note)

fun-asr-realtime

Currently, fun-asr-realtime-2025-11-07

Stable

$0.00009/second

36,000 seconds (10 hours)

Valid for 90 days

fun-asr-realtime-2025-11-07

Snapshot

  • Languages supported: Mandarin, Cantonese, Wu, Minnan, Hakka, Gan, Xiang, and Jin. Also supports Mandarin accents from Zhongyuan, Southwest, Jilu, Jianghuai, Lanyin, Jiaoliao, Northeast, Beijing, and Hong Kong–Taiwan regions—including Henan, Shaanxi, Hubei, Sichuan, Chongqing, Yunnan, Guizhou, Guangdong, Guangxi, Hebei, Tianjin, Shandong, Anhui, Nanjing, Jiangsu, Hangzhou, Gansu, and Ningxia. Also supports English and Japanese.

  • Sample rates supported: 16 kHz

  • Audio formats supported: pcm, wav, mp3, opus, speex, aac, amr

Chinese Mainland

In the Chinese Mainland deployment mode, endpoints and data storage are in the Beijing region. Model inference compute resources are limited to Chinese Mainland.

Model

Version

Unit price

Free quota (Note)

fun-asr-realtime

Currently, fun-asr-realtime-2025-11-07

Stable

$0.000047/second

No free quota

fun-asr-realtime-2026-02-28

Snapshot

fun-asr-realtime-2025-11-07

Snapshot

fun-asr-realtime-2025-09-15

fun-asr-flash-8k-realtime

Currently, fun-asr-flash-8k-realtime-2026-01-28

Stable

$0.000032/second

fun-asr-flash-8k-realtime-2026-01-28

Snapshot

  • Languages supported:

    • fun-asr-realtime, fun-asr-realtime-2026-02-28, fun-asr-realtime-2025-11-07: Chinese (Mandarin, Cantonese, Wu, Minnan, Hakka, Gan, Xiang, and Jin. Also supports Mandarin accents from Zhongyuan, Southwest, Jilu, Jianghuai, Lanyin, Jiaoliao, Northeast, Beijing, and Hong Kong–Taiwan regions—including Henan, Shaanxi, Hubei, Sichuan, Chongqing, Yunnan, Guizhou, Guangdong, Guangxi, Hebei, Tianjin, Shandong, Anhui, Nanjing, Jiangsu, Hangzhou, Gansu, and Ningxia), English, and Japanese.

    • fun-asr-realtime-2025-09-15: Chinese (Mandarin), English

  • Sample rates supported: 16 kHz

  • Sample rates supported:

    • fun-asr-flash-8k-realtime and fun-asr-flash-8k-realtime-2026-01-28: 8 kHz

    • All other models: 16 kHz

  • Audio formats supported: pcm, wav, mp3, opus, speex, aac, amr

Getting started

The Recognition class provides non-streaming and bidirectional streaming interfaces. Choose based on your requirements:

  • Non-streaming call: Recognizes a local file and returns the complete result at once. Suitable for pre-recorded audio.

  • Bidirectional streaming call: Recognizes audio streams with real-time output. Audio can originate from a microphone or local file. Suitable for scenarios requiring immediate feedback.

Non-streaming call

Submit a speech-to-text task with a local file and synchronously receive the transcription result. This call blocks the current thread.

To perform recognition, instantiate the Recognition class, and then call the call method with the request parameters and the file to be recognized.

Click to view the complete example

The audio file used in the example is asr_example.wav.

import com.alibaba.dashscope.audio.asr.recognition.Recognition;
import com.alibaba.dashscope.audio.asr.recognition.RecognitionParam;
import com.alibaba.dashscope.utils.Constants;

import java.io.File;

public class Main {
    public static void main(String[] args) {
        // The following URL is for the Singapore region. If you use a model in the Beijing region, replace the URL with wss://dashscope.aliyuncs.com/api-ws/v1/inference.
        Constants.baseWebsocketApiUrl = "wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference";
        // Create a Recognition instance.
        Recognition recognizer = new Recognition();
        // Create a RecognitionParam.
        RecognitionParam param =
                RecognitionParam.builder()
                        .model("fun-asr-realtime")
                        // The API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key.
                        // If you have not configured an environment variable, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
                        .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                        .format("wav")
                        .sampleRate(16000)
                        //.parameter("language_hints", new String[]{"zh"})
                        .build();

        try {
            System.out.println("Recognition result: " + recognizer.call(param, new File("asr_example.wav")));
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            // Close the WebSocket connection after the task is complete.
            recognizer.getDuplexApi().close(1000, "bye");
        }
        System.out.println(
                "[Metric] requestId: "
                        + recognizer.getLastRequestId()
                        + ", first package delay ms: "
                        + recognizer.getFirstPackageDelay()
                        + ", last package delay ms: "
                        + recognizer.getLastPackageDelay());
        System.exit(0);
    }
}

Bidirectional streaming call: Callback-based

Submit a speech-to-text task and receive streaming recognition results by implementing a callback interface.

  1. Start streaming speech recognition

    Instantiate the Recognition class and call the call method to bind the request parameters and the callback interface (ResultCallback) to start streaming speech recognition.

  2. Stream audio

    Repeatedly call sendAudioFrame to send binary audio stream segments to the server. The stream can be from a local file or microphone.

    While you send audio data, the server returns real-time recognition results to the client through the onEvent method of the ResultCallback callback interface.

    Each audio segment should be ~100 ms long, between 1 KB and 16 KB.

  3. End processing

    Call the stop method of the Recognition class to stop speech recognition.

    Blocks until the onComplete or onError callback of the ResultCallback triggers.

Click to view the complete example

Recognize speech from a microphone

import com.alibaba.dashscope.audio.asr.recognition.Recognition;
import com.alibaba.dashscope.audio.asr.recognition.RecognitionParam;
import com.alibaba.dashscope.audio.asr.recognition.RecognitionResult;
import com.alibaba.dashscope.common.ResultCallback;
import com.alibaba.dashscope.utils.Constants;

import javax.sound.sampled.AudioFormat;
import javax.sound.sampled.AudioSystem;
import javax.sound.sampled.TargetDataLine;

import java.nio.ByteBuffer;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;

public class Main {
    public static void main(String[] args) throws InterruptedException {
        // The following URL is for the Singapore region. If you use a model in the Beijing region, replace the URL with wss://dashscope.aliyuncs.com/api-ws/v1/inference.
        Constants.baseWebsocketApiUrl = "wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference";
        ExecutorService executorService = Executors.newSingleThreadExecutor();
        executorService.submit(new RealtimeRecognitionTask());
        executorService.shutdown();
        executorService.awaitTermination(1, TimeUnit.MINUTES);
        System.exit(0);
    }
}

class RealtimeRecognitionTask implements Runnable {
    @Override
    public void run() {
        RecognitionParam param = RecognitionParam.builder()
                .model("fun-asr-realtime")
                // The API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key.
                // If you have not configured an environment variable, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .format("wav")
                .sampleRate(16000)
                .build();
        Recognition recognizer = new Recognition();

        ResultCallback<RecognitionResult> callback = new ResultCallback<RecognitionResult>() {
            @Override
            public void onEvent(RecognitionResult result) {
                if (result.isSentenceEnd()) {
                    System.out.println("Final Result: " + result.getSentence().getText());
                } else {
                    System.out.println("Intermediate Result: " + result.getSentence().getText());
                }
            }

            @Override
            public void onComplete() {
                System.out.println("Recognition complete");
            }

            @Override
            public void onError(Exception e) {
                System.out.println("RecognitionCallback error: " + e.getMessage());
            }
        };
        try {
            recognizer.call(param, callback);
            // Create an audio format.
            AudioFormat audioFormat = new AudioFormat(16000, 16, 1, true, false);
            // Match the default recording device based on the format.
            TargetDataLine targetDataLine =
                    AudioSystem.getTargetDataLine(audioFormat);
            targetDataLine.open(audioFormat);
            // Start recording.
            targetDataLine.start();
            ByteBuffer buffer = ByteBuffer.allocate(1024);
            long start = System.currentTimeMillis();
            // Record for 50 seconds and perform real-time transcription.
            while (System.currentTimeMillis() - start < 50000) {
                int read = targetDataLine.read(buffer.array(), 0, buffer.capacity());
                if (read > 0) {
                    buffer.limit(read);
                    // Send the recorded audio data to the streaming recognition service.
                    recognizer.sendAudioFrame(buffer);
                    buffer = ByteBuffer.allocate(1024);
                    // The recording rate is limited. Sleep for a short period to prevent high CPU usage.
                    Thread.sleep(20);
                }
            }
            recognizer.stop();
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            // Close the WebSocket connection after the task is complete.
            recognizer.getDuplexApi().close(1000, "bye");
        }

        System.out.println(
                "[Metric] requestId: "
                        + recognizer.getLastRequestId()
                        + ", first package delay ms: "
                        + recognizer.getFirstPackageDelay()
                        + ", last package delay ms: "
                        + recognizer.getLastPackageDelay());
    }
}

Recognize a local audio file

The audio file used in the example is asr_example.wav.

import com.alibaba.dashscope.api.GeneralApi;
import com.alibaba.dashscope.audio.asr.recognition.Recognition;
import com.alibaba.dashscope.audio.asr.recognition.RecognitionParam;
import com.alibaba.dashscope.audio.asr.recognition.RecognitionResult;
import com.alibaba.dashscope.base.HalfDuplexParamBase;
import com.alibaba.dashscope.common.GeneralListParam;
import com.alibaba.dashscope.common.ResultCallback;
import com.alibaba.dashscope.protocol.GeneralServiceOption;
import com.alibaba.dashscope.protocol.HttpMethod;
import com.alibaba.dashscope.protocol.Protocol;
import com.alibaba.dashscope.protocol.StreamingMode;
import com.alibaba.dashscope.utils.Constants;

import java.io.FileInputStream;
import java.nio.ByteBuffer;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.time.LocalDateTime;
import java.time.format.DateTimeFormatter;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;

class TimeUtils {
    private static final DateTimeFormatter formatter =
            DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss.SSS");

    public static String getTimestamp() {
        return LocalDateTime.now().format(formatter);
    }
}

public class Main {
    public static void main(String[] args) throws InterruptedException {
        // The following URL is for the Singapore region. If you use a model in the Beijing region, replace the URL with wss://dashscope.aliyuncs.com/api-ws/v1/inference.
        Constants.baseWebsocketApiUrl = "wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference";
        // In a real application, call this method only once at program startup.
        warmUp();

        ExecutorService executorService = Executors.newSingleThreadExecutor();
        executorService.submit(new RealtimeRecognitionTask(Paths.get(System.getProperty("user.dir"), "asr_example.wav")));
        executorService.shutdown();

        // Wait for all tasks to complete.
        executorService.awaitTermination(1, TimeUnit.MINUTES);
        System.exit(0);
    }

    public static void warmUp() {
        try {
            // Lightweight GET request to establish a connection.
            GeneralServiceOption warmupOption = GeneralServiceOption.builder()
                    .protocol(Protocol.HTTP)
                    .httpMethod(HttpMethod.GET)
                    .streamingMode(StreamingMode.OUT)
                    .path("assistants")
                    .build();

            warmupOption.setBaseHttpUrl(Constants.baseHttpApiUrl);
            GeneralApi<HalfDuplexParamBase> api = new GeneralApi<>();
            api.get(GeneralListParam.builder().limit(1L).build(), warmupOption);
        } catch (Exception e) {
            // Allow retry if warm-up fails.
        }
    }
}

class RealtimeRecognitionTask implements Runnable {
    private Path filepath;

    public RealtimeRecognitionTask(Path filepath) {
        this.filepath = filepath;
    }

    @Override
    public void run() {
        RecognitionParam param = RecognitionParam.builder()
                .model("fun-asr-realtime")
                // The API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key.
                // If you have not configured an environment variable, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .format("wav")
                .sampleRate(16000)
                .build();
        Recognition recognizer = new Recognition();

        String threadName = Thread.currentThread().getName();

        ResultCallback<RecognitionResult> callback = new ResultCallback<RecognitionResult>() {
            @Override
            public void onEvent(RecognitionResult message) {
                if (message.isSentenceEnd()) {

                    System.out.println(TimeUtils.getTimestamp()+" "+
                            "[process " + threadName + "] Final Result:" + message.getSentence().getText());
                } else {
                    System.out.println(TimeUtils.getTimestamp()+" "+
                            "[process " + threadName + "] Intermediate Result: " + message.getSentence().getText());
                }
            }

            @Override
            public void onComplete() {
                System.out.println(TimeUtils.getTimestamp()+" "+"[" + threadName + "] Recognition complete");
            }

            @Override
            public void onError(Exception e) {
                System.out.println(TimeUtils.getTimestamp()+" "+
                        "[" + threadName + "] RecognitionCallback error: " + e.getMessage());
            }
        };

        try {
            recognizer.call(param, callback);
            // Replace the path with your audio file path.
            System.out.println(TimeUtils.getTimestamp()+" "+"[" + threadName + "] Input file_path is: " + this.filepath);
            // Read the file and send audio in chunks.
            FileInputStream fis = new FileInputStream(this.filepath.toFile());
            byte[] allData = new byte[fis.available()];
            int ret = fis.read(allData);
            fis.close();

            int sendFrameLength = 3200;
            for (int i = 0; i * sendFrameLength < allData.length; i ++) {
                int start = i * sendFrameLength;
                int end = Math.min(start + sendFrameLength, allData.length);
                ByteBuffer byteBuffer = ByteBuffer.wrap(allData, start, end - start);
                recognizer.sendAudioFrame(byteBuffer);
                Thread.sleep(100);
            }

            System.out.println(TimeUtils.getTimestamp()+" "+LocalDateTime.now());
            recognizer.stop();
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            // Close the WebSocket connection after the task is complete.
            recognizer.getDuplexApi().close(1000, "bye");
        }

        System.out.println(
                "["
                        + threadName
                        + "][Metric] requestId: "
                        + recognizer.getLastRequestId()
                        + ", first package delay ms: "
                        + recognizer.getFirstPackageDelay()
                        + ", last package delay ms: "
                        + recognizer.getLastPackageDelay());
    }
}

Bidirectional streaming call: Flowable-based

Submit a speech-to-text task and receive streaming results through a Flowable workflow.

Flowable is a reactive stream type from the RxJava library supporting backpressure. See Flowable API reference for details.

Click to view the complete example

Directly call the streamCall method of the Recognition class to start recognition.

The streamCall method returns a Flowable<RecognitionResult> instance. Call methods of the Flowable instance, such as blockingForEach and subscribe, to process the recognition results. The recognition results are encapsulated in the RecognitionResult object.

The streamCall method requires two parameters:

  • A RecognitionParam instance (request parameters): Use this instance to set parameters for speech recognition, such as the model, sample rate, and audio format.

  • A Flowable<ByteBuffer> instance: You need to create an instance of the Flowable<ByteBuffer> type and implement a method to parse the audio stream within this instance.

import com.alibaba.dashscope.audio.asr.recognition.Recognition;
import com.alibaba.dashscope.audio.asr.recognition.RecognitionParam;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.utils.Constants;
import io.reactivex.BackpressureStrategy;
import io.reactivex.Flowable;

import javax.sound.sampled.AudioFormat;
import javax.sound.sampled.AudioSystem;
import javax.sound.sampled.TargetDataLine;
import java.nio.ByteBuffer;

public class Main {
    public static void main(String[] args) throws NoApiKeyException {
        // The following URL is for the Singapore region. If you use a model in the Beijing region, replace the URL with wss://dashscope.aliyuncs.com/api-ws/v1/inference.
        Constants.baseWebsocketApiUrl = "wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference";
        // Create a Flowable<ByteBuffer>.
        Flowable<ByteBuffer> audioSource =
                Flowable.create(
                        emitter -> {
                            new Thread(
                                    () -> {
                                        try {
                                            // Create an audio format.
                                            AudioFormat audioFormat = new AudioFormat(16000, 16, 1, true, false);
                                            // Match the default recording device based on the format.
                                            TargetDataLine targetDataLine =
                                                    AudioSystem.getTargetDataLine(audioFormat);
                                            targetDataLine.open(audioFormat);
                                            // Start recording.
                                            targetDataLine.start();
                                            ByteBuffer buffer = ByteBuffer.allocate(1024);
                                            long start = System.currentTimeMillis();
                                            // Record for 50 seconds and perform real-time transcription.
                                            while (System.currentTimeMillis() - start < 50000) {
                                                int read = targetDataLine.read(buffer.array(), 0, buffer.capacity());
                                                if (read > 0) {
                                                    buffer.limit(read);
                                                    // Send the recorded audio data to the streaming recognition service.
                                                    emitter.onNext(buffer);
                                                    buffer = ByteBuffer.allocate(1024);
                                                    // The recording rate is limited. Sleep for a short period to prevent high CPU usage.
                                                    Thread.sleep(20);
                                                }
                                            }
                                            // Notify that the transcription is complete.
                                            emitter.onComplete();
                                        } catch (Exception e) {
                                            emitter.onError(e);
                                        }
                                    })
                                    .start();
                        },
                        BackpressureStrategy.BUFFER);

        // Create a Recognition instance.
        Recognition recognizer = new Recognition();
        // Create a RecognitionParam and pass the created Flowable<ByteBuffer> in the audioFrames parameter.
        RecognitionParam param = RecognitionParam.builder()
                .model("fun-asr-realtime")
                // The API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key.
                // If you have not configured an environment variable, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .format("pcm")
                .sampleRate(16000)
                .build();

        // Call the streaming interface.
        recognizer
                .streamCall(param, audioSource)
                .blockingForEach(
                        result -> {
                            // Subscribe to the output result.
                            if (result.isSentenceEnd()) {
                                System.out.println("Final Result: " + result.getSentence().getText());
                            } else {
                                System.out.println("Intermediate Result: " + result.getSentence().getText());
                            }
                        });
        // Close the WebSocket connection after the task is complete.
        recognizer.getDuplexApi().close(1000, "bye");
        System.out.println(
                "[Metric] requestId: "
                        + recognizer.getLastRequestId()
                        + ", first package delay ms: "
                        + recognizer.getFirstPackageDelay()
                        + ", last package delay ms: "
                        + recognizer.getLastPackageDelay());
        System.exit(0);
    }
}

High-concurrency calls

The DashScope Java SDK uses OkHttp3 connection pooling to reduce connection overhead. See Optimize Paraformer real-time speech recognition for high concurrency for details.

Request parameters

Use RecognitionParam builder methods to configure parameters (model, sample rate, audio format), then pass to call or streamCall methods.

Click to view an example

RecognitionParam param = RecognitionParam.builder()
  .model("fun-asr-realtime")
  .format("pcm")
  .sampleRate(16000)
  //.parameter("language_hints", new String[]{"zh"})
  .build();

Parameter

Type

Default

Required

Description

model

String

-

Yes

The model for real-time speech recognition. For more information, see Model list.

sampleRate

Integer

-

Yes

Audio sample rate in Hz.

fun-asr-realtime supports 16000 Hz.

format

String

-

Yes

Audio format.

Supported: pcm, wav, mp3, opus, speex, aac, amr.

Important

opus/speex: Must use Ogg encapsulation.

wav: Must be PCM encoded.

amr: Only AMR-NB is supported.

semantic_punctuation_enabled

boolean

false

No

Enable semantic punctuation. Default: false.

  • true: Enables semantic punctuation and disables Voice Activity Detection (VAD) punctuation.

  • false (default): Enables VAD punctuation and disables semantic punctuation.

Semantic punctuation: higher accuracy, suitable for meeting transcription. VAD punctuation: lower latency, suitable for interactive scenarios.

Use semantic_punctuation_enabled to switch punctuation methods.

Note

Set the semantic_punctuation_enabled parameter using the parameter or parameters method of the RecognitionParam instance:

Set using the parameter method

RecognitionParam param = RecognitionParam.builder()
 .model("fun-asr-realtime")
 .format("pcm")
 .sampleRate(16000)
 .parameter("semantic_punctuation_enabled", true)
 .build();

Set using the parameters method

RecognitionParam param = RecognitionParam.builder()
 .model("fun-asr-realtime")
 .format("pcm")
 .sampleRate(16000)
 .parameters(Collections.singletonMap("semantic_punctuation_enabled", true))
 .build();

max_sentence_silence

Integer

1300

No

Silence duration threshold for VAD punctuation in ms.

When silence after speech exceeds this threshold, the sentence ends.

Range: 200-6000 ms. Default: 1300 ms.

Only applies when semantic_punctuation_enabled is false (VAD punctuation).

Note

Set the max_sentence_silence parameter using the parameter or parameters method of the RecognitionParam instance:

Set using the parameter method

RecognitionParam param = RecognitionParam.builder()
 .model("fun-asr-realtime")
 .format("pcm")
 .sampleRate(16000)
 .parameter("max_sentence_silence", 800)
 .build();

Set using the parameters method

RecognitionParam param = RecognitionParam.builder()
 .model("fun-asr-realtime")
 .format("pcm")
 .sampleRate(16000)
 .parameters(Collections.singletonMap("max_sentence_silence", 800))
 .build();

multi_threshold_mode_enabled

boolean

false

No

Prevents VAD from prematurely segmenting long sentences. Default: false.

Only applies when semantic_punctuation_enabled is false (VAD punctuation).

Note

Set the multi_threshold_mode_enabled parameter using the parameter or parameters method of the RecognitionParam instance:

Set using the parameter method

RecognitionParam param = RecognitionParam.builder()
 .model("fun-asr-realtime")
 .format("pcm")
 .sampleRate(16000)
 .parameter("multi_threshold_mode_enabled", true)
 .build();

Set using the parameters method

RecognitionParam param = RecognitionParam.builder()
 .model("fun-asr-realtime")
 .format("pcm")
 .sampleRate(16000)
 .parameters(Collections.singletonMap("multi_threshold_mode_enabled", true))
 .build();

punctuation_prediction_enabled

boolean

true

No

Auto-add punctuation to recognition result:

  • true (default): This setting cannot be modified.

Note

Set the punctuation_prediction_enabled parameter using the parameter or parameters method of the RecognitionParam instance:

Set using the parameter method

RecognitionParam param = RecognitionParam.builder()
 .model("fun-asr-realtime")
 .format("pcm")
 .sampleRate(16000)
 .parameter("punctuation_prediction_enabled", false)
 .build();

Set using the parameters method

RecognitionParam param = RecognitionParam.builder()
 .model("fun-asr-realtime")
 .format("pcm")
 .sampleRate(16000)
 .parameters(Collections.singletonMap("punctuation_prediction_enabled", false))
 .build();

heartbeat

boolean

false

No

Maintain persistent connection with server:

  • true: Connection maintained even when sending silent audio continuously.

  • false (default): Connection disconnects after 60 seconds timeout, even when sending silent audio.

    Silent audio: audio with no sound signal. Generate it with editing software (Audacity, Adobe Audition) or FFmpeg.

Note

To use this field, the SDK version must be 2.19.1 or later.

Set the heartbeat parameter using the parameter or parameters method of the RecognitionParam instance:

Set using the parameter method

RecognitionParam param = RecognitionParam.builder()
 .model("fun-asr-realtime")
 .format("pcm")
 .sampleRate(16000)
 .parameter("heartbeat", true)
 .build();

Set using the parameters method

RecognitionParam param = RecognitionParam.builder()
 .model("fun-asr-realtime")
 .format("pcm")
 .sampleRate(16000)
 .parameters(Collections.singletonMap("heartbeat", true))
 .build();

language_hints

String[]

-

No

Sets the language codes for recognition. If the language is unknown in advance, leave this parameter unset and the model will identify it automatically.

The system reads only the first value in the array and ignores all other values.

Supported language codes by model:

  • fun-asr-realtime, fun-asr-realtime-2025-11-07:

    • zh: Chinese

    • en: English

    • ja: Japanese

  • fun-asr-realtime-2025-09-15:

    • zh: Chinese

    • en: English

Note

Set the language_hints parameter using the parameter or parameters method of the RecognitionParam instance:

Set using the parameter method

RecognitionParam param = RecognitionParam.builder()
 .model("fun-asr-realtime")
 .format("pcm")
 .sampleRate(16000)
 .parameter("language_hints", new String[]{"zh"})
 .build();

Set using the parameters method

RecognitionParam param = RecognitionParam.builder()
 .model("fun-asr-realtime")
 .format("pcm")
 .sampleRate(16000)
 .parameters(Collections.singletonMap("language_hints", new String[]{"zh", "en"}))
 .build();

speech_noise_threshold

float

-

No

Adjusts the speech-noise detection threshold to control VAD sensitivity.

Range: [-1.0, 1.0].

Guidelines:

  • Near -1: Lowers the noise threshold — more noise may be transcribed as speech.

  • Near +1: Raises the noise threshold — some speech may be filtered out as noise.

Important

This is an advanced parameter. Adjustments can significantly affect recognition quality.

  • Test thoroughly before adjusting.

  • Make small adjustments (step size 0.1) based on your audio environment.

Note

Set the speech_noise_threshold using the parameter or parameters method of the RecognitionParam instance:

Set using the parameter method

RecognitionParam param = RecognitionParam.builder()
 .model("fun-asr-realtime")
 .format("pcm")
 .sampleRate(16000)
 .parameter("speech_noise_threshold", -0.5)
 .build();

Set using the parameters method

RecognitionParam param = RecognitionParam.builder()
 .model("fun-asr-realtime")
 .format("pcm")
 .sampleRate(16000)
 .parameters(Collections.singletonMap("speech_noise_threshold", -0.5))
 .build();

apiKey

String

-

No

Your API key.

Key interfaces

Recognition class

Import: import com.alibaba.dashscope.audio.asr.recognition.Recognition;. Key interfaces:

Interface/Method

Parameters

Return value

Description

public void call(RecognitionParam param, final ResultCallback<RecognitionResult> callback)

None

Performs callback-based streaming recognition. Does not block the current thread.

public String call(RecognitionParam param, File file)

Recognition result

Performs non-streaming recognition with a local file. Blocks until the entire audio file is processed. File must be readable.

public Flowable<RecognitionResult> streamCall(RecognitionParam param, Flowable<ByteBuffer> audioFrame)

Flowable<RecognitionResult>

Performs Flowable-based streaming real-time recognition.

public void sendAudioFrame(ByteBuffer audioFrame)
  • audioFrame: The binary audio stream, of the ByteBuffer type

None

Sends audio stream segment. Each packet: ~100 ms duration, 1-16 KB size.

Results are retrieved via the onEvent method of the ResultCallback callback.

public void stop()

None

None

Stops recognition.

Blocks until onComplete or onError is called.

boolean getDuplexApi().close(int code, String reason)

code: WebSocket Close Code

reason: The reason for closing

Refer to the The WebSocket Protocol document to configure these two parameters.

true

Close the WebSocket connection after task completion to prevent leaks, even if exceptions occur. See Optimize Paraformer real-time speech recognition for high concurrency for connection reuse.

public String getLastRequestId()

None

requestId

Gets request ID. Use after calling call or streamingCall.

Note

This method is available only in SDK versions 2.18.0 and later.

public long getFirstPackageDelay()

None

First-packet latency

Gets first-packet latency (delay from sending first audio packet to receiving first result). Use after task completion.

Note

This method is available only in SDK versions 2.18.0 and later.

public long getLastPackageDelay()

None

Last-packet latency

Gets last-packet latency (time from stop instruction to last result delivery). Use after task completion.

Note

This method is available only in SDK versions 2.18.0 and later.

Callback interface (ResultCallback)

In bidirectional streaming calls, the server returns results via callbacks. Implement callback methods to handle returned data.

Implement callbacks by inheriting the ResultCallback abstract class. Specify the generic type as RecognitionResult. The RecognitionResult object encapsulates server-returned data.

Because Java supports connection reuse, there are no onClose or onOpen methods.

Example

ResultCallback<RecognitionResult> callback = new ResultCallback<RecognitionResult>() {
    @Override
    public void onEvent(RecognitionResult result) {
        System.out.println("RequestId is: " + result.getRequestId());
        // Implement the logic to process the speech recognition result here.
    }

    @Override
    public void onComplete() {
        System.out.println("Task complete");
    }

    @Override
    public void onError(Exception e) {
        System.out.println("Task failed: " + e.getMessage());
    }
};

Interface/Method

Parameters

Return value

Description

public void onEvent(RecognitionResult result)

result: Real-time recognition result (RecognitionResult)

None

Called when the server returns a recognition result.

public void onComplete()

None

None

Called when the recognition task completes successfully.

public void onError(Exception e)

e: Exception information

None

Called when an error occurs during recognition.

Response results

Real-time recognition result (RecognitionResult)

RecognitionResult represents a single recognition result.

Interface/Method

Parameters

Return value

Description

public String getRequestId()

None

requestId

Gets the request ID.

public boolean isSentenceEnd()

None

Whether sentence segmentation has completed

Returns whether the current sentence has ended (final result).

public Sentence getSentence()

None

Sentence

Gets sentence information, including timestamps and text.

Sentence information (Sentence)

Interface/Method

Parameters

Return value

Description

public Long getBeginTime()

None

Sentence start time, in ms

Returns the sentence start time.

public Long getEndTime()

None

Sentence end time, in ms

Returns the sentence end time.

public String getText()

None

Recognized text

Returns the recognized text.

public List<Word> getWords()

None

A List of Word timestamp information (Word).

Returns word timestamp information.

Word timestamp information (Word)

Interface/Method

Parameters

Return value

Description

public long getBeginTime()

None

Word start time, in ms

Returns the word start time.

public long getEndTime()

None

Word end time, in ms

Returns the word end time.

public String getText()

None

Word

Returns the recognized word.

public String getPunctuation()

None

Punctuation

Returns the punctuation.

Error codes

If an error occurs, see Error messages for troubleshooting.

If the problem persists, join the developer group to report the issue. Provide the Request ID to help us investigate the issue.

FAQ

Features

Q: How do I maintain a persistent connection with the server during long periods of silence?

Set heartbeat to true and continuously send silent audio.

Silent audio has no sound signal. Generate it using audio editing software (Audacity, Adobe Audition) or command-line tools (FFmpeg).

Q: How do I convert an audio format to a supported format?

Use FFmpeg to convert audio files. See the official FFmpeg website for details.

# Basic conversion command (universal template)
# -i, function: input file path, example: audio.wav
# -c:a, function: audio encoder, examples: aac, libmp3lame, pcm_s16le
# -b:a, function: bit rate (audio quality control), examples: 192k, 320k
# -ar, function: sample rate, examples: 44100 (CD), 48000, 16000
# -ac, function: number of sound channels, examples: 1 (mono), 2 (stereo)
# -y, function: overwrite existing file (no value needed)
ffmpeg -i input_audio.ext -c:a encoder_name -b:a bit_rate -ar sample_rate -ac number_of_channels output.ext

# Example: WAV → MP3 (preserve original quality)
ffmpeg -i input.wav -c:a libmp3lame -q:a 0 output.mp3
# Example: MP3 → WAV (16-bit PCM standard format)
ffmpeg -i input.mp3 -c:a pcm_s16le -ar 44100 -ac 2 output.wav
# Example: M4A → AAC (extract/convert Apple audio)
ffmpeg -i input.m4a -c:a copy output.aac  # Directly extract without re-encoding
ffmpeg -i input.m4a -c:a aac -b:a 256k output.aac  # Re-encode to improve quality
# Example: FLAC lossless → Opus (high compression)
ffmpeg -i input.flac -c:a libopus -b:a 128k -vbr on output.opus

Q: How do I recognize a local file (recorded audio file)?

A: There are two ways to recognize a local file:

Troubleshooting

Q: Why is speech not recognized (no recognition result)?

  1. Check whether format and sampleRate/sample_rate in request parameters are correct and comply with constraints. Common errors:

    • The audio file has a .wav file name extension but is actually in MP3 format, while the format request parameter is set to mp3.

    • The actual audio sample rate is 3600 Hz, but the sampleRate or sample_rate request parameter is set to 48000.

    Use ffprobe to get audio info (container, encoding, sample rate, channels):

    ffprobe -v error -show_entries format=format_name -show_entries stream=codec_name,sample_rate,channels -of default=noprint_wrappers=1 input.xxx
  2. Check if language_hints matches the actual audio language.

    For example, the audio is in Chinese, but language_hints is set to en (English).