All Products
Search
Document Center

Alibaba Cloud Model Studio:Paraformer real-time speech recognition Java SDK

Last Updated:Dec 17, 2025

This topic describes the parameters and interfaces of the Paraformer real-time speech recognition Java SDK.

Important

This document applies only to the China (Beijing) region. To use the model, you must use an API key from the China (Beijing) region.

User guide: For model descriptions and selection guidance, see Real-time speech recognition - Fun-ASR/Paraformer.

Prerequisites

  • You have activated the Model Studio and created an API key. To prevent security risks, export the API key as an environment variable instead of hard-coding it in your code.

    Note

    To grant temporary access permissions to third-party applications or users, or if you want to strictly control high-risk operations such as accessing or deleting sensitive data, we recommend that you use a temporary authentication token.

    Compared with long-term API keys, temporary authentication tokens are more secure because they are short-lived (60 seconds). They are suitable for temporary call scenarios and can effectively reduce the risk of API key leakage.

    To use a temporary token, replace the API key used for authentication in your code with the temporary authentication token.

  • Install the latest version of the DashScope SDK.

Model list

paraformer-realtime-v2

paraformer-realtime-8k-v2

Scenarios

Scenarios such as live streaming and meetings

Recognition scenarios for 8 kHz audio, such as telephone customer service and voicemail

Sample rate

Any

8 kHz

Languages

Chinese (including Mandarin and various dialects), English, Japanese, Korean, German, French, and Russian

Supported Chinese dialects: Shanghainese, Wu, Minnan, Northeastern, Gansu, Guizhou, Henan, Hubei, Hunan, Jiangxi, Ningxia, Shanxi, Shaanxi, Shandong, Sichuan, Tianjin, Yunnan, and Cantonese

Chinese

Punctuation prediction

✅ Supported by default. No configuration is required.

✅ Supported by default. No configuration is required.

Inverse Text Normalization (ITN)

✅ Supported by default. No configuration is required.

✅ Supported by default. No configuration is required.

Custom hotwords

✅ For more information, see Customize hotwords

✅ For more information, see Customize hotwords

Specify recognition language

✅ Specified by the language_hints parameter.

Emotion recognition

✅ (Click for usage instructions)

Emotion recognition has the following constraints:

  • Applies only to the paraformer-realtime-8k-v2 model.

  • You must disable semantic sentence segmentation. This is controlled by the semantic_punctuation_enabled request parameter. Semantic sentence segmentation is disabled by default.

  • The emotion recognition result is displayed only when the isSentenceEnd method of the real-time recognition result (RecognitionResult) returns true.

To get the emotion recognition results, call the getEmoTag and getEmoConfidence methods of the Sentence information (Sentence) object. These methods return the emotion and confidence level for the current sentence.

Getting started

The Recognition class provides interfaces for synchronous and streaming calls. Select a method based on your requirements:

  • Synchronous call: Recognizes a local file and returns the complete result in a single response. Suitable for processing pre-recorded audio.

  • Streaming call: Recognizes an audio stream directly and outputs the results in real time. The audio stream can originate from an external device, such as a microphone, or be read from a local file. Suitable for scenarios that require immediate feedback.

Synchronous call

Submit a real-time speech-to-text task for a local file and receive the transcription result in a single, synchronous response. This is a blocking operation.

image

Instantiate the Recognition class and call the call method with the request parameters and the file to be recognized. This action performs the recognition and returns the result.

Click to view the full example

The audio file used in the example is: asr_example.wav.

import com.alibaba.dashscope.audio.asr.recognition.Recognition;
import com.alibaba.dashscope.audio.asr.recognition.RecognitionParam;

import java.io.File;

public class Main {
    public static void main(String[] args) {
        // Create a Recognition instance.
        Recognition recognizer = new Recognition();
        // Create a RecognitionParam.
        RecognitionParam param =
                RecognitionParam.builder()
                        // If you do not configure the API Key to an environment variable, uncomment the following line of code and replace yourApikey with your API Key.
                        // .apiKey("yourApikey")
                        .model("paraformer-realtime-v2")
                        .format("wav")
                        .sampleRate(16000)
                        // The "language_hints" parameter is supported only by the paraformer-realtime-v2 model.
                        .parameter("language_hints", new String[]{"zh", "en"})
                        .build();

        try {
            System.out.println("Recognition result: " + recognizer.call(param, new File("asr_example.wav")));
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            // Close the WebSocket connection after the task is complete.
            recognizer.getDuplexApi().close(1000, "bye");
        }
        System.out.println(
                "[Metric] requestId: "
                        + recognizer.getLastRequestId()
                        + ", first package delay ms: "
                        + recognizer.getFirstPackageDelay()
                        + ", last package delay ms: "
                        + recognizer.getLastPackageDelay());
        System.exit(0);
    }
}

Streaming call: Based on callbacks

You can submit a real-time speech-to-text task and receive streaming recognition results by implementing a callback interface.

image
  1. Start streaming speech recognition

    Instantiate the Recognition class and call the call method with the request parameters and the callback interface (ResultCallback) to start streaming speech recognition.

  2. Stream audio

    Repeatedly call the sendAudioFrame method of the Recognition class to send segments of a binary audio stream to the server. The audio stream can be read from a local file or a device, such as a microphone.

    While you are sending audio data, the server returns recognition results to the client in real time through the onEvent method of the callback interface (ResultCallback).

    We recommend that you send each audio segment with a duration of about 100 ms and a data size between 1 KB and 16 KB.

  3. End processing

    Call the stop method of the Recognition class to stop speech recognition.

    This method blocks the current thread until the onComplete or onError callback of the callback interface (ResultCallback) is triggered.

Click to view the full example

Recognize speech from a microphone

import com.alibaba.dashscope.audio.asr.recognition.Recognition;
import com.alibaba.dashscope.audio.asr.recognition.RecognitionParam;
import com.alibaba.dashscope.audio.asr.recognition.RecognitionResult;
import com.alibaba.dashscope.common.ResultCallback;

import javax.sound.sampled.AudioFormat;
import javax.sound.sampled.AudioSystem;
import javax.sound.sampled.TargetDataLine;

import java.nio.ByteBuffer;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;

public class Main {
    public static void main(String[] args) throws InterruptedException {
        ExecutorService executorService = Executors.newSingleThreadExecutor();
        executorService.submit(new RealtimeRecognitionTask());
        executorService.shutdown();
        executorService.awaitTermination(1, TimeUnit.MINUTES);
        System.exit(0);
    }
}

class RealtimeRecognitionTask implements Runnable {
    @Override
    public void run() {
        RecognitionParam param = RecognitionParam.builder()
                // If you do not configure the API Key to an environment variable, replace apiKey with your API Key.
                // .apiKey("yourApikey")
                .model("paraformer-realtime-v2")
                .format("wav")
                .sampleRate(16000)
                // The "language_hints" parameter is supported only by the paraformer-realtime-v2 model.
                .parameter("language_hints", new String[]{"zh", "en"})
                .build();
        Recognition recognizer = new Recognition();

        ResultCallback<RecognitionResult> callback = new ResultCallback<RecognitionResult>() {
            @Override
            public void onEvent(RecognitionResult result) {
                if (result.isSentenceEnd()) {
                    System.out.println("Final Result: " + result.getSentence().getText());
                } else {
                    System.out.println("Intermediate Result: " + result.getSentence().getText());
                }
            }

            @Override
            public void onComplete() {
                System.out.println("Recognition complete");
            }

            @Override
            public void onError(Exception e) {
                System.out.println("RecognitionCallback error: " + e.getMessage());
            }
        };
        try {
            recognizer.call(param, callback);
            // Create an audio format.
            AudioFormat audioFormat = new AudioFormat(16000, 16, 1, true, false);
            // Match the default recording device based on the format.
            TargetDataLine targetDataLine =
                    AudioSystem.getTargetDataLine(audioFormat);
            targetDataLine.open(audioFormat);
            // Start recording.
            targetDataLine.start();
            ByteBuffer buffer = ByteBuffer.allocate(1024);
            long start = System.currentTimeMillis();
            // Record for 50s and perform real-time transcription.
            while (System.currentTimeMillis() - start < 50000) {
                int read = targetDataLine.read(buffer.array(), 0, buffer.capacity());
                if (read > 0) {
                    buffer.limit(read);
                    // Send the recorded audio data to the streaming recognition service.
                    recognizer.sendAudioFrame(buffer);
                    buffer = ByteBuffer.allocate(1024);
                    // The recording rate is limited. Sleep for a short period to prevent high CPU usage.
                    Thread.sleep(20);
                }
            }
            recognizer.stop();
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            // Close the WebSocket connection after the task is complete.
            recognizer.getDuplexApi().close(1000, "bye");
        }

        System.out.println(
                "[Metric] requestId: "
                        + recognizer.getLastRequestId()
                        + ", first package delay ms: "
                        + recognizer.getFirstPackageDelay()
                        + ", last package delay ms: "
                        + recognizer.getLastPackageDelay());
    }
}

Recognize a local audio file

The audio file used in the example is: asr_example.wav.

import com.alibaba.dashscope.audio.asr.recognition.Recognition;
import com.alibaba.dashscope.audio.asr.recognition.RecognitionParam;
import com.alibaba.dashscope.audio.asr.recognition.RecognitionResult;
import com.alibaba.dashscope.common.ResultCallback;

import java.io.FileInputStream;
import java.nio.ByteBuffer;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.time.LocalDateTime;
import java.time.format.DateTimeFormatter;
import java.util.concurrent.CountDownLatch;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;

class TimeUtils {
    private static final DateTimeFormatter formatter =
            DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss.SSS");

    public static String getTimestamp() {
        return LocalDateTime.now().format(formatter);
    }
}

public class Main {
    public static void main(String[] args) throws InterruptedException {
        ExecutorService executorService = Executors.newSingleThreadExecutor();
        executorService.submit(new RealtimeRecognitionTask(Paths.get(System.getProperty("user.dir"), "asr_example.wav")));
        executorService.shutdown();

        // wait for all tasks to complete
        executorService.awaitTermination(1, TimeUnit.MINUTES);
        System.exit(0);
    }
}

class RealtimeRecognitionTask implements Runnable {
    private Path filepath;

    public RealtimeRecognitionTask(Path filepath) {
        this.filepath = filepath;
    }

    @Override
    public void run() {
        RecognitionParam param = RecognitionParam.builder()
                // If you do not configure the API Key to an environment variable, replace apiKey with your API Key.
                // .apiKey("yourApikey")
                .model("paraformer-realtime-v2")
                .format("wav")
                .sampleRate(16000)
                // The "language_hints" parameter is supported only by the paraformer-realtime-v2 model.
                .parameter("language_hints", new String[]{"zh", "en"})
                .build();
        Recognition recognizer = new Recognition();

        String threadName = Thread.currentThread().getName();

        ResultCallback<RecognitionResult> callback = new ResultCallback<RecognitionResult>() {
            @Override
            public void onEvent(RecognitionResult message) {
                if (message.isSentenceEnd()) {

                    System.out.println(TimeUtils.getTimestamp()+" "+
                            "[process " + threadName + "] Final Result:" + message.getSentence().getText());
                } else {
                    System.out.println(TimeUtils.getTimestamp()+" "+
                            "[process " + threadName + "] Intermediate Result: " + message.getSentence().getText());
                }
            }

            @Override
            public void onComplete() {
                System.out.println(TimeUtils.getTimestamp()+" "+"[" + threadName + "] Recognition complete");
            }

            @Override
            public void onError(Exception e) {
                System.out.println(TimeUtils.getTimestamp()+" "+
                        "[" + threadName + "] RecognitionCallback error: " + e.getMessage());
            }
        };

        try {
            recognizer.call(param, callback);
            // Please replace the path with your audio file path.
            System.out.println(TimeUtils.getTimestamp()+" "+"[" + threadName + "] Input file_path is: " + this.filepath);
            // Read file and send audio by chunks.
            FileInputStream fis = new FileInputStream(this.filepath.toFile());
            // Set the chunk size to 1 second for a 16 kHz sample rate.
            byte[] buffer = new byte[3200];
            int bytesRead;
            // Loop to read chunks of the file.
            while ((bytesRead = fis.read(buffer)) != -1) {
                ByteBuffer byteBuffer;
                // Handle the last chunk which might be smaller than the buffer size.
                System.out.println(TimeUtils.getTimestamp()+" "+"[" + threadName + "] bytesRead: " + bytesRead);
                if (bytesRead < buffer.length) {
                    byteBuffer = ByteBuffer.wrap(buffer, 0, bytesRead);
                } else {
                    byteBuffer = ByteBuffer.wrap(buffer);
                }

                recognizer.sendAudioFrame(byteBuffer);
                buffer = new byte[3200];
                Thread.sleep(100);
            }
            System.out.println(TimeUtils.getTimestamp()+" "+LocalDateTime.now());
            recognizer.stop();
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            // Close the WebSocket connection after the task is complete.
            recognizer.getDuplexApi().close(1000, "bye");
        }

        System.out.println(
                "["
                        + threadName
                        + "][Metric] requestId: "
                        + recognizer.getLastRequestId()
                        + ", first package delay ms: "
                        + recognizer.getFirstPackageDelay()
                        + ", last package delay ms: "
                        + recognizer.getLastPackageDelay());
    }
}

Streaming call: Based on Flowable

You can submit a real-time speech-to-text task and receive streaming recognition results by implementing a Flowable workflow.

Flowable is an open-source framework for workflow and business process management that is released under the Apache 2.0 license. For more information, see Flowable API reference.

Click to view the full example

Directly call the streamCall method of the Recognition class to start recognition.

The streamCall method returns a Flowable<RecognitionResult> instance. Call methods of the Flowable instance, such as blockingForEach and subscribe, to process the recognition results. The results are encapsulated in RecognitionResult.

The streamCall method requires two parameters:

  • A RecognitionParam instance (request parameters): Use this instance to set parameters for speech recognition, such as the model, sample rate, and audio format.

  • A Flowable<ByteBuffer> instance: Create an instance of the Flowable<ByteBuffer> type and implement a method within the instance to parse the audio stream.

import com.alibaba.dashscope.audio.asr.recognition.Recognition;
import com.alibaba.dashscope.audio.asr.recognition.RecognitionParam;
import com.alibaba.dashscope.exception.NoApiKeyException;
import io.reactivex.BackpressureStrategy;
import io.reactivex.Flowable;

import javax.sound.sampled.AudioFormat;
import javax.sound.sampled.AudioSystem;
import javax.sound.sampled.TargetDataLine;
import java.nio.ByteBuffer;

public class Main {
    public static void main(String[] args) throws NoApiKeyException {
        // Create a Flowable<ByteBuffer>.
        Flowable<ByteBuffer> audioSource =
                Flowable.create(
                        emitter -> {
                            new Thread(
                                    () -> {
                                        try {
                                            // Create an audio format.
                                            AudioFormat audioFormat = new AudioFormat(16000, 16, 1, true, false);
                                            // Match the default recording device based on the format.
                                            TargetDataLine targetDataLine =
                                                    AudioSystem.getTargetDataLine(audioFormat);
                                            targetDataLine.open(audioFormat);
                                            // Start recording.
                                            targetDataLine.start();
                                            ByteBuffer buffer = ByteBuffer.allocate(1024);
                                            long start = System.currentTimeMillis();
                                            // Record for 50s and perform real-time transcription.
                                            while (System.currentTimeMillis() - start < 50000) {
                                                int read = targetDataLine.read(buffer.array(), 0, buffer.capacity());
                                                if (read > 0) {
                                                    buffer.limit(read);
                                                    // Send the recorded audio data to the streaming recognition service.
                                                    emitter.onNext(buffer);
                                                    buffer = ByteBuffer.allocate(1024);
                                                    // The recording rate is limited. Sleep for a short period to prevent high CPU usage.
                                                    Thread.sleep(20);
                                                }
                                            }
                                            // Notify that the transcription is complete.
                                            emitter.onComplete();
                                        } catch (Exception e) {
                                            emitter.onError(e);
                                        }
                                    })
                                    .start();
                        },
                        BackpressureStrategy.BUFFER);

        // Create a Recognizer.
        Recognition recognizer = new Recognition();
        // Create a RecognitionParam and pass the created Flowable<ByteBuffer> to the audioFrames parameter.
        RecognitionParam param = RecognitionParam.builder()
                // If you do not configure the API Key to an environment variable, replace apiKey with your API Key.
                // .apiKey("yourApikey")
                .model("paraformer-realtime-v2")
                .format("pcm")
                .sampleRate(16000)
                // The "language_hints" parameter is supported only by the paraformer-realtime-v2 model.
                .parameter("language_hints", new String[]{"zh", "en"})
                .build();

        // Call the streaming interface.
        recognizer
                .streamCall(param, audioSource)
                .blockingForEach(
                        result -> {
                            // Subscribe to the output result.
                            if (result.isSentenceEnd()) {
                                System.out.println("Final Result: " + result.getSentence().getText());
                            } else {
                                System.out.println("Intermediate Result: " + result.getSentence().getText());
                            }
                        });
        // Close the WebSocket connection after the task is complete.
        recognizer.getDuplexApi().close(1000, "bye");
        System.out.println(
                "[Metric] requestId: "
                        + recognizer.getLastRequestId()
                        + ", first package delay ms: "
                        + recognizer.getFirstPackageDelay()
                        + ", last package delay ms: "
                        + recognizer.getLastPackageDelay());
        System.exit(0);
    }
}

High-concurrency calls

The DashScope Java SDK uses OkHttp3 connection pooling to reduce the overhead of repeatedly establishing connections. For more information, see Real-time speech recognition in high-concurrency scenarios.

Request parameters

Use the chained methods of RecognitionParam to configure parameters such as the model, sample rate, and audio format. Pass the configured parameter object to the call or streamCall method of the Recognition class.

Click to view an example

RecognitionParam param = RecognitionParam.builder()
  .model("paraformer-realtime-v2")
  .format("pcm")
  .sampleRate(16000)
  // The "language_hints" parameter is supported only by the paraformer-realtime-v2 model.
  .parameter("language_hints", new String[]{"zh", "en"})
  .build();

Parameter

Type

Default value

Required

Description

model

String

-

Yes

The model for real-time speech recognition. For more information, see Model list.

sampleRate

Integer

-

Yes

The audio sampling rate in Hz.

This parameter varies by model:

  • paraformer-realtime-v2 supports any sample rate.

  • paraformer-realtime-8k-v2 supports only an 8000 Hz sample rate.

format

String

-

Yes

The format of the audio to be recognized.

Supported audio formats: pcm, wav, mp3, opus, speex, aac, and amr.

Important

opus/speex: Must be encapsulated in Ogg.

wav: Must be PCM encoded.

amr: Only the AMR-NB type is supported.

vocabularyId

String

-

No

The ID of the hotword vocabulary. This parameter takes effect only when it is set. Use this field to set the hotword ID for v2 and later models.

The hotword information for this hotword ID is applied to the speech recognition request. For more information, see Custom vocabulary.

disfluencyRemovalEnabled

boolean

false

No

Specifies whether to filter out disfluent words:

  • true

  • false (default)

language_hints

String[]

["zh", "en"]

No

The language code of the language to be recognized. If you cannot determine the language in advance, you can leave this parameter unset. The model automatically detects the language.

Currently supported language codes:

  • zh: Chinese

  • en: English

  • ja: Japanese

  • yue: Cantonese

  • ko: Korean

  • de: German

  • fr: French

  • ru: Russian

This parameter applies only to models that support multiple languages. For more information, see Model list.

Note

Set language_hints using the parameter or parameters method of the RecognitionParam instance:

Set using the parameter method

RecognitionParam param = RecognitionParam.builder()
 .model("paraformer-realtime-v2")
 .format("pcm")
 .sampleRate(16000)
 .parameter("language_hints", new String[]{"zh", "en"})
 .build();

Set using the parameters method

RecognitionParam param = RecognitionParam.builder()
 .model("paraformer-realtime-v2")
 .format("pcm")
 .sampleRate(16000)
 .parameters(Collections.singletonMap("language_hints", new String[]{"zh", "en"}))
 .build();

semantic_punctuation_enabled

boolean

false

No

Specifies whether to enable semantic sentence segmentation. This feature is disabled by default.

  • true: Enables semantic sentence segmentation and disables Voice Activity Detection (VAD) sentence segmentation.

  • false (default): Enables VAD sentence segmentation and disables semantic sentence segmentation.

Semantic sentence segmentation provides higher accuracy and is suitable for meeting transcription scenarios. VAD sentence segmentation has lower latency and is suitable for interactive scenarios.

By adjusting the semantic_punctuation_enabled parameter, you can flexibly switch the sentence segmentation method to suit different scenarios.

This parameter is effective only for v2 and later models.

Note

Set semantic_punctuation_enabled using the parameter or parameters method of the RecognitionParam instance:

Set using the parameter method

RecognitionParam param = RecognitionParam.builder()
 .model("paraformer-realtime-v2")
 .format("pcm")
 .sampleRate(16000)
 .parameter("semantic_punctuation_enabled", true)
 .build();

Set using the parameters method

RecognitionParam param = RecognitionParam.builder()
 .model("paraformer-realtime-v2")
 .format("pcm")
 .sampleRate(16000)
 .parameters(Collections.singletonMap("semantic_punctuation_enabled", true))
 .build();

max_sentence_silence

Integer

800

No

The silence duration threshold for VAD sentence segmentation, in ms.

If the silence duration after a speech segment exceeds this threshold, the system determines that the sentence has ended.

The parameter ranges from 200 ms to 6000 ms. The default value is 800 ms.

This parameter is effective only when the semantic_punctuation_enabled parameter is `false` (VAD segmentation) and the model is v2 or later.

Note

Set max_sentence_silence using the parameter or parameters method of the RecognitionParam instance:

Set using the parameter method

RecognitionParam param = RecognitionParam.builder()
 .model("paraformer-realtime-v2")
 .format("pcm")
 .sampleRate(16000)
 .parameter("max_sentence_silence", 800)
 .build();

Set using the parameters method

RecognitionParam param = RecognitionParam.builder()
 .model("paraformer-realtime-v2")
 .format("pcm")
 .sampleRate(16000)
 .parameters(Collections.singletonMap("max_sentence_silence", 800))
 .build();

multi_threshold_mode_enabled

boolean

false

No

If this parameter is set to `true`, it prevents VAD from segmenting sentences that are too long. This feature is disabled by default.

This parameter is effective only when the semantic_punctuation_enabled parameter is `false` (VAD segmentation) and the model is v2 or later.

Note

Set multi_threshold_mode_enabled using the parameter or parameters method of the RecognitionParam instance:

Set using the parameter method

RecognitionParam param = RecognitionParam.builder()
 .model("paraformer-realtime-v2")
 .format("pcm")
 .sampleRate(16000)
 .parameter("multi_threshold_mode_enabled", true)
 .build();

Set using the parameters method

RecognitionParam param = RecognitionParam.builder()
 .model("paraformer-realtime-v2")
 .format("pcm")
 .sampleRate(16000)
 .parameters(Collections.singletonMap("multi_threshold_mode_enabled", true))
 .build();

punctuation_prediction_enabled

boolean

true

No

Specifies whether to automatically add punctuation to the recognition results:

  • true (default)

  • false

This parameter is effective only for v2 and later models.

Note

Set punctuation_prediction_enabled using the parameter or parameters method of the RecognitionParam instance:

Set using the parameter method

RecognitionParam param = RecognitionParam.builder()
 .model("paraformer-realtime-v2")
 .format("pcm")
 .sampleRate(16000)
 .parameter("punctuation_prediction_enabled", false)
 .build();

Set using the parameters method

RecognitionParam param = RecognitionParam.builder()
 .model("paraformer-realtime-v2")
 .format("pcm")
 .sampleRate(16000)
 .parameters(Collections.singletonMap("punctuation_prediction_enabled", false))
 .build();

heartbeat

boolean

false

No

Controls whether to maintain a persistent connection with the server:

  • true: Maintains the connection with the server without interruption when you continuously send silent audio.

  • false (default): Even if silent audio is continuously sent, the connection will disconnect after 60 seconds due to a timeout.

    Silent audio refers to content in an audio file or data stream that has no sound signal. You can generate silent audio using various methods, such as using audio editing software such as Audacity or Adobe Audition, or command-line tools such as FFmpeg.

This parameter is effective only for v2 and later models.

Note

To use this field, your SDK version must be 2.19.1 or later.

Set heartbeat using the parameter or parameters method of the RecognitionParam instance:

Set using the parameter method

RecognitionParam param = RecognitionParam.builder()
 .model("paraformer-realtime-v2")
 .format("pcm")
 .sampleRate(16000)
 .parameter("heartbeat", true)
 .build();

Set using the parameters method

RecognitionParam param = RecognitionParam.builder()
 .model("paraformer-realtime-v2")
 .format("pcm")
 .sampleRate(16000)
 .parameters(Collections.singletonMap("heartbeat", true))
 .build();

inverse_text_normalization_enabled

boolean

true

No

Specifies whether to enable Inverse Text Normalization (ITN).

This feature is enabled by default (`true`). When enabled, Chinese numerals are converted to Arabic numerals.

This parameter is effective only for v2 and later models.

Note

Set inverse_text_normalization_enabled using the parameter or parameters method of the RecognitionParam instance:

Set using the parameter method

RecognitionParam param = RecognitionParam.builder()
 .model("paraformer-realtime-v2")
 .format("pcm")
 .sampleRate(16000)
 .parameter("inverse_text_normalization_enabled", false)
 .build();

Set using the parameters method

RecognitionParam param = RecognitionParam.builder()
 .model("paraformer-realtime-v2")
 .format("pcm")
 .sampleRate(16000)
 .parameters(Collections.singletonMap("inverse_text_normalization_enabled", false))
 .build();

apiKey

String

-

No

Your API key.

Key interfaces

Recognition class

Import the Recognition class using the statement import com.alibaba.dashscope.audio.asr.recognition.Recognition;. Its key interfaces are:

Interface/Method

Parameters

Return value

Description

public void call(RecognitionParam param, final ResultCallback<RecognitionResult> callback)

None

Performs streaming real-time recognition based on callbacks. This method does not block the current thread.

public String call(RecognitionParam param, File file)

Recognition result

Performs a synchronous call based on a local file. This method blocks the current thread until the entire audio file is read. The file must have read permissions.

public Flowable<RecognitionResult> streamCall(RecognitionParam param, Flowable<ByteBuffer> audioFrame)

Flowable<RecognitionResult>

Performs streaming real-time recognition based on Flowable.

public void sendAudioFrame(ByteBuffer audioFrame)
  • audioFrame: A binary audio stream of the ByteBuffer type

None

Pushes an audio stream. The size of each pushed audio stream should not be too large or too small. We recommend that each audio packet has a duration of about 100 ms and a size between 1 KB and 16 KB.

The recognition result is obtained through the onEvent method of the callback interface (ResultCallback).

public void stop()

None

None

Stops real-time recognition.

This method blocks the current thread until the onComplete or onError method of the ResultCallback instance is called.

recognizer.getDuplexApi().close(int code, String reason)

code: WebSocket Close Code

reason: Reason for closing

For information about how to configure these two parameters, see The WebSocket Protocol document.

true

After the task ends, you must close the WebSocket connection to prevent connection leaks, regardless of whether an exception occurs. For more information about how to reuse connections to improve efficiency, see Real-time speech recognition in high-concurrency scenarios.

public String getLastRequestId()

None

requestId

Gets the request ID of the current task. Use this method after you start a new task by calling call or streamingCall.

Note

This method is available in SDK versions 2.18.0 and later.

public long getFirstPackageDelay()

None

First-packet latency

Gets the first-packet latency, which is the delay from sending the first audio packet to receiving the first recognition result. Use this method after the task is complete.

Note

This method is available in SDK versions 2.18.0 and later.

public long getLastPackageDelay()

None

Last-packet latency

Gets the last-packet latency, which is the time taken from sending the stop command to receiving the last recognition result packet. Use this method after the task is complete.

Note

This method is available in SDK versions 2.18.0 and later.

ResultCallback

When you make a streaming call, the server returns key process information and data to the client through callbacks. Implement callback methods to handle the information and data returned by the server.

To implement the callback methods, inherit the ResultCallback abstract class. When you inherit this class, you can specify the generic type as RecognitionResult. RecognitionResult encapsulates the data structure returned by the server.

Because Java supports connection reuse, there are no onClose or onOpen methods.

Example

ResultCallback<RecognitionResult> callback = new ResultCallback<RecognitionResult>() {
    @Override
    public void onEvent(RecognitionResult result) {
        System.out.println("RequestId is: " + result.getRequestId());
        // Implement the logic to process the speech recognition result here.
    }

    @Override
    public void onComplete() {
        System.out.println("Task complete");
    }

    @Override
    public void onError(Exception e) {
        System.out.println("Task failed: " + e.getMessage());
    }
};

Interface/Method

Parameters

Return value

Description

public void onEvent(RecognitionResult result)

result: Real-time recognition result (RecognitionResult)

None

This method is called back when the service sends a response.

public void onComplete()

None

None

This interface is called back when the task is complete.

public void onError(Exception e)

e: Exception information

None

This interface is called back when an exception occurs.

Response

Real-time recognition result (RecognitionResult)

RecognitionResult represents the result of a single real-time recognition.

Interface/Method

Parameters

Return value

Description

public String getRequestId()

None

requestId

Gets the request ID.

public boolean isSentenceEnd()

None

Whether the sentence is complete, which means a sentence break has occurred

Checks whether the given sentence has ended.

public Sentence getSentence()

None

Sentence information (Sentence)

Gets single-sentence information, including the timestamp and text.

Sentence information (Sentence)

Interface/Method

Parameters

Return value

Description

public Long getBeginTime()

None

Sentence start time, in ms

Returns the start time of the sentence.

public Long getEndTime()

None

Sentence end time, in ms

Returns the end time of the sentence.

public String getText()

None

Recognized text

Returns the recognized text.

public List<Word> getWords()

None

A list of Word timestamp information (Word) objects

Returns word timestamp information.

public String getEmoTag()

None

Emotion of the current sentence

Returns the emotion of the current sentence:

  • positive: Positive emotion, such as happy or satisfied.

  • negative: Negative emotion, such as angry or dull.

  • neutral: No obvious emotion.

Emotion recognition has the following constraints:

  • Applies only to the paraformer-realtime-8k-v2 model.

  • You must disable semantic sentence segmentation. This is controlled by the semantic_punctuation_enabled request parameter. Semantic sentence segmentation is disabled by default.

  • The emotion recognition result is displayed only when the isSentenceEnd method of the real-time recognition result (RecognitionResult) returns true.

public Double getEmoConfidence()

None

Confidence level of the recognized emotion for the current sentence

Returns the confidence level of the recognized emotion for the current sentence. The value ranges from 0.0 to 1.0. A larger value indicates a higher confidence level.

Emotion recognition has the following constraints:

  • Applies only to the paraformer-realtime-8k-v2 model.

  • You must disable semantic sentence segmentation. This is controlled by the semantic_punctuation_enabled request parameter. Semantic sentence segmentation is disabled by default.

  • The emotion recognition result is displayed only when the isSentenceEnd method of the real-time recognition result (RecognitionResult) returns true.

Word timestamp information (Word)

Interface/Method

Parameters

Return value

Description

public long getBeginTime()

None

Word start time, in ms

Returns the start time of the word.

public long getEndTime()

None

Word end time, in ms

Returns the end time of the word.

public String getText()

None

Word

Returns the recognized word.

public String getPunctuation()

None

Punctuation

Returns the punctuation.

Error codes

If an error occurs, see Error messages for troubleshooting.

If the problem persists, join the developer group to report the issue. Provide the Request ID to help us investigate the issue.

More examples

For more examples, see GitHub.

FAQ

Features

Q: How to maintain a persistent connection with the server during long periods of silence?

You can set the heartbeat request parameter to true and continuously send silent audio to the server.

Silent audio refers to content in an audio file or data stream that has no sound signal. You can generate silent audio using various methods, such as using audio editing software such as Audacity or Adobe Audition, or command-line tools such as FFmpeg.

Q: How to convert an audio format to the required format?

You can use the FFmpeg tool. For more information, see the official FFmpeg website.

# Basic conversion command (universal template)
# -i: Specifies the input file path. Example: audio.wav
# -c:a: Specifies the audio encoder. Examples: aac, libmp3lame, pcm_s16le
# -b:a: Specifies the bit rate (controls audio quality). Examples: 192k, 320k
# -ar: Specifies the sample rate. Examples: 44100 (CD), 48000, 16000
# -ac: Specifies the number of sound channels. Examples: 1 (mono), 2 (stereo)
# -y: Overwrites an existing file (no value needed).
ffmpeg -i input_audio.ext -c:a encoder_name -b:a bit_rate -ar sample_rate -ac num_channels output.ext

# Example: WAV to MP3 (maintain original quality)
ffmpeg -i input.wav -c:a libmp3lame -q:a 0 output.mp3
# Example: MP3 to WAV (16-bit PCM standard format)
ffmpeg -i input.mp3 -c:a pcm_s16le -ar 44100 -ac 2 output.wav
# Example: M4A to AAC (extract/convert Apple audio)
ffmpeg -i input.m4a -c:a copy output.aac  # Directly extract without re-encoding
ffmpeg -i input.m4a -c:a aac -b:a 256k output.aac  # Re-encode to improve quality
# Example: FLAC lossless to Opus (high compression)
ffmpeg -i input.flac -c:a libopus -b:a 128k -vbr on output.opus

Q: Can I view the time range for each sentence?

Yes, you can. The speech recognition results include the start and end timestamps for each sentence. You can use these timestamps to determine the time range of each sentence.

Q: How to recognize a local file (recorded audio file)?

There are two ways to recognize a local file:

  • Pass the local file path directly: This method returns the complete recognition result after the entire file is processed. It is not suitable for scenarios that require immediate feedback.

    For more information, see Synchronous call. You can pass the file path to the call method of the Recognition class to directly recognize the audio file.

  • Convert the local file into a binary stream for recognition: This method provides real-time results as the file is streamed and recognized. It is suitable for scenarios that require immediate feedback.

Troubleshooting

Q: Why there is no recognition result?

  1. Check whether the audio format and sampleRate/sample_rate in the request parameters are set correctly and meet the parameter constraints. The following are common error examples:

    • The audio file has a .wav extension but is in MP3 format, and the format parameter is incorrectly set to `mp3`.

    • The audio sample rate is 3600 Hz, but the sampleRate/sample_rate parameter is incorrectly set to 48000.

    You can use the ffprobe tool to obtain audio information, such as the container, encoding, sample rate, and sound channels:

    ffprobe -v error -show_entries format=format_name -show_entries stream=codec_name,sample_rate,channels -of default=noprint_wrappers=1 input.xxx
  2. When you use the paraformer-realtime-v2 model, check whether the language set in language_hints matches the actual language of the audio.

    For example, the audio is in Chinese, but language_hints is set to en (English).

  3. If all the preceding checks pass, you can use custom hotwords to improve the recognition of specific words.