通过Java SDK调用Paraformer实时语音识别API - Alibaba Cloud Model Studio

This topic describes the parameters and interfaces of the Paraformer real-time speech recognition Java SDK.

Important

This document applies only to the China (Beijing) region. To use the model, you must use an API key from the China (Beijing) region.

User guide: For model descriptions and selection guidance, see Real-time speech recognition - Fun-ASR/Paraformer.

Prerequisites

You have activated the Model Studio and created an API key. To prevent security risks, export the API key as an environment variable instead of hard-coding it in your code.
Note
To grant temporary access permissions to third-party applications or users, or if you want to strictly control high-risk operations such as accessing or deleting sensitive data, we recommend that you use a temporary authentication token.
Compared with long-term API keys, temporary authentication tokens are more secure because they are short-lived (60 seconds). They are suitable for temporary call scenarios and can effectively reduce the risk of API key leakage.
To use a temporary token, replace the API key used for authentication in your code with the temporary authentication token.
Install the latest version of the DashScope SDK.

Model list

	paraformer-realtime-v2	paraformer-realtime-8k-v2
Scenarios	Scenarios such as live streaming and meetings	Recognition scenarios for 8 kHz audio, such as telephone customer service and voicemail
Sample rate	Any	8 kHz
Languages	Chinese (including Mandarin and various dialects), English, Japanese, Korean, German, French, and Russian Supported Chinese dialects: Shanghainese, Wu, Minnan, Northeastern, Gansu, Guizhou, Henan, Hubei, Hunan, Jiangxi, Ningxia, Shanxi, Shaanxi, Shandong, Sichuan, Tianjin, Yunnan, and Cantonese	Chinese
Punctuation prediction	✅ Supported by default. No configuration is required.	✅ Supported by default. No configuration is required.
Inverse Text Normalization (ITN)	✅ Supported by default. No configuration is required.	✅ Supported by default. No configuration is required.
Custom hotwords	✅ For more information, see Customize hotwords	✅ For more information, see Customize hotwords
Specify recognition language	✅ Specified by the `language_hints` parameter.	❌
Emotion recognition	❌	✅ (Click for usage instructions) Emotion recognition has the following constraints: Applies only to the `paraformer-realtime-8k-v2` model. You must disable semantic sentence segmentation. This is controlled by the `semantic_punctuation_enabled` request parameter. Semantic sentence segmentation is disabled by default. The emotion recognition result is displayed only when the `isSentenceEnd` method of the real-time recognition result (RecognitionResult) returns `true`. To get the emotion recognition results, call the `getEmoTag` and `getEmoConfidence` methods of the Sentence information (Sentence) object. These methods return the emotion and confidence level for the current sentence.

Getting started

The Recognition class provides interfaces for synchronous and streaming calls. Select a method based on your requirements:

Synchronous call: Recognizes a local file and returns the complete result in a single response. Suitable for processing pre-recorded audio.
Streaming call: Recognizes an audio stream directly and outputs the results in real time. The audio stream can originate from an external device, such as a microphone, or be read from a local file. Suitable for scenarios that require immediate feedback.

Synchronous call

Submit a real-time speech-to-text task for a local file and receive the transcription result in a single, synchronous response. This is a blocking operation.

Instantiate the Recognition class and call the call method with the request parameters and the file to be recognized. This action performs the recognition and returns the result.

Click to view the full example

The audio file used in the example is: asr_example.wav.

import com.alibaba.dashscope.audio.asr.recognition.Recognition;
import com.alibaba.dashscope.audio.asr.recognition.RecognitionParam;

import java.io.File;

public class Main {
    public static void main(String[] args) {
        // Create a Recognition instance.
        Recognition recognizer = new Recognition();
        // Create a RecognitionParam.
        RecognitionParam param =
                RecognitionParam.builder()
                        // If you do not configure the API Key to an environment variable, uncomment the following line of code and replace yourApikey with your API Key.
                        // .apiKey("yourApikey")
                        .model("paraformer-realtime-v2")
                        .format("wav")
                        .sampleRate(16000)
                        // The "language_hints" parameter is supported only by the paraformer-realtime-v2 model.
                        .parameter("language_hints", new String[]{"zh", "en"})
                        .build();

        try {
            System.out.println("Recognition result: " + recognizer.call(param, new File("asr_example.wav")));
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            // Close the WebSocket connection after the task is complete.
            recognizer.getDuplexApi().close(1000, "bye");
        }
        System.out.println(
                "[Metric] requestId: "
                        + recognizer.getLastRequestId()
                        + ", first package delay ms: "
                        + recognizer.getFirstPackageDelay()
                        + ", last package delay ms: "
                        + recognizer.getLastPackageDelay());
        System.exit(0);
    }
}

Streaming call: Based on callbacks

You can submit a real-time speech-to-text task and receive streaming recognition results by implementing a callback interface.

Start streaming speech recognition
Instantiate the Recognition class and call the call method with the request parameters and the callback interface (ResultCallback) to start streaming speech recognition.
Stream audio
Repeatedly call the sendAudioFrame method of the Recognition class to send segments of a binary audio stream to the server. The audio stream can be read from a local file or a device, such as a microphone.
While you are sending audio data, the server returns recognition results to the client in real time through the onEvent method of the callback interface (ResultCallback).
We recommend that you send each audio segment with a duration of about 100 ms and a data size between 1 KB and 16 KB.
End processing
Call the stop method of the Recognition class to stop speech recognition.
This method blocks the current thread until the onComplete or onError callback of the callback interface (ResultCallback) is triggered.

Click to view the full example

Recognize speech from a microphone

import com.alibaba.dashscope.audio.asr.recognition.Recognition;
import com.alibaba.dashscope.audio.asr.recognition.RecognitionParam;
import com.alibaba.dashscope.audio.asr.recognition.RecognitionResult;
import com.alibaba.dashscope.common.ResultCallback;

import javax.sound.sampled.AudioFormat;
import javax.sound.sampled.AudioSystem;
import javax.sound.sampled.TargetDataLine;

import java.nio.ByteBuffer;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;

public class Main {
    public static void main(String[] args) throws InterruptedException {
        ExecutorService executorService = Executors.newSingleThreadExecutor();
        executorService.submit(new RealtimeRecognitionTask());
        executorService.shutdown();
        executorService.awaitTermination(1, TimeUnit.MINUTES);
        System.exit(0);
    }
}

class RealtimeRecognitionTask implements Runnable {
    @Override
    public void run() {
        RecognitionParam param = RecognitionParam.builder()
                // If you do not configure the API Key to an environment variable, replace apiKey with your API Key.
                // .apiKey("yourApikey")
                .model("paraformer-realtime-v2")
                .format("wav")
                .sampleRate(16000)
                // The "language_hints" parameter is supported only by the paraformer-realtime-v2 model.
                .parameter("language_hints", new String[]{"zh", "en"})
                .build();
        Recognition recognizer = new Recognition();

        ResultCallback<RecognitionResult> callback = new ResultCallback<RecognitionResult>() {
            @Override
            public void onEvent(RecognitionResult result) {
                if (result.isSentenceEnd()) {
                    System.out.println("Final Result: " + result.getSentence().getText());
                } else {
                    System.out.println("Intermediate Result: " + result.getSentence().getText());
                }
            }

            @Override
            public void onComplete() {
                System.out.println("Recognition complete");
            }

            @Override
            public void onError(Exception e) {
                System.out.println("RecognitionCallback error: " + e.getMessage());
            }
        };
        try {
            recognizer.call(param, callback);
            // Create an audio format.
            AudioFormat audioFormat = new AudioFormat(16000, 16, 1, true, false);
            // Match the default recording device based on the format.
            TargetDataLine targetDataLine =
                    AudioSystem.getTargetDataLine(audioFormat);
            targetDataLine.open(audioFormat);
            // Start recording.
            targetDataLine.start();
            ByteBuffer buffer = ByteBuffer.allocate(1024);
            long start = System.currentTimeMillis();
            // Record for 50s and perform real-time transcription.
            while (System.currentTimeMillis() - start < 50000) {
                int read = targetDataLine.read(buffer.array(), 0, buffer.capacity());
                if (read > 0) {
                    buffer.limit(read);
                    // Send the recorded audio data to the streaming recognition service.
                    recognizer.sendAudioFrame(buffer);
                    buffer = ByteBuffer.allocate(1024);
                    // The recording rate is limited. Sleep for a short period to prevent high CPU usage.
                    Thread.sleep(20);
                }
            }
            recognizer.stop();
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            // Close the WebSocket connection after the task is complete.
            recognizer.getDuplexApi().close(1000, "bye");
        }

        System.out.println(
                "[Metric] requestId: "
                        + recognizer.getLastRequestId()
                        + ", first package delay ms: "
                        + recognizer.getFirstPackageDelay()
                        + ", last package delay ms: "
                        + recognizer.getLastPackageDelay());
    }
}

Recognize a local audio file

The audio file used in the example is: asr_example.wav.

import com.alibaba.dashscope.audio.asr.recognition.Recognition;
import com.alibaba.dashscope.audio.asr.recognition.RecognitionParam;
import com.alibaba.dashscope.audio.asr.recognition.RecognitionResult;
import com.alibaba.dashscope.common.ResultCallback;

import java.io.FileInputStream;
import java.nio.ByteBuffer;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.time.LocalDateTime;
import java.time.format.DateTimeFormatter;
import java.util.concurrent.CountDownLatch;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;

class TimeUtils {
    private static final DateTimeFormatter formatter =
            DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss.SSS");

    public static String getTimestamp() {
        return LocalDateTime.now().format(formatter);
    }
}

public class Main {
    public static void main(String[] args) throws InterruptedException {
        ExecutorService executorService = Executors.newSingleThreadExecutor();
        executorService.submit(new RealtimeRecognitionTask(Paths.get(System.getProperty("user.dir"), "asr_example.wav")));
        executorService.shutdown();

        // wait for all tasks to complete
        executorService.awaitTermination(1, TimeUnit.MINUTES);
        System.exit(0);
    }
}

class RealtimeRecognitionTask implements Runnable {
    private Path filepath;

    public RealtimeRecognitionTask(Path filepath) {
        this.filepath = filepath;
    }

    @Override
    public void run() {
        RecognitionParam param = RecognitionParam.builder()
                // If you do not configure the API Key to an environment variable, replace apiKey with your API Key.
                // .apiKey("yourApikey")
                .model("paraformer-realtime-v2")
                .format("wav")
                .sampleRate(16000)
                // The "language_hints" parameter is supported only by the paraformer-realtime-v2 model.
                .parameter("language_hints", new String[]{"zh", "en"})
                .build();
        Recognition recognizer = new Recognition();

        String threadName = Thread.currentThread().getName();

        ResultCallback<RecognitionResult> callback = new ResultCallback<RecognitionResult>() {
            @Override
            public void onEvent(RecognitionResult message) {
                if (message.isSentenceEnd()) {

                    System.out.println(TimeUtils.getTimestamp()+" "+
                            "[process " + threadName + "] Final Result:" + message.getSentence().getText());
                } else {
                    System.out.println(TimeUtils.getTimestamp()+" "+
                            "[process " + threadName + "] Intermediate Result: " + message.getSentence().getText());
                }
            }

            @Override
            public void onComplete() {
                System.out.println(TimeUtils.getTimestamp()+" "+"[" + threadName + "] Recognition complete");
            }

            @Override
            public void onError(Exception e) {
                System.out.println(TimeUtils.getTimestamp()+" "+
                        "[" + threadName + "] RecognitionCallback error: " + e.getMessage());
            }
        };

        try {
            recognizer.call(param, callback);
            // Please replace the path with your audio file path.
            System.out.println(TimeUtils.getTimestamp()+" "+"[" + threadName + "] Input file_path is: " + this.filepath);
            // Read file and send audio by chunks.
            FileInputStream fis = new FileInputStream(this.filepath.toFile());
            // Set the chunk size to 1 second for a 16 kHz sample rate.
            byte[] buffer = new byte[3200];
            int bytesRead;
            // Loop to read chunks of the file.
            while ((bytesRead = fis.read(buffer)) != -1) {
                ByteBuffer byteBuffer;
                // Handle the last chunk which might be smaller than the buffer size.
                System.out.println(TimeUtils.getTimestamp()+" "+"[" + threadName + "] bytesRead: " + bytesRead);
                if (bytesRead < buffer.length) {
                    byteBuffer = ByteBuffer.wrap(buffer, 0, bytesRead);
                } else {
                    byteBuffer = ByteBuffer.wrap(buffer);
                }

                recognizer.sendAudioFrame(byteBuffer);
                buffer = new byte[3200];
                Thread.sleep(100);
            }
            System.out.println(TimeUtils.getTimestamp()+" "+LocalDateTime.now());
            recognizer.stop();
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            // Close the WebSocket connection after the task is complete.
            recognizer.getDuplexApi().close(1000, "bye");
        }

        System.out.println(
                "["
                        + threadName
                        + "][Metric] requestId: "
                        + recognizer.getLastRequestId()
                        + ", first package delay ms: "
                        + recognizer.getFirstPackageDelay()
                        + ", last package delay ms: "
                        + recognizer.getLastPackageDelay());
    }
}

Streaming call: Based on Flowable

You can submit a real-time speech-to-text task and receive streaming recognition results by implementing a Flowable workflow.

Flowable is an open-source framework for workflow and business process management that is released under the Apache 2.0 license. For more information, see Flowable API reference.

Click to view the full example

Directly call the streamCall method of the Recognition class to start recognition.

The streamCall method returns a Flowable<RecognitionResult> instance. Call methods of the Flowable instance, such as blockingForEach and subscribe, to process the recognition results. The results are encapsulated in RecognitionResult.

The streamCall method requires two parameters:

A RecognitionParam instance (request parameters): Use this instance to set parameters for speech recognition, such as the model, sample rate, and audio format.
A Flowable<ByteBuffer> instance: Create an instance of the Flowable<ByteBuffer> type and implement a method within the instance to parse the audio stream.

import com.alibaba.dashscope.audio.asr.recognition.Recognition;
import com.alibaba.dashscope.audio.asr.recognition.RecognitionParam;
import com.alibaba.dashscope.exception.NoApiKeyException;
import io.reactivex.BackpressureStrategy;
import io.reactivex.Flowable;

import javax.sound.sampled.AudioFormat;
import javax.sound.sampled.AudioSystem;
import javax.sound.sampled.TargetDataLine;
import java.nio.ByteBuffer;

public class Main {
    public static void main(String[] args) throws NoApiKeyException {
        // Create a Flowable<ByteBuffer>.
        Flowable<ByteBuffer> audioSource =
                Flowable.create(
                        emitter -> {
                            new Thread(
                                    () -> {
                                        try {
                                            // Create an audio format.
                                            AudioFormat audioFormat = new AudioFormat(16000, 16, 1, true, false);
                                            // Match the default recording device based on the format.
                                            TargetDataLine targetDataLine =
                                                    AudioSystem.getTargetDataLine(audioFormat);
                                            targetDataLine.open(audioFormat);
                                            // Start recording.
                                            targetDataLine.start();
                                            ByteBuffer buffer = ByteBuffer.allocate(1024);
                                            long start = System.currentTimeMillis();
                                            // Record for 50s and perform real-time transcription.
                                            while (System.currentTimeMillis() - start < 50000) {
                                                int read = targetDataLine.read(buffer.array(), 0, buffer.capacity());
                                                if (read > 0) {
                                                    buffer.limit(read);
                                                    // Send the recorded audio data to the streaming recognition service.
                                                    emitter.onNext(buffer);
                                                    buffer = ByteBuffer.allocate(1024);
                                                    // The recording rate is limited. Sleep for a short period to prevent high CPU usage.
                                                    Thread.sleep(20);
                                                }
                                            }
                                            // Notify that the transcription is complete.
                                            emitter.onComplete();
                                        } catch (Exception e) {
                                            emitter.onError(e);
                                        }
                                    })
                                    .start();
                        },
                        BackpressureStrategy.BUFFER);

        // Create a Recognizer.
        Recognition recognizer = new Recognition();
        // Create a RecognitionParam and pass the created Flowable<ByteBuffer> to the audioFrames parameter.
        RecognitionParam param = RecognitionParam.builder()
                // If you do not configure the API Key to an environment variable, replace apiKey with your API Key.
                // .apiKey("yourApikey")
                .model("paraformer-realtime-v2")
                .format("pcm")
                .sampleRate(16000)
                // The "language_hints" parameter is supported only by the paraformer-realtime-v2 model.
                .parameter("language_hints", new String[]{"zh", "en"})
                .build();

        // Call the streaming interface.
        recognizer
                .streamCall(param, audioSource)
                .blockingForEach(
                        result -> {
                            // Subscribe to the output result.
                            if (result.isSentenceEnd()) {
                                System.out.println("Final Result: " + result.getSentence().getText());
                            } else {
                                System.out.println("Intermediate Result: " + result.getSentence().getText());
                            }
                        });
        // Close the WebSocket connection after the task is complete.
        recognizer.getDuplexApi().close(1000, "bye");
        System.out.println(
                "[Metric] requestId: "
                        + recognizer.getLastRequestId()
                        + ", first package delay ms: "
                        + recognizer.getFirstPackageDelay()
                        + ", last package delay ms: "
                        + recognizer.getLastPackageDelay());
        System.exit(0);
    }
}

High-concurrency calls

The DashScope Java SDK uses OkHttp3 connection pooling to reduce the overhead of repeatedly establishing connections. For more information, see Real-time speech recognition in high-concurrency scenarios.

Request parameters

Use the chained methods of RecognitionParam to configure parameters such as the model, sample rate, and audio format. Pass the configured parameter object to the call or streamCall method of the Recognition class.

Click to view an example

RecognitionParam param = RecognitionParam.builder()
  .model("paraformer-realtime-v2")
  .format("pcm")
  .sampleRate(16000)
  // The "language_hints" parameter is supported only by the paraformer-realtime-v2 model.
  .parameter("language_hints", new String[]{"zh", "en"})
  .build();

Parameter	Type	Default value	Required	Description
model	String	-	Yes	The model for real-time speech recognition. For more information, see Model list.
sampleRate	Integer	-	Yes	The audio sampling rate in Hz. This parameter varies by model: paraformer-realtime-v2 supports any sample rate. paraformer-realtime-8k-v2 supports only an 8000 Hz sample rate.
format	String	-	Yes	The format of the audio to be recognized. Supported audio formats: pcm, wav, mp3, opus, speex, aac, and amr. Important opus/speex: Must be encapsulated in Ogg. wav: Must be PCM encoded. amr: Only the AMR-NB type is supported.
vocabularyId	String	-	No	The ID of the hotword vocabulary. This parameter takes effect only when it is set. Use this field to set the hotword ID for v2 and later models. The hotword information for this hotword ID is applied to the speech recognition request. For more information, see Custom vocabulary.
disfluencyRemovalEnabled	boolean	false	No	Specifies whether to filter out disfluent words: true false (default)
language_hints	String[]	["zh", "en"]	No	The language code of the language to be recognized. If you cannot determine the language in advance, you can leave this parameter unset. The model automatically detects the language. Currently supported language codes: zh: Chinese en: English ja: Japanese yue: Cantonese ko: Korean de: German fr: French ru: Russian This parameter applies only to models that support multiple languages. For more information, see Model list. Note Set `language_hints` using the `parameter` or `parameters` method of the `RecognitionParam` instance: Set using the parameter method `RecognitionParam param = RecognitionParam.builder() .model("paraformer-realtime-v2") .format("pcm") .sampleRate(16000) .parameter("language_hints", new String[]{"zh", "en"}) .build();` Set using the parameters method `RecognitionParam param = RecognitionParam.builder() .model("paraformer-realtime-v2") .format("pcm") .sampleRate(16000) .parameters(Collections.singletonMap("language_hints", new String[]{"zh", "en"})) .build();`
semantic_punctuation_enabled	boolean	false	No	Specifies whether to enable semantic sentence segmentation. This feature is disabled by default. true: Enables semantic sentence segmentation and disables Voice Activity Detection (VAD) sentence segmentation. false (default): Enables VAD sentence segmentation and disables semantic sentence segmentation. Semantic sentence segmentation provides higher accuracy and is suitable for meeting transcription scenarios. VAD sentence segmentation has lower latency and is suitable for interactive scenarios. By adjusting the `semantic_punctuation_enabled` parameter, you can flexibly switch the sentence segmentation method to suit different scenarios. This parameter is effective only for v2 and later models. Note Set `semantic_punctuation_enabled` using the `parameter` or `parameters` method of the `RecognitionParam` instance: Set using the parameter method `RecognitionParam param = RecognitionParam.builder() .model("paraformer-realtime-v2") .format("pcm") .sampleRate(16000) .parameter("semantic_punctuation_enabled", true) .build();` Set using the parameters method `RecognitionParam param = RecognitionParam.builder() .model("paraformer-realtime-v2") .format("pcm") .sampleRate(16000) .parameters(Collections.singletonMap("semantic_punctuation_enabled", true)) .build();`
max_sentence_silence	Integer	800	No	The silence duration threshold for VAD sentence segmentation, in ms. If the silence duration after a speech segment exceeds this threshold, the system determines that the sentence has ended. The parameter ranges from 200 ms to 6000 ms. The default value is 800 ms. This parameter is effective only when the `semantic_punctuation_enabled` parameter is `false` (VAD segmentation) and the model is v2 or later. Note Set `max_sentence_silence` using the `parameter` or `parameters` method of the `RecognitionParam` instance: Set using the parameter method `RecognitionParam param = RecognitionParam.builder() .model("paraformer-realtime-v2") .format("pcm") .sampleRate(16000) .parameter("max_sentence_silence", 800) .build();` Set using the parameters method `RecognitionParam param = RecognitionParam.builder() .model("paraformer-realtime-v2") .format("pcm") .sampleRate(16000) .parameters(Collections.singletonMap("max_sentence_silence", 800)) .build();`
multi_threshold_mode_enabled	boolean	false	No	If this parameter is set to `true`, it prevents VAD from segmenting sentences that are too long. This feature is disabled by default. This parameter is effective only when the `semantic_punctuation_enabled` parameter is `false` (VAD segmentation) and the model is v2 or later. Note Set `multi_threshold_mode_enabled` using the `parameter` or `parameters` method of the `RecognitionParam` instance: Set using the parameter method `RecognitionParam param = RecognitionParam.builder() .model("paraformer-realtime-v2") .format("pcm") .sampleRate(16000) .parameter("multi_threshold_mode_enabled", true) .build();` Set using the parameters method `RecognitionParam param = RecognitionParam.builder() .model("paraformer-realtime-v2") .format("pcm") .sampleRate(16000) .parameters(Collections.singletonMap("multi_threshold_mode_enabled", true)) .build();`
punctuation_prediction_enabled	boolean	true	No	Specifies whether to automatically add punctuation to the recognition results: true (default) false This parameter is effective only for v2 and later models. Note Set `punctuation_prediction_enabled` using the `parameter` or `parameters` method of the `RecognitionParam` instance: Set using the parameter method `RecognitionParam param = RecognitionParam.builder() .model("paraformer-realtime-v2") .format("pcm") .sampleRate(16000) .parameter("punctuation_prediction_enabled", false) .build();` Set using the parameters method `RecognitionParam param = RecognitionParam.builder() .model("paraformer-realtime-v2") .format("pcm") .sampleRate(16000) .parameters(Collections.singletonMap("punctuation_prediction_enabled", false)) .build();`
heartbeat	boolean	false	No	Controls whether to maintain a persistent connection with the server: true: Maintains the connection with the server without interruption when you continuously send silent audio. false (default): Even if silent audio is continuously sent, the connection will disconnect after 60 seconds due to a timeout. Silent audio refers to content in an audio file or data stream that has no sound signal. You can generate silent audio using various methods, such as using audio editing software such as Audacity or Adobe Audition, or command-line tools such as FFmpeg. This parameter is effective only for v2 and later models. Note To use this field, your SDK version must be 2.19.1 or later. Set `heartbeat` using the `parameter` or `parameters` method of the `RecognitionParam` instance: Set using the parameter method `RecognitionParam param = RecognitionParam.builder() .model("paraformer-realtime-v2") .format("pcm") .sampleRate(16000) .parameter("heartbeat", true) .build();` Set using the parameters method `RecognitionParam param = RecognitionParam.builder() .model("paraformer-realtime-v2") .format("pcm") .sampleRate(16000) .parameters(Collections.singletonMap("heartbeat", true)) .build();`
inverse_text_normalization_enabled	boolean	true	No	Specifies whether to enable Inverse Text Normalization (ITN). This feature is enabled by default (`true`). When enabled, Chinese numerals are converted to Arabic numerals. This parameter is effective only for v2 and later models. Note Set `inverse_text_normalization_enabled` using the `parameter` or `parameters` method of the `RecognitionParam` instance: Set using the parameter method `RecognitionParam param = RecognitionParam.builder() .model("paraformer-realtime-v2") .format("pcm") .sampleRate(16000) .parameter("inverse_text_normalization_enabled", false) .build();` Set using the parameters method `RecognitionParam param = RecognitionParam.builder() .model("paraformer-realtime-v2") .format("pcm") .sampleRate(16000) .parameters(Collections.singletonMap("inverse_text_normalization_enabled", false)) .build();`
apiKey	String	-	No	Your API key.

Key interfaces

`Recognition` class

Import the Recognition class using the statement import com.alibaba.dashscope.audio.asr.recognition.Recognition;. Its key interfaces are:

Interface/Method	Parameters	Return value	Description
`public void call(RecognitionParam param, final ResultCallback<RecognitionResult> callback)`	`param`: Request parameters `callback`: ResultCallback	None	Performs streaming real-time recognition based on callbacks. This method does not block the current thread.
`public String call(RecognitionParam param, File file)`	`param`: Request parameters `file`: The audio file to be recognized	Recognition result	Performs a synchronous call based on a local file. This method blocks the current thread until the entire audio file is read. The file must have read permissions.
`public Flowable<RecognitionResult> streamCall(RecognitionParam param, Flowable<ByteBuffer> audioFrame)`	`param`: Request parameters `audioFrame`: A `Flowable<ByteBuffer>` instance	`Flowable<RecognitionResult>`	Performs streaming real-time recognition based on Flowable.
`public void sendAudioFrame(ByteBuffer audioFrame)`	`audioFrame`: A binary audio stream of the `ByteBuffer` type	None	Pushes an audio stream. The size of each pushed audio stream should not be too large or too small. We recommend that each audio packet has a duration of about 100 ms and a size between 1 KB and 16 KB. The recognition result is obtained through the onEvent method of the callback interface (ResultCallback).
`public void stop()`	None	None	Stops real-time recognition. This method blocks the current thread until the `onComplete` or `onError` method of the `ResultCallback` instance is called.
`recognizer.getDuplexApi().close(int code, String reason)`	code: WebSocket Close Code reason: Reason for closing For information about how to configure these two parameters, see The WebSocket Protocol document.	true	After the task ends, you must close the WebSocket connection to prevent connection leaks, regardless of whether an exception occurs. For more information about how to reuse connections to improve efficiency, see Real-time speech recognition in high-concurrency scenarios.
`public String getLastRequestId()`	None	requestId	Gets the request ID of the current task. Use this method after you start a new task by calling `call` or `streamingCall`. Note This method is available in SDK versions 2.18.0 and later.
`public long getFirstPackageDelay()`	None	First-packet latency	Gets the first-packet latency, which is the delay from sending the first audio packet to receiving the first recognition result. Use this method after the task is complete. Note This method is available in SDK versions 2.18.0 and later.
`public long getLastPackageDelay()`	None	Last-packet latency	Gets the last-packet latency, which is the time taken from sending the `stop` command to receiving the last recognition result packet. Use this method after the task is complete. Note This method is available in SDK versions 2.18.0 and later.

`ResultCallback`

When you make a streaming call, the server returns key process information and data to the client through callbacks. Implement callback methods to handle the information and data returned by the server.

To implement the callback methods, inherit the ResultCallback abstract class. When you inherit this class, you can specify the generic type as RecognitionResult. RecognitionResult encapsulates the data structure returned by the server.

Because Java supports connection reuse, there are no onClose or onOpen methods.

Example

ResultCallback<RecognitionResult> callback = new ResultCallback<RecognitionResult>() {
    @Override
    public void onEvent(RecognitionResult result) {
        System.out.println("RequestId is: " + result.getRequestId());
        // Implement the logic to process the speech recognition result here.
    }

    @Override
    public void onComplete() {
        System.out.println("Task complete");
    }

    @Override
    public void onError(Exception e) {
        System.out.println("Task failed: " + e.getMessage());
    }
};

Interface/Method	Parameters	Return value	Description
`public void onEvent(RecognitionResult result)`	`result`: Real-time recognition result (RecognitionResult)	None	This method is called back when the service sends a response.
`public void onComplete()`	None	None	This interface is called back when the task is complete.
`public void onError(Exception e)`	`e`: Exception information	None	This interface is called back when an exception occurs.

Response

Real-time recognition result (`RecognitionResult`)

RecognitionResult represents the result of a single real-time recognition.

Interface/Method	Parameters	Return value	Description
`public String getRequestId()`	None	requestId	Gets the request ID.
`public boolean isSentenceEnd()`	None	Whether the sentence is complete, which means a sentence break has occurred	Checks whether the given sentence has ended.
`public Sentence getSentence()`	None	Sentence information (Sentence)	Gets single-sentence information, including the timestamp and text.

Sentence information (`Sentence`)

Interface/Method	Parameters	Return value	Description
`public Long getBeginTime()`	None	Sentence start time, in ms	Returns the start time of the sentence.
`public Long getEndTime()`	None	Sentence end time, in ms	Returns the end time of the sentence.
`public String getText()`	None	Recognized text	Returns the recognized text.
`public List<Word> getWords()`	None	A list of Word timestamp information (Word) objects	Returns word timestamp information.
`public String getEmoTag()`	None	Emotion of the current sentence	Returns the emotion of the current sentence: positive: Positive emotion, such as happy or satisfied. negative: Negative emotion, such as angry or dull. neutral: No obvious emotion. Emotion recognition has the following constraints: Applies only to the `paraformer-realtime-8k-v2` model. You must disable semantic sentence segmentation. This is controlled by the `semantic_punctuation_enabled` request parameter. Semantic sentence segmentation is disabled by default. The emotion recognition result is displayed only when the `isSentenceEnd` method of the real-time recognition result (RecognitionResult) returns `true`.
`public Double getEmoConfidence()`	None	Confidence level of the recognized emotion for the current sentence	Returns the confidence level of the recognized emotion for the current sentence. The value ranges from 0.0 to 1.0. A larger value indicates a higher confidence level. Emotion recognition has the following constraints: Applies only to the `paraformer-realtime-8k-v2` model. You must disable semantic sentence segmentation. This is controlled by the `semantic_punctuation_enabled` request parameter. Semantic sentence segmentation is disabled by default. The emotion recognition result is displayed only when the `isSentenceEnd` method of the real-time recognition result (RecognitionResult) returns `true`.

Word timestamp information (`Word`)

Interface/Method	Parameters	Return value	Description
`public long getBeginTime()`	None	Word start time, in ms	Returns the start time of the word.
`public long getEndTime()`	None	Word end time, in ms	Returns the end time of the word.
`public String getText()`	None	Word	Returns the recognized word.
`public String getPunctuation()`	None	Punctuation	Returns the punctuation.

Error codes

If an error occurs, see Error messages for troubleshooting.

If the problem persists, join the developer group to report the issue. Provide the Request ID to help us investigate the issue.

More examples

For more examples, see GitHub.

FAQ

Features

Q: How to maintain a persistent connection with the server during long periods of silence?

You can set the heartbeat request parameter to true and continuously send silent audio to the server.

Silent audio refers to content in an audio file or data stream that has no sound signal. You can generate silent audio using various methods, such as using audio editing software such as Audacity or Adobe Audition, or command-line tools such as FFmpeg.

Q: How to convert an audio format to the required format?

You can use the FFmpeg tool. For more information, see the official FFmpeg website.

# Basic conversion command (universal template)
# -i: Specifies the input file path. Example: audio.wav
# -c:a: Specifies the audio encoder. Examples: aac, libmp3lame, pcm_s16le
# -b:a: Specifies the bit rate (controls audio quality). Examples: 192k, 320k
# -ar: Specifies the sample rate. Examples: 44100 (CD), 48000, 16000
# -ac: Specifies the number of sound channels. Examples: 1 (mono), 2 (stereo)
# -y: Overwrites an existing file (no value needed).
ffmpeg -i input_audio.ext -c:a encoder_name -b:a bit_rate -ar sample_rate -ac num_channels output.ext

# Example: WAV to MP3 (maintain original quality)
ffmpeg -i input.wav -c:a libmp3lame -q:a 0 output.mp3
# Example: MP3 to WAV (16-bit PCM standard format)
ffmpeg -i input.mp3 -c:a pcm_s16le -ar 44100 -ac 2 output.wav
# Example: M4A to AAC (extract/convert Apple audio)
ffmpeg -i input.m4a -c:a copy output.aac  # Directly extract without re-encoding
ffmpeg -i input.m4a -c:a aac -b:a 256k output.aac  # Re-encode to improve quality
# Example: FLAC lossless to Opus (high compression)
ffmpeg -i input.flac -c:a libopus -b:a 128k -vbr on output.opus

Q: Can I view the time range for each sentence?

Yes, you can. The speech recognition results include the start and end timestamps for each sentence. You can use these timestamps to determine the time range of each sentence.

Q: How to recognize a local file (recorded audio file)?

There are two ways to recognize a local file:

Pass the local file path directly: This method returns the complete recognition result after the entire file is processed. It is not suitable for scenarios that require immediate feedback.
For more information, see Synchronous call. You can pass the file path to the call method of the Recognition class to directly recognize the audio file.
Convert the local file into a binary stream for recognition: This method provides real-time results as the file is streamed and recognized. It is suitable for scenarios that require immediate feedback.
- Use the sendAudioFrame method of the Recognition class to send a binary stream to the server for recognition. For more information, see Streaming call: Based on callbacks.
- Use the streamCall method of the Recognition class to send a binary stream to the server for recognition. For more information, see Streaming call: Based on Flowable.

Troubleshooting

Q: Why there is no recognition result?

Check whether the audio format and sampleRate/sample_rate in the request parameters are set correctly and meet the parameter constraints. The following are common error examples:
- The audio file has a .wav extension but is in MP3 format, and the format parameter is incorrectly set to `mp3`.
- The audio sample rate is 3600 Hz, but the sampleRate/sample_rate parameter is incorrectly set to 48000.
You can use the ffprobe tool to obtain audio information, such as the container, encoding, sample rate, and sound channels:
```
ffprobe -v error -show_entries format=format_name -show_entries stream=codec_name,sample_rate,channels -of default=noprint_wrappers=1 input.xxx
```
When you use the paraformer-realtime-v2 model, check whether the language set in language_hints matches the actual language of the audio.
For example, the audio is in Chinese, but language_hints is set to en (English).
If all the preceding checks pass, you can use custom hotwords to improve the recognition of specific words.

Prerequisites

Model list

Getting started

Synchronous call

Streaming call: Based on callbacks

Recognize speech from a microphone

Recognize a local audio file

Streaming call: Based on Flowable

High-concurrency calls

Request parameters

Set using the parameter method

Set using the parameters method

Set using the parameter method

Set using the parameters method

Set using the parameter method

Set using the parameters method

Set using the parameter method

Set using the parameters method

Set using the parameter method

Set using the parameters method

Set using the parameter method

Set using the parameters method

Set using the parameter method

Set using the parameters method

Key interfaces

Recognition class

ResultCallback

Response

Real-time recognition result (RecognitionResult)

Sentence information (Sentence)

Word timestamp information (Word)

Error codes

More examples

FAQ

Features

Q: How to maintain a persistent connection with the server during long periods of silence?

Q: How to convert an audio format to the required format?

Q: Can I view the time range for each sentence?

Q: How to recognize a local file (recorded audio file)?

Troubleshooting

Q: Why there is no recognition result?

`Recognition` class

`ResultCallback`

Real-time recognition result (`RecognitionResult`)

Sentence information (`Sentence`)

Word timestamp information (`Word`)