Use the DashScope Java SDK to call Qwen-ASR-Realtime.
User guide: For model overview, features, and complete sample code, see Real-time speech recognition - Qwen.
Prerequisites
DashScope SDK 2.22.5 or later (Install the SDK)
Understand the interaction flow between client and server
Interaction modes
Qwen-ASR-Realtime supports two modes for deciding when to process audio:
Mode |
| How it works |
VAD mode (default) |
| The server detects speech boundaries using voice activity detection (VAD) and decides when to commit the audio buffer for recognition. |
Manual mode |
| The client controls when to commit audio by calling |
For details on each mode, see VAD mode and Manual mode.
Request parameters
Connection parameters (OmniRealtimeParam)
Set these parameters with the chained methods of the OmniRealtimeParam class.
Parameter | Type | Required | Description |
|
| Yes | The model to use. Example: |
|
| Yes | The service endpoint. Chinese Mainland: |
|
| No | The API key. |
Session configuration (OmniRealtimeConfig)
Set these parameters with the chained methods of the OmniRealtimeConfig class.
Parameter | Type | Required | Description |
|
| Yes | Output modality. Fixed to |
|
| No | Enables server-side VAD. When disabled, call |
|
| No | VAD type. Fixed to |
|
| No | The VAD detection threshold. Recommended value: Default: Valid values: A lower threshold increases VAD sensitivity, which might cause background noise to be mistaken for speech. A higher threshold decreases sensitivity and helps reduce false triggers in noisy environments. |
|
| No | The VAD endpointing threshold in milliseconds (ms). A period of silence that exceeds this threshold is considered the end of a statement. Recommended value: Default: Valid values: A lower value, such as 300 ms, allows the model to respond faster but may cause unnatural segmentation at normal pauses. A higher value, such as 1200 ms, can better handle pauses within long sentences but increases the overall response latency. |
|
| No | Speech recognition settings. See Transcription parameters. |
Transcription parameters (OmniRealtimeTranscriptionParam)
Set these parameters with the setter methods of the OmniRealtimeTranscriptionParam class.
Parameter | Type | Required | Description |
|
| No | The source language of the audio.
|
|
| No | The audio sampling rate in Hz. Supported values are Default: If you set this parameter to |
|
| No | The audio format. Supported formats are Default: |
|
| No | Specifies the context. You can provide background text, entity vocabularies, and other reference information (context) during speech recognition to get customized results. Length limit: 10,000 tokens. For more information, see Contextual biasing. |
Key interfaces
OmniRealtimeConversation
Import: com.alibaba.dashscope.audio.omni.OmniRealtimeConversation
This class manages the WebSocket lifecycle: connecting to the server, sending audio, and ending the session.
Create a conversation
OmniRealtimeConversation conversation =
new OmniRealtimeConversation(param, callback);Creates a new conversation instance with the specified connection parameters and callback handler.
Connect to the server
conversation.connect();Opens a WebSocket connection. The server responds with session.created and session.updated events.
Throws: NoApiKeyException, InterruptedException.
Configure the session
conversation.updateSession(config);Updates the session configuration after the connection is established. The server responds with a session.updated event. If not called, the server uses default settings.
Send audio data
conversation.appendAudio(audioBase64);Appends a Base64-encoded audio segment to the server-side audio buffer.
VAD mode (
enableTurnDetection=true): The server detects speech boundaries and decides when to process the buffer.Manual mode (
enableTurnDetection=false): Audio accumulates in the buffer until you callcommit()to trigger recognition. Each event can contain up to 15 MiB of audio data.
Commit the audio buffer
conversation.commit();Submits the buffered audio for recognition. The server responds with an input_audio_buffer.committed event.
This method is only available in manual mode (enableTurnDetection=false). An error occurs if the audio buffer is empty.End the session
conversation.endSession(); // synchronous
// or
conversation.endSessionAsync(); // asynchronousNotifies the server to finish processing any remaining audio and end the session. The server responds with a session.finished event.
When to call:
VAD mode: After you finish sending audio.
Manual mode: After you call
commit().
Close the connection
conversation.close();Stops the task and closes the WebSocket connection immediately.
Get session and response IDs
String sessionId = conversation.getSessionId();
String responseId = conversation.getResponseId();getSessionId()returns the session ID for the current task.getResponseId()returns the response ID of the most recent server response.
OmniRealtimeCallback
Import: com.alibaba.dashscope.audio.omni.OmniRealtimeCallback
Inherit this class and implement the callback methods to handle server events.
Method | Parameters | Triggered when |
| None | The WebSocket connection is established. |
|
| A server event is received. Parse the |
|
| The WebSocket connection is closed. |