All Products
Search
Document Center

Alibaba Cloud Model Studio:Qwen-ASR-Realtime Java SDK - API reference

Last Updated:Mar 02, 2026

Use the DashScope Java SDK to call Qwen-ASR-Realtime.

User guide: For model overview, features, and complete sample code, see Real-time speech recognition - Qwen.

Prerequisites

Interaction modes

Qwen-ASR-Realtime supports two modes for deciding when to process audio:

Mode

enableTurnDetection

How it works

VAD mode (default)

true

The server detects speech boundaries using voice activity detection (VAD) and decides when to commit the audio buffer for recognition.

Manual mode

false

The client controls when to commit audio by calling commit(). This gives you full control over segmentation.

For details on each mode, see VAD mode and Manual mode.

Request parameters

Connection parameters (OmniRealtimeParam)

Set these parameters with the chained methods of the OmniRealtimeParam class.

Click to view sample code

OmniRealtimeParam param = OmniRealtimeParam.builder()
        .model("qwen3-asr-flash-realtime")
        // Endpoint for the Singapore region.
        // For the Beijing region, use wss://dashscope.aliyuncs.com/api-ws/v1/realtime
        .url("wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime")
        // The API keys for the Singapore and Beijing regions are different.
        // To obtain an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
        // If you have not configured an environment variable, replace the following line with .apikey("sk-xxx").
        .apikey(System.getenv("DASHSCOPE_API_KEY"))
        .build();

Parameter

Type

Required

Description

model

String

Yes

The model to use. Example: qwen3-asr-flash-realtime.

url

String

Yes

The service endpoint. Chinese Mainland: wss://dashscope.aliyuncs.com/api-ws/v1/realtime. International (Singapore): wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime.

apikey

String

No

The API key.

Session configuration (OmniRealtimeConfig)

Set these parameters with the chained methods of the OmniRealtimeConfig class.

Click to view sample code

OmniRealtimeTranscriptionParam transcriptionParam = new OmniRealtimeTranscriptionParam();
transcriptionParam.setLanguage("zh");
transcriptionParam.setInputSampleRate(16000);
transcriptionParam.setInputAudioFormat("pcm");

OmniRealtimeConfig config = OmniRealtimeConfig.builder()
        .modalities(Collections.singletonList(OmniRealtimeModality.TEXT))
        .enableTurnDetection(true)
        .turnDetectionType("server_vad")
        .turnDetectionThreshold(0.0f)
        .turnDetectionSilenceDurationMs(400)
        .transcriptionConfig(transcriptionParam)
        .build();

Parameter

Type

Required

Description

modalities

List<OmniRealtimeModality>

Yes

Output modality. Fixed to [OmniRealtimeModality.TEXT].

enableTurnDetection

boolean

No

Enables server-side VAD. When disabled, call commit() to trigger recognition manually. Default: true.

turnDetectionType

String

No

VAD type. Fixed to server_vad.

turnDetectionThreshold

float

No

The VAD detection threshold. Recommended value: 0.0.

Default: 0.2.

Valid values: [-1, 1].

A lower threshold increases VAD sensitivity, which might cause background noise to be mistaken for speech. A higher threshold decreases sensitivity and helps reduce false triggers in noisy environments.

turnDetectionSilenceDurationMs

int

No

The VAD endpointing threshold in milliseconds (ms). A period of silence that exceeds this threshold is considered the end of a statement. Recommended value: 400.

Default: 800.

Valid values: [200, 6000].

A lower value, such as 300 ms, allows the model to respond faster but may cause unnatural segmentation at normal pauses. A higher value, such as 1200 ms, can better handle pauses within long sentences but increases the overall response latency.

transcriptionConfig

OmniRealtimeTranscriptionParam

No

Speech recognition settings. See Transcription parameters.

Transcription parameters (OmniRealtimeTranscriptionParam)

Set these parameters with the setter methods of the OmniRealtimeTranscriptionParam class.

Click to view sample code

OmniRealtimeTranscriptionParam transcriptionParam = new OmniRealtimeTranscriptionParam();
transcriptionParam.setLanguage("zh");
transcriptionParam.setInputSampleRate(16000);
transcriptionParam.setInputAudioFormat("pcm");

Parameter

Type

Required

Description

language

String

No

The source language of the audio.

  • zh: Chinese (Mandarin, Sichuanese, Minnan, and Wu)

  • yue: Cantonese

  • en: English

  • ja: Japanese

  • de: German

  • ko: Korean

  • ru: Russian

  • fr: French

  • pt: Portuguese

  • ar: Arabic

  • it: Italian

  • es: Spanish

  • hi: Hindi

  • id: Indonesian

  • th: Thai

  • tr: Turkish

  • uk: Ukrainian

  • vi: Vietnamese

  • cs: Czech

  • da: Danish

  • fil: Filipino

  • fi: Finnish

  • is: Icelandic

  • ms: Malay

  • no: Norwegian

  • pl: Polish

  • sv: Swedish

inputSampleRate

int

No

The audio sampling rate in Hz. Supported values are 16000 and 8000.

Default: 16000.

If you set this parameter to 8000, the server upsamples the audio to 16000 Hz before recognition. This might introduce a minor delay. Use this value only if the source audio is 8000 Hz, such as audio from a telephone line.

inputAudioFormat

String

No

The audio format. Supported formats are pcm and opus.

Default: pcm.

corpusText

String

No

Specifies the context. You can provide background text, entity vocabularies, and other reference information (context) during speech recognition to get customized results.

Length limit: 10,000 tokens.

For more information, see Contextual biasing.

Key interfaces

OmniRealtimeConversation

Import: com.alibaba.dashscope.audio.omni.OmniRealtimeConversation

This class manages the WebSocket lifecycle: connecting to the server, sending audio, and ending the session.

Create a conversation

OmniRealtimeConversation conversation =
        new OmniRealtimeConversation(param, callback);

Creates a new conversation instance with the specified connection parameters and callback handler.

Connect to the server

conversation.connect();

Opens a WebSocket connection. The server responds with session.created and session.updated events.

Throws: NoApiKeyException, InterruptedException.

Configure the session

conversation.updateSession(config);

Updates the session configuration after the connection is established. The server responds with a session.updated event. If not called, the server uses default settings.

Send audio data

conversation.appendAudio(audioBase64);

Appends a Base64-encoded audio segment to the server-side audio buffer.

  • VAD mode (enableTurnDetection=true): The server detects speech boundaries and decides when to process the buffer.

  • Manual mode (enableTurnDetection=false): Audio accumulates in the buffer until you call commit() to trigger recognition. Each event can contain up to 15 MiB of audio data.

Commit the audio buffer

conversation.commit();

Submits the buffered audio for recognition. The server responds with an input_audio_buffer.committed event.

This method is only available in manual mode (enableTurnDetection=false). An error occurs if the audio buffer is empty.

End the session

conversation.endSession();  // synchronous
// or
conversation.endSessionAsync();  // asynchronous

Notifies the server to finish processing any remaining audio and end the session. The server responds with a session.finished event.

When to call:

  • VAD mode: After you finish sending audio.

  • Manual mode: After you call commit().

Close the connection

conversation.close();

Stops the task and closes the WebSocket connection immediately.

Get session and response IDs

String sessionId = conversation.getSessionId();
String responseId = conversation.getResponseId();
  • getSessionId() returns the session ID for the current task.

  • getResponseId() returns the response ID of the most recent server response.

OmniRealtimeCallback

Import: com.alibaba.dashscope.audio.omni.OmniRealtimeCallback

Inherit this class and implement the callback methods to handle server events.

Method

Parameters

Triggered when

onOpen()

None

The WebSocket connection is established.

onEvent(JsonObject message)

message: A server event as JSON. Common event types: session.created, session.updated, input_audio_buffer.committed, conversation.item.input_audio_transcription.completed, session.finished.

A server event is received. Parse the type field to determine the event type.

onClose(int code, String reason)

code: Status code. reason: Reason for closing.

The WebSocket connection is closed.