Key interfaces and request parameters for the Qwen-Omni-Realtime DashScope Java SDK.
Prerequisites
Ensure that your Java SDK version is 2.20.9 or later. Before you begin, see Real-time multimodal interaction flow.
Getting started
You can download the sample code from GitHub. The following three usage scenarios are provided:
-
Audio conversation example: Captures real-time audio from a microphone, enables VAD mode (automatic voice activity detection), and supports voice interruption.
Set the enableTurnDetection parameter to
true.Use headphones for audio playback to prevent echoes from triggering voice interruption.
-
Audio and video conversation example: Captures real-time audio and video from a microphone and camera, enables VAD mode, and supports voice interruption.
Set the enableTurnDetection parameter to
true.Use headphones for audio playback to prevent echoes from triggering voice interruption.
-
Local call: Uses local audio and images as input and enables Manual mode (manual control over the sending pace).
Set the enableTurnDetection parameter to
false.
Request parameters
Configure the following request parameters using the chained methods or setters of the OmniRealtimeParam object. Then, pass this object as a parameter to the OmniRealtimeConversation constructor.
|
Parameter |
Type |
Description |
|
model |
String |
The name of the Qwen-Omni real-time model. For more information, see Model list. |
|
url |
String |
The endpoint URL:
|
Configure the following request parameters using the chained methods or setters of the OmniRealtimeConfig object. Then, pass this object as a parameter to the updateSession interface.
|
Parameter |
Type |
Description |
|
modalities |
List<OmniRealtimeModality> |
The output modalities of the model. Set it to [OmniRealtimeModality.TEXT] for text output only, or [OmniRealtimeModality.TEXT, OmniRealtimeModality.AUDIO] for audio and text output. |
|
voice |
String |
The voice used for the model's audio output. For a list of supported voices, see Voice list. Default voice:
|
|
inputAudioFormat |
OmniRealtimeAudioFormat |
The format of the user's input audio. Currently, only PCM_16000HZ_MONO_16BIT is supported. |
|
outputAudioFormat |
OmniRealtimeAudioFormat |
The format of the model's output audio. Currently, only |
|
smooth_output |
Boolean |
This parameter is supported only by the Qwen3-Omni-Flash-Realtime series.
Note
Set
|
|
instructions |
String |
A system message that sets the goal or role for the model. Example: "You are an AI customer service agent for a five-star hotel. Answer customer questions about room types, facilities, prices, and booking policies accurately and in a friendly manner. Always respond professionally and helpfully. Do not provide unverified information or information outside the scope of the hotel's services." Note
Set |
|
enableInputAudioTranscription |
Boolean |
Specifies whether to enable speech recognition for the input audio. |
|
InputAudioTranscription |
String |
The speech recognition model used for transcribing input audio. Currently, only gummy-realtime-v1 is supported. |
|
enableTurnDetection |
Boolean |
Specifies whether to enable voice activity detection (VAD). If disabled, you must manually submit audio to trigger a model response. |
|
turnDetectionType |
String |
The server-side VAD type. Fixed value: "server_vad". |
|
turnDetectionThreshold |
Float |
The VAD threshold. Increase this value in noisy environments and decrease it in quiet environments.
Default value: 0.5. Value range: [-1.0, 1.0]. |
|
turnDetectionSilenceDurationMs |
Integer |
The duration of silence (in milliseconds) that indicates the end of speech. The model triggers a response after this duration elapses. Default value: 800. Value range: [200, 6000]. |
|
temperature |
float |
The sampling temperature, which controls the diversity of the generated content. A higher temperature value results in more diverse content. A lower value results in more deterministic content. Value range: [0, 2). Because both temperature and top_p control content diversity, we recommend setting only one of them. Default values of temperature:
Note
Set |
|
top_p |
float |
The probability threshold for nucleus sampling, which controls the diversity of the generated content. A higher top_p value results in more diverse content. A lower value results in more deterministic content. Value range: (0, 1.0]. Because both temperature and top_p control content diversity, we recommend setting only one of them. Default values of top_p:
Note
Set |
|
top_k |
integer |
The size of the candidate token set for sampling during generation. For example, a value of 50 means that only the top 50 tokens with the highest scores are considered for random sampling in each generation step. A larger value increases randomness, while a smaller value increases determinism. If the value is None or greater than 100, top_k sampling is disabled and only top_p sampling takes effect. The value must be 0 or greater. Default values of top_k:
Note
Set |
|
max_tokens |
integer |
The maximum number of tokens that can be returned in the response.
The default and maximum values are the maximum output length of the model. For the maximum output length of each model, see Model List. Use the max_tokens parameter in scenarios where you need to limit the output length, such as generating summaries or keywords, controlling costs, or reducing response latency.
Note
Set |
|
repetition_penalty |
float |
Controls the repetition penalty for consecutive sequences during model generation. A higher repetition_penalty value reduces repetition. A value of 1.0 means no penalty is applied. There is no strict value range, but the value must be greater than 0. Default value: 1.05.
Note
Set |
|
presence_penalty |
float |
Controls the likelihood of repeated tokens in the generated content. Default value: 0.0. Value range: [-2.0, 2.0]. Positive values reduce repetition, while negative values increase it. Use cases: A higher presence_penalty is suitable for scenarios that require diversity, interest, or creativity, such as creative writing or brainstorming. A lower presence_penalty is suitable for scenarios that require consistency or the use of professional terms, such as technical documents or other formal writing.
Note
Set |
|
seed |
integer |
Setting the seed parameter makes the model's output more deterministic. It is typically used to ensure consistent results across runs. If you pass the same seed value in each model call and keep other parameters unchanged, the model returns the same result whenever possible. Value range: 0 to 231-1. Default value: -1.
Note
Set |
Key interfaces
OmniRealtimeConversation class
Import the OmniRealtimeConversation class using import com.alibaba.dashscope.audio.omni.OmniRealtimeConversation;.
|
Method signature |
Server response event (sent via callback) |
Description |
|
Session created Session configuration updated |
Creates a connection to the server. |
|
Session configuration updated |
Updates the default configuration for the current session. For parameter settings, see the Request parameters section. When you establish a connection, the server returns the default input and output configurations for the session. To update the default session configuration, we recommend calling this method immediately after the connection is established. After receiving the session.update event, the server validates the parameters. If the parameters are invalid, an error is returned. Otherwise, the server-side session configuration is updated. |
|
None |
Appends Base64-encoded audio data to the cloud input audio buffer (temporary storage for writing data before submission).
|
|
None |
Adds Base64-encoded image data to the cloud video buffer (local images or real-time video stream captures). Image input limits:
|
|
Audio received by the server is cleared |
Clears the audio in the current cloud buffer. |
|
Server received the submitted audio |
Submits audio and video previously added to the cloud buffer via append. Returns an error if the input audio buffer is empty.
Note:
|
|
Server starts generating the response New output content is available in the response Conversation item created New output content added to the assistant message item response.audio_transcript.delta Incrementally generated transcribed text Incrementally generated audio from the model response.audio_transcript.done Text transcription completed Audio generation completed Streaming of text or audio content for the assistant message is complete Streaming of the entire output item for the assistant message is complete Response completed |
Instructs the server to create a model response. When the session is configured with "turn_detection" mode enabled, the server automatically creates a model response. |
|
None |
Cancels the in-progress response. Returns an error if no response exists to cancel. |
|
None |
Stops the task and closes the connection. |
|
None |
Gets the session ID of the current task. |
|
None |
Gets the response ID of the most recent response. |
|
None |
Gets the first-packet text latency of the most recent response. |
|
None |
Gets the first-packet audio latency of the most recent response. |
Callback interface (OmniRealtimeCallback)
The server uses callbacks to return events and data to the client. Implement these callback methods to handle server-side events and data.
Import the interface using import com.alibaba.dashscope.audio.omni.OmniRealtimeCallback;.
|
Method |
Parameter |
Return value |
Description |
|
|
None |
None |
This method is called immediately after a connection is established to the server. |
|
message: The server response event. |
None |
Includes method call responses and model-generated text and audio. For details, see Server events. |
|
code: The status code for closing the WebSocket. reason: The reason for closing the WebSocket. |
None |
This method is called after the connection to the server is closed. |
FAQ
Q: How are the input audio and images aligned?
A: The omni-realtime model uses the audio stream as a timeline. Images are inserted based on send time. You can add images at any point on the timeline and enable/disable video input anytime.
Q: What is the recommended frequency for inputting images and audio?
A: For real-time interactions, send images at 1-2 fps and audio in 100 ms packets.
Q: What is the difference between the two modes of the turn_detection switch?
A: Currently, only the server_vad mode is supported when turn_detection is enabled.
-
Enable "turn_detection":
-
Input state: Cloud-based VAD analyzes input audio to detect sentence end. The service then automatically calls the omni model for inference and returns text and audio responses.
-
Response state: Audio and video input continues without interruption during the model's response. After response completion, the state returns to input state awaiting further speech.
-
Interruption: If the user speaks during the model's response, the service immediately stops the current response and switches to input state.
-
-
If turn_detection is disabled:
-
You must determine the input turn end and manually trigger omni model inference by calling commit and create_response.
-
You must stop sending audio and video input while the model generates a response. Resume input only after response completion.
-
You must call response_cancel to interrupt the model's response.
-
When turn_detection is enabled, you can still manually trigger responses using commit and create_response, and interrupt using response_cancel.
Q: Why do I need to select another model for input_audio_transcription?
The omni model is an end-to-end multimodal model. Its text output responds to input rather than transcribing it. Therefore, a separate ASR model is required for transcription.