Use the Java SDK for real-time multimodal interaction with Qwen-Omni - Alibaba Cloud Model Studio

Key interfaces and request parameters for the Qwen-Omni-Realtime DashScope Java SDK.

Prerequisites

Ensure that your Java SDK version is 2.20.9 or later. Before you begin, see Real-time multimodal interaction flow.

Getting started

You can download the sample code from GitHub. The following three usage scenarios are provided:

Audio conversation example: Captures real-time audio from a microphone, enables VAD mode (automatic voice activity detection), and supports voice interruption.

Set the enableTurnDetection parameter to true.

Use headphones for audio playback to prevent echoes from triggering voice interruption.
Audio and video conversation example: Captures real-time audio and video from a microphone and camera, enables VAD mode, and supports voice interruption.

Set the enableTurnDetection parameter to true.

Use headphones for audio playback to prevent echoes from triggering voice interruption.
Local call: Uses local audio and images as input and enables Manual mode (manual control over the sending pace).

Set the enableTurnDetection parameter to false.

Request parameters

Configure the following request parameters using the chained methods or setters of the OmniRealtimeParam object. Then, pass this object as a parameter to the OmniRealtimeConversation constructor.

Parameter

Type

Description

model

String

The name of the Qwen-Omni real-time model. For more information, see Model list.

url

String

The endpoint URL:

Singapore: wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime
China (Beijing) region: wss://dashscope.aliyuncs.com/api-ws/v1/realtime

Configure the following request parameters using the chained methods or setters of the OmniRealtimeConfig object. Then, pass this object as a parameter to the updateSession interface.

Parameter	Type	Description
modalities	List<OmniRealtimeModality>	The output modalities of the model. Set it to [OmniRealtimeModality.TEXT] for text output only, or [OmniRealtimeModality.TEXT, OmniRealtimeModality.AUDIO] for audio and text output.
voice	String	The voice used for the model's audio output. For a list of supported voices, see Voice list. Default voice: Qwen3-Omni-Flash-Realtime: "Cherry" Qwen-Omni-Turbo-Realtime: "Chelsie"
inputAudioFormat	OmniRealtimeAudioFormat	The format of the user's input audio. Currently, only PCM_16000HZ_MONO_16BIT is supported.
outputAudioFormat	OmniRealtimeAudioFormat	The format of the model's output audio. Currently, only `pcm` is supported.
smooth_output	Boolean	This parameter is supported only by the Qwen3-Omni-Flash-Realtime series. true: Conversational responses. false: Formal responses. However, performance may be suboptimal if the content is difficult to read aloud. null: Default value. The model automatically chooses between conversational and formal response styles. Note Set `smoothOutput` using the `parameters` method of the `OmniRealtimeConfig` instance: `conversation.updateSession(OmniRealtimeConfig.builder() .modalities(Arrays.asList(OmniRealtimeModality.AUDIO, OmniRealtimeModality.TEXT)) .voice("Cherry") .enableTurnDetection(true) .enableInputAudioTranscription(true) .parameters(Map.of( "smooth_output",true)) .build() );`
instructions	String	A system message that sets the goal or role for the model. Example: "You are an AI customer service agent for a five-star hotel. Answer customer questions about room types, facilities, prices, and booking policies accurately and in a friendly manner. Always respond professionally and helpfully. Do not provide unverified information or information outside the scope of the hotel's services." Note Set `instructions` using the `parameters` method of the `OmniRealtimeConfig` instance. The usage is the same as for `smooth_output`.
enableInputAudioTranscription	Boolean	Specifies whether to enable speech recognition for the input audio.
InputAudioTranscription	String	The speech recognition model used for transcribing input audio. Currently, only gummy-realtime-v1 is supported.
enableTurnDetection	Boolean	Specifies whether to enable voice activity detection (VAD). If disabled, you must manually submit audio to trigger a model response.
turnDetectionType	String	The server-side VAD type. Fixed value: "server_vad".
turnDetectionThreshold	Float	The VAD threshold. Increase this value in noisy environments and decrease it in quiet environments. A value closer to -1 increases the probability that noise is detected as speech. A value closer to 1 decreases the probability that noise is detected as speech. Default value: 0.5. Value range: [-1.0, 1.0].
turnDetectionSilenceDurationMs	Integer	The duration of silence (in milliseconds) that indicates the end of speech. The model triggers a response after this duration elapses. Default value: 800. Value range: [200, 6000].
temperature	float	The sampling temperature, which controls the diversity of the generated content. A higher temperature value results in more diverse content. A lower value results in more deterministic content. Value range: [0, 2). Because both temperature and top_p control content diversity, we recommend setting only one of them. Default values of temperature: `qwen3-omni-flash-realtime` series: 0.9 `qwen-omni-turbo-realtime` series: 1.0 `qwen-omni-turbo` models do not support modification. Note Set `temperature` using the `parameters` method of the `OmniRealtimeConfig` instance. The usage is the same as for `smooth_output`.
top_p	float	The probability threshold for nucleus sampling, which controls the diversity of the generated content. A higher top_p value results in more diverse content. A lower value results in more deterministic content. Value range: (0, 1.0]. Because both temperature and top_p control content diversity, we recommend setting only one of them. Default values of top_p: `qwen3-omni-flash-realtime` series: 1.0 `qwen-omni-turbo-realtime` series: 0.01 `qwen-omni-turbo` models do not support modification. Note Set `top_p` using the `parameters` method of the `OmniRealtimeConfig` instance. The usage is the same as for `smooth_output`.
top_k	integer	The size of the candidate token set for sampling during generation. For example, a value of 50 means that only the top 50 tokens with the highest scores are considered for random sampling in each generation step. A larger value increases randomness, while a smaller value increases determinism. If the value is None or greater than 100, top_k sampling is disabled and only top_p sampling takes effect. The value must be 0 or greater. Default values of top_k: `qwen3-omni-flash-realtime` series: 50 `qwen-omni-turbo-realtime` series: 20 `qwen-omni-turbo` models do not support modification. Note Set `top_k` using the `parameters` method of the `OmniRealtimeConfig` instance. The usage is the same as for `smooth_output`.
max_tokens	integer	The maximum number of tokens that can be returned in the response. `max_tokens` does not affect the model's generation process. If the number of tokens generated by the model exceeds `max_tokens`, the returned content is truncated. The default and maximum values are the maximum output length of the model. For the maximum output length of each model, see Model List. Use the max_tokens parameter in scenarios where you need to limit the output length, such as generating summaries or keywords, controlling costs, or reducing response latency. `qwen-omni-turbo` models do not support modification. Note Set `max_tokens` using the `parameters` method of the `OmniRealtimeConfig` instance. The usage is the same as for `smooth_output`.
repetition_penalty	float	Controls the repetition penalty for consecutive sequences during model generation. A higher repetition_penalty value reduces repetition. A value of 1.0 means no penalty is applied. There is no strict value range, but the value must be greater than 0. Default value: 1.05. `qwen-omni-turbo` models do not support modification. Note Set `repetition_penalty` using the `parameters` method of the `OmniRealtimeConfig` instance. The usage is the same as for `smooth_output`.
presence_penalty	float	Controls the likelihood of repeated tokens in the generated content. Default value: 0.0. Value range: [-2.0, 2.0]. Positive values reduce repetition, while negative values increase it. Use cases: A higher presence_penalty is suitable for scenarios that require diversity, interest, or creativity, such as creative writing or brainstorming. A lower presence_penalty is suitable for scenarios that require consistency or the use of professional terms, such as technical documents or other formal writing. `qwen-omni-turbo` models do not support modification. Note Set `presence_penalty` using the `parameters` method of the `OmniRealtimeConfig` instance. The usage is the same as for `smooth_output`.
seed	integer	Setting the seed parameter makes the model's output more deterministic. It is typically used to ensure consistent results across runs. If you pass the same seed value in each model call and keep other parameters unchanged, the model returns the same result whenever possible. Value range: 0 to 2³¹-1. Default value: -1. `qwen-omni-turbo` models do not support modification. Note Set `seed` using the `parameters` method of the `OmniRealtimeConfig` instance. The usage is the same as for `smooth_output`.

Key interfaces

OmniRealtimeConversation class

Import the OmniRealtimeConversation class using import com.alibaba.dashscope.audio.omni.OmniRealtimeConversation;.

Method signature	Server response event (sent via callback)	Description
`public void connect() throws NoApiKeyException, InterruptedException`	Server events Session created session.updated Session configuration updated	Creates a connection to the server.
`public void updateSession(OmniRealtimeConfig config)`	session.updated Session configuration updated	Updates the default configuration for the current session. For parameter settings, see the Request parameters section. When you establish a connection, the server returns the default input and output configurations for the session. To update the default session configuration, we recommend calling this method immediately after the connection is established. After receiving the session.update event, the server validates the parameters. If the parameters are invalid, an error is returned. Otherwise, the server-side session configuration is updated.
`public void appendAudio(String audioBase64)`	None	Appends Base64-encoded audio data to the cloud input audio buffer (temporary storage for writing data before submission). If "turn_detection" is enabled, the server uses the buffer to detect speech and decides when to submit it. If "turn_detection" is disabled, the client controls audio amount per event (up to 15 MiB). Smaller blocks can improve VAD responsiveness.
`public void appendVideo(String videoBase64)`	None	Adds Base64-encoded image data to the cloud video buffer (local images or real-time video stream captures). Image input limits: Format: JPG or JPEG. Recommended resolution: 480p or 720p (max 1080p). Size: Max 500 KB per image (before Base64 encoding). Encoding: Must be Base64-encoded. Frequency: 1 image per second.
`public void clearAppendedAudio()`	input_audio_buffer.cleared Audio received by the server is cleared	Clears the audio in the current cloud buffer.
`public void commit()`	input_audio_buffer.committed Server received the submitted audio	Submits audio and video previously added to the cloud buffer via append. Returns an error if the input audio buffer is empty. If "turn_detection" is enabled, the server automatically submits the audio buffer (client does not need to send this event). If "turn_detection" is disabled, the client must submit the audio buffer to create a user message item. Note: If input_audio_transcription is configured, the system transcribes the audio. Submitting the buffer does not trigger a model response.
`public void createResponse(String instructions, List<OmniRealtimeModality> modalities)`	Server events Server starts generating the response response.output_item.added New output content is available in the response Server events Conversation item created response.content_part.added New output content added to the assistant message item response.audio_transcript.delta Incrementally generated transcribed text response.audio.delta Incrementally generated audio from the model response.audio_transcript.done Text transcription completed response.audio.done Audio generation completed response.content_part.done Streaming of text or audio content for the assistant message is complete response.output_item.done Streaming of the entire output item for the assistant message is complete response.done Response completed	Instructs the server to create a model response. When the session is configured with "turn_detection" mode enabled, the server automatically creates a model response.
`public void cancelResponse()`	None	Cancels the in-progress response. Returns an error if no response exists to cancel.
`public void close()`	None	Stops the task and closes the connection.
`public String getSessionId()`	None	Gets the session ID of the current task.
`public String getResponseId()`	None	Gets the response ID of the most recent response.
`public long getFirstTextDelay()`	None	Gets the first-packet text latency of the most recent response.
`public long getFirstAudioDelay()`	None	Gets the first-packet audio latency of the most recent response.

Callback interface (OmniRealtimeCallback)

The server uses callbacks to return events and data to the client. Implement these callback methods to handle server-side events and data.

Import the interface using import com.alibaba.dashscope.audio.omni.OmniRealtimeCallback;.

Method	Parameter	Return value	Description
`public void onOpen()`	None	None	This method is called immediately after a connection is established to the server.
`public abstract void onEvent(JsonObject message)`	message: The server response event.	None	Includes method call responses and model-generated text and audio. For details, see Server events.
`public abstract void onClose(int code, String reason)`	code: The status code for closing the WebSocket. reason: The reason for closing the WebSocket.	None	This method is called after the connection to the server is closed.

FAQ

Q: How are the input audio and images aligned?

A: The omni-realtime model uses the audio stream as a timeline. Images are inserted based on send time. You can add images at any point on the timeline and enable/disable video input anytime.

Q: What is the recommended frequency for inputting images and audio?

A: For real-time interactions, send images at 1-2 fps and audio in 100 ms packets.

Q: What is the difference between the two modes of the turn_detection switch?

A: Currently, only the server_vad mode is supported when turn_detection is enabled.

Enable "turn_detection":
- Input state: Cloud-based VAD analyzes input audio to detect sentence end. The service then automatically calls the omni model for inference and returns text and audio responses.
- Response state: Audio and video input continues without interruption during the model's response. After response completion, the state returns to input state awaiting further speech.
- Interruption: If the user speaks during the model's response, the service immediately stops the current response and switches to input state.
If turn_detection is disabled:
- You must determine the input turn end and manually trigger omni model inference by calling commit and create_response.
- You must stop sending audio and video input while the model generates a response. Resume input only after response completion.
- You must call response_cancel to interrupt the model's response.

When turn_detection is enabled, you can still manually trigger responses using commit and create_response, and interrupt using response_cancel.

Q: Why do I need to select another model for input_audio_transcription?

The omni model is an end-to-end multimodal model. Its text output responds to input rather than transcribing it. Therefore, a separate ASR model is required for transcription.