All Products
Search
Document Center

Alibaba Cloud Model Studio:Java SDK

Last Updated:Mar 15, 2026

Key interfaces and request parameters for the Qwen-Omni-Realtime DashScope Java SDK.

Prerequisites

Ensure that your Java SDK version is 2.20.9 or later. Before you begin, see Real-time multimodal interaction flow.

Getting started

You can download the sample code from GitHub. The following three usage scenarios are provided:

  1. Audio conversation example: Captures real-time audio from a microphone, enables VAD mode (automatic voice activity detection), and supports voice interruption.

    Set the enableTurnDetection parameter to true.
    Use headphones for audio playback to prevent echoes from triggering voice interruption.
  2. Audio and video conversation example: Captures real-time audio and video from a microphone and camera, enables VAD mode, and supports voice interruption.

    Set the enableTurnDetection parameter to true.
    Use headphones for audio playback to prevent echoes from triggering voice interruption.
  3. Local call: Uses local audio and images as input and enables Manual mode (manual control over the sending pace).

    Set the enableTurnDetection parameter to false.

Request parameters

Configure the following request parameters using the chained methods or setters of the OmniRealtimeParam object. Then, pass this object as a parameter to the OmniRealtimeConversation constructor.

Parameter

Type

Description

model

String

The name of the Qwen-Omni real-time model. For more information, see Model list.

url

String

The endpoint URL:

  • Singapore: wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime

  • China (Beijing) region: wss://dashscope.aliyuncs.com/api-ws/v1/realtime

Configure the following request parameters using the chained methods or setters of the OmniRealtimeConfig object. Then, pass this object as a parameter to the updateSession interface.

Parameter

Type

Description

modalities

List<OmniRealtimeModality>

The output modalities of the model. Set it to [OmniRealtimeModality.TEXT] for text output only, or [OmniRealtimeModality.TEXT, OmniRealtimeModality.AUDIO] for audio and text output.

voice

String

The voice used for the model's audio output. For a list of supported voices, see Voice list.

Default voice:

  • Qwen3-Omni-Flash-Realtime: "Cherry"

  • Qwen-Omni-Turbo-Realtime: "Chelsie"

inputAudioFormat

OmniRealtimeAudioFormat

The format of the user's input audio. Currently, only PCM_16000HZ_MONO_16BIT is supported.

outputAudioFormat

OmniRealtimeAudioFormat

The format of the model's output audio. Currently, only pcm is supported.

smooth_output

Boolean

This parameter is supported only by the Qwen3-Omni-Flash-Realtime series.

  • true: Conversational responses.

  • false: Formal responses.

    However, performance may be suboptimal if the content is difficult to read aloud.
  • null: Default value. The model automatically chooses between conversational and formal response styles.

Note

Set smoothOutput using the parameters method of the OmniRealtimeConfig instance:

conversation.updateSession(OmniRealtimeConfig.builder()
        .modalities(Arrays.asList(OmniRealtimeModality.AUDIO, OmniRealtimeModality.TEXT))
        .voice("Cherry")
        .enableTurnDetection(true)
        .enableInputAudioTranscription(true)
        .parameters(Map.of(
                "smooth_output",true))
        .build()
);

instructions

String

A system message that sets the goal or role for the model.

Example: "You are an AI customer service agent for a five-star hotel. Answer customer questions about room types, facilities, prices, and booking policies accurately and in a friendly manner. Always respond professionally and helpfully. Do not provide unverified information or information outside the scope of the hotel's services."

Note

Set instructions using the parameters method of the OmniRealtimeConfig instance. The usage is the same as for smooth_output.

enableInputAudioTranscription

Boolean

Specifies whether to enable speech recognition for the input audio.

InputAudioTranscription

String

The speech recognition model used for transcribing input audio. Currently, only gummy-realtime-v1 is supported.

enableTurnDetection

Boolean

Specifies whether to enable voice activity detection (VAD). If disabled, you must manually submit audio to trigger a model response.

turnDetectionType

String

The server-side VAD type. Fixed value: "server_vad".

turnDetectionThreshold

Float

The VAD threshold. Increase this value in noisy environments and decrease it in quiet environments.

  • A value closer to -1 increases the probability that noise is detected as speech.

  • A value closer to 1 decreases the probability that noise is detected as speech.

Default value: 0.5. Value range: [-1.0, 1.0].

turnDetectionSilenceDurationMs

Integer

The duration of silence (in milliseconds) that indicates the end of speech. The model triggers a response after this duration elapses. Default value: 800. Value range: [200, 6000].

temperature

float

The sampling temperature, which controls the diversity of the generated content.

A higher temperature value results in more diverse content. A lower value results in more deterministic content.

Value range: [0, 2).

Because both temperature and top_p control content diversity, we recommend setting only one of them.

Default values of temperature:

  • qwen3-omni-flash-realtime series: 0.9

  • qwen-omni-turbo-realtime series: 1.0

qwen-omni-turbo models do not support modification.
Note

Set temperature using the parameters method of the OmniRealtimeConfig instance. The usage is the same as for smooth_output.

top_p

float

The probability threshold for nucleus sampling, which controls the diversity of the generated content.

A higher top_p value results in more diverse content. A lower value results in more deterministic content.

Value range: (0, 1.0].

Because both temperature and top_p control content diversity, we recommend setting only one of them.

Default values of top_p:

  • qwen3-omni-flash-realtime series: 1.0

  • qwen-omni-turbo-realtime series: 0.01

qwen-omni-turbo models do not support modification.
Note

Set top_p using the parameters method of the OmniRealtimeConfig instance. The usage is the same as for smooth_output.

top_k

integer

The size of the candidate token set for sampling during generation. For example, a value of 50 means that only the top 50 tokens with the highest scores are considered for random sampling in each generation step. A larger value increases randomness, while a smaller value increases determinism. If the value is None or greater than 100, top_k sampling is disabled and only top_p sampling takes effect.

The value must be 0 or greater.

Default values of top_k:

  • qwen3-omni-flash-realtime series: 50

  • qwen-omni-turbo-realtime series: 20

qwen-omni-turbo models do not support modification.
Note

Set top_k using the parameters method of the OmniRealtimeConfig instance. The usage is the same as for smooth_output.

max_tokens

integer

The maximum number of tokens that can be returned in the response.

max_tokens does not affect the model's generation process. If the number of tokens generated by the model exceeds max_tokens, the returned content is truncated.

The default and maximum values are the maximum output length of the model. For the maximum output length of each model, see Model List.

Use the max_tokens parameter in scenarios where you need to limit the output length, such as generating summaries or keywords, controlling costs, or reducing response latency.

qwen-omni-turbo models do not support modification.
Note

Set max_tokens using the parameters method of the OmniRealtimeConfig instance. The usage is the same as for smooth_output.

repetition_penalty

float

Controls the repetition penalty for consecutive sequences during model generation. A higher repetition_penalty value reduces repetition. A value of 1.0 means no penalty is applied. There is no strict value range, but the value must be greater than 0.

Default value: 1.05.

qwen-omni-turbo models do not support modification.
Note

Set repetition_penalty using the parameters method of the OmniRealtimeConfig instance. The usage is the same as for smooth_output.

presence_penalty

float

Controls the likelihood of repeated tokens in the generated content.

Default value: 0.0. Value range: [-2.0, 2.0]. Positive values reduce repetition, while negative values increase it.

Use cases:

A higher presence_penalty is suitable for scenarios that require diversity, interest, or creativity, such as creative writing or brainstorming.

A lower presence_penalty is suitable for scenarios that require consistency or the use of professional terms, such as technical documents or other formal writing.

qwen-omni-turbo models do not support modification.
Note

Set presence_penalty using the parameters method of the OmniRealtimeConfig instance. The usage is the same as for smooth_output.

seed

integer

Setting the seed parameter makes the model's output more deterministic. It is typically used to ensure consistent results across runs.

If you pass the same seed value in each model call and keep other parameters unchanged, the model returns the same result whenever possible.

Value range: 0 to 231-1. Default value: -1.

qwen-omni-turbo models do not support modification.
Note

Set seed using the parameters method of the OmniRealtimeConfig instance. The usage is the same as for smooth_output.

Key interfaces

OmniRealtimeConversation class

Import the OmniRealtimeConversation class using import com.alibaba.dashscope.audio.omni.OmniRealtimeConversation;.

Method signature

Server response event (sent via callback)

Description

public void connect() throws NoApiKeyException, InterruptedException

Server events

Session created

session.updated

Session configuration updated

Creates a connection to the server.

public void updateSession(OmniRealtimeConfig config)

session.updated

Session configuration updated

Updates the default configuration for the current session. For parameter settings, see the Request parameters section.

When you establish a connection, the server returns the default input and output configurations for the session. To update the default session configuration, we recommend calling this method immediately after the connection is established.

After receiving the session.update event, the server validates the parameters. If the parameters are invalid, an error is returned. Otherwise, the server-side session configuration is updated.

public void appendAudio(String audioBase64)

None

Appends Base64-encoded audio data to the cloud input audio buffer (temporary storage for writing data before submission).

  • If "turn_detection" is enabled, the server uses the buffer to detect speech and decides when to submit it.

  • If "turn_detection" is disabled, the client controls audio amount per event (up to 15 MiB). Smaller blocks can improve VAD responsiveness.

public void appendVideo(String videoBase64)

None

Adds Base64-encoded image data to the cloud video buffer (local images or real-time video stream captures).

Image input limits:

  • Format: JPG or JPEG. Recommended resolution: 480p or 720p (max 1080p).

  • Size: Max 500 KB per image (before Base64 encoding).

  • Encoding: Must be Base64-encoded.

  • Frequency: 1 image per second.

public void clearAppendedAudio()

input_audio_buffer.cleared

Audio received by the server is cleared

Clears the audio in the current cloud buffer.

public void commit()

input_audio_buffer.committed

Server received the submitted audio

Submits audio and video previously added to the cloud buffer via append. Returns an error if the input audio buffer is empty.

  • If "turn_detection" is enabled, the server automatically submits the audio buffer (client does not need to send this event).

  • If "turn_detection" is disabled, the client must submit the audio buffer to create a user message item.

Note:

  1. If input_audio_transcription is configured, the system transcribes the audio.

  2. Submitting the buffer does not trigger a model response.

public void createResponse(String instructions, List<OmniRealtimeModality> modalities)

Server events

Server starts generating the response

response.output_item.added

New output content is available in the response

Server events

Conversation item created

response.content_part.added

New output content added to the assistant message item

response.audio_transcript.delta

Incrementally generated transcribed text

response.audio.delta

Incrementally generated audio from the model

response.audio_transcript.done

Text transcription completed

response.audio.done

Audio generation completed

response.content_part.done

Streaming of text or audio content for the assistant message is complete

response.output_item.done

Streaming of the entire output item for the assistant message is complete

response.done

Response completed

Instructs the server to create a model response.

When the session is configured with "turn_detection" mode enabled, the server automatically creates a model response.

public void cancelResponse()

None

Cancels the in-progress response. Returns an error if no response exists to cancel.

public void close()

None

Stops the task and closes the connection.

public String getSessionId()

None

Gets the session ID of the current task.

public String getResponseId()

None

Gets the response ID of the most recent response.

public long getFirstTextDelay()

None

Gets the first-packet text latency of the most recent response.

public long getFirstAudioDelay()

None

Gets the first-packet audio latency of the most recent response.

Callback interface (OmniRealtimeCallback)

The server uses callbacks to return events and data to the client. Implement these callback methods to handle server-side events and data.

Import the interface using import com.alibaba.dashscope.audio.omni.OmniRealtimeCallback;.

Method

Parameter

Return value

Description

public void onOpen()

None

None

This method is called immediately after a connection is established to the server.

public abstract void onEvent(JsonObject message)

message: The server response event.

None

Includes method call responses and model-generated text and audio. For details, see Server events.

public abstract void onClose(int code, String reason)

code: The status code for closing the WebSocket.

reason: The reason for closing the WebSocket.

None

This method is called after the connection to the server is closed.

FAQ

Q: How are the input audio and images aligned?

A: The omni-realtime model uses the audio stream as a timeline. Images are inserted based on send time. You can add images at any point on the timeline and enable/disable video input anytime.

Q: What is the recommended frequency for inputting images and audio?

A: For real-time interactions, send images at 1-2 fps and audio in 100 ms packets.

Q: What is the difference between the two modes of the turn_detection switch?

A: Currently, only the server_vad mode is supported when turn_detection is enabled.

  • Enable "turn_detection":

    • Input state: Cloud-based VAD analyzes input audio to detect sentence end. The service then automatically calls the omni model for inference and returns text and audio responses.

    • Response state: Audio and video input continues without interruption during the model's response. After response completion, the state returns to input state awaiting further speech.

    • Interruption: If the user speaks during the model's response, the service immediately stops the current response and switches to input state.

  • If turn_detection is disabled:

    • You must determine the input turn end and manually trigger omni model inference by calling commit and create_response.

    • You must stop sending audio and video input while the model generates a response. Resume input only after response completion.

    • You must call response_cancel to interrupt the model's response.

When turn_detection is enabled, you can still manually trigger responses using commit and create_response, and interrupt using response_cancel.

Q: Why do I need to select another model for input_audio_transcription?

The omni model is an end-to-end multimodal model. Its text output responds to input rather than transcribing it. Therefore, a separate ASR model is required for transcription.