All Products
Search
Document Center

Alibaba Cloud Model Studio:Client Events

Last Updated:Mar 31, 2026

This topic describes client events for the Qwen-Omni-Realtime API.

For more information, see Real-time (Qwen-Omni-Realtime).

Session.update

After establishing a WebSocket connection, send this event first to update the session's default configurations. The server validates parameters after receiving the session.update event. If parameters are invalid, an error is returned. If parameters are valid, the server updates the configuration and returns the complete configuration.

type string (Required)

Event type. Set to session.update.

{
    "event_id": "event_ToPZqeobitzUJnt3QqtWg",
    "type": "session.update",
    "session": {
        "modalities": [
            "text",
            "audio"
        ],
        "voice": "Chelsie",
        "input_audio_format": "pcm",
        "output_audio_format": "pcm",
        "instructions": "You are an AI customer service specialist for a five-star hotel. Accurately and friendly answer customer inquiries about room types, facilities, prices, and booking policies. Always respond with a professional and helpful attitude. Do not provide unverified information or information outside the scope of hotel services.",
        "turn_detection": {
            "type": "server_vad",
            "threshold": 0.5,
            "silence_duration_ms": 800
        },
        "enable_search": true,
        "search_options": {
            "enable_source": true
        },
        "seed": 1314,
        "max_tokens": 16384,
        "repetition_penalty": 1.05,
        "presence_penalty": 0.0,
        "top_k": 50,
        "top_p": 1.0,
        "temperature": 0.9
    }
}

session object (Optional)

Session configuration.

Properties

modalities array (Optional)

Model output modality settings. Options:

  • ["text"]

    Output text only.

  • ["text","audio"] (default)

    Output text and audio.

voice string (Optional)

The model generates audio with voice tones. For supported voice tones, see Voice Tones List.

Default voice:

  • Qwen3.5-Omni-Plus-Realtime: Tina

  • Qwen3-Omni-Flash-Realtime: Cherry

  • Qwen-Omni-Turbo-Realtime: Chelsie

input_audio_format string (Optional)

User input audio format. Currently, only pcm is supported.

output_audio_format string (Optional)

Model output audio format. Currently, only pcm is supported.

smooth_output boolean|null (Optional)

Applies only when you use the Qwen3-Omni-Flash-Realtime series models.

Enable colloquial response style. Options:

  • true (default): Get colloquial responses.

  • false: Get more formal, written responses.

    Content difficult to read aloud may not perform well.
  • null: The model automatically selects between colloquial or formal response styles.

instructions string (Optional)

System message. Use it to set the model's goal or role.

turn_detection object (Optional)

Voice Activity Detection (VAD) configuration. Set to null to disable VAD. The user manually triggers model response. If this field is not provided, the system enables VAD using the following default parameters.

Properties

type string (Optional)

Server-side VAD type. Set to server_vad. The default value is server_vad.

threshold float (Optional)

VAD sensitivity. Lower values make VAD more sensitive, detecting faint sounds such as background noise as speech. Higher values make it less sensitive, requiring clearer, louder speech to trigger.

Range: [-1.0, 1.0]. Default value: 0.5.

silence_duration_ms integer (Optional)

Minimum silence duration after speech ends. A timeout triggers a model response. Lower values result in faster responses but may cause false triggers during short pauses in speech.

Default value: 800. Parameter range: [200, 6000].

enable_search boolean (Optional)

Effective only when using the Qwen3.5-Omni-Realtime model.

Enable web search. Set to true to enable. Default is false. When enabled, the model independently decides whether to search to answer immediate user questions.

search_options object (Optional)

Web search options configuration. Effective only after enable_search is enabled.

Properties

enable_source boolean (Optional)

Return a list of search result sources. Set to true to enable.

temperature float (Optional)

Sampling temperature. Controls the diversity of model-generated content.

Higher temperature leads to more diverse content. Lower temperature leads to more deterministic content.

Range: [0, 2)

Because both temperature and top_p control content diversity, set only one value.

Default temperature:

  • qwen3-omni-flash-realtime series: 0.9

  • qwen-omni-turbo-realtime series: 1.0

qwen-omni-turbo models do not support modification.

top_p float (Optional)

Nucleus sampling probability threshold. Controls the diversity of model-generated content.

Higher top_p leads to more diverse content. Lower top_p leads to more deterministic content.

Range: (0, 1.0]

Because both temperature and top_p control content diversity, set only one value.

Default top_p:

  • qwen3-omni-flash-realtime series: 1.0

  • qwen-omni-turbo-realtime series: 0.01

qwen-omni-turbo models do not support modification.

top_k integer (Optional)

The size of the candidate set used during the generation process. For example, when the value is 50, only the 50 highest-scoring tokens from a single generation are included in the candidate set for random sampling. The larger the value, the higher the randomness of the generated content. Conversely, the smaller the value, the more deterministic the generated content. A value of null or a top_k value greater than 100 disables the top_k policy, so only the top_p policy takes effect.

The value must be greater than or equal to 0.

Default top_k:

  • qwen3-omni-flash-realtime series: 50

  • qwen-omni-turbo-realtime series: 20

qwen-omni-turbo models do not support modification.

max_tokens integer (Optional)

Maximum number of tokens returned for this request.

max_tokens does not affect the Large Language Model (LLM) generation process. If the model generates more tokens than max_tokens, the request returns truncated content.

The default value and maximum value both represent the maximum output length of the model. For more information, see Model List.

The max_tokens parameter applies to scenarios requiring word count limits such as generating summaries or keywords, cost control, or reduced response time.

qwen-omni-turbo models do not support modification.

repetition_penalty float (Optional)

This parameter controls repetition in consecutive sequences during model generation. A higher repetition_penalty value reduces repetition, and a value of 1.0 indicates no penalty. There is no strict range of valid values; the value must be greater than 0.

Default value: 1.05.

qwen-omni-turbo models do not support modification.

presence_penalty float (Optional)

Controls repetition in model-generated content.

Default value: 0.0. Range: [-2.0, 2.0]. Positive values reduce repetition. Negative values increase repetition.

Scenarios:

Higher presence_penalty applies to scenarios requiring diversity, interest, or creativity, such as creative writing or brainstorming.

Lower presence_penalty applies to scenarios requiring consistency or specialized terminology, such as technical documents or other formal documents.

qwen-omni-turbo models do not support modification.

seed integer (Optional)

Setting the seed parameter makes Large Language Model (LLM) generation more deterministic. Use it to ensure consistent results across model runs.

Pass the same seed value with each model call. Keep other parameters unchanged. The model returns the same results as much as possible.

Range: 0 to 231−1. Default value: -1.

qwen-omni-turbo models do not support modification.

Response.create

The response.create event instructs the server to create a model response. In VAD mode, the server automatically creates model responses. Do not send this event.

The server responds with a response.created event, one or more item and content events such as conversation.item.created and response.content_part.added, and finally a response.done event to indicate response completion.

type string (Required)

Event type. Set toresponse.create.

{
    "type": "response.create",
    "event_id": "event_1718624400000"
}

Response.cancel

The client sends this event to cancel an ongoing response. If no response is available for cancellation, the server responds with an error event.

type string (Required)

Event type. Set toresponse.cancel.

{
    "event_id": "event_B4o9RHSTWobB5OQdEHLTo",
    "type": "response.cancel"
}

input_audio_buffer.append

Append audio bytes to the input audio buffer.

type string (Required)

Event type. Set to input_audio_buffer.append.

{
    "event_id": "event_B4o9RHSTWobB5OQdEHLTo",
    "type": "input_audio_buffer.append",
    "audio": "UklGR..."
}

audio string (Required)

Base64-encoded audio data.

input_audio_buffer.commit

Submit the user input audio buffer. This action creates a new user message item in the conversation. If the input audio buffer is empty, the server returns an error event.

  • VAD mode: The client does not need to send this event. The server automatically commits the audio buffer.

  • Manual mode: The client must commit the audio buffer to create a user message item.

Committing the input audio buffer does not create a response from the model. The server responds with an input_audio_buffer.committed event.

If the client sent an input_image_buffer.append event, the input_audio_buffer.commit event commits the image buffer along with it.

type string (Required)

Event type. Set toinput_audio_buffer.commit.

{
    "event_id": "event_B4o9RHSTWobB5OQdEHLTo",
    "type": "input_audio_buffer.commit"
}

input_audio_buffer.clear

Clear audio bytes from the buffer. The server responds with an input_audio_buffer.cleared event.

type string (Required)

Event type. Set toinput_audio_buffer.clear.

{
    "event_id": "event_xxx",
    "type": "input_audio_buffer.clear"
}

input_image_buffer.append

Add image data to the image buffer. Images can come from local files or be captured in real-time from a video stream.

Image input has the following limitations:

  • Image format must be JPG or JPEG. For optimal performance, a resolution of 480p or 720p is recommended, with a maximum of 1080p.

  • Single image size must not exceed 500 KB before Base64 encoding.

  • Image data must be Base64-encoded.

  • Send images to the server at a recommended frequency of 1 image per second.

  • Before sending an input_image_buffer.append event, send at least one input_audio_buffer.append event.

The image buffer is committed along with the audio buffer via the input_audio_buffer.commit event.

type string (Required)

Event type. Set to input_image_buffer.append.

{
    "event_id": "event_xxx",
    "type": "input_image_buffer.append",
    "image": "xxx"
}

image string (Required)

Base64-encoded image data.