Client events for the Realtime API - Alibaba Cloud Model Studio

This topic describes the client events for the Qwen-Omni-Realtime API.

session.update

After establishing a WebSocket connection, send this event to update the default session configuration. When the service receives the session.update event, it validates the parameters. If the parameters are valid, the service updates the session and returns the complete configuration. Otherwise, the service returns an error.

type string (Required)

The event type. Always session.update.

{
    "event_id": "event_ToPZqeobitzUJnt3QqtWg",
    "type": "session.update",
    "session": {
        "modalities": [
            "text",
            "audio"
        ],
        "voice": "Chelsie",
        "input_audio_format": "pcm16",
        "output_audio_format": "pcm24",
        "instructions": "You are an AI customer service agent for a five-star hotel. Please answer customer inquiries about room types, facilities, prices, and reservation policies accurately and in a friendly manner. Always respond with a professional and helpful attitude. Do not provide unconfirmed information or information beyond the scope of the hotel's services.",
        "turn_detection": {
            "type": "server_vad",
            "threshold": 0.5,
            "silence_duration_ms": 800
        },
        "seed": 1314,
        "max_tokens": 16384,
        "repetition_penalty": 1.05,
        "presence_penalty": 0.0,
        "top_k": 50,
        "top_p": 1.0,
        "temperature": 0.9
    }
}

session object (Optional)

The session configuration.

Properties

modalities array (Optional)

The output modalities of the model. Valid values:

["text"]
Outputs text only.
["text","audio"] (Default)
Outputs text and audio.

voice string (Optional)

The voice for the generated audio. For a list of supported voices, see Voice list.

Default voice:

Qwen3-Omni-Flash-Realtime: Cherry
Qwen-Omni-Turbo-Realtime: Chelsie

input_audio_format string (Optional)

The format of the user's input audio. Currently, only pcm16 is supported.

output_audio_format string (Optional)

The format of the output audio:

Qwen3-Omni-Flash-Realtime: Only supports pcm24
Qwen-Omni-Turbo-Realtime: Only supports pcm16

smooth_output boolean | null (Optional)

This parameter applies only to Qwen3-Omni-Flash-Realtime models.

Specifies whether to enable a conversational reply style. Valid values:

true (Default): Conversational replies.
false: More formal, written-style replies.
This may not work well for content that is difficult to read.
null: The model automatically selects a conversational or written reply style.

instructions string (Optional)

The system message that sets the goal or role for the model.

turn_detection object (Optional)

The voice activity detection (VAD) configuration. Set this to null to disable VAD and manually trigger model responses. If this parameter is not provided, the system enables VAD with the default parameters.

Properties

type string (Optional)

The server-side VAD type. Always server_vad. The default value is server_vad.

threshold float (Optional)

The sensitivity of VAD. A lower value makes VAD more sensitive and more likely to detect faint sounds, including background noise, as speech. A higher value makes it less sensitive and requires clearer, louder speech to trigger a detection.

The value ranges from [-1.0, 1.0]. The default value is 0.5.

silence_duration_ms integer (Optional)

The minimum duration of silence required after speech ends to trigger a model response. A lower value results in a faster response but may cause the model to respond incorrectly during brief pauses in speech.

The default value is 800. The parameter ranges from 200 to 6000.

temperature float (Optional)

The sampling temperature, which controls the diversity of the generated content.

A higher temperature value creates more diverse content. A lower value creates more deterministic content.

Value range: [0, 2).

Because both `temperature` and `top_p` control content diversity, set only one of them.

Default value:

qwen3-omni-flash-realtime models: 0.9
qwen-omni-turbo-realtime models: 1.0

top_p float (Optional)

The probability threshold for nucleus sampling, which controls the diversity of the generated content.

A higher `top_p` value creates more diverse content. A lower value creates more deterministic content.

Value range: (0, 1.0].

Because both `temperature` and `top_p` control content diversity, set only one of them.

Default value:

qwen3-omni-flash-realtime models: 1.0
qwen-omni-turbo-realtime models: 0.01

top_k integer (Optional)

The size of the candidate set for sampling during generation. For example, if you set this parameter to 50, only the 50 tokens with the highest scores in a single generation are used to form the candidate set for random sampling. A larger value increases randomness. A smaller value increases determinism. If the value is null or greater than 100, the top_k policy is not enabled, and only the top_p policy takes effect.

The value must be greater than or equal to 0.

Default value:

qwen3-omni-flash-realtime models: 50
qwen-omni-turbo-realtime models: 20

max_tokens integer (Optional)

The maximum number of tokens to return for the request.

The default and maximum values are the model's maximum output length. For more information about the maximum output length of each model, see Model list.

Use the `max_tokens` parameter for scenarios where you need to limit the word count, such as generating summaries or keywords, control costs, or reduce response times.

repetition_penalty float (Optional)

The penalty for repetition in consecutive sequences during model generation. Increasing the `repetition_penalty` value reduces the repetition of the generated content. A value of 1.0 means no penalty. There is no strict value range, but the value must be greater than 0.

The default value is 1.05.

presence_penalty float (Optional)

Controls the repetition of the generated content.

The default value is 0.0. The value must be in the range of [-2.0, 2.0]. A positive value reduces repetition, and a negative value increases repetition.

Scenarios:

A higher `presence_penalty` is suitable for scenarios that require diversity, creativity, or fun, such as creative writing or brainstorming.

A lower `presence_penalty` is suitable for scenarios that require consistency or the use of professional terms, such as technical documents or other formal documents.

seed integer (Optional)

Setting the `seed` parameter makes the model's generation process more deterministic. It is typically used to ensure that the model produces consistent results for each run.

If you pass the same `seed` value in each model call and keep other parameters unchanged, the model returns the same result.

The value must be in the range of 0 to 2³¹-1. The default value is -1.

response.create

The response.create event instructs the service to create a model response. In VAD mode, the service automatically creates model responses, so you do not need to send this event.

The service responds with a response.created event, one or more item and content events (such as conversation.item.created and response.content_part.added), and finally a response.done event to indicate that the response is complete.

type string (Required)

The event type. Always response.create.

{
    "type": "response.create",
    "event_id": "event_1718624400000"
}

response.cancel

The client sends this event to cancel an ongoing response. If there is no response to cancel, the service responds with an error event.

type string (Required)

The event type. Always response.cancel.

{
    "event_id": "event_B4o9RHSTWobB5OQdEHLTo",
    "type": "response.cancel"
}

input_audio_buffer.append

Appends audio bytes to the input audio buffer.

type string (Required)

The event type. Always input_audio_buffer.append.

{
    "event_id": "event_B4o9RHSTWobB5OQdEHLTo",
    "type": "input_audio_buffer.append",
    "audio": "UklGR..."
}

audio string (Required)

The Base64-encoded audio data.

input_audio_buffer.commit

Submits the user input audio buffer to create a new user message item in the conversation. If the input audio buffer is empty, the service returns an error event.

VAD mode: The client does not need to send this event. The service automatically submits the audio buffer.
Manual mode: The client must submit the audio buffer to create a user message item.

Submitting the input audio buffer does not create a response from the model. The service responds with an input_audio_buffer.committed event.

type string (Required)

The event type. Always input_audio_buffer.commit.

{
    "event_id": "event_B4o9RHSTWobB5OQdEHLTo",
    "type": "input_audio_buffer.commit"
}

input_audio_buffer.clear

Clears the audio bytes from the buffer. The service responds with an input_audio_buffer.cleared event.

type string (Required)

The event type. Always input_audio_buffer.clear.

{
    "event_id": "event_xxx",
    "type": "input_audio_buffer.clear"
}

input_image_buffer.append

Adds image data to the image buffer. The image can be from a local file or captured in real-time from a video stream.

The following limits apply to image inputs:

The image format must be JPG or JPEG. Recommended: 480p or 720p. Maximum: 1080p.
The size of a single image cannot exceed 500 KB before Base64 encoding.
The image data must be Base64-encoded.
Send images to the service at a frequency of 1 images per second.
Before you send an input_image_buffer.append event, you must send at least one input_audio_buffer.append event.

type string (Required)

The event type. Always input_image_buffer.append.

{
    "event_id": "event_xxx",
    "type": "input_image_buffer.append",
    "image": "xxx"
}

image string (Required)

The Base64-encoded image data.