All Products
Search
Document Center

Alibaba Cloud Model Studio:Server-side events

Last Updated:Nov 08, 2025

This topic describes the server-side events for the Qwen-Omni-Realtime API.

For more information, see Real-time multi-modal.

error

An error message returned by the server.

event_id string

The unique identifier for this event.

{
  "event_id": "event_RoUu4T8yExPMI37GKwaOC",
  "type": "error",
  "error": {
    "type": "invalid_request_error",
    "code": "invalid_value",
    "message": "Invalid modalities: ['audio']. Supported combinations are: ['text'] and ['audio', 'text'].",
    "param": "session.modalities"
  }
}

type string

The event type. This is always error.

error object

The detailed information about the error.

Properties

type string

The error type.

code string

The error code.

message string

The error message.

param string

The parameter related to the error, such as session.modalities.

session.created

After a client connects, this is the first event that the server returns. It contains the default configuration information for the session.

event_id string

The unique identifier for this event.

{
    "event_id": "event_RdvlSpbBb2ssyBjYrDHjt",
    "type": "session.created",
    "session": {
        "object": "realtime.session",
        "model": "qwen3-omni-flash-realtime",
        "modalities": [
            "text",
            "audio"
        ],
        "voice": "Cherry",
        "input_audio_format": "pcm16",
        "output_audio_format": "pcm24",
        "input_audio_transcription": {
            "model": "gummy-realtime-v1"
        },
        "turn_detection": {
            "type": "server_vad",
            "threshold": 0.5,
            "prefix_padding_ms": 300,
            "silence_duration_ms": 800,
            "create_response": true,
            "interrupt_response": true
        },
        "tools": [],
        "tool_choice": "auto",
        "temperature": 0.8,
        "id": "sess_Ov7GOXoNXhNjlxXtOGKQS"
    }
}

type string

The event type. This is always session.created.

session object

The configuration information for the session.

Properties

object string

This is always realtime.session.

model string

The model used.

modalities array

The output modality settings for the model.

voice string

The timbre of the audio generated by the model.

input_audio_format string

The input audio format. This is always pcm16.

output_audio_format string

The output audio format. This is always pcm24.

input_audio_transcription object

The configuration for speech transcription.

Properties

model string

The speech transcription model. This is always gummy-realtime-v1.

turn_detection object

The configuration for voice activity detection (VAD).

Properties

type string

The server-side VAD type. This is always server_vad.

threshold float

The VAD detection threshold.

silence_duration_ms integer

The duration of silence to detect the end of speech.

temperature float

The temperature parameter for the model.

session.updated

After receiving a user's session.update request, the server returns this event if the request is successful. If an error occurs, the server returns an error event.

event_id string

The unique identifier for this event.

{
    "event_id": "event_X1HsXS4b4uptp6yo1LgKd",
    "type": "session.updated",
    "session": {
        "id": "sess_Aih6vAcY5Ddt6jwFx1tCa",
        "object": "realtime.session",
        "model": "qwen3-omni-flash-realtime",
        "modalities": [
            "text",
            "audio"
        ],
        "instructions": "You are a personal assistant named Xiaoyun. Please answer user questions accurately and in a friendly manner, always responding with a helpful attitude.",
        "voice": "Cherry",
        "input_audio_format": "pcm16",
        "output_audio_format": "pcm24",
        "input_audio_transcription": {
            "model": "gummy-realtime-v1"
        },
        "turn_detection": {
            "type": "server_vad",
            "threshold": 0.1,
            "prefix_padding_ms": 500,
            "silence_duration_ms": 900,
            "create_response": true,
            "interrupt_response": true
        },
        "temperature": 0.8,
        "max_response_output_token": "inf",
        "max_tokens": 16384,
        "repetition_penalty": 1.05,
        "presence_penalty": 0.0,
        "top_k": 50,
        "top_p": 1.0,
        "seed":-1
    }
}

type string

The event type. This is always session.updated.

session object

The configuration information for the session.

Properties

temperature float

The temperature parameter for the model.

modalities array

The output modality settings for the model.

voice string

The timbre of the audio generated by the model.

instructions string

The objective and role of the model.

input_audio_format string

The input audio format. This is always pcm16.

output_audio_format string

The output audio format. This is always pcm24.

input_audio_transcription object

The configuration for speech transcription.

Properties

model string

The speech transcription model. This is always gummy-realtime-v1.

turn_detection object

The configuration for VAD.

Properties

type string

The server-side VAD type. This is always server_vad.

threshold float

The VAD detection threshold.

silence_duration_ms integer

The duration of silence to detect the end of speech.

top_pfloat

The probability threshold for nucleus sampling.

top_k integer

The size of the candidate set for sampling during model generation.

max_tokens integer

The maximum number of tokens that the model can return for the current request.

repetition_penalty float

Controls the degree of repetition in consecutive sequences during model generation.

presence_penalty float

Controls the degree of repetition when the model generates content.

seed integer

The degree of consistency in the results for each request.

input_audio_buffer.speech_started

In VAD mode, the server returns this event when it detects the start of speech in the audio buffer.

This event might also be triggered each time audio is added to the buffer if the server has not yet detected speech.

event_id string

The unique identifier for this event.

{
    "event_id": "event_Pvp8nEhsQuGCQbFJ9x58n",
    "type": "input_audio_buffer.speech_started",
    "audio_start_ms": 3647,
    "item_id": "item_YbAiGvK2H7YaS34o4R6Ba"
}

type string

The event type. This is always input_audio_buffer.speech_started.

audio_start_ms integer

The time in milliseconds from when audio writing to the buffer begins until speech is first detected.

item_id string

The ID of the user message item that will be created when speech stops.

User message items are used to append user input to the conversation history for subsequent model inference and generation.

input_audio_buffer.speech_stopped

In VAD mode, the server returns this event when it detects the end of speech in the audio buffer.

At the same time, the server also returns a conversation.item.created event to create the corresponding user message item.

event_id string

The unique identifier for this event.

{
    "event_id": "event_UhQiqNVRsgUiq4KUS5Xb5",
    "type": "input_audio_buffer.speech_stopped",
    "audio_end_ms": 4453,
    "item_id": "item_YbAiGvK2H7YaS34o4R6Ba"
}

type string

The event type. This is always input_audio_buffer.speech_stopped.

audio_end_ms integer

The time in milliseconds from the start of the session until speech stops.

item_id string

The ID of the user message item that will be created.

input_audio_buffer.committed

This event is returned when the input audio buffer is committed.

  • In VAD mode, the server automatically commits the audio buffer and returns this event when it detects that the user has finished speaking.

  • In Manual mode, the server returns this event after the client sends an input_audio_buffer.commit event.

event_id string

The unique identifier for this event.

{
    "event_id": "event_Iy6sUzL1nmdFgshFYxJEz",
    "type": "input_audio_buffer.committed",
    "item_id": "item_YbAiGvK2H7YaS34o4R6Ba"
}

type string

The event type. This is always input_audio_buffer.committed.

item_id string

The ID of the user message item that will be created.

input_audio_buffer.cleared

The server returns this event after the client sends an input_audio_buffer.clear event.

event_id string

The unique identifier for this event.

{
  "event_id": "event_RoUu4T8yExPMI37GKwaOC",
  "type": "input_audio_buffer.cleared"
}

type string

The event type. This is always input_audio_buffer.cleared.

conversation.item.created

This event is returned when a conversation item is created.

event_id string

The unique identifier for this event.

{
    "event_id": "event_JEfkrr9gO3Ny7Xcv9bGVd",
    "type": "conversation.item.created",
    "item": {
        "id": "item_YbAiGvK2H7YaS34o4R6Ba",
        "object": "realtime.item",
        "type": "message",
        "status": "in_progress",
        "role": "assistant",
        "content": [
            {
                "type": "input_audio"
            }
        ]
    }
}

type string

The event type. This is always conversation.item.created.

item object

The item to add to the conversation.

Properties

id string

The unique ID of the conversation item.

object string

This is always realtime.item.

status string

The status of the conversation item.

role string

The role of the message.

content string

The content of the message.

conversation.item.input_audio_transcription.completed

This event provides the transcription result that is generated after the user's audio is buffered. The transcription is processed by a separate speech recognition model, which is currently set to gummy-realtime-v1.

The transcribed text from the speech recognition model may differ from the text that is processed by the Qwen-Omni-Realtime model and is for reference only.

event_id string

The unique identifier for this event.

{
    "event_id": "event_FrrZcxiDfTB9LD9p4pVng",
    "type": "conversation.item.input_audio_transcription.completed",
    "item_id": "item_YbAiGvK2H7YaS34o4R6Ba",
    "content_index": 0,
    "transcript": "Hello."
}

type string

The event type. This is always conversation.item.input_audio_transcription.completed.

item_id string

The ID of the user message item.

content_index integer

The value is currently fixed to 0.

transcript string

The transcribed text content.

conversation.item.input_audio_transcription.failed

If input audio transcription is enabled and fails, the server returns this event. This event is independent of the error event to help the client identify the issue.

event_id string

The unique identifier for this event.

{
  "type": "conversation.item.input_audio_transcription.failed",
  "item_id": "<item_id>",
  "content_index": 0,
  "error": {
    "code": "<code>",
    "message": "<message>",
    "param": "<param>"
  }
}

type string

The event type. This is always conversation.item.input_audio_transcription.failed.

item_id string

The ID of the user message item.

content_index integer

The value is currently fixed to 0.

error object

The error message.

Properties

code string

The error code.

message string

The error message.

param string

The parameter related to the error.

response.created

The server returns this event when it generates a new model response.

event_id string

The unique identifier for this event.

{
    "event_id": "event_XuDavMzQN3KKepqGu3KRh",
    "type": "response.created",
    "response": {
        "id": "resp_HaVOPdbmX6vifiV5pAfJY",
        "object": "realtime.response",
        "conversation_id": "conv_FjJaccpnvwHNo9cPVuzGc",
        "status": "in_progress",
        "modalities": [
            "text",
            "audio"
        ],
        "voice": "Cherry",
        "output_audio_format": "pcm24",
        "output": []
    }
}

type string

The event type. This is always response.created.

response object

The response object.

Properties

id string

The unique ID of the response.

conversation_id string

The unique ID of the current session.

object string

The object type. For this event, it is always realtime.response.

status string

The status of the response. Valid values: [completed, failed, in_progress, or incomplete].

modalities array

The modalities of the response.

voice string

The timbre of the audio generated by the model.

output string

This is currently empty for this event.

response.done

After the response is generated, the server returns this event. The response object in the event contains all output items except for the raw audio data.

event_id string

The unique identifier for this event.

{
    "event_id": "event_CSaxRRYLvbrfexDXAEuDG",
    "type": "response.done",
    "response": {
        "id": "resp_HaVOPdbmX6vifiV5pAfJY",
        "object": "realtime.response",
        "conversation_id": "conv_FjJaccpnvwHNo9cPVuzGc",
        "status": "completed",
        "modalities": [
            "text",
            "audio"
        ],
        "voice": "Cherry",
        "output_audio_format": "pcm24",
        "output": [
            {
                "id": "item_Ls6MtCUWO7LM4E59QziNv",
                "object": "realtime.item",
                "type": "message",
                "status": "completed",
                "role": "assistant",
                "content": [
                    {
                        "type": "audio",
                        "transcript": "Hello! Is there anything I can help you with?"
                    }
                ]
            }
        ],
        "usage": {
            "total_tokens": 377,
            "input_tokens": 336,
            "output_tokens": 41,
            "input_tokens_details": {
                "text_tokens": 228,
                "audio_tokens": 108
            },
            "output_tokens_details": {
                "text_tokens": 9,
                "audio_tokens": 32
            }
        }
    }
}

type string

The event type. This is always response.done.

response object

The response object.

Properties

id string

The unique ID of the response.

conversation_id string

The unique ID of the current session.

object string

The object type. For this event, it is always realtime.response.

status string

The status of the response.

modalities array

The modalities of the response.

voice string

The timbre of the audio generated by the model.

output object

The output of the response.

Properties

id string

The ID corresponding to the response output.

type string

The type of the output item. The value is currently set to message.

object string

The object type of the output item. The value is currently set to realtime.item.

status string

The status of the output item.

role string

The role of the output item.

content array

The content of the output item.

Properties

type string

The type of the output content. The value is text if the output is plain text, or audio if the output includes audio.

text string

The output text content.

transcript string

The text content that is transcribed from the audio.

usage object

The token consumption information for this response.

response.text.delta

When the output modality includes only text and the model incrementally generates new text, the server returns this event.

event_id string

The unique identifier for this event.

{
    "delta": "Hello",
    "event_id": "event_TH49MauuPmRo1RGaMSlP7",
    "type": "response.text.delta",
    "response_id": "resp_PrRSvPVpnCExdUOGHHLuP",
    "item_id": "item_L8IRm9kRXFpxoOjDqDC96",
    "output_index": 0,
    "content_index": 0
}

type string

The event type. This is always response.text.delta.

delta string

The incremental text returned.

response_id string

The ID of the response.

item_id string

The ID of the message item. You can use this ID to associate items from the same message.

output_index integer

The index of the output item in the response. The value is currently fixed to 0.

content_index integer

The index of the content part within the output item. The value is currently fixed to 0.

response.text.done

When the output modality includes only text and the model finishes generating text, the server returns this event.

This event is also returned when the response is interrupted, incomplete, or canceled.

event_id string

The unique identifier for this event.

{
  "event_id": "event_B1lIeE2Nac33zn5V7h2mm",
  "type": "response.text.done",
  "response_id": "resp_B1lIdtjF4Noqpn5NOjznj",
  "item_id": "item_B1lIdJsAJlJiFs8ztWpJt",
  "output_index": 0,
  "content_index": 0,
  "text": "How can I assist you today?"
}

type string

The event type. This is always response.text.done.

response_id string

The ID of the response.

item_id string

The ID of the message item.

output_indexinteger

The index of the output item in the response.

content_indexinteger

The index of the content part within the output item.

text string

The complete text output by the model.

response.audio.delta

When the output modality includes audio and the model incrementally generates new audio data, the server returns this event.

event_id string

The unique identifier for this event.

{
  "event_id": "event_B1osWMZBtrEQbiIwW0qHQ",
  "type": "response.audio.delta",
  "response_id": "resp_P79OOMs8LnrXVpiIHUCKR",
  "item_id": "item_OFaPGtzfWCPyGzxnuEX9i",
  "output_index": 0,
  "content_index": 0,
  "delta": "{base64 audio}"
}

type string

The event type. This is always response.audio.delta.

response_id string

The ID of the response.

item_id string

The ID of the message item.

output_indexinteger

The index of the output item in the response.

content_indexinteger

The index of the content part within the output item.

delta string

The incremental audio data output by the model, encoded in Base64.

response.audio.done

When the output modality includes audio and the model finishes generating audio data, the server returns this event.

This event is also returned when the response is interrupted, incomplete, or canceled.

event_id string

The unique identifier for this event.

{
    "event_id": "event_Le1TDl7VfyHQxl47DtGxI",
    "type": "response.audio.done",
    "response_id": "resp_HaVOPdbmX6vifiV5pAfJY",
    "item_id": "item_Ls6MtCUWO7LM4E59QziNv",
    "output_index": 0,
    "content_index": 0
}

type string

The event type. This is always response.audio.done.

response_id string

The ID of the response.

item_id string

The ID of the message item.

output_indexinteger

The index of the output item in the response.

content_indexinteger

The index of the content part within the output item.

response.audio_transcript.delta

When the output modality includes audio and the model incrementally generates text corresponding to the new audio, the server returns a response.audio_transcript.delta event.

event_id string

The unique identifier for this event.

{
    "event_id": "event_BksW7fOwnyavZdDxIzZYM",
    "type": "response.audio_transcript.delta",
    "response_id": "resp_HaVOPdbmX6vifiV5pAfJY",
    "item_id": "item_Ls6MtCUWO7LM4E59QziNv",
    "output_index": 0,
    "content_index": 0,
    "delta": "Is there anything"
}

type string

The event type. This is always response.audio_transcript.delta.

response_id string

The ID of the response.

item_id string

The ID of the message item.

output_indexinteger

The index of the output item in the response.

content_indexinteger

The index of the content part within the output item.

delta string

The incremental text.

response.audio_transcript.done

When the output modality includes audio and the model finishes transcribing the audio, the server returns a response.audio_transcript.done event.

event_id string

The unique identifier for this event.

{
    "event_id": "event_X49tL2WerT4WjxcmH16lS",
    "type": "response.audio_transcript.done",
    "response_id": "resp_HaVOPdbmX6vifiV5pAfJY",
    "item_id": "item_Ls6MtCUWO7LM4E59QziNv",
    "output_index": 0,
    "content_index": 0,
    "transcript": "Hello! Is there anything I can help you with?"
}

type string

The event type. This is always response.audio_transcript.done.

response_id string

The ID of the response.

item_id string

The ID of the message item.

output_indexinteger

The index of the output item in the response.

content_indexinteger

The index of the content part within the output item.

transcript string

The complete text.

response.output_item.added

The server returns this event when a new item is created during response generation.

event_id string

The unique identifier for this event.

{
    "event_id": "event_DsCO341DEVtiATtCB6BUY",
    "type": "response.output_item.added",
    "response_id": "resp_HaVOPdbmX6vifiV5pAfJY",
    "output_index": 0,
    "item": {
        "id": "item_Ls6MtCUWO7LM4E59QziNv",
        "object": "realtime.item",
        "type": "message",
        "status": "in_progress",
        "role": "assistant",
        "content": []
    }
}

type string

The event type. This is always response.output_item.added.

response_id string

The ID of the response.

output_indexinteger

The index of the output item in the response.

itemobject

Information about the output item.

Properties

id string

The unique ID of the output item.

object string

This is always realtime.item.

status string

The status of the output item.

role string

The role of the message sender.

content string

The content of the message.

response.output_item.done

The server returns this event when the new item output is complete.

event_id string

The unique identifier for this event.

{
    "event_id": "event_MEu5nlLw1LsOguHiehIP8",
    "type": "response.output_item.done",
    "response_id": "resp_HaVOPdbmX6vifiV5pAfJY",
    "output_index": 0,
    "item": {
        "id": "item_Ls6MtCUWO7LM4E59QziNv",
        "object": "realtime.item",
        "type": "message",
        "status": "completed",
        "role": "assistant",
        "content": [
            {
                "type": "audio",
                "text": "Hello! Is there anything I can help you with?"
            }
        ]
    }
}

type string

The event type. This is always response.output_item.done.

response_id string

The ID of the response.

output_indexinteger

The index of the output item in the response.

itemobject

Information about the output item.

Properties

id string

The unique ID of the output item.

object string

This is always realtime.item.

status string

The status of the output item.

role string

The role of the message sender.

content string

The content of the message.

response.content_part.added

The server returns this event when a new content part is added to an assistant message item during response generation.

event_id string

The unique identifier for this event.

{
    "event_id": "event_AVBOmrgY3C8bjlRajfSUT",
    "type": "response.content_part.added",
    "response_id": "resp_HaVOPdbmX6vifiV5pAfJY",
    "item_id": "item_Ls6MtCUWO7LM4E59QziNv",
    "output_index": 0,
    "content_index": 0,
    "part": {
        "type": "audio",
        "text": ""
    }
}

type string

The event type. This is always response.content_part.added.

response_id string

The ID of the response.

item_id string

The ID of the message item.

output_indexinteger

The index of the output item in the response. The value is currently fixed to 0.

content_indexinteger

The index of the content part within the output item. The value is currently fixed to 0.

partobject

Information about the content part.

Properties

type string

The type of the content part.

text string

The text of the content part.

response.content_part.done

The server returns this event when the streaming of a content part in an assistant message item is complete.

event_id string

The unique identifier for this event.

{
    "event_id": "event_Il8HD19v58Qr5IBkw7LtN",
    "type": "response.content_part.done",
    "response_id": "resp_HaVOPdbmX6vifiV5pAfJY",
    "item_id": "item_Ls6MtCUWO7LM4E59QziNv",
    "output_index": 0,
    "content_index": 0,
    "part": {
        "type": "audio",
        "text": "Hello! Is there anything I can help you with?"
    }
}

type string

The event type. This is always response.content_part.done.

response_id string

The ID of the response.

item_id string

The ID of the message item.

output_indexinteger

The index of the output item in the response. The value is currently fixed to 0.

content_indexinteger

The index of the content part within the content array of the item. The value is currently fixed to 0.

partobject

Returned information

Properties

type string

The type of the content part.

text string

The text of the content part.