This topic describes the server-side events for the Qwen-Omni-Realtime API.
For more information, see Real-time multi-modal.
error
An error message returned by the server.
event_id string The unique identifier for this event. | {
"event_id": "event_RoUu4T8yExPMI37GKwaOC",
"type": "error",
"error": {
"type": "invalid_request_error",
"code": "invalid_value",
"message": "Invalid modalities: ['audio']. Supported combinations are: ['text'] and ['audio', 'text'].",
"param": "session.modalities"
}
}
|
type string The event type. This is always error. |
error object The detailed information about the error. Properties type string The error type. code string The error code. message string The error message. param string The parameter related to the error, such as session.modalities. |
session.created
After a client connects, this is the first event that the server returns. It contains the default configuration information for the session.
event_id string The unique identifier for this event. | {
"event_id": "event_RdvlSpbBb2ssyBjYrDHjt",
"type": "session.created",
"session": {
"object": "realtime.session",
"model": "qwen3-omni-flash-realtime",
"modalities": [
"text",
"audio"
],
"voice": "Cherry",
"input_audio_format": "pcm16",
"output_audio_format": "pcm24",
"input_audio_transcription": {
"model": "gummy-realtime-v1"
},
"turn_detection": {
"type": "server_vad",
"threshold": 0.5,
"prefix_padding_ms": 300,
"silence_duration_ms": 800,
"create_response": true,
"interrupt_response": true
},
"tools": [],
"tool_choice": "auto",
"temperature": 0.8,
"id": "sess_Ov7GOXoNXhNjlxXtOGKQS"
}
}
|
type string The event type. This is always session.created. |
session object The configuration information for the session. Properties object string This is always realtime.session. model string The model used. modalities array The output modality settings for the model. voice string The timbre of the audio generated by the model. input_audio_format string The input audio format. This is always pcm16. output_audio_format string The output audio format. This is always pcm24. input_audio_transcription object The configuration for speech transcription. Properties model string The speech transcription model. This is always gummy-realtime-v1. turn_detection object The configuration for voice activity detection (VAD). Properties type string The server-side VAD type. This is always server_vad. threshold float The VAD detection threshold. silence_duration_ms integer The duration of silence to detect the end of speech. temperature float The temperature parameter for the model. |
session.updated
After receiving a user's session.update request, the server returns this event if the request is successful. If an error occurs, the server returns an error event.
event_id string The unique identifier for this event. | {
"event_id": "event_X1HsXS4b4uptp6yo1LgKd",
"type": "session.updated",
"session": {
"id": "sess_Aih6vAcY5Ddt6jwFx1tCa",
"object": "realtime.session",
"model": "qwen3-omni-flash-realtime",
"modalities": [
"text",
"audio"
],
"instructions": "You are a personal assistant named Xiaoyun. Please answer user questions accurately and in a friendly manner, always responding with a helpful attitude.",
"voice": "Cherry",
"input_audio_format": "pcm16",
"output_audio_format": "pcm24",
"input_audio_transcription": {
"model": "gummy-realtime-v1"
},
"turn_detection": {
"type": "server_vad",
"threshold": 0.1,
"prefix_padding_ms": 500,
"silence_duration_ms": 900,
"create_response": true,
"interrupt_response": true
},
"temperature": 0.8,
"max_response_output_token": "inf",
"max_tokens": 16384,
"repetition_penalty": 1.05,
"presence_penalty": 0.0,
"top_k": 50,
"top_p": 1.0,
"seed":-1
}
}
|
type string The event type. This is always session.updated. |
session object The configuration information for the session. Properties temperature float The temperature parameter for the model. modalities array The output modality settings for the model. voice string The timbre of the audio generated by the model. instructions string The objective and role of the model. input_audio_format string The input audio format. This is always pcm16. output_audio_format string The output audio format. This is always pcm24. input_audio_transcription object The configuration for speech transcription. Properties model string The speech transcription model. This is always gummy-realtime-v1. turn_detection object The configuration for VAD. Properties type string The server-side VAD type. This is always server_vad. threshold float The VAD detection threshold. silence_duration_ms integer The duration of silence to detect the end of speech. top_pfloat The probability threshold for nucleus sampling. top_k integer The size of the candidate set for sampling during model generation. max_tokens integer The maximum number of tokens that the model can return for the current request. repetition_penalty float Controls the degree of repetition in consecutive sequences during model generation. presence_penalty float Controls the degree of repetition when the model generates content. seed integer The degree of consistency in the results for each request. |
input_audio_buffer.speech_started
In VAD mode, the server returns this event when it detects the start of speech in the audio buffer.
This event might also be triggered each time audio is added to the buffer if the server has not yet detected speech.
event_id string The unique identifier for this event. | {
"event_id": "event_Pvp8nEhsQuGCQbFJ9x58n",
"type": "input_audio_buffer.speech_started",
"audio_start_ms": 3647,
"item_id": "item_YbAiGvK2H7YaS34o4R6Ba"
}
|
type string The event type. This is always input_audio_buffer.speech_started. |
audio_start_ms integer The time in milliseconds from when audio writing to the buffer begins until speech is first detected. |
item_id string The ID of the user message item that will be created when speech stops. User message items are used to append user input to the conversation history for subsequent model inference and generation. |
input_audio_buffer.speech_stopped
In VAD mode, the server returns this event when it detects the end of speech in the audio buffer.
At the same time, the server also returns a conversation.item.created event to create the corresponding user message item.
event_id string The unique identifier for this event. | {
"event_id": "event_UhQiqNVRsgUiq4KUS5Xb5",
"type": "input_audio_buffer.speech_stopped",
"audio_end_ms": 4453,
"item_id": "item_YbAiGvK2H7YaS34o4R6Ba"
}
|
type string The event type. This is always input_audio_buffer.speech_stopped. |
audio_end_ms integer The time in milliseconds from the start of the session until speech stops. |
item_id string The ID of the user message item that will be created. |
input_audio_buffer.committed
This event is returned when the input audio buffer is committed.
In VAD mode, the server automatically commits the audio buffer and returns this event when it detects that the user has finished speaking.
In Manual mode, the server returns this event after the client sends an input_audio_buffer.commit event.
event_id string The unique identifier for this event. | {
"event_id": "event_Iy6sUzL1nmdFgshFYxJEz",
"type": "input_audio_buffer.committed",
"item_id": "item_YbAiGvK2H7YaS34o4R6Ba"
}
|
type string The event type. This is always input_audio_buffer.committed. |
item_id string The ID of the user message item that will be created. |
input_audio_buffer.cleared
The server returns this event after the client sends an input_audio_buffer.clear event.
event_id string The unique identifier for this event. | {
"event_id": "event_RoUu4T8yExPMI37GKwaOC",
"type": "input_audio_buffer.cleared"
}
|
type string The event type. This is always input_audio_buffer.cleared. |
conversation.item.created
This event is returned when a conversation item is created.
event_id string The unique identifier for this event. | {
"event_id": "event_JEfkrr9gO3Ny7Xcv9bGVd",
"type": "conversation.item.created",
"item": {
"id": "item_YbAiGvK2H7YaS34o4R6Ba",
"object": "realtime.item",
"type": "message",
"status": "in_progress",
"role": "assistant",
"content": [
{
"type": "input_audio"
}
]
}
}
|
type string The event type. This is always conversation.item.created. |
item object The item to add to the conversation. Properties id string The unique ID of the conversation item. object string This is always realtime.item. status string The status of the conversation item. role string The role of the message. content string The content of the message. |
conversation.item.input_audio_transcription.completed
This event provides the transcription result that is generated after the user's audio is buffered. The transcription is processed by a separate speech recognition model, which is currently set to gummy-realtime-v1.
The transcribed text from the speech recognition model may differ from the text that is processed by the Qwen-Omni-Realtime model and is for reference only.
event_id string The unique identifier for this event. | {
"event_id": "event_FrrZcxiDfTB9LD9p4pVng",
"type": "conversation.item.input_audio_transcription.completed",
"item_id": "item_YbAiGvK2H7YaS34o4R6Ba",
"content_index": 0,
"transcript": "Hello."
}
|
type string The event type. This is always conversation.item.input_audio_transcription.completed. |
item_id string The ID of the user message item. |
content_index integer The value is currently fixed to 0. |
transcript string The transcribed text content. |
conversation.item.input_audio_transcription.failed
If input audio transcription is enabled and fails, the server returns this event. This event is independent of the error event to help the client identify the issue.
event_id string The unique identifier for this event. | {
"type": "conversation.item.input_audio_transcription.failed",
"item_id": "<item_id>",
"content_index": 0,
"error": {
"code": "<code>",
"message": "<message>",
"param": "<param>"
}
}
|
type string The event type. This is always conversation.item.input_audio_transcription.failed. |
item_id string The ID of the user message item. |
content_index integer The value is currently fixed to 0. |
error object The error message. Properties code string The error code. message string The error message. param string The parameter related to the error. |
response.created
The server returns this event when it generates a new model response.
event_id string The unique identifier for this event. | {
"event_id": "event_XuDavMzQN3KKepqGu3KRh",
"type": "response.created",
"response": {
"id": "resp_HaVOPdbmX6vifiV5pAfJY",
"object": "realtime.response",
"conversation_id": "conv_FjJaccpnvwHNo9cPVuzGc",
"status": "in_progress",
"modalities": [
"text",
"audio"
],
"voice": "Cherry",
"output_audio_format": "pcm24",
"output": []
}
}
|
type string The event type. This is always response.created. |
response object The response object. Properties id string The unique ID of the response. conversation_id string The unique ID of the current session. object string The object type. For this event, it is always realtime.response. status string The status of the response. Valid values: [completed, failed, in_progress, or incomplete]. modalities array The modalities of the response. voice string The timbre of the audio generated by the model. output string This is currently empty for this event. |
response.done
After the response is generated, the server returns this event. The response object in the event contains all output items except for the raw audio data.
event_id string The unique identifier for this event. | {
"event_id": "event_CSaxRRYLvbrfexDXAEuDG",
"type": "response.done",
"response": {
"id": "resp_HaVOPdbmX6vifiV5pAfJY",
"object": "realtime.response",
"conversation_id": "conv_FjJaccpnvwHNo9cPVuzGc",
"status": "completed",
"modalities": [
"text",
"audio"
],
"voice": "Cherry",
"output_audio_format": "pcm24",
"output": [
{
"id": "item_Ls6MtCUWO7LM4E59QziNv",
"object": "realtime.item",
"type": "message",
"status": "completed",
"role": "assistant",
"content": [
{
"type": "audio",
"transcript": "Hello! Is there anything I can help you with?"
}
]
}
],
"usage": {
"total_tokens": 377,
"input_tokens": 336,
"output_tokens": 41,
"input_tokens_details": {
"text_tokens": 228,
"audio_tokens": 108
},
"output_tokens_details": {
"text_tokens": 9,
"audio_tokens": 32
}
}
}
}
|
type string The event type. This is always response.done. |
response object The response object. Properties id string The unique ID of the response. conversation_id string The unique ID of the current session. object string The object type. For this event, it is always realtime.response. status string The status of the response. modalities array The modalities of the response. voice string The timbre of the audio generated by the model. output object The output of the response. Properties id string The ID corresponding to the response output. type string The type of the output item. The value is currently set to message. object string The object type of the output item. The value is currently set to realtime.item. status string The status of the output item. role string The role of the output item. content array The content of the output item. Properties type string The type of the output content. The value is text if the output is plain text, or audio if the output includes audio. text string The output text content. transcript string The text content that is transcribed from the audio. usage object The token consumption information for this response. |
response.text.delta
When the output modality includes only text and the model incrementally generates new text, the server returns this event.
event_id string The unique identifier for this event. | {
"delta": "Hello",
"event_id": "event_TH49MauuPmRo1RGaMSlP7",
"type": "response.text.delta",
"response_id": "resp_PrRSvPVpnCExdUOGHHLuP",
"item_id": "item_L8IRm9kRXFpxoOjDqDC96",
"output_index": 0,
"content_index": 0
}
|
type string The event type. This is always response.text.delta. |
delta string The incremental text returned. |
response_id string The ID of the response. |
item_id string The ID of the message item. You can use this ID to associate items from the same message. |
output_index integer The index of the output item in the response. The value is currently fixed to 0. |
content_index integer The index of the content part within the output item. The value is currently fixed to 0. |
response.text.done
When the output modality includes only text and the model finishes generating text, the server returns this event.
This event is also returned when the response is interrupted, incomplete, or canceled.
event_id string The unique identifier for this event. | {
"event_id": "event_B1lIeE2Nac33zn5V7h2mm",
"type": "response.text.done",
"response_id": "resp_B1lIdtjF4Noqpn5NOjznj",
"item_id": "item_B1lIdJsAJlJiFs8ztWpJt",
"output_index": 0,
"content_index": 0,
"text": "How can I assist you today?"
}
|
type string The event type. This is always response.text.done. |
response_id string The ID of the response. |
item_id string The ID of the message item. |
output_indexinteger The index of the output item in the response. |
content_indexinteger The index of the content part within the output item. |
text string The complete text output by the model. |
response.audio.delta
When the output modality includes audio and the model incrementally generates new audio data, the server returns this event.
event_id string The unique identifier for this event. | {
"event_id": "event_B1osWMZBtrEQbiIwW0qHQ",
"type": "response.audio.delta",
"response_id": "resp_P79OOMs8LnrXVpiIHUCKR",
"item_id": "item_OFaPGtzfWCPyGzxnuEX9i",
"output_index": 0,
"content_index": 0,
"delta": "{base64 audio}"
}
|
type string The event type. This is always response.audio.delta. |
response_id string The ID of the response. |
item_id string The ID of the message item. |
output_indexinteger The index of the output item in the response. |
content_indexinteger The index of the content part within the output item. |
delta string The incremental audio data output by the model, encoded in Base64. |
response.audio.done
When the output modality includes audio and the model finishes generating audio data, the server returns this event.
This event is also returned when the response is interrupted, incomplete, or canceled.
event_id string The unique identifier for this event. | {
"event_id": "event_Le1TDl7VfyHQxl47DtGxI",
"type": "response.audio.done",
"response_id": "resp_HaVOPdbmX6vifiV5pAfJY",
"item_id": "item_Ls6MtCUWO7LM4E59QziNv",
"output_index": 0,
"content_index": 0
}
|
type string The event type. This is always response.audio.done. |
response_id string The ID of the response. |
item_id string The ID of the message item. |
output_indexinteger The index of the output item in the response. |
content_indexinteger The index of the content part within the output item. |
response.audio_transcript.delta
When the output modality includes audio and the model incrementally generates text corresponding to the new audio, the server returns a response.audio_transcript.delta event.
event_id string The unique identifier for this event. | {
"event_id": "event_BksW7fOwnyavZdDxIzZYM",
"type": "response.audio_transcript.delta",
"response_id": "resp_HaVOPdbmX6vifiV5pAfJY",
"item_id": "item_Ls6MtCUWO7LM4E59QziNv",
"output_index": 0,
"content_index": 0,
"delta": "Is there anything"
}
|
type string The event type. This is always response.audio_transcript.delta. |
response_id string The ID of the response. |
item_id string The ID of the message item. |
output_indexinteger The index of the output item in the response. |
content_indexinteger The index of the content part within the output item. |
delta string The incremental text. |
response.audio_transcript.done
When the output modality includes audio and the model finishes transcribing the audio, the server returns a response.audio_transcript.done event.
event_id string The unique identifier for this event. | {
"event_id": "event_X49tL2WerT4WjxcmH16lS",
"type": "response.audio_transcript.done",
"response_id": "resp_HaVOPdbmX6vifiV5pAfJY",
"item_id": "item_Ls6MtCUWO7LM4E59QziNv",
"output_index": 0,
"content_index": 0,
"transcript": "Hello! Is there anything I can help you with?"
}
|
type string The event type. This is always response.audio_transcript.done. |
response_id string The ID of the response. |
item_id string The ID of the message item. |
output_indexinteger The index of the output item in the response. |
content_indexinteger The index of the content part within the output item. |
transcript string The complete text. |
response.output_item.added
The server returns this event when a new item is created during response generation.
event_id string The unique identifier for this event. | {
"event_id": "event_DsCO341DEVtiATtCB6BUY",
"type": "response.output_item.added",
"response_id": "resp_HaVOPdbmX6vifiV5pAfJY",
"output_index": 0,
"item": {
"id": "item_Ls6MtCUWO7LM4E59QziNv",
"object": "realtime.item",
"type": "message",
"status": "in_progress",
"role": "assistant",
"content": []
}
}
|
type string The event type. This is always response.output_item.added. |
response_id string The ID of the response. |
output_indexinteger The index of the output item in the response. |
itemobject Information about the output item. Properties id string The unique ID of the output item. object string This is always realtime.item. status string The status of the output item. role string The role of the message sender. content string The content of the message. |
response.output_item.done
The server returns this event when the new item output is complete.
event_id string The unique identifier for this event. | {
"event_id": "event_MEu5nlLw1LsOguHiehIP8",
"type": "response.output_item.done",
"response_id": "resp_HaVOPdbmX6vifiV5pAfJY",
"output_index": 0,
"item": {
"id": "item_Ls6MtCUWO7LM4E59QziNv",
"object": "realtime.item",
"type": "message",
"status": "completed",
"role": "assistant",
"content": [
{
"type": "audio",
"text": "Hello! Is there anything I can help you with?"
}
]
}
}
|
type string The event type. This is always response.output_item.done. |
response_id string The ID of the response. |
output_indexinteger The index of the output item in the response. |
itemobject Information about the output item. Properties id string The unique ID of the output item. object string This is always realtime.item. status string The status of the output item. role string The role of the message sender. content string The content of the message. |
response.content_part.added
The server returns this event when a new content part is added to an assistant message item during response generation.
event_id string The unique identifier for this event. | {
"event_id": "event_AVBOmrgY3C8bjlRajfSUT",
"type": "response.content_part.added",
"response_id": "resp_HaVOPdbmX6vifiV5pAfJY",
"item_id": "item_Ls6MtCUWO7LM4E59QziNv",
"output_index": 0,
"content_index": 0,
"part": {
"type": "audio",
"text": ""
}
}
|
type string The event type. This is always response.content_part.added. |
response_id string The ID of the response. |
item_id string The ID of the message item. |
output_indexinteger The index of the output item in the response. The value is currently fixed to 0. |
content_indexinteger The index of the content part within the output item. The value is currently fixed to 0. |
partobject Information about the content part. Properties type string The type of the content part. text string The text of the content part. |
response.content_part.done
The server returns this event when the streaming of a content part in an assistant message item is complete.
event_id string The unique identifier for this event. | {
"event_id": "event_Il8HD19v58Qr5IBkw7LtN",
"type": "response.content_part.done",
"response_id": "resp_HaVOPdbmX6vifiV5pAfJY",
"item_id": "item_Ls6MtCUWO7LM4E59QziNv",
"output_index": 0,
"content_index": 0,
"part": {
"type": "audio",
"text": "Hello! Is there anything I can help you with?"
}
}
|
type string The event type. This is always response.content_part.done. |
response_id string The ID of the response. |
item_id string The ID of the message item. |
output_indexinteger The index of the output item in the response. The value is currently fixed to 0. |
content_indexinteger The index of the content part within the content array of the item. The value is currently fixed to 0. |
partobject Returned information Properties type string The type of the content part. text string The text of the content part. |