This topic describes the events that the server sends to the client during a WebSocket session with the Qwen-ASR-Realtime API.
error
Sent when the server detects a client or server error.
|
Parameter
|
Type
|
Description
|
|
type
|
string
|
The event type. Fixed to error.
|
|
event_id
|
string
|
The event ID.
|
|
error.type
|
string
|
The error type.
|
|
error.code
|
string
|
The error code.
|
|
error.message
|
string
|
The specific error message. For solutions, see Error messages.
|
|
error.param
|
string
|
The parameter related to the error.
|
|
error.event_id
|
string
|
The event ID related to the error.
|
|
{
"event_id": "event_B2uoU7VOt1AAITsPRPH9n",
"type": "error",
"error": {
"type": "invalid_request_error",
"code": "invalid_value",
"message": "Invalid value: 'whisper-1xx'. Supported values are: 'whisper-1'.",
"param": "session.input_audio_transcription.model",
"event_id": "event_123"
}
}
|
session.created
Sent first after successful connection. Contains default session configurations.
|
Parameter
|
Type
|
Description
|
|
type
|
string
|
The event type. Fixed to session.created.
|
|
event_id
|
string
|
The event ID.
|
|
session.id
|
string
|
The ID of the current WebSocket session.
|
|
session.object
|
string
|
Fixed to realtime.session.
|
|
session.model
|
string
|
The model name.
|
|
session.modalities
|
array[string]
|
The output modality of the model. Fixed to ["text"].
|
|
session.input_audio_format
|
string
|
The input audio format.
|
|
session.input_audio_transcription
|
object
|
Configuration for speech recognition. See input_audio_transcription in the client's session.update event for details.
|
|
session.turn_detection
|
object
|
The Voice Activity Detection (VAD) configuration.
|
|
session.turn_detection.type
|
string
|
Fixed to server_vad.
|
|
session.turn_detection.threshold
|
float
|
The VAD detection threshold.
|
|
session.turn_detection.silence_duration_ms
|
integer
|
The VAD sentence-break detection threshold in milliseconds (ms).
|
|
{
"event_id": "event_1234",
"type": "session.created",
"session": {
"id": "sess_001",
"object": "realtime.session",
"model": "qwen3-asr-flash-realtime",
"modalities": ["text"],
"input_audio_format": "pcm16",
"input_audio_transcription": null,
"turn_detection": {
"type": "server_vad",
"threshold": 0.5,
"silence_duration_ms": 200
}
}
}
|
session.updated
Sent after successfully processing the client's session.update event. On error, an error event is sent instead.
|
Parameter
|
Type
|
Description
|
|
type
|
string
|
The event type. The value is session.updated.
|
For descriptions of the other parameters, see session.created.
|
{
"event_id": "event_1234",
"type": "session.updated",
"session": {
"id": "sess_001",
"object": "realtime.session",
"model": "gpt-4o-realtime-preview-2024-12-17",
"modalities": ["text"],
"input_audio_format": "pcm16",
"input_audio_transcription": null,
"turn_detection": {
"type": "server_vad",
"threshold": 0.5,
"silence_duration_ms": 200
}
}
}
|
input_audio_buffer.speech_started
Sent in VAD mode when speech starts in the audio buffer.
Can occur each time audio is added to the buffer, unless speech start has been detected.
|
Parameter
|
Type
|
Description
|
|
type
|
string
|
The event type. Fixed to input_audio_buffer.speech_started.
|
|
event_id
|
string
|
The event ID.
|
|
audio_start_ms
|
integer
|
Milliseconds from when audio started writing to the buffer until speech was first detected in the session.
|
|
item_id
|
string
|
The ID of the user message item that will be created.
|
|
{
"event_id": "event_B1lV7FPbgTv9qGxPI1tH4",
"type": "input_audio_buffer.speech_started",
"audio_start_ms": 64,
"item_id": "item_B1lV7jWLscp4mMV8hSs8c"
}
|
input_audio_buffer.speech_stopped
Sent in VAD mode when speech ends in the audio buffer.
After this event, the server immediately sends a conversation.item.created event containing the user message item created from the audio buffer.
|
Parameter
|
Type
|
Description
|
|
type
|
string
|
The event type. Fixed to input_audio_buffer.speech_stopped.
|
|
event_id
|
string
|
The event ID.
|
|
audio_end_ms
|
integer
|
Milliseconds elapsed from session start to when speech stopped.
|
|
item_id
|
string
|
ID of the user message item created when speech stops.
|
|
{
"event_id": "event_B3GGEYh2orwNIdhUagZPz",
"type": "input_audio_buffer.speech_stopped",
"audio_end_ms": 28128,
"item_id": "item_B3GGE8ry4yqbqJGzrVhEM"
}
|
input_audio_buffer.committed
|
Parameter
|
Type
|
Description
|
|
type
|
string
|
The event type. Fixed to input_audio_buffer.committed.
|
|
event_id
|
string
|
The event ID.
|
|
previous_item_id
|
string
|
The ID of the previous conversation item.
|
|
item_id
|
string
|
The ID of the user conversation item to be created.
|
|
{
"event_id": "event_1121",
"type": "input_audio_buffer.committed",
"previous_item_id": "msg_001",
"item_id": "msg_002"
}
|
conversation.item.created
Sent when a new conversation item is created.
|
Parameter
|
Type
|
Description
|
|
type
|
string
|
The type of the event. Fixed to conversation.item.created.
|
|
event_id
|
string
|
The event ID.
|
|
previous_item_id
|
string
|
The ID of the previous conversation item.
|
|
item
|
object
|
The item to add to the conversation.
|
|
item.id
|
string
|
The unique ID of the conversation item.
|
|
item.object
|
string
|
Fixed to realtime.item.
|
|
item.type
|
string
|
Fixed to message.
|
|
item.status
|
string
|
The status of the conversation item.
|
|
item.role
|
string
|
The role of the message sender.
|
|
item.content
|
array[object]
|
The content of the message.
|
|
item.content.type
|
string
|
Fixed to input_audio.
|
|
item.content.transcript
|
string
|
Fixed to null. The complete recognition result is provided in the conversation.item.input_audio_transcription.completed event.
|
|
{
"type": "conversation.item.created",
"event_id": "event_B3GGKbCfBZTpqFHZ0P8vg",
"previous_item_id": "item_B3GGE8ry4yqbqJGzrVhEM",
"item": {
"id": "item_B3GGEPlolCqdMiVbYIf5L",
"object": "realtime.item",
"type": "message",
"status": "completed",
"role": "user",
"content": [
{
"type": "input_audio",
"transcript": null
}
]
}
}
|
conversation.item.input_audio_transcription.text
Sent frequently with real-time recognition results.
|
Parameter
|
Type
|
Description
|
|
type
|
string
|
The type of the event. Fixed to conversation.item.input_audio_transcription.text.
|
|
event_id
|
string
|
The ID of the event.
|
|
item_id
|
string
|
The ID of the associated conversation item.
|
|
content_index
|
integer
|
The index of the content part that contains the audio.
|
|
language
|
string
|
Language of the recognized audio. Matches the language request parameter if specified.
Possible values:
|
|
emotion
|
string
|
The emotion detected in the audio. The following emotions are supported:
-
surprised
-
neutral
-
happy
-
sad
-
disgusted
-
angry
-
fearful
|
|
text
|
string
|
Confirmed text prefix — the part of the current sentence that the model has verified and won't change.
|
|
stash
|
string
|
Pre-recognized text suffix following the confirmed part. Temporary draft the model is still processing and may correct.
|
|
{
"event_id": "event_R7Pfu8QVBfP5HmpcbEFSd",
"type": "conversation.item.input_audio_transcription.text",
"item_id": "item_MpJQPNQzqVRc9aC9zMwSj",
"content_index": 0,
"language": "en",
"emotion": "neutral",
"text": "",
"stash": "Beijing's"
}
To get the complete sentence preview, concatenate: text + stash.
Click to view an example
For example, assume a user says, "The weather is nice today, sunny and bright."
The following table shows the event stream you might receive and explains how to interpret it:
|
Timestamp
|
User speech progress
|
API response (text and stash)
|
UI display (text + stash)
|
|
T1
|
"The..."
|
text: ""
stash: "The"
|
The
|
|
T2
|
"...weather is..."
|
Text: ""
stash: "The weather is"
|
The weather is
|
|
T3
|
"...nice today"
|
text: "The"
stash: "weather is nice today"
|
The weather is nice today
(Note: "The" is confirmed and moved to the text field.)
|
|
T4
|
(Short pause)
|
text: "The weather is nice today,"
stash: ""
|
The weather is nice today,
(The first clause is fully confirmed.)
|
|
T5
|
"...sunny and..."
|
text: "The weather is nice today,"
stash: "sunny and"
|
The weather is nice today, sunny and
|
|
T6
|
"...bright."
|
text: "The weather is nice today,"
stash: "sunny and bright."
|
The weather is nice today, sunny and bright.
|
|
T7
|
(User stops speaking)
|
-
|
Use the content of the transcript from the conversation.item.input_audio_transcription.completed event as the final result.
|
|
conversation.item.input_audio_transcription.completed
Sends the final recognition result, marking the end of a conversation item.
|
Parameter
|
Type
|
Description
|
|
type
|
string
|
The event type. Fixed to conversation.item.input_audio_transcription.completed.
|
|
event_id
|
string
|
The event ID.
|
|
item_id
|
string
|
The ID of the associated conversation item.
|
|
content_index
|
integer
|
The index of the content part that contains the audio.
|
|
language
|
string
|
Language of the recognized audio. Matches the language request parameter if specified.
Possible values:
|
|
emotion
|
string
|
The emotion detected in the audio. The following emotions are supported:
-
surprised
-
neutral
-
happy
-
sad
-
disgusted
-
angry
-
fearful
|
|
transcript
|
string
|
The transcription result.
|
|
{
"event_id": "event_B3GGEjPT2sLzjBM74W6kB",
"type": "conversation.item.input_audio_transcription.completed",
"item_id": "item_B3GGC53jGOuIFcjZkmEQ9",
"content_index": 0,
"language": "en",
"emotion": "neutral",
"transcript": "What's the weather like today?"
}
|
conversation.item.input_audio_transcription.failed
Sent when input audio recognition fails. Handled separately from other error events to identify the specific failed item.
|
Parameter
|
Type
|
Description
|
|
type
|
string
|
The event type. Fixed to conversation.item.input_audio_transcription.failed.
|
|
item_id
|
string
|
The ID of the associated conversation item.
|
|
content_index
|
integer
|
The index of the content part that contains the audio.
|
|
error.code
|
string
|
The error code.
|
|
error.message
|
string
|
The error message.
|
|
error.param
|
string
|
The parameter related to the error.
|
|
{
"type": "conversation.item.input_audio_transcription.failed",
"item_id": "<item_id>",
"content_index": 0,
"error": {
"code": "<code>",
"message": "<message>",
"param": "<param>"
}
}
|
session.finished
Session finished; all audio recognition complete.
Sent after the client sends session.finish event. Client can disconnect after receiving this.
|
Parameter
|
Type
|
Description
|
|
type
|
string
|
The event type. The value is session.finished.
|
|
event_id
|
string
|
The event ID.
|
|
{
"event_id": "event_2239",
"type": "session.finished"
}
|