Server events for Qwen-ASR-Realtime - Alibaba Cloud Model Studio

This topic describes the events that the server sends to the client during a WebSocket session with the Qwen-ASR-Realtime API.

User guide: For an overview of the model, its features, and complete sample code, see Real-time speech recognition - Qwen.

error

This event is sent to the client when the server detects an error. The error can be a client or server error.

Parameter	Type	Description
type	string	The event type. Fixed to `error`.
event_id	string	The event ID.
error.type	string	The error type.
error.code	string	The error code.
error.message	string	The specific error message. For solutions, see Error messages.
error.param	string	The parameter related to the error.
error.event_id	string	The event ID related to the error.

{
  "event_id": "event_B2uoU7VOt1AAITsPRPH9n",
  "type": "error",
  "error": {
    "type": "invalid_request_error",
    "code": "invalid_value",
    "message": "Invalid value: 'whisper-1xx'. Supported values are: 'whisper-1'.",
    "param": "session.input_audio_transcription.model",
    "event_id": "event_123"
  }
}

session.created

This is the first event that the server sends after a client successfully connects. It contains the default configurations that the server sets for the session.

Parameter	Type	Description
type	string	The event type. Fixed to `session.created`.
event_id	string	The event ID.
session.id	string	The ID of the current WebSocket session.
session.object	string	Fixed to `realtime.session`.
session.model	string	The model name.
session.modalities	array[string]	The output modality of the model. Fixed to `["text"]`.
session.input_audio_format	string	The input audio format.
session.input_audio_transcription	object	Configuration parameters for speech recognition. For more information, see the `input_audio_transcription` parameter of the client's session.update event.
session.turn_detection	object	The Voice Activity Detection (VAD) configuration.
session.turn_detection.type	string	Fixed to `server_vad`.
session.turn_detection.threshold	float	The VAD detection threshold.
session.turn_detection.silence_duration_ms	integer	The VAD sentence-break detection threshold in milliseconds (ms).

{
    "event_id": "event_1234",
    "type": "session.created",
    "session": {
        "id": "sess_001",
        "object": "realtime.session",
        "model": "qwen3-asr-flash-realtime",
        "modalities": ["text"],
        "input_audio_format": "pcm16",
        "input_audio_transcription": null,
        "turn_detection": {
            "type": "server_vad",
            "threshold": 0.5,
            "silence_duration_ms": 200
        }
    }
}

session.updated

The server sends this event after it successfully processes a session.update event from the client. If an error occurs during processing, the server sends an error event instead.

Parameter	Type	Description
type	string	The event type. The value is `session.updated`.

For descriptions of the other parameters, see session.created.

{
    "event_id": "event_1234",
    "type": "session.updated",
    "session": {
        "id": "sess_001",
        "object": "realtime.session",
        "model": "gpt-4o-realtime-preview-2024-12-17",
        "modalities": ["text"],
        "input_audio_format": "pcm16",
        "input_audio_transcription": null,
        "turn_detection": {
            "type": "server_vad",
            "threshold": 0.5,
            "silence_duration_ms": 200
        }
    }
}

input_audio_buffer.speech_started

This event is sent only in VAD mode. The server sends it when it detects the start of speech in the audio buffer.

This event can occur each time audio is added to the buffer, unless the start of speech has already been detected.

Parameter	Type	Description
type	string	The event type. Fixed to `input_audio_buffer.speech_started`.
event_id	string	The event ID.
audio_start_ms	integer	The time in milliseconds from when audio started writing to the buffer until speech was first detected during the session.
item_id	string	The ID of the user message item that will be created.

{
  "event_id": "event_B1lV7FPbgTv9qGxPI1tH4",
  "type": "input_audio_buffer.speech_started",
  "audio_start_ms": 64,
  "item_id": "item_B1lV7jWLscp4mMV8hSs8c"
}

input_audio_buffer.speech_stopped

This event is sent only in VAD mode. The server sends it when it detects the end of speech in the audio buffer.

After this event is triggered, the server immediately sends a conversation.item.created event, which contains the user message item created from the audio buffer.

Parameter	Type	Description
type	string	The event type. Fixed to `input_audio_buffer.speech_stopped`.
event_id	string	The event ID.
audio_end_ms	integer	The elapsed time in milliseconds from the start of the session to when speech stopped.
item_id	string	The ID of the user message item that is created when speech stops.

{
  "event_id": "event_B3GGEYh2orwNIdhUagZPz",
  "type": "input_audio_buffer.speech_stopped",
  "audio_end_ms": 28128,
  "item_id": "item_B3GGE8ry4yqbqJGzrVhEM"
}

input_audio_buffer.committed

VAD mode: The server sends this event after the client finishes sending audio data using the input_audio_buffer.append event.
Manual mode: The server sends this event after the client finishes sending audio data using the input_audio_buffer.append event and then sends an input_audio_buffer.commit event.

Parameter	Type	Description
type	string	The event type. Fixed to `input_audio_buffer.committed`.
event_id	string	The event ID.
previous_item_id	string	The ID of the previous conversation item.
item_id	string	The ID of the user conversation item to be created.

{
    "event_id": "event_1121",
    "type": "input_audio_buffer.committed",
    "previous_item_id": "msg_001",
    "item_id": "msg_002"
}

conversation.item.created

The server sends this event when a new conversation item is created.

Parameter	Type	Description
type	string	The type of the event. Fixed to `conversation.item.created`.
event_id	string	The event ID.
previous_item_id	string	The ID of the previous conversation item.
item	object	The item to add to the conversation.
item.id	string	The unique ID of the conversation item.
item.object	string	Fixed to `realtime.item`.
item.type	string	Fixed to `message`.
item.status	string	The status of the conversation item.
item.role	string	The role of the message sender.
item.content	array[object]	The content of the message.
item.content.type	string	Fixed to `input_audio`.
item.content.transcript	string	Fixed to `null`. The complete recognition result is provided in the conversation.item.input_audio_transcription.completed event.

{
  "type": "conversation.item.created",
  "event_id": "event_B3GGKbCfBZTpqFHZ0P8vg",
  "previous_item_id": "item_B3GGE8ry4yqbqJGzrVhEM",
  "item": {
    "id": "item_B3GGEPlolCqdMiVbYIf5L",
    "object": "realtime.item",
    "type": "message",
    "status": "completed",
    "role": "user",
    "content": [
      {
        "type": "input_audio",
        "transcript": null
      }
    ]
  }
}

conversation.item.input_audio_transcription.text

This event is sent frequently to provide real-time recognition results.

Parameter	Type	Description
type	string	The type of the event. Fixed to `conversation.item.input_audio_transcription.text`.
event_id	string	The ID of the event.
item_id	string	The ID of the associated conversation item.
content_index	integer	The index of the content part that contains the audio.
language	string	The language of the recognized audio. If a language is specified in the `language` request parameter, this value matches the specified language. Possible values: zh: Chinese (Mandarin, Sichuanese, Minnan, and Wu) yue: Cantonese en: English ja: Japanese de: German ko: Korean ru: Russian fr: French pt: Portuguese ar: Arabic it: Italian es: Spanish hi: Hindi id: Indonesian th: Thai tr: Turkish uk: Ukrainian vi: Vietnamese
emotion	string	The emotion detected in the audio. The following emotions are supported: `surprised` `neutral` `happy` `sad` `disgusted` `angry` `fearful`
text	string	The confirmed text prefix. This is the part of the current sentence that the model has confirmed and will not change.
stash	string	The pre-recognized text suffix. This is a temporary draft that follows the confirmed part. The model is still processing this draft and may correct it.

{
  "event_id": "event_R7Pfu8QVBfP5HmpcbEFSd",
  "type": "conversation.item.input_audio_transcription.text",
  "item_id": "item_MpJQPNQzqVRc9aC9zMwSj",
  "content_index": 0,
  "language": "en",
  "emotion": "neutral",
  "text": "",
  "stash": "Beijing's"
}

At any time, you can obtain the most complete sentence preview by concatenating these two fields: real-time preview sentence = text + stash.

Click to view an example

For example, assume a user says, "The weather is nice today, sunny and bright."

The following table shows the event stream you might receive and explains how to interpret it:

Timestamp	User speech progress	API response (text and stash)	UI display (text + stash)
T1	"The..."	text: "" stash: "The"	The
T2	"...weather is..."	Text: "" stash: "The weather is"	The weather is
T3	"...nice today"	text: "The" stash: "weather is nice today"	The weather is nice today (Note: "The" is confirmed and moved to the text field.)
T4	(Short pause)	text: "The weather is nice today," stash: ""	The weather is nice today, (The first clause is fully confirmed.)
T5	"...sunny and..."	text: "The weather is nice today," stash: "sunny and"	The weather is nice today, sunny and
T6	"...bright."	text: "The weather is nice today," stash: "sunny and bright."	The weather is nice today, sunny and bright.
T7	(User stops speaking)	-	Use the content of the transcript from the conversation.item.input_audio_transcription.completed event as the final result.

conversation.item.input_audio_transcription.completed

This event sends the final recognition result to the client. It marks the end of a conversation item.

Parameter	Type	Description
type	string	The event type. Fixed to `conversation.item.input_audio_transcription.completed`.
event_id	string	The event ID.
item_id	string	The ID of the associated conversation item.
content_index	integer	The index of the content part that contains the audio.
language	string	The language of the recognized audio. If a language is specified in the `language` request parameter, this value matches the specified language. Possible values: zh: Chinese (Mandarin, Sichuanese, Minnan, and Wu) yue: Cantonese en: English ja: Japanese de: German ko: Korean ru: Russian fr: French pt: Portuguese ar: Arabic it: Italian es: Spanish hi: Hindi id: Indonesian th: Thai tr: Turkish uk: Ukrainian vi: Vietnamese
emotion	string	The emotion detected in the audio. The following emotions are supported: `surprised` `neutral` `happy` `sad` `disgusted` `angry` `fearful`
transcript	string	The transcription result.

{
  "event_id": "event_B3GGEjPT2sLzjBM74W6kB",
  "type": "conversation.item.input_audio_transcription.completed",
  "item_id": "item_B3GGC53jGOuIFcjZkmEQ9",
  "content_index": 0,
  "language": "en",
  "emotion": "neutral",
  "transcript": "What's the weather like today?"
}

conversation.item.input_audio_transcription.failed

The server sends this event if recognition fails for the input audio. This event is handled separately from other error events to help the client identify the specific item that failed.

Parameter	Type	Description
type	string	The event type. Fixed to `conversation.item.input_audio_transcription.failed`.
item_id	string	The ID of the associated conversation item.
content_index	integer	The index of the content part that contains the audio.
error.code	string	The error code.
error.message	string	The error message.
error.param	string	The parameter related to the error.

{
  "type": "conversation.item.input_audio_transcription.failed",
  "item_id": "<item_id>",
  "content_index": 0,
  "error": {
    "code": "<code>",
    "message": "<message>",
    "param": "<param>"
  }
}

session.finished

The session is finished and all audio recognition in the current session is complete.

This event is sent only after the client sends the session.finish event. The client can disconnect after receiving this event.

Parameter	Type	Description
type	string	The event type. The value is `session.finished`.
event_id	string	The event ID.

{
  "event_id": "event_2239",
  "type": "session.finished"
}