All Products
Search
Document Center

Alibaba Cloud Model Studio:Server-side events for real-time speech recognition (Qwen-ASR-Realtime)

Last Updated:Oct 30, 2025

This document describes the events that the server sends to the client during a WebSocket session with the Qwen real-time speech recognition API.

User guide: For an introduction to the model, its features, and sample code, see Real-time speech recognition - Qwen.

error

This event is sent to the client when the server detects an error. The error can be a client-side or server-side error.

Parameter

Type

Description

type

string

The event type. Fixed to error.

event_id

string

The event ID.

error.type

string

The error type.

error.code

string

The error code.

error.message

string

The specific error message. For solutions, see Error messages.

error.param

string

The parameter related to the error.

error.event_id

string

The event ID related to the error.

{
  "event_id": "event_B2uoU7VOt1AAITsPRPH9n",
  "type": "error",
  "error": {
    "type": "invalid_request_error",
    "code": "invalid_value",
    "message": "Invalid value: 'whisper-1xx'. Supported values are: 'whisper-1'.",
    "param": "session.input_audio_transcription.model",
    "event_id": "event_123"
  }
}

session.created

This is the first event that the server sends after a client successfully connects. It contains the default configurations that the server sets for the session.

Parameter

Type

Description

type

string

The event type. Fixed to session.created.

event_id

string

The event ID.

session.id

string

The ID of the current WebSocket session.

session.object

string

Fixed to realtime.session.

session.model

string

The model name.

session.modalities

array[string]

The output modality of the model. Fixed to ["text"].

session.input_audio_format

string

The input audio format.

session.input_audio_transcription

object

Configuration parameters for speech recognition. For more information, see the input_audio_transcription parameter of the client's session.update event.

session.turn_detection

object

The Voice Activity Detection (VAD) configuration.

session.turn_detection.type

string

Fixed to server_vad.

session.turn_detection.threshold

float

The VAD detection threshold.

session.turn_detection.silence_duration_ms

integer

The VAD sentence-break detection threshold in milliseconds (ms).

{
    "event_id": "event_1234",
    "type": "session.created",
    "session": {
        "id": "sess_001",
        "object": "realtime.session",
        "model": "qwen3-asr-flash-realtime",
        "modalities": ["text"],
        "input_audio_format": "pcm16",
        "input_audio_transcription": null,
        "turn_detection": {
            "type": "server_vad",
            "threshold": 0.5,
            "silence_duration_ms": 200
        }
    }
}

session.updated

The server sends this event after it successfully processes a session.update event from the client. If an error occurs during processing, the server sends an error event instead.

Parameter

Type

Description

type

string

The event type. The value is session.updated.

For descriptions of the other parameters, see session.created.

{
    "event_id": "event_1234",
    "type": "session.updated",
    "session": {
        "id": "sess_001",
        "object": "realtime.session",
        "model": "gpt-4o-realtime-preview-2024-12-17",
        "modalities": ["text"],
        "input_audio_format": "pcm16",
        "input_audio_transcription": null,
        "turn_detection": {
            "type": "server_vad",
            "threshold": 0.5,
            "silence_duration_ms": 200
        }
    }
}

input_audio_buffer.speech_started

This event is sent only in VAD mode. The server sends it when it detects the start of speech in the audio buffer.

This event can occur each time audio is added to the buffer, unless the start of speech has already been detected.

Parameter

Type

Description

type

string

The event type. Fixed to input_audio_buffer.speech_started.

event_id

string

The event ID.

audio_start_ms

integer

The time in milliseconds from when audio started writing to the buffer until speech was first detected during the session.

item_id

string

The ID of the user message item that will be created.

{
  "event_id": "event_B1lV7FPbgTv9qGxPI1tH4",
  "type": "input_audio_buffer.speech_started",
  "audio_start_ms": 64,
  "item_id": "item_B1lV7jWLscp4mMV8hSs8c"
}

input_audio_buffer.speech_stopped

This event is sent only in VAD mode. The server sends it when it detects the end of speech in the audio buffer.

After this event is triggered, the server immediately sends a conversation.item.created event, which contains the user message item created from the audio buffer.

Parameter

Type

Description

type

string

The event type. Fixed to input_audio_buffer.speech_stopped.

event_id

string

The event ID.

audio_end_ms

integer

The elapsed time in milliseconds from the start of the session to when speech stopped.

item_id

string

The ID of the user message item that is created when speech stops.

{
  "event_id": "event_B3GGEYh2orwNIdhUagZPz",
  "type": "input_audio_buffer.speech_stopped",
  "audio_end_ms": 28128,
  "item_id": "item_B3GGE8ry4yqbqJGzrVhEM"
}

input_audio_buffer.committed

Parameter

Type

Description

type

string

The event type. Fixed to input_audio_buffer.committed.

event_id

string

The event ID.

previous_item_id

string

The ID of the previous conversation item.

item_id

string

The ID of the user conversation item to be created.

{
    "event_id": "event_1121",
    "type": "input_audio_buffer.committed",
    "previous_item_id": "msg_001",
    "item_id": "msg_002"
}

conversation.item.created

The server sends this event when a new conversation item is created.

Parameter

Type

Description

type

string

The type of the event. Fixed to conversation.item.created.

event_id

string

The event ID.

previous_item_id

string

The ID of the previous conversation item.

item

object

The item to add to the conversation.

item.id

string

The unique ID of the conversation item.

item.object

string

Fixed to realtime.item.

item.type

string

Fixed to message.

item.status

string

The status of the conversation item.

item.role

string

The role of the message sender.

item.content

array[object]

The content of the message.

item.content.type

string

Fixed to input_audio.

item.content.transcript

string

Fixed to null. The complete recognition result is provided in the conversation.item.input_audio_transcription.completed event.

{
  "type": "conversation.item.created",
  "event_id": "event_B3GGKbCfBZTpqFHZ0P8vg",
  "previous_item_id": "item_B3GGE8ry4yqbqJGzrVhEM",
  "item": {
    "id": "item_B3GGEPlolCqdMiVbYIf5L",
    "object": "realtime.item",
    "type": "message",
    "status": "completed",
    "role": "user",
    "content": [
      {
        "type": "input_audio",
        "transcript": null
      }
    ]
  }
}

conversation.item.input_audio_transcription.text

This event is sent frequently to provide real-time recognition results.

Parameter

Type

Description

type

string

The type of the event. Fixed to conversation.item.input_audio_transcription.text.

event_id

string

The ID of the event.

item_id

string

The ID of the associated conversation item.

content_index

integer

The index of the content part that contains the audio.

text

string

The final and confirmed recognition result. This value will not change.

stash

string

The temporary recognition result. This is an intermediate value that might be corrected in subsequent events.

{
  "event_id": "event_R7Pfu8QVBfP5HmpcbEFSd",
  "type": "conversation.item.input_audio_transcription.text",
  "item_id": "item_MpJQPNQzqVRc9aC9zMwSj",
  "content_index": 0,
  "text": "",
  "stash": "Beijing's"
}

conversation.item.input_audio_transcription.completed

This event sends the final recognition result to the client. It marks the end of a conversation item.

Parameter

Type

Description

type

string

The event type. Fixed to conversation.item.input_audio_transcription.completed.

event_id

string

The event ID.

item_id

string

The ID of the associated conversation item.

content_index

integer

The index of the content part that contains the audio.

transcript

string

The transcription result.

{
  "event_id": "event_B3GGEjPT2sLzjBM74W6kB",
  "type": "conversation.item.input_audio_transcription.completed",
  "item_id": "item_B3GGC53jGOuIFcjZkmEQ9",
  "content_index": 0,
  "transcript": "What's the weather like today?"
}

conversation.item.input_audio_transcription.failed

The server sends this event if recognition fails for the input audio. This event is handled separately from other error events to help the client identify the specific item that failed.

Parameter

Type

Description

type

string

The event type. Fixed to conversation.item.input_audio_transcription.failed.

item_id

string

The ID of the associated conversation item.

content_index

integer

The index of the content part that contains the audio.

error.code

string

The error code.

error.message

string

The error message.

error.param

string

The parameter related to the error.

{
  "type": "conversation.item.input_audio_transcription.failed",
  "item_id": "<item_id>",
  "content_index": 0,
  "error": {
    "code": "<code>",
    "message": "<message>",
    "param": "<param>"
  }
}