All Products
Search
Document Center

Alibaba Cloud Model Studio:Server events for Qwen-ASR-Realtime

Last Updated:Mar 15, 2026

This topic describes the events that the server sends to the client during a WebSocket session with the Qwen-ASR-Realtime API.

User guide: For model overview, features, and complete sample code, see Real-time speech recognition - Qwen.

error

Sent when the server detects a client or server error.

Parameter

Type

Description

type

string

The event type. Fixed to error.

event_id

string

The event ID.

error.type

string

The error type.

error.code

string

The error code.

error.message

string

The specific error message. For solutions, see Error messages.

error.param

string

The parameter related to the error.

error.event_id

string

The event ID related to the error.

{
  "event_id": "event_B2uoU7VOt1AAITsPRPH9n",
  "type": "error",
  "error": {
    "type": "invalid_request_error",
    "code": "invalid_value",
    "message": "Invalid value: 'whisper-1xx'. Supported values are: 'whisper-1'.",
    "param": "session.input_audio_transcription.model",
    "event_id": "event_123"
  }
}

session.created

Sent first after successful connection. Contains default session configurations.

Parameter

Type

Description

type

string

The event type. Fixed to session.created.

event_id

string

The event ID.

session.id

string

The ID of the current WebSocket session.

session.object

string

Fixed to realtime.session.

session.model

string

The model name.

session.modalities

array[string]

The output modality of the model. Fixed to ["text"].

session.input_audio_format

string

The input audio format.

session.input_audio_transcription

object

Configuration for speech recognition. See input_audio_transcription in the client's session.update event for details.

session.turn_detection

object

The Voice Activity Detection (VAD) configuration.

session.turn_detection.type

string

Fixed to server_vad.

session.turn_detection.threshold

float

The VAD detection threshold.

session.turn_detection.silence_duration_ms

integer

The VAD sentence-break detection threshold in milliseconds (ms).

{
    "event_id": "event_1234",
    "type": "session.created",
    "session": {
        "id": "sess_001",
        "object": "realtime.session",
        "model": "qwen3-asr-flash-realtime",
        "modalities": ["text"],
        "input_audio_format": "pcm16",
        "input_audio_transcription": null,
        "turn_detection": {
            "type": "server_vad",
            "threshold": 0.5,
            "silence_duration_ms": 200
        }
    }
}

session.updated

Sent after successfully processing the client's session.update event. On error, an error event is sent instead.

Parameter

Type

Description

type

string

The event type. The value is session.updated.

For descriptions of the other parameters, see session.created.

{
    "event_id": "event_1234",
    "type": "session.updated",
    "session": {
        "id": "sess_001",
        "object": "realtime.session",
        "model": "gpt-4o-realtime-preview-2024-12-17",
        "modalities": ["text"],
        "input_audio_format": "pcm16",
        "input_audio_transcription": null,
        "turn_detection": {
            "type": "server_vad",
            "threshold": 0.5,
            "silence_duration_ms": 200
        }
    }
}

input_audio_buffer.speech_started

Sent in VAD mode when speech starts in the audio buffer.

Can occur each time audio is added to the buffer, unless speech start has been detected.

Parameter

Type

Description

type

string

The event type. Fixed to input_audio_buffer.speech_started.

event_id

string

The event ID.

audio_start_ms

integer

Milliseconds from when audio started writing to the buffer until speech was first detected in the session.

item_id

string

The ID of the user message item that will be created.

{
  "event_id": "event_B1lV7FPbgTv9qGxPI1tH4",
  "type": "input_audio_buffer.speech_started",
  "audio_start_ms": 64,
  "item_id": "item_B1lV7jWLscp4mMV8hSs8c"
}

input_audio_buffer.speech_stopped

Sent in VAD mode when speech ends in the audio buffer.

After this event, the server immediately sends a conversation.item.created event containing the user message item created from the audio buffer.

Parameter

Type

Description

type

string

The event type. Fixed to input_audio_buffer.speech_stopped.

event_id

string

The event ID.

audio_end_ms

integer

Milliseconds elapsed from session start to when speech stopped.

item_id

string

ID of the user message item created when speech stops.

{
  "event_id": "event_B3GGEYh2orwNIdhUagZPz",
  "type": "input_audio_buffer.speech_stopped",
  "audio_end_ms": 28128,
  "item_id": "item_B3GGE8ry4yqbqJGzrVhEM"
}

input_audio_buffer.committed

Parameter

Type

Description

type

string

The event type. Fixed to input_audio_buffer.committed.

event_id

string

The event ID.

previous_item_id

string

The ID of the previous conversation item.

item_id

string

The ID of the user conversation item to be created.

{
    "event_id": "event_1121",
    "type": "input_audio_buffer.committed",
    "previous_item_id": "msg_001",
    "item_id": "msg_002"
}

conversation.item.created

Sent when a new conversation item is created.

Parameter

Type

Description

type

string

The type of the event. Fixed to conversation.item.created.

event_id

string

The event ID.

previous_item_id

string

The ID of the previous conversation item.

item

object

The item to add to the conversation.

item.id

string

The unique ID of the conversation item.

item.object

string

Fixed to realtime.item.

item.type

string

Fixed to message.

item.status

string

The status of the conversation item.

item.role

string

The role of the message sender.

item.content

array[object]

The content of the message.

item.content.type

string

Fixed to input_audio.

item.content.transcript

string

Fixed to null. The complete recognition result is provided in the conversation.item.input_audio_transcription.completed event.

{
  "type": "conversation.item.created",
  "event_id": "event_B3GGKbCfBZTpqFHZ0P8vg",
  "previous_item_id": "item_B3GGE8ry4yqbqJGzrVhEM",
  "item": {
    "id": "item_B3GGEPlolCqdMiVbYIf5L",
    "object": "realtime.item",
    "type": "message",
    "status": "completed",
    "role": "user",
    "content": [
      {
        "type": "input_audio",
        "transcript": null
      }
    ]
  }
}

conversation.item.input_audio_transcription.text

Sent frequently with real-time recognition results.

Parameter

Type

Description

type

string

The type of the event. Fixed to conversation.item.input_audio_transcription.text.

event_id

string

The ID of the event.

item_id

string

The ID of the associated conversation item.

content_index

integer

The index of the content part that contains the audio.

language

string

Language of the recognized audio. Matches the language request parameter if specified.

Possible values:

  • zh: Chinese (Mandarin, Sichuanese, Minnan, and Wu)

  • yue: Cantonese

  • en: English

  • ja: Japanese

  • de: German

  • ko: Korean

  • ru: Russian

  • fr: French

  • pt: Portuguese

  • ar: Arabic

  • it: Italian

  • es: Spanish

  • hi: Hindi

  • id: Indonesian

  • th: Thai

  • tr: Turkish

  • uk: Ukrainian

  • vi: Vietnamese

emotion

string

The emotion detected in the audio. The following emotions are supported:

  • surprised

  • neutral

  • happy

  • sad

  • disgusted

  • angry

  • fearful

text

string

Confirmed text prefix — the part of the current sentence that the model has verified and won't change.

stash

string

Pre-recognized text suffix following the confirmed part. Temporary draft the model is still processing and may correct.

{
  "event_id": "event_R7Pfu8QVBfP5HmpcbEFSd",
  "type": "conversation.item.input_audio_transcription.text",
  "item_id": "item_MpJQPNQzqVRc9aC9zMwSj",
  "content_index": 0,
  "language": "en",
  "emotion": "neutral",
  "text": "",
  "stash": "Beijing's"
}

To get the complete sentence preview, concatenate: text + stash.

Click to view an example

For example, assume a user says, "The weather is nice today, sunny and bright."

The following table shows the event stream you might receive and explains how to interpret it:

Timestamp

User speech progress

API response (text and stash)

UI display (text + stash)

T1

"The..."

text: ""

stash: "The"

The

T2

"...weather is..."

Text: ""

stash: "The weather is"

The weather is

T3

"...nice today"

text: "The"

stash: "weather is nice today"

The weather is nice today

(Note: "The" is confirmed and moved to the text field.)

T4

(Short pause)

text: "The weather is nice today,"

stash: ""

The weather is nice today,

(The first clause is fully confirmed.)

T5

"...sunny and..."

text: "The weather is nice today,"

stash: "sunny and"

The weather is nice today, sunny and

T6

"...bright."

text: "The weather is nice today,"

stash: "sunny and bright."

The weather is nice today, sunny and bright.

T7

(User stops speaking)

-

Use the content of the transcript from the conversation.item.input_audio_transcription.completed event as the final result.

conversation.item.input_audio_transcription.completed

Sends the final recognition result, marking the end of a conversation item.

Parameter

Type

Description

type

string

The event type. Fixed to conversation.item.input_audio_transcription.completed.

event_id

string

The event ID.

item_id

string

The ID of the associated conversation item.

content_index

integer

The index of the content part that contains the audio.

language

string

Language of the recognized audio. Matches the language request parameter if specified.

Possible values:

  • zh: Chinese (Mandarin, Sichuanese, Minnan, and Wu)

  • yue: Cantonese

  • en: English

  • ja: Japanese

  • de: German

  • ko: Korean

  • ru: Russian

  • fr: French

  • pt: Portuguese

  • ar: Arabic

  • it: Italian

  • es: Spanish

  • hi: Hindi

  • id: Indonesian

  • th: Thai

  • tr: Turkish

  • uk: Ukrainian

  • vi: Vietnamese

emotion

string

The emotion detected in the audio. The following emotions are supported:

  • surprised

  • neutral

  • happy

  • sad

  • disgusted

  • angry

  • fearful

transcript

string

The transcription result.

{
  "event_id": "event_B3GGEjPT2sLzjBM74W6kB",
  "type": "conversation.item.input_audio_transcription.completed",
  "item_id": "item_B3GGC53jGOuIFcjZkmEQ9",
  "content_index": 0,
  "language": "en",
  "emotion": "neutral",
  "transcript": "What's the weather like today?"
}

conversation.item.input_audio_transcription.failed

Sent when input audio recognition fails. Handled separately from other error events to identify the specific failed item.

Parameter

Type

Description

type

string

The event type. Fixed to conversation.item.input_audio_transcription.failed.

item_id

string

The ID of the associated conversation item.

content_index

integer

The index of the content part that contains the audio.

error.code

string

The error code.

error.message

string

The error message.

error.param

string

The parameter related to the error.

{
  "type": "conversation.item.input_audio_transcription.failed",
  "item_id": "<item_id>",
  "content_index": 0,
  "error": {
    "code": "<code>",
    "message": "<message>",
    "param": "<param>"
  }
}

session.finished

Session finished; all audio recognition complete.

Sent after the client sends session.finish event. Client can disconnect after receiving this.

Parameter

Type

Description

type

string

The event type. The value is session.finished.

event_id

string

The event ID.

{
  "event_id": "event_2239",
  "type": "session.finished"
}