即時語音辨識（Qwen-ASR-Realtime）服務端事件 - Alibaba Cloud Model Studio

本文檔介紹在與 Qwen-ASR Realtime API 的 WebSocket 會話中，服務端向用戶端發送的事件。

使用者指南：模型介紹、功能特性和完整範例程式碼請參見即時語音辨識-通義千問

error

當服務端檢測到錯誤（包括用戶端錯誤和服務端錯誤）時，向用戶端發送的事件。

參數	類型	說明
type	string	事件類型。固定為`error`。
event_id	string	事件ID。
error.type	string	錯誤類型。
error.code	string	錯誤碼。
error.message	string	具體的報錯資訊。請按錯誤資訊所示的解決方案進行處理。
error.param	string	與錯誤相關的參數。
error.event_id	string	與錯誤相關的事件ID。

{
  "event_id": "event_B2uoU7VOt1AAITsPRPH9n",
  "type": "error",
  "error": {
    "type": "invalid_request_error",
    "code": "invalid_value",
    "message": "Invalid value: 'whisper-1xx'. Supported values are: 'whisper-1'.",
    "param": "session.input_audio_transcription.model",
    "event_id": "event_123"
  }
}

session.created

當用戶端成功串連到服務端後，服務端響應的第一個事件。該事件包含服務端為此次串連設定的預設配置資訊。

參數	類型	說明
type	string	事件類型。固定為`session.created`。
event_id	string	事件ID。
session.id	string	本次 WebSocket 會話的ID。
session.object	string	固定為`realtime.session`。
session.model	string	當前調用的模型名稱。
session.modalities	array[string]	模型輸出模態，固定為`["text"]`。
session.input_audio_format	string	輸入音頻格式。
session.input_audio_transcription	object	語音辨識相關配置參數。詳情參見用戶端session.update事件`input_audio_transcription`參數。
session.turn_detection	object	VAD（Voice Activity Detection，語音活動檢測）配置。
session.turn_detection.type	string	固定為`server_vad`。
session.turn_detection.threshold	float	VAD檢測閾值。
session.turn_detection.silence_duration_ms	integer	VAD斷句檢測閾值（ms）。

{
    "event_id": "event_1234",
    "type": "session.created",
    "session": {
        "id": "sess_001",
        "object": "realtime.session",
        "model": "qwen3-asr-flash-realtime",
        "modalities": ["text"],
        "input_audio_format": "pcm16",
        "input_audio_transcription": null,
        "turn_detection": {
            "type": "server_vad",
            "threshold": 0.5,
            "silence_duration_ms": 200
        }
    }
}

session.updated

當用戶端發送session.update事件並成功被服務端處理後，服務端將發送該事件。如果處理過程中出現錯誤，則直接發送 error 事件。

參數	類型	說明
type	string	事件類型。固定為`session.updated`。

其他參數含義同session.created。

{
    "event_id": "event_1234",
    "type": "session.updated",
    "session": {
        "id": "sess_001",
        "object": "realtime.session",
        "model": "gpt-4o-realtime-preview-2024-12-17",
        "modalities": ["text"],
        "input_audio_format": "pcm16",
        "input_audio_transcription": null,
        "turn_detection": {
            "type": "server_vad",
            "threshold": 0.5,
            "silence_duration_ms": 200
        }
    }
}

input_audio_buffer.speech_started

此事件僅在 VAD 模式下發送。當服務端在音頻緩衝區中檢測到語音開始時發送。

該事件可能在每次音頻添加到緩衝區時發生（除非已檢測到語音開始）。

參數	類型	說明
type	string	事件類型。固定為`input_audio_buffer.speech_started`。
event_id	string	事件ID。
audio_start_ms	integer	在會話期間，從音頻開始寫入緩衝區到首次檢測到語音時的毫秒數。
item_id	string	將建立的使用者訊息項的 ID。

{
  "event_id": "event_B1lV7FPbgTv9qGxPI1tH4",
  "type": "input_audio_buffer.speech_started",
  "audio_start_ms": 64,
  "item_id": "item_B1lV7jWLscp4mMV8hSs8c"
}

input_audio_buffer.speech_stopped

此事件僅在 VAD 模式下發送。當服務端在音頻緩衝區中檢測到語音結束時發送。

該事件觸發後，服務端將緊接著發送一個conversation.item.created事件，包含從音頻緩衝區建立的使用者訊息項。

參數	類型	說明
type	string	事件類型。固定為`input_audio_buffer.speech_stopped`。
event_id	string	事件ID。
audio_end_ms	integer	從會話開始到語音停止的毫秒數。
item_id	string	當語音停止時將建立的使用者訊息項的 ID。

{
  "event_id": "event_B3GGEYh2orwNIdhUagZPz",
  "type": "input_audio_buffer.speech_stopped",
  "audio_end_ms": 28128,
  "item_id": "item_B3GGE8ry4yqbqJGzrVhEM"
}

input_audio_buffer.committed

VAD模式：當用戶端完成音頻資料發送（通過input_audio_buffer.append事件）後，服務端發送該事件。
非VAD模式：用戶端完成音頻資料發送（通過input_audio_buffer.append事件）並發送input_audio_buffer.commit事件後，服務端發送該事件。

參數	類型	說明
type	string	事件類型。固定為`input_audio_buffer.committed`。
event_id	string	事件ID。
previous_item_id	string	前一個對話項的ID
item_id	string	將建立的使用者對話項的 ID。

{
    "event_id": "event_1121",
    "type": "input_audio_buffer.committed",
    "previous_item_id": "msg_001",
    "item_id": "msg_002"
}

conversation.item.created

當對話項（item）建立時發送該事件。

參數	類型	說明
type	string	事件類型。固定為`conversation.item.created`。
event_id	string	事件ID。
previous_item_id	string	前一個對話項的ID
item	object	要添加到對話中的條目。
item.id	string	對話項的唯一ID。
item.object	string	固定為 `realtime.item` 。
item.type	string	固定為`message`。
item.status	string	對話項的狀態。
item.role	string	訊息發送的角色。
item.content	array[object]	訊息的內容。
item.content.type	string	固定為`input_audio`。
item.content.transcript	string	固定為`null`。完整的識別結果通過conversation.item.input_audio_transcription.completed事件提供。

{
  "type": "conversation.item.created",
  "event_id": "event_B3GGKbCfBZTpqFHZ0P8vg",
  "previous_item_id": "item_B3GGE8ry4yqbqJGzrVhEM",
  "item": {
    "id": "item_B3GGEPlolCqdMiVbYIf5L",
    "object": "realtime.item",
    "type": "message",
    "status": "completed",
    "role": "user",
    "content": [
      {
        "type": "input_audio",
        "transcript": null
      }
    ]
  }
}

conversation.item.input_audio_transcription.text

此事件會高頻發送，用於展示即時識別結果。

參數	類型	說明
type	string	事件類型。固定為`conversation.item.input_audio_transcription.text`。
event_id	string	事件ID。
item_id	string	關聯的對話項ID。
content_index	integer	包含音訊內容部分的索引。
language	string	被識別音訊語種。當請求參數`language`已指定語種時，該值與所指定的參數一致。可能的值如下： zh：中文（普通話、四川話、閩南語、吳語） yue：粵語 en：英文 ja：日語 de：德語 ko：韓語 ru：俄語 fr：法語 pt：葡萄牙語 ar：阿拉伯語 it：意大利語 es：西班牙語 hi：印地語 id：印尼語 th：泰語 tr：土耳其語 uk：烏克蘭語 vi：越南語
emotion	string	被識別音訊情感。支援的情感如下： `surprised`：驚訝 `neutral`：平靜 `happy`：愉快 `sad`：悲傷 `disgusted`：厭惡 `angry`：憤怒 `fearful`：恐懼
text	string	已確認的文本首碼。這是當前句子中，模型已確認不會再變更的部分。
stash	string	預識別的文本尾碼。這是緊跟在已確認部分之後，模型仍在處理、可能會被修正的臨時草稿。

{
  "event_id": "event_R7Pfu8QVBfP5HmpcbEFSd",
  "type": "conversation.item.input_audio_transcription.text",
  "item_id": "item_MpJQPNQzqVRc9aC9zMwSj",
  "content_index": 0,
  "language": "zh",
  "emotion": "neutral",
  "text": "",
  "stash": "北京的"
}

在任何時刻，要擷取當前最完整的句子預覽，都需要將這兩個欄位拼接起來：即時預覽句子 = text + stash。

點擊查看樣本

假設使用者正在說：“今天天氣不錯，陽光明媚。”

以下是您可能會收到的事件流以及如何解讀它們：

時間點	使用者說話進度	API 響應 (text 和 stash)	用戶端 UI 應顯示 (text + stash)
T1	“今天……”	text: "" stash: "今天"	今天
T2	“……天氣……”	text: "" stash: "今天天氣"	今天天氣
T3	“……不錯”	text: "今天" stash: "天氣不錯"	今天天氣不錯（注意，“今天”已被確認並移入text）
T4	（短暫停頓）	text: "今天天氣不錯，" stash: ""	今天天氣不錯，（前半句完全確認）
T5	“……陽光……”	text: "今天天氣不錯，" stash: "陽光"	今天天氣不錯，陽光
T6	“……明媚。”	text: "今天天氣不錯，" stash: "陽光明媚。"	今天天氣不錯，陽光明媚。
T7	（結束說話）	-	使用conversation.item.input_audio_transcription.completed的transcript的內容作為最終結果。

conversation.item.input_audio_transcription.completed

向用戶端發送最終識別結果。此事件標誌著一個對話項（item）的結束。

參數	類型	說明
type	string	事件類型。固定為`conversation.item.input_audio_transcription.completed`。
event_id	string	事件ID。
item_id	string	關聯的對話項ID。
content_index	integer	包含音訊內容部分的索引。
language	string	被識別音訊語種。當請求參數`language`已指定語種時，該值與所指定的參數一致。可能的值如下： zh：中文（普通話、四川話、閩南語、吳語） yue：粵語 en：英文 ja：日語 de：德語 ko：韓語 ru：俄語 fr：法語 pt：葡萄牙語 ar：阿拉伯語 it：意大利語 es：西班牙語 hi：印地語 id：印尼語 th：泰語 tr：土耳其語 uk：烏克蘭語 vi：越南語
emotion	string	被識別音訊情感。支援的情感如下： `surprised`：驚訝 `neutral`：平靜 `happy`：愉快 `sad`：悲傷 `disgusted`：厭惡 `angry`：憤怒 `fearful`：恐懼
transcript	string	識別結果。

{
  "event_id": "event_B3GGEjPT2sLzjBM74W6kB",
  "type": "conversation.item.input_audio_transcription.completed",
  "item_id": "item_B3GGC53jGOuIFcjZkmEQ9",
  "content_index": 0,
  "language": "zh",
  "emotion": "neutral",
  "transcript": "今天天氣怎麼樣"
}

conversation.item.input_audio_transcription.failed

當輸入了音頻但是識別失敗時，服務端發送該事件。與其他 error 事件分開處理，便於用戶端識別相關的具體專案。

參數	類型	說明
type	string	事件類型。固定為`conversation.item.input_audio_transcription.failed`。
item_id	string	關聯的對話項ID。
content_index	integer	包含音訊內容部分的索引。
error.code	string	錯誤碼。
error.message	string	錯誤訊息。
error.param	string	錯誤相關的參數。

{
  "type": "conversation.item.input_audio_transcription.failed",
  "item_id": "<item_id>",
  "content_index": 0,
  "error": {
    "code": "<code>",
    "message": "<message>",
    "param": "<param>"
  }
}

session.finished

會話結束事件，表示當前會話中，所有音頻識別已完成。

該事件只有在用戶端發送session.finish後才會發送，用戶端接收到該事件後可主動中斷連線。

參數	類型	說明
type	string	事件類型。固定為`session.finished`。
event_id	string	事件ID。

{
  "event_id": "event_2239",
  "type": "session.finished"
}