Qwen-ASR-Realtime のサーバーイベント - Alibaba Cloud Model Studio

このトピックでは、Qwen-ASR-Realtime API との WebSocket セッション中にサーバーがクライアントに送信するイベントについて説明します。

ユーザーガイド：モデルの概要、特徴、および完全なサンプルコードについては、「リアルタイム音声認識 - Qwen」をご参照ください。

error

このイベントは、サーバーがエラーを検出したときにクライアントに送信されます。エラーには、クライアントエラーまたはサーバーエラーがあります。

パラメーター	タイプ	説明
type	string	イベントタイプ。値は `error` です。
event_id	string	イベント ID。
error.type	string	エラータイプ。
error.code	string	エラーコード。
error.message	string	具体的なエラーメッセージ。ソリューションについては、「エラーメッセージ」をご参照ください。
error.param	string	エラーに関連するパラメーター。
error.event_id	string	エラーに関連するイベント ID。

{
  "event_id": "event_B2uoU7VOt1AAITsPRPH9n",
  "type": "error",
  "error": {
    "type": "invalid_request_error",
    "code": "invalid_value",
    "message": "Invalid value: 'whisper-1xx'. Supported values are: 'whisper-1'.",
    "param": "session.input_audio_transcription.model",
    "event_id": "event_123"
  }
}

session.created

これは、クライアントが正常に接続した後にサーバーが送信する最初のイベントです。セッションに対してサーバーが設定したデフォルトの構成が含まれています。

パラメーター	タイプ	説明
type	string	イベントタイプ。 `session.created` に固定されています。
event_id	string	イベント ID。
session.id	string	現在の WebSocket セッションの ID。
session.object	string	`realtime.session` に固定されます。
session.model	string	モデル名。
session.modalities	array[string]	モデルの出力モダリティです。`["text"]`に固定されています。
session.input_audio_format	string	入力音声フォーマット。
session.input_audio_transcription	object	音声認識の設定パラメーター。詳細については、「クライアントの session.update イベントの `input_audio_transcription` パラメーター」をご参照ください。
session.turn_detection	object	音声アクティビティ検出 (VAD) の構成。
session.turn_detection.type	string	`server_vad` に固定
session.turn_detection.threshold	float	VAD 検出のしきい値。
session.turn_detection.silence_duration_ms	integer	VAD の文の区切り検出のしきい値 (ミリ秒、ms)。

{
    "event_id": "event_1234",
    "type": "session.created",
    "session": {
        "id": "sess_001",
        "object": "realtime.session",
        "model": "qwen3-asr-flash-realtime",
        "modalities": ["text"],
        "input_audio_format": "pcm16",
        "input_audio_transcription": null,
        "turn_detection": {
            "type": "server_vad",
            "threshold": 0.5,
            "silence_duration_ms": 200
        }
    }
}

session.updated

サーバーは、クライアントからの session.update イベントの処理に成功した後にこのイベントを送信します。処理中にエラーが発生した場合、サーバーは代わりにエラーイベントを送信します。

パラメーター	タイプ	説明
type	string	イベントタイプ。値は `session.updated` です。

その他のパラメーターの説明については、「session.created」をご参照ください。

{
    "event_id": "event_1234",
    "type": "session.updated",
    "session": {
        "id": "sess_001",
        "object": "realtime.session",
        "model": "gpt-4o-realtime-preview-2024-12-17",
        "modalities": ["text"],
        "input_audio_format": "pcm16",
        "input_audio_transcription": null,
        "turn_detection": {
            "type": "server_vad",
            "threshold": 0.5,
            "silence_duration_ms": 200
        }
    }
}

input_audio_buffer.speech_started

このイベントは VAD モードでのみ送信されます。サーバーは、音声バッファー内で発話の開始を検出したときにこのイベントを送信します。

このイベントは、発話の開始がすでに検出されていない限り、音声がバッファーに追加されるたびに発生する可能性があります。

パラメーター	タイプ	説明
type	string	イベントタイプ。固定値は `input_audio_buffer.speech_started` です。
event_id	string	イベント ID。
audio_start_ms	integer	音声がバッファーに書き込まれ始めてから、セッション中に初めて発話が検出されるまでの時間 (ミリ秒)。
item_id	string	作成されるユーザーメッセージアイテムの ID。

{
  "event_id": "event_B1lV7FPbgTv9qGxPI1tH4",
  "type": "input_audio_buffer.speech_started",
  "audio_start_ms": 64,
  "item_id": "item_B1lV7jWLscp4mMV8hSs8c"
}

input_audio_buffer.speech_stopped

このイベントは VAD モードでのみ送信されます。サーバーは、音声バッファー内で発話の終了を検出したときにこのイベントを送信します。

このイベントがトリガーされた後、サーバーは直ちに conversation.item.created イベントを送信します。これには、音声バッファーから作成されたユーザーメッセージアイテムが含まれます。

パラメーター	タイプ	説明
type	string	イベントタイプ。`input_audio_buffer.speech_stopped` に固定されています。
event_id	string	イベント ID。
audio_end_ms	integer	セッションの開始から発話が停止するまでの経過時間 (ミリ秒)。
item_id	string	発話が停止したときに作成されるユーザーメッセージアイテムの ID。

{
  "event_id": "event_B3GGEYh2orwNIdhUagZPz",
  "type": "input_audio_buffer.speech_stopped",
  "audio_end_ms": 28128,
  "item_id": "item_B3GGE8ry4yqbqJGzrVhEM"
}

input_audio_buffer.committed

VAD モード：サーバーは、クライアントが input_audio_buffer.append イベントを使用して音声データの送信を完了した後にこのイベントを送信します。
手動モード：サーバーは、クライアントが input_audio_buffer.append イベントを使用して音声データの送信を完了し、その後 input_audio_buffer.commit イベントを送信した後にこのイベントを送信します。

パラメーター	タイプ	説明
type	string	イベントタイプで、`input_audio_buffer.committed` に固定されています。
event_id	string	イベント ID。
previous_item_id	string	前の会話アイテムの ID。
item_id	string	作成されるユーザー会話アイテムの ID。

{
    "event_id": "event_1121",
    "type": "input_audio_buffer.committed",
    "previous_item_id": "msg_001",
    "item_id": "msg_002"
}

conversation.item.created

サーバーは、新しい会話アイテムが作成されたときにこのイベントを送信します。

パラメーター	タイプ	説明
type	string	イベントのタイプです。`conversation.item.created` に固定されています。
event_id	string	イベント ID。
previous_item_id	string	前の会話アイテムの ID。
item	object	会話に追加するアイテム。
item.id	string	会話アイテムの一意の ID。
item.object	string	`realtime.item` に固定。
item.type	string	`message` に固定されます。
item.status	string	会話アイテムのステータス。
item.role	string	メッセージ送信者のロール。
item.content	array[object]	メッセージの内容。
item.content.type	string	`input_audio` に固定されます。
item.content.transcript	string	`null` に固定。完全な認識結果は、conversation.item.input_audio_transcription.completed イベントで提供されます。

{
  "type": "conversation.item.created",
  "event_id": "event_B3GGKbCfBZTpqFHZ0P8vg",
  "previous_item_id": "item_B3GGE8ry4yqbqJGzrVhEM",
  "item": {
    "id": "item_B3GGEPlolCqdMiVbYIf5L",
    "object": "realtime.item",
    "type": "message",
    "status": "completed",
    "role": "user",
    "content": [
      {
        "type": "input_audio",
        "transcript": null
      }
    ]
  }
}

conversation.item.input_audio_transcription.text

このイベントは、リアルタイムの認識結果を提供するために頻繁に送信されます。

パラメーター	タイプ	説明
type	string	イベントのタイプ。 `conversation.item.input_audio_transcription.text` に固定されています。
event_id	string	イベントの ID。
item_id	string	関連付けられた会話アイテムの ID。
content_index	integer	音声を含むコンテンツ部分のインデックス。
language	string	認識された音声の言語です。`language` リクエストパラメーターで言語が指定されている場合、この値は指定された言語と一致します。指定可能な値： zh：中国語 (標準語、四川語、閩南語、呉語) yue：広東語 en：英語 ja：日本語 de：ドイツ語 ko：韓国語 ru：ロシア語 fr：フランス語 pt：ポルトガル語 ar：アラビア語 it：イタリア語 es：スペイン語 hi：ヒンディー語 id：インドネシア語 th：タイ語 tr：トルコ語 uk：ウクライナ語 vi：ベトナム語
emotion	string	音声から検出された感情。以下の感情がサポートされています： `surprised` (驚き) `neutral` (普通) `happy` (喜び) `sad` `disgusted` (嫌悪) `angry` (怒り) `fearful`
text	string	確定したテキストのプレフィックスです。これは、モデルが確定済みで、今後変更されることのない現在の文の部分です。
stash	string	事前認識されたテキストのサフィックスです。これは、確定した部分に続く一時的な下書きです。モデルはこの下書きをまだ処理中であり、修正する可能性があります。

{
  "event_id": "event_R7Pfu8QVBfP5HmpcbEFSd",
  "type": "conversation.item.input_audio_transcription.text",
  "item_id": "item_MpJQPNQzqVRc9aC9zMwSj",
  "content_index": 0,
  "language": "en",
  "emotion": "neutral",
  "text": "",
  "stash": "Beijing's"
}

いつでも、この 2 つのフィールドを連結することで、最も完全な文章のプレビューを取得できます。リアルタイムプレビュー文は text + stash となります。

クリックして例を表示

例えば、ユーザーが「今日はいい天気ですね、晴れていて明るいです」と話したとします。

以下の表は、受信する可能性のあるイベントストリームと、その解釈方法を示しています：

タイムスタンプ	ユーザーの発話の進捗	API 応答 (text と stash)	UI 表示 (text + stash)
T1	「The...」	text: "" スタッシュ: "The"	その
T2	「...天気は...」	text: "" stash: "天気は"	天気は
T3	「...今日はいいですね」	テキスト: "The" stash: "いい天気ですね"	今日はいい天気ですね (注: 「The」は確認され、テキストフィールドに移動されました。)
T4	(短い間)	text: "今日はいい天気ですね、" stash: ""	今日はいい天気ですね、 (最初の句が完全に確定しました。)
T5	「...晴れていて...」	text: "今日はいい天気ですね、" stash: "晴れていて"	今日はいい天気ですね、晴れていて
T6	「...明るいです。」	text: "今日はいい天気ですね、" stash: "晴れていて明るいです。"	今日はいい天気ですね、晴れていて明るいです。
T7	(ユーザーが話すのをやめる)	-	conversation.item.input_audio_transcription.completed イベントのトランスクリプトの内容を最終結果として使用します。

conversation.item.input_audio_transcription.completed

このイベントは、最終的な認識結果をクライアントに送信します。これは、会話アイテムの終了を示します。

パラメーター	タイプ	説明
type	string	イベントタイプ。`conversation.item.input_audio_transcription.completed` に固定されています。
event_id	string	イベント ID。
item_id	string	関連付けられた会話アイテムの ID。
content_index	integer	音声を含むコンテンツ部分のインデックス。
language	string	認識された音声の言語。言語が `language` リクエストパラメーターで指定されている場合、この値は指定された言語と一致します。指定可能な値： zh：中国語 (標準語、四川語、閩南語、呉語) yue：広東語 en：英語 ja：日本語 de：ドイツ語 ko：韓国語 ru：ロシア語 fr：フランス語 pt：ポルトガル語 ar：アラビア語 it：イタリア語 es：スペイン語 hi：ヒンディー語 id：インドネシア語 th：タイ語 tr：トルコ語 uk：ウクライナ語 vi：ベトナム語
emotion	string	音声から検出された感情。以下の感情がサポートされています： `surprised` (驚き) `neutral` (普通) `happy` (喜び) `sad` (悲しみ) `disgusted` (嫌悪) `angry` (怒り) `fearful` (恐れ)
transcript	string	トランスクリプトの結果。

{
  "event_id": "event_B3GGEjPT2sLzjBM74W6kB",
  "type": "conversation.item.input_audio_transcription.completed",
  "item_id": "item_B3GGC53jGOuIFcjZkmEQ9",
  "content_index": 0,
  "language": "en",
  "emotion": "neutral",
  "transcript": "What's the weather like today?"
}

conversation.item.input_audio_transcription.failed

入力音声の認識が失敗した場合、サーバーはこのイベントを送信します。このイベントは、クライアントが失敗した特定のアイテムを特定しやすくするため、他の error イベントとは別に処理されます。

パラメーター	タイプ	説明
type	string	イベントタイプ。`conversation.item.input_audio_transcription.failed` に固定されています。
item_id	string	関連付けられた会話アイテムの ID。
content_index	integer	音声を含むコンテンツ部分のインデックス。
error.code	string	エラーコード。
error.message	string	エラーメッセージ。
error.param	string	エラーに関連するパラメーター。

{
  "type": "conversation.item.input_audio_transcription.failed",
  "item_id": "<item_id>",
  "content_index": 0,
  "error": {
    "code": "<code>",
    "message": "<message>",
    "param": "<param>"
  }
}

session.finished

セッションは終了し、現在のセッション内のすべての音声認識が完了しました。

このイベントは、クライアントが <a baseurl="t3167042_v2_1_0.xdita" data-node="6184310" data-root="85177" data-tag="xref" href="t3166998.xdita#147ce70052d4z" id="51096614e2809">session.finish</a> イベントを送信した後にのみ送信されます。クライアントは、このイベントを受信した後に切断できます。

パラメーター	タイプ	説明
type	string	イベントタイプ。値は `session.finished` です。
event_id	string	イベント ID。

{
  "event_id": "event_2239",
  "type": "session.finished"
}