Qwen-Omni リアルタイムモデル - Alibaba Cloud Model Studio - Alibaba Cloud ドキュメントセンター

Qwen-Omni-Realtime は、Qwen シリーズのリアルタイム音声・ビデオチャットモデルです。ビデオストリームからリアルタイムで抽出された連続した画像フレームなど、ストリーミング音声と画像入力を理解できます。また、高品質のテキストと音声をリアルタイムで生成することもできます。

使用方法

1. 接続の確立

Qwen-Omni-Realtime モデルには、WebSocket プロトコルを介してアクセスします。以下の Python コード例または DashScope SDK を使用して接続を確立できます。

説明

Qwen-Omni-Realtime の単一の WebSocket セッションは、最大 30 分間持続します。この制限に達すると、サービスは自動的に接続を閉じます。

ネイティブ WebSocket 接続

以下の設定項目が必要です:

設定項目	説明
エンドポイント	中国 (北京): wss://dashscope.aliyuncs.com/api-ws/v1/realtime 国際 (シンガポール): wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime
クエリパラメーター	クエリパラメーターは `model` です。アクセスしたいモデルの名前に設定する必要があります。例: `?model=qwen3-omni-flash-realtime`
リクエストヘッダー	Bearer トークンを使用した認証: Authorization: Bearer DASHSCOPE_API_KEY DASHSCOPE_API_KEY は、Model Studio でリクエストした API キーです。

# pip install websocket-client
import json
import websocket
import os

API_KEY=os.getenv("DASHSCOPE_API_KEY")
API_URL = "wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime?model=qwen3-omni-flash-realtime"

headers = [
    "Authorization: Bearer " + API_KEY
]

def on_open(ws):
    print(f"Connected to server: {API_URL}")
def on_message(ws, message):
    data = json.loads(message)
    print("Received event:", json.dumps(data, indent=2))
def on_error(ws, error):
    print("Error:", error)

ws = websocket.WebSocketApp(
    API_URL,
    header=headers,
    on_open=on_open,
    on_message=on_message,
    on_error=on_error
)

ws.run_forever()

DashScope SDK

Python

# SDK バージョン 1.23.9 以降
import os
import json
from dashscope.audio.qwen_omni import OmniRealtimeConversation,OmniRealtimeCallback
import dashscope
# シンガポールリージョンと北京リージョンの API キーは異なります。API キーを取得するには、https://www.alibabacloud.com/help/en/model-studio/get-api-key をご参照ください
# API キーを設定していない場合は、次の行を dashscope.api_key = "sk-xxx" に変更します
dashscope.api_key = os.getenv("DASHSCOPE_API_KEY")

class PrintCallback(OmniRealtimeCallback):
    def on_open(self) -> None:
        print("Connected Successfully")
    def on_event(self, response: dict) -> None:
        print("Received event:")
        print(json.dumps(response, indent=2, ensure_ascii=False))
    def on_close(self, close_status_code: int, close_msg: str) -> None:
        print(f"Connection closed (code={close_status_code}, msg={close_msg}).")

callback = PrintCallback()
conversation = OmniRealtimeConversation(
    model="qwen3-omni-flash-realtime",
    callback=callback,
    # 以下はシンガポールリージョンの URL です。北京リージョンのモデルを使用する場合は、URL を wss://dashscope.aliyuncs.com/api-ws/v1/realtime に置き換えてください
    url="wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime"
)
try:
    conversation.connect()
    print("Conversation started. Press Ctrl+C to exit.")
    conversation.thread.join()
except KeyboardInterrupt:
    conversation.close()

Java

// SDK バージョン 2.20.9 以降
import com.alibaba.dashscope.audio.omni.*;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.google.gson.JsonObject;
import java.util.concurrent.CountDownLatch;

public class Main {
    public static void main(String[] args) throws InterruptedException, NoApiKeyException {
        CountDownLatch latch = new CountDownLatch(1);
        OmniRealtimeParam param = OmniRealtimeParam.builder()
                .model("qwen3-omni-flash-realtime")
                .apikey(System.getenv("DASHSCOPE_API_KEY"))
                // 以下はシンガポールリージョンの URL です。北京リージョンのモデルを使用する場合は、URL を wss://dashscope.aliyuncs.com/api-ws/v1/realtime に置き換えてください
                .url("wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime")
                .build();

        OmniRealtimeConversation conversation = new OmniRealtimeConversation(param, new OmniRealtimeCallback() {
            @Override
            public void onOpen() {
                System.out.println("Connected Successfully");
            }
            @Override
            public void onEvent(JsonObject message) {
                System.out.println(message);
            }
            @Override
            public void onClose(int code, String reason) {
                System.out.println("connection closed code: " + code + ", reason: " + reason);
                latch.countDown();
            }
        });
        conversation.connect();
        latch.await();
        conversation.close(1000, "bye");
        System.exit(0);
    }
}

2. セッションの設定

session.update クライアントイベントを送信します:

{
    // このイベントの ID。クライアントによって生成されます。
    "event_id": "event_ToPZqeobitzUJnt3QqtWg",
    // イベントタイプ。session.update に固定されています。
    "type": "session.update",
    // セッション設定。
    "session": {
        // 出力モダリティ。サポートされている値は ["text"] (テキストのみ) または ["text", "audio"] (テキストと音声) です。
        "modalities": [
            "text",
            "audio"
        ],
        // 出力音声のボイス。
        "voice": "Cherry",
        // 入力音声フォーマット。pcm16 のみがサポートされています。
        "input_audio_format": "pcm16",
        // 出力音声フォーマット。pcm24 のみがサポートされています。
        "output_audio_format": "pcm24",
        // システムメッセージ。モデルの目標やロールを設定するために使用されます。
        "instructions": "あなたは五つ星ホテルの AI カスタマーサービスエージェントです。客室タイプ、施設、価格、予約ポリシーに関するお客様からのお問い合わせに、正確かつフレンドリに回答してください。常にプロフェッショナルで親切な態度で対応してください。未確認の情報やホテルのサービスの範囲を超える情報は提供しないでください。",
        // 音声区間検出を有効にするかどうかを指定します。有効にするには、設定オブジェクトを渡します。サーバーは、このオブジェクトに基づいて音声の開始と終了を自動的に検出します。
        // null に設定すると、クライアントがモデルの応答を開始するタイミングを決定します。
        "turn_detection": {
            // VAD タイプ。server_vad に設定する必要があります。
            "type": "server_vad",
            // VAD 検出のしきい値。騒がしい環境ではこの値を大きくし、静かな環境では小さくします。
            "threshold": 0.5,
            // 音声の終了を検出するための無音の持続時間。この値を超えると、モデルの応答がトリガーされます。
            "silence_duration_ms": 800
        }
    }
}

3. 音声と画像の入力

クライアントは、input_audio_buffer.append および input_image_buffer.append イベントを使用して、Base64 でエンコードされた音声および画像データをサーバーバッファーに送信します。音声入力は必須です。画像入力はオプションです。

画像は、ローカルファイルから、またはビデオストリームからリアルタイムでキャプチャできます。

サーバー側の音声区間検出 (VAD) が有効になっている場合、サーバーは音声の終了を検出すると自動的にデータを送信し、応答をトリガーします。VAD が無効になっている場合 (手動モード)、クライアントは input_audio_buffer.commit イベントを呼び出してデータを送信する必要があります。

4. モデルの応答を受信する

モデルの応答のフォーマットは、設定された出力モダリティによって異なります。

テキストのみ
response.text.delta イベントを介してストリーミングテキストを受信できます。response.text.done イベントで完全なテキストを取得できます。
テキストと音声
- テキスト: response.audio_transcript.delta イベントを介してストリーミングテキストを受信できます。response.audio_transcript.done イベントで完全なテキストを取得できます。
- 音声: response.audio.delta イベントを介して、Base64 でエンコードされたストリーミング音声出力データを取得できます。response.audio.done イベントは、音声データの生成が完了したことを示します。

モデルリスト

Qwen3-Omni-Flash-Realtime は、Qwen シリーズの最新のリアルタイムマルチモーダルモデルです。更新されなくなる旧世代モデル Qwen-Omni-Turbo-Realtime と比較して、Qwen3-Omni-Flash-Realtime には次の利点があります:

サポートされている言語
サポートされている言語の数は 10 に増加しました。これには、中国語 (標準語および上海語、広東語、四川語などの方言)、英語、フランス語、ドイツ語、ロシア語、イタリア語、スペイン語、ポルトガル語、日本語、韓国語が含まれます。Qwen-Omni-Turbo-Realtime は、中国語 (標準語) と英語の 2 つの言語のみをサポートしています。
サポートされている音声
サポートされている音声の数は 17 に増加しました。Qwen-Omni-Turbo-Realtime は 4 つしかサポートしていません。詳細については、「音声リスト」をご参照ください。

国際 (シンガポール)

モデル	バージョン	コンテキストウィンドウ	最大入力	最大出力	無料クォータ (注)
		(トークン)
qwen3-omni-flash-realtime qwen3-omni-flash-realtime-2025-09-15 と同等	安定	65,536	49,152	16,384	モダリティに関係なく、それぞれ 100 万トークン Model Studio をアクティベートしてから 90 日間有効
qwen3-omni-flash-realtime-2025-09-15	スナップショット

その他のモデル

モデル	バージョン	コンテキストウィンドウ	最大入力	最大出力	無料クォータ (注)
		(トークン)
qwen-omni-turbo-realtime qwen-omni-turbo-realtime-2025-05-08 と同等	安定	32,768	30,720	2,048	モダリティに関係なく 100 万トークン Model Studio をアクティベートしてから 90 日間有効
qwen-omni-turbo-realtime-latest 常に最新のスナップショットバージョンと同等	最新
qwen-omni-turbo-realtime-2025-05-08	スナップショット

中国 (北京)

モデル	バージョン	コンテキストウィンドウ	最大入力	最大出力	無料クォータ (注)
		(トークン)
qwen3-omni-flash-realtime qwen3-omni-flash-realtime-2025-09-15 と同等	安定	65,536	49,152	16,384	無料クォータなし
qwen3-omni-flash-realtime-2025-09-15	スナップショット

その他のモデル

モデル	バージョン	コンテキストウィンドウ	最大入力	最大出力	無料クォータ (注)
		(トークン)
qwen-omni-turbo-realtime qwen-omni-turbo-2025-05-08 と同等	安定	32,768	30,720	2,048	無料クォータなし
qwen-omni-turbo-realtime-latest 常に最新のスナップショットバージョンと同等	最新
qwen-omni-turbo-realtime-2025-05-08	スナップショット

はじめに

「準備: API キーの取得と設定」および「API キーを環境変数として設定する (非推奨、API キーの設定に統合予定)」の手順を完了する必要があります。

使い慣れたプログラミング言語を選択し、以下の手順に従って Qwen-Omni-Realtime モデルとのリアルタイム対話をすばやく開始します。

DashScope Python SDK

ランタイム環境の準備

Python のバージョンは 3.10 以降である必要があります。

まず、オペレーティングシステムに基づいて pyaudio をインストールします。

macOS

brew install portaudio && pip install pyaudio

Debian/Ubuntu

仮想環境を使用していない場合、システムのパッケージマネージャを使用して直接インストールできます:
```
sudo apt-get install python3-pyaudio
```
仮想環境を使用している場合、まずコンパイルの依存関係をインストールする必要があります:
```
sudo apt update
sudo apt install -y python3-dev portaudio19-dev
```
次に、アクティブ化された仮想環境で pip を使用してインストールします:
```
pip install pyaudio
```

CentOS

sudo yum install -y portaudio portaudio-devel && pip install pyaudio

Windows

pip install pyaudio

インストールが完了したら、pip を使用して依存関係をインストールします:

pip install websocket-client dashscope

インタラクションモードの選択

VAD モード (音声区間検出、音声の開始と終了を自動的に検出)
サーバーは、ユーザーがいつ話し始め、いつ話し終えたかを自動的に判断し、それに応じて応答します。
手動モード (押して話し、離して送信)
クライアントは音声の開始と終了を制御します。ユーザーが話し終えた後、クライアントはサーバーに積極的にメッセージを送信する必要があります。

VAD モード

vad_dash.py という名前の新しい Python ファイルを作成し、次のコードをファイルにコピーします:

vad_dash.py

# 依存関係: dashscope >= 1.23.9, pyaudio
import os
import base64
import time
import pyaudio
from dashscope.audio.qwen_omni import MultiModality, AudioFormat,OmniRealtimeCallback,OmniRealtimeConversation
import dashscope

# 設定パラメーター: URL、API キー、音声、モデル、モデルロール
# リージョンを指定します。国際 (シンガポール) の場合は 'intl'、中国 (北京) の場合は 'cn' に設定します。
region = 'intl'
base_domain = 'dashscope-intl.aliyuncs.com' if region == 'intl' else 'dashscope.aliyuncs.com'
url = f'wss://{base_domain}/api-ws/v1/realtime'
# API キーを設定します。環境変数を設定していない場合は、次の行を dashscope.api_key = "sk-xxx" に置き換えます
dashscope.api_key = os.getenv('DASHSCOPE_API_KEY')
# 音声を指定します
voice = 'Cherry'
# モデルを指定します
model = 'qwen3-omni-flash-realtime'
# モデルロールを指定します
instructions = "You are Xiaoyun, a personal assistant. Please answer the user's questions in a humorous and witty way."
class SimpleCallback(OmniRealtimeCallback):
    def __init__(self, pya):
        self.pya = pya
        self.out = None
    def on_open(self):
        # オーディオ出力ストリームを初期化します
        self.out = self.pya.open(
            format=pyaudio.paInt16,
            channels=1,
            rate=24000,
            output=True
        )
    def on_event(self, response):
        if response['type'] == 'response.audio.delta':
            # 音声を再生します
            self.out.write(base64.b64decode(response['delta']))
        elif response['type'] == 'conversation.item.input_audio_transcription.completed':
            # 文字起こしされたテキストを出力します
            print(f"[User] {response['transcript']}")
        elif response['type'] == 'response.audio_transcript.done':
            # アシスタントの返信テキストを出力します
            print(f"[LLM] {response['transcript']}")

# 1. オーディオデバイスを初期化します
pya = pyaudio.PyAudio()
# 2. コールバック関数とセッションを作成します
callback = SimpleCallback(pya)
conv = OmniRealtimeConversation(model=model, callback=callback, url=url)
# 3. 接続を確立し、セッションを設定します
conv.connect()
conv.update_session(output_modalities=[MultiModality.AUDIO, MultiModality.TEXT], voice=voice, instructions=instructions)
# 4. オーディオ入力ストリームを初期化します
mic = pya.open(format=pyaudio.paInt16, channels=1, rate=16000, input=True)
# 5. オーディオ入力を処理するメインループ
print("Conversation started. Speak into the microphone (Ctrl+C to exit)...")
try:
    while True:
        audio_data = mic.read(3200, exception_on_overflow=False)
        conv.append_audio(base64.b64encode(audio_data).decode())
        time.sleep(0.01)
except KeyboardInterrupt:
    # リソースをクリーンアップします
    conv.close()
    mic.close()
    callback.out.close()
    pya.terminate()
    print("\nConversation ended")

vad_dash.py を実行して、マイクを通じて Qwen-Omni-Realtime モデルとリアルタイムで会話します。システムはあなたの音声の開始と終了を検出し、手動介入なしで自動的にサーバーに送信します。

手動モード

manual_dash.py という名前の新しい Python ファイルを作成し、次のコードをファイルにコピーします:

manual_dash.py

# 依存関係: dashscope >= 1.23.9, pyaudio。
import os
import base64
import sys
import threading
import pyaudio
from dashscope.audio.qwen_omni import *
import dashscope

# 環境変数を設定していない場合は、次の行を API キーに置き換えます: dashscope.api_key = "sk-xxx"
dashscope.api_key = os.getenv('DASHSCOPE_API_KEY')
voice = 'Cherry'

class MyCallback(OmniRealtimeCallback):
    """最小限のコールバック: 接続時にスピーカーを初期化し、返された音声をイベントで直接再生します。"""
    def __init__(self, ctx):
        super().__init__()
        self.ctx = ctx

    def on_open(self) -> None:
        # 接続が確立された後、PyAudio とスピーカー (24k/mono/16bit) を初期化します。
        print('connection opened')
        try:
            self.ctx['pya'] = pyaudio.PyAudio()
            self.ctx['out'] = self.ctx['pya'].open(
                format=pyaudio.paInt16,
                channels=1,
                rate=24000,
                output=True
            )
            print('audio output initialized')
        except Exception as e:
            print('[Error] audio init failed: {}'.format(e))

    def on_close(self, close_status_code, close_msg) -> None:
        print('connection closed with code: {}, msg: {}'.format(close_status_code, close_msg))
        sys.exit(0)

    def on_event(self, response: str) -> None:
        try:
            t = response['type']
            handlers = {
                'session.created': lambda r: print('start session: {}'.format(r['session']['id'])),
                'conversation.item.input_audio_transcription.completed': lambda r: print('question: {}'.format(r['transcript'])),
                'response.audio_transcript.delta': lambda r: print('llm text: {}'.format(r['delta'])),
                'response.audio.delta': self._play_audio,
                'response.done': self._response_done,
            }
            h = handlers.get(t)
            if h:
                h(response)
        except Exception as e:
            print('[Error] {}'.format(e))

    def _play_audio(self, response):
        # base64 を直接デコードし、再生のために出力ストリームに書き込みます。
        if self.ctx['out'] is None:
            return
        try:
            data = base64.b64decode(response['delta'])
            self.ctx['out'].write(data)
        except Exception as e:
            print('[Error] audio playback failed: {}'.format(e))

    def _response_done(self, response):
        # メインループが待機するために、現在の会話ターンを完了としてマークします。
        if self.ctx['conv'] is not None:
            print('[Metric] response: {}, first text delay: {}, first audio delay: {}'.format(
                self.ctx['conv'].get_last_response_id(),
                self.ctx['conv'].get_last_first_text_delay(),
                self.ctx['conv'].get_last_first_audio_delay(),
            ))
        if self.ctx['resp_done'] is not None:
            self.ctx['resp_done'].set()

def shutdown_ctx(ctx):
    """オーディオと PyAudio リソースを安全に解放します。"""
    try:
        if ctx['out'] is not None:
            ctx['out'].close()
            ctx['out'] = None
    except Exception:
        pass
    try:
        if ctx['pya'] is not None:
            ctx['pya'].terminate()
            ctx['pya'] = None
    except Exception:
        pass


def record_until_enter(pya_inst: pyaudio.PyAudio, sample_rate=16000, chunk_size=3200):
    """Enter キーを押して録音を停止し、PCM バイトを返します。"""
    frames = []
    stop_evt = threading.Event()

    stream = pya_inst.open(
        format=pyaudio.paInt16,
        channels=1,
        rate=sample_rate,
        input=True,
        frames_per_buffer=chunk_size
    )

    def _reader():
        while not stop_evt.is_set():
            try:
                frames.append(stream.read(chunk_size, exception_on_overflow=False))
            except Exception:
                break

    t = threading.Thread(target=_reader, daemon=True)
    t.start()
    input()  # ユーザーが再度 Enter キーを押して録音を停止します。
    stop_evt.set()
    t.join(timeout=1.0)
    try:
        stream.close()
    except Exception:
        pass
    return b''.join(frames)


if __name__  == '__main__':
    print('Initializing ...')
    # ランタイムコンテキスト: オーディオとセッションハンドルを格納します。
    ctx = {'pya': None, 'out': None, 'conv': None, 'resp_done': threading.Event()}
    callback = MyCallback(ctx)
    conversation = OmniRealtimeConversation(
        model='qwen3-omni-flash-realtime',
        callback=callback,
        # 以下は国際 (シンガポール) リージョンの URL です。中国 (北京) リージョンのモデルを使用する場合は、URL を wss://dashscope.aliyuncs.com/api-ws/v1/realtime に置き換えてください
        url="wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime",
    )
    try:
        conversation.connect()
    except Exception as e:
        print('[Error] connect failed: {}'.format(e))
        sys.exit(1)

    ctx['conv'] = conversation
    # セッション設定: テキストと音声出力を有効にします (サーバー側の VAD を無効にし、手動録音に切り替えます)。
    conversation.update_session(
        output_modalities=[MultiModality.AUDIO, MultiModality.TEXT],
        voice=voice,
        input_audio_format=AudioFormat.PCM_16000HZ_MONO_16BIT,
        output_audio_format=AudioFormat.PCM_24000HZ_MONO_16BIT,
        enable_input_audio_transcription=True,
        # 入力音声を文字起こしするモデル。gummy-realtime-v1 のみがサポートされています。
        input_audio_transcription_model='gummy-realtime-v1',
        enable_turn_detection=False,
        instructions="あなたは Xiaoyun というパーソナルアシスタントです。ユーザーの質問に正確かつフレンドリに回答し、常に親切な態度で対応してください。"
    )

    try:
        turn = 1
        while True:
            print(f"\n--- Turn {turn} ---")
            print("Press Enter to start recording (enter q to exit)...")
            user_input = input()
            if user_input.strip().lower() in ['q', 'quit']:
                print("User requested to exit...")
                break
            print("Recording... Press Enter again to stop.")
            if ctx['pya'] is None:
                ctx['pya'] = pyaudio.PyAudio()
            recorded = record_until_enter(ctx['pya'])
            if not recorded:
                print("No valid audio was recorded. Please try again.")
                continue
            print(f"Successfully recorded audio: {len(recorded)} bytes. Sending...")

            # 3200 バイトのチャンクで送信します (16k/16bit/100ms に対応)。
            chunk_size = 3200
            for i in range(0, len(recorded), chunk_size):
                chunk = recorded[i:i+chunk_size]
                conversation.append_audio(base64.b64encode(chunk).decode('ascii'))

            print("Sending complete. Waiting for model response...")
            ctx['resp_done'].clear()
            conversation.commit()
            conversation.create_response()
            ctx['resp_done'].wait()
            print('Audio playback complete.')
            turn += 1
    except KeyboardInterrupt:
        print("\nProgram interrupted by user.")
    finally:
        shutdown_ctx(ctx)
        print("Program exited.")

manual_dash.py を実行し、Enter キーを押して話し始め、再度 Enter キーを押してモデルの音声応答を受信します。

DashScope Java SDK

インタラクションモードの選択

VAD モード (音声区間検出、音声の開始と終了を自動的に検出)
Realtime API は、ユーザーがいつ話し始め、いつ話し終えたかを自動的に判断し、それに応じて応答します。
手動モード (押して話し、離して送信)
クライアントは音声の開始と終了を制御します。ユーザーが話し終えた後、クライアントはサーバーに積極的にメッセージを送信する必要があります。

VAD モード

OmniServerVad.java

import com.alibaba.dashscope.audio.omni.*;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.google.gson.JsonObject;
import javax.sound.sampled.*;
import java.nio.ByteBuffer;
import java.util.Arrays;
import java.util.Base64;
import java.util.Map;
import java.util.Queue;
import java.util.concurrent.ConcurrentLinkedQueue;
import java.util.concurrent.atomic.AtomicBoolean;

public class OmniServerVad {
    static class SequentialAudioPlayer {
        private final SourceDataLine line;
        private final Queue<byte[]> audioQueue = new ConcurrentLinkedQueue<>();
        private final Thread playerThread;
        private final AtomicBoolean shouldStop = new AtomicBoolean(false);

        public SequentialAudioPlayer() throws LineUnavailableException {
            AudioFormat format = new AudioFormat(24000, 16, 1, true, false);
            line = AudioSystem.getSourceDataLine(format);
            line.open(format);
            line.start();

            playerThread = new Thread(() -> {
                while (!shouldStop.get()) {
                    byte[] audio = audioQueue.poll();
                    if (audio != null) {
                        line.write(audio, 0, audio.length);
                    } else {
                        try { Thread.sleep(10); } catch (InterruptedException ignored) {}
                    }
                }
            }, "AudioPlayer");
            playerThread.start();
        }

        public void play(String base64Audio) {
            try {
                byte[] audio = Base64.getDecoder().decode(base64Audio);
                audioQueue.add(audio);
            } catch (Exception e) {
                System.err.println("Audio decoding failed: " + e.getMessage());
            }
        }

        public void cancel() {
            audioQueue.clear();
            line.flush();
        }

        public void close() {
            shouldStop.set(true);
            try { playerThread.join(1000); } catch (InterruptedException ignored) {}
            line.drain();
            line.close();
        }
    }

    public static void main(String[] args) {
        try {
            SequentialAudioPlayer player = new SequentialAudioPlayer();
            AtomicBoolean userIsSpeaking = new AtomicBoolean(false);
            AtomicBoolean shouldStop = new AtomicBoolean(false);

            OmniRealtimeParam param = OmniRealtimeParam.builder()
                    .model("qwen3-omni-flash-realtime")
                    .apikey(System.getenv("DASHSCOPE_API_KEY"))
                    // 以下は国際 (シンガポール) リージョンの URL です。中国 (北京) リージョンのモデルを使用する場合は、URL を wss://dashscope.aliyuncs.com/api-ws/v1/realtime に置き換えてください
                    .url("wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime")
                    .build();

            OmniRealtimeConversation conversation = new OmniRealtimeConversation(param, new OmniRealtimeCallback() {
                @Override public void onOpen() {
                    System.out.println("Connection established");
                }
                @Override public void onClose(int code, String reason) {
                    System.out.println("Connection closed (" + code + "): " + reason);
                    shouldStop.set(true);
                }
                @Override public void onEvent(JsonObject event) {
                    handleEvent(event, player, userIsSpeaking);
                }
            });

            conversation.connect();
            conversation.updateSession(OmniRealtimeConfig.builder()
                    .modalities(Arrays.asList(OmniRealtimeModality.AUDIO, OmniRealtimeModality.TEXT))
                    .voice("Cherry")
                    .enableTurnDetection(true)
                    .enableInputAudioTranscription(true)
                    .parameters(Map.of("instructions",
                            "You are an AI customer service agent for a five-star hotel. Answer customer inquiries about room types, facilities, prices, and booking policies accurately and friendly. Always respond with a professional and helpful attitude. Do not provide unconfirmed information or information beyond the scope of the hotel's services."))
                    .build()
            );

            System.out.println("Please start speaking (automatic detection of speech start/end, press Ctrl+C to exit)...");
            AudioFormat format = new AudioFormat(16000, 16, 1, true, false);
            TargetDataLine mic = AudioSystem.getTargetDataLine(format);
            mic.open(format);
            mic.start();

            ByteBuffer buffer = ByteBuffer.allocate(3200);
            while (!shouldStop.get()) {
                int bytesRead = mic.read(buffer.array(), 0, buffer.capacity());
                if (bytesRead > 0) {
                    try {
                        conversation.appendAudio(Base64.getEncoder().encodeToString(buffer.array()));
                    } catch (Exception e) {
                        if (e.getMessage() != null && e.getMessage().contains("closed")) {
                            System.out.println("Conversation closed. Stopping recording.");
                            break;
                        }
                    }
                }
                Thread.sleep(20);
            }

            conversation.close(1000, "Normal exit");
            player.close();
            mic.close();
            System.out.println("\nProgram exited.");

        } catch (NoApiKeyException e) {
            System.err.println("API KEY not found: Please set the DASHSCOPE_API_KEY environment variable.");
            System.exit(1);
        } catch (Exception e) {
            e.printStackTrace();
        }
    }

    private static void handleEvent(JsonObject event, SequentialAudioPlayer player, AtomicBoolean userIsSpeaking) {
        String type = event.get("type").getAsString();
        switch (type) {
            case "input_audio_buffer.speech_started":
                System.out.println("\n[User started speaking]");
                player.cancel();
                userIsSpeaking.set(true);
                break;
            case "input_audio_buffer.speech_stopped":
                System.out.println("[User stopped speaking]");
                userIsSpeaking.set(false);
                break;
            case "response.audio.delta":
                if (!userIsSpeaking.get()) {
                    player.play(event.get("delta").getAsString());
                }
                break;
            case "conversation.item.input_audio_transcription.completed":
                System.out.println("User: " + event.get("transcript").getAsString());
                break;
            case "response.audio_transcript.delta":
                System.out.print(event.get("delta").getAsString());
                break;
            case "response.done":
                System.out.println("Response complete");
                break;
        }
    }
}

OmniServerVad.main() メソッドを実行して、マイクを通じて Qwen-Omni-Realtime モデルとリアルタイムで会話します。システムはあなたの音声の開始と終了を検出し、手動介入なしで自動的にサーバーに送信します。

手動モード

OmniWithoutServerVad.java

// DashScope Java SDK バージョン 2.20.9 以降

import com.alibaba.dashscope.audio.omni.*;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.google.gson.JsonObject;
import javax.sound.sampled.*;
import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.util.Arrays;
import java.util.Base64;
import java.util.HashMap;
import java.util.Queue;
import java.util.concurrent.ConcurrentLinkedQueue;
import java.util.concurrent.CountDownLatch;
import java.util.concurrent.atomic.AtomicBoolean;
import java.util.concurrent.atomic.AtomicReference;

public class Main {
    // RealtimePcmPlayer クラス定義開始
    public static class RealtimePcmPlayer {
        private int sampleRate;
        private SourceDataLine line;
        private AudioFormat audioFormat;
        private Thread decoderThread;
        private Thread playerThread;
        private AtomicBoolean stopped = new AtomicBoolean(false);
        private Queue<String> b64AudioBuffer = new ConcurrentLinkedQueue<>();
        private Queue<byte[]> RawAudioBuffer = new ConcurrentLinkedQueue<>();

        // コンストラクターは、オーディオフォーマットとオーディオラインを初期化します。
        public RealtimePcmPlayer(int sampleRate) throws LineUnavailableException {
            this.sampleRate = sampleRate;
            this.audioFormat = new AudioFormat(this.sampleRate, 16, 1, true, false);
            DataLine.Info info = new DataLine.Info(SourceDataLine.class, audioFormat);
            line = (SourceDataLine) AudioSystem.getLine(info);
            line.open(audioFormat);
            line.start();
            decoderThread = new Thread(new Runnable() {
                @Override
                public void run() {
                    while (!stopped.get()) {
                        String b64Audio = b64AudioBuffer.poll();
                        if (b64Audio != null) {
                            byte[] rawAudio = Base64.getDecoder().decode(b64Audio);
                            RawAudioBuffer.add(rawAudio);
                        } else {
                            try {
                                Thread.sleep(100);
                            } catch (InterruptedException e) {
                                throw new RuntimeException(e);
                            }
                        }
                    }
                }
            });
            playerThread = new Thread(new Runnable() {
                @Override
                public void run() {
                    while (!stopped.get()) {
                        byte[] rawAudio = RawAudioBuffer.poll();
                        if (rawAudio != null) {
                            try {
                                playChunk(rawAudio);
                            } catch (IOException e) {
                                throw new RuntimeException(e);
                            } catch (InterruptedException e) {
                                throw new RuntimeException(e);
                            }
                        } else {
                            try {
                                Thread.sleep(100);
                            } catch (InterruptedException e) {
                                throw new RuntimeException(e);
                            }
                        }
                    }
                }
            });
            decoderThread.start();
            playerThread.start();
        }

        // オーディオチャンクを再生し、再生が完了するまでブロックします。
        private void playChunk(byte[] chunk) throws IOException, InterruptedException {
            if (chunk == null || chunk.length == 0) return;

            int bytesWritten = 0;
            while (bytesWritten < chunk.length) {
                bytesWritten += line.write(chunk, bytesWritten, chunk.length - bytesWritten);
            }
            int audioLength = chunk.length / (this.sampleRate*2/1000);
            // バッファー内の音声が再生し終わるのを待ちます。
            Thread.sleep(audioLength - 10);
        }

        public void write(String b64Audio) {
            b64AudioBuffer.add(b64Audio);
        }

        public void cancel() {
            b64AudioBuffer.clear();
            RawAudioBuffer.clear();
        }

        public void waitForComplete() throws InterruptedException {
            while (!b64AudioBuffer.isEmpty() || !RawAudioBuffer.isEmpty()) {
                Thread.sleep(100);
            }
            line.drain();
        }

        public void shutdown() throws InterruptedException {
            stopped.set(true);
            decoderThread.join();
            playerThread.join();
            if (line != null && line.isRunning()) {
                line.drain();
                line.close();
            }
        }
    } // RealtimePcmPlayer クラス定義終了
    // 録音メソッドを追加
    private static void recordAndSend(TargetDataLine line, OmniRealtimeConversation conversation) throws IOException {
        ByteArrayOutputStream out = new ByteArrayOutputStream();
        byte[] buffer = new byte[3200];
        AtomicBoolean stopRecording = new AtomicBoolean(false);

        // Enter キーをリッスンするスレッドを開始します。
        Thread enterKeyListener = new Thread(() -> {
            try {
                System.in.read();
                stopRecording.set(true);
            } catch (IOException e) {
                e.printStackTrace();
            }
        });
        enterKeyListener.start();

        // 録音ループ
        while (!stopRecording.get()) {
            int count = line.read(buffer, 0, buffer.length);
            if (count > 0) {
                out.write(buffer, 0, count);
            }
        }

        // 録音したデータを送信します。
        byte[] audioData = out.toByteArray();
        String audioB64 = Base64.getEncoder().encodeToString(audioData);
        conversation.appendAudio(audioB64);
        out.close();
    }

    public static void main(String[] args) throws InterruptedException, LineUnavailableException {
        OmniRealtimeParam param = OmniRealtimeParam.builder()
                .model("qwen3-omni-flash-realtime")
                // シンガポールリージョンと北京リージョンの API キーは異なります。API キーを取得するには、https://www.alibabacloud.com/help/en/model-studio/get-api-key をご参照ください
                // 環境変数を設定していない場合は、次の行を Model Studio API キーに置き換えます: .apikey("sk-xxx")
                .apikey(System.getenv("DASHSCOPE_API_KEY"))
                //以下はシンガポールリージョンの URL です。北京リージョンのモデルを使用する場合は、URL を wss://dashscope.aliyuncs.com/api-ws/v1/realtime に置き換えてください
                .url("wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime")
                .build();
        AtomicReference<CountDownLatch> responseDoneLatch = new AtomicReference<>(null);
        responseDoneLatch.set(new CountDownLatch(1));

        RealtimePcmPlayer audioPlayer = new RealtimePcmPlayer(24000);
        final AtomicReference<OmniRealtimeConversation> conversationRef = new AtomicReference<>(null);
        OmniRealtimeConversation conversation = new OmniRealtimeConversation(param, new OmniRealtimeCallback() {
            @Override
            public void onOpen() {
                System.out.println("connection opened");
            }
            @Override
            public void onEvent(JsonObject message) {
                String type = message.get("type").getAsString();
                switch(type) {
                    case "session.created":
                        System.out.println("start session: " + message.get("session").getAsJsonObject().get("id").getAsString());
                        break;
                    case "conversation.item.input_audio_transcription.completed":
                        System.out.println("question: " + message.get("transcript").getAsString());
                        break;
                    case "response.audio_transcript.delta":
                        System.out.println("got llm response delta: " + message.get("delta").getAsString());
                        break;
                    case "response.audio.delta":
                        String recvAudioB64 = message.get("delta").getAsString();
                        audioPlayer.write(recvAudioB64);
                        break;
                    case "response.done":
                        System.out.println("======RESPONSE DONE======");
                        if (conversationRef.get() != null) {
                            System.out.println("[Metric] response: " + conversationRef.get().getResponseId() +
                                    ", first text delay: " + conversationRef.get().getFirstTextDelay() +
                                    " ms, first audio delay: " + conversationRef.get().getFirstAudioDelay() + " ms");
                        }
                        responseDoneLatch.get().countDown();
                        break;
                    default:
                        break;
                }
            }
            @Override
            public void onClose(int code, String reason) {
                System.out.println("connection closed code: " + code + ", reason: " + reason);
            }
        });
        conversationRef.set(conversation);
        try {
            conversation.connect();
        } catch (NoApiKeyException e) {
            throw new RuntimeException(e);
        }
        OmniRealtimeConfig config = OmniRealtimeConfig.builder()
                .modalities(Arrays.asList(OmniRealtimeModality.AUDIO, OmniRealtimeModality.TEXT))
                .voice("Cherry")
                .enableTurnDetection(false)
                // モデルロールを設定します。
                .parameters(new HashMap<String, Object>() {{
                    put("instructions","あなたは Xiaoyun というパーソナルアシスタントです。ユーザーの質問に正確かつフレンドリに回答し、常に親切な態度で対応してください。");
                }})
                .build();
        conversation.updateSession(config);

        // マイク録音機能を追加します。
        AudioFormat format = new AudioFormat(16000, 16, 1, true, false);
        DataLine.Info info = new DataLine.Info(TargetDataLine.class, format);

        if (!AudioSystem.isLineSupported(info)) {
            System.out.println("Line not supported");
            return;
        }

        TargetDataLine line = null;
        try {
            line = (TargetDataLine) AudioSystem.getLine(info);
            line.open(format);
            line.start();

            while (true) {
                System.out.println("Press Enter to start recording...");
                System.in.read();
                System.out.println("Recording started. Please speak... Press Enter again to stop recording and send.");
                recordAndSend(line, conversation);
                conversation.commit();
                conversation.createResponse(null, null);
                // 次の待機のためにラッチをリセットします。
                responseDoneLatch.set(new CountDownLatch(1));
            }
        } catch (LineUnavailableException | IOException e) {
            e.printStackTrace();
        } finally {
            if (line != null) {
                line.stop();
                line.close();
            }
        }
    }}

OmniWithoutServerVad.main() メソッドを実行します。Enter キーを押して録音を開始します。録音中に再度 Enter キーを押すと、録音が停止し、音声が送信されます。その後、モデルの応答が受信され、再生されます。

WebSocket (Python)

ランタイム環境の準備
Python のバージョンは 3.10 以降である必要があります。
まず、オペレーティングシステムに基づいて pyaudio をインストールします。
macOS
```
brew install portaudio && pip install pyaudio
```
Debian/Ubuntu
```
sudo apt-get install python3-pyaudio

or

pip install pyaudio
```
pip install pyaudio の使用を推奨します。インストールが失敗した場合は、まずオペレーティングシステムの portaudio 依存関係をインストールしてください。
CentOS
```
sudo yum install -y portaudio portaudio-devel && pip install pyaudio
```
Windows
```
pip install pyaudio
```
インストールが完了したら、pip を使用して websocket 関連の依存関係をインストールします:
```
pip install websockets==15.0.1
```

クライアントの作成

ローカルディレクトリに omni_realtime_client.py という名前の新しい Python ファイルを作成し、次のコードをファイルにコピーします:

omni_realtime_client.py

import asyncio
import websockets
import json
import base64
import time
from typing import Optional, Callable, List, Dict, Any
from enum import Enum

class TurnDetectionMode(Enum):
    SERVER_VAD = "server_vad"
    MANUAL = "manual"

class OmniRealtimeClient:

    def __init__(
            self,
            base_url,
            api_key: str,
            model: str = "",
            voice: str = "Ethan",
            instructions: str = "You are a helpful assistant.",
            turn_detection_mode: TurnDetectionMode = TurnDetectionMode.SERVER_VAD,
            on_text_delta: Optional[Callable[[str], None]] = None,
            on_audio_delta: Optional[Callable[[bytes], None]] = None,
            on_input_transcript: Optional[Callable[[str], None]] = None,
            on_output_transcript: Optional[Callable[[str], None]] = None,
            extra_event_handlers: Optional[Dict[str, Callable[[Dict[str, Any]], None]]] = None
    ):
        self.base_url = base_url
        self.api_key = api_key
        self.model = model
        self.voice = voice
        self.instructions = instructions
        self.ws = None
        self.on_text_delta = on_text_delta
        self.on_audio_delta = on_audio_delta
        self.on_input_transcript = on_input_transcript
        self.on_output_transcript = on_output_transcript
        self.turn_detection_mode = turn_detection_mode
        self.extra_event_handlers = extra_event_handlers or {}

        # 現在の応答ステータス
        self._current_response_id = None
        self._current_item_id = None
        self._is_responding = False
        # 入力/出力トランスクリプトの印刷ステータス
        self._print_input_transcript = True
        self._output_transcript_buffer = ""

    async def connect(self) -> None:
        """Realtime API との WebSocket 接続を確立します。"""
        url = f"{self.base_url}?model={self.model}"
        headers = {
            "Authorization": f"Bearer {self.api_key}"
        }
        self.ws = await websockets.connect(url, additional_headers=headers)

        # セッション設定
        session_config = {
            "modalities": ["text", "audio"],
            "voice": self.voice,
            "instructions": self.instructions,
            "input_audio_format": "pcm16",
            "output_audio_format": "pcm24",
            "input_audio_transcription": {
                "model": "gummy-realtime-v1"
            }
        }

        if self.turn_detection_mode == TurnDetectionMode.MANUAL:
            session_config['turn_detection'] = None
            await self.update_session(session_config)
        elif self.turn_detection_mode == TurnDetectionMode.SERVER_VAD:
            session_config['turn_detection'] = {
                "type": "server_vad",
                "threshold": 0.1,
                "prefix_padding_ms": 500,
                "silence_duration_ms": 900
            }
            await self.update_session(session_config)
        else:
            raise ValueError(f"Invalid turn detection mode: {self.turn_detection_mode}")

    async def send_event(self, event) -> None:
        event['event_id'] = "event_" + str(int(time.time() * 1000))
        await self.ws.send(json.dumps(event))

    async def update_session(self, config: Dict[str, Any]) -> None:
        """セッション設定を更新します。"""
        event = {
            "type": "session.update",
            "session": config
        }
        await self.send_event(event)

    async def stream_audio(self, audio_chunk: bytes) -> None:
        """生のオーディオデータを API にストリーミングします。"""
        # 16 ビット、16 kHz、モノラル PCM のみがサポートされています。
        audio_b64 = base64.b64encode(audio_chunk).decode()
        append_event = {
            "type": "input_audio_buffer.append",
            "audio": audio_b64
        }
        await self.send_event(append_event)

    async def commit_audio_buffer(self) -> None:
        """オーディオバッファをコミットして処理をトリガーします。"""
        event = {
            "type": "input_audio_buffer.commit"
        }
        await self.send_event(event)

    async def append_image(self, image_chunk: bytes) -> None:
        """画像データを画像バッファに追加します。
        画像データは、ローカルファイルまたはリアルタイムのビデオストリームから取得できます。
        注:
            - 画像フォーマットは JPG または JPEG である必要があります。480p または 720p の解像度が推奨されます。サポートされる最大解像度は 1080p です。
            - 1 つの画像のサイズは 500 KB を超えてはなりません。
            - 送信する前に画像データを Base64 にエンコードします。
            - 1 秒あたり 2 フレーム以下のレートでサーバーに画像を送信することをお勧めします。
            - 画像データを送信する前に、少なくとも 1 回は音声データを送信する必要があります。
        """
        image_b64 = base64.b64encode(image_chunk).decode()
        event = {
            "type": "input_image_buffer.append",
            "image": image_b64
        }
        await self.send_event(event)

    async def create_response(self) -> None:
        """API に応答の生成を要求します (手動モードでのみ呼び出す必要があります)。"""
        event = {
            "type": "response.create"
        }
        await self.send_event(event)

    async def cancel_response(self) -> None:
        """現在の応答をキャンセルします。"""
        event = {
            "type": "response.cancel"
        }
        await self.send_event(event)

    async def handle_interruption(self):
        """現在の応答に対するユーザーの中断を処理します。"""
        if not self._is_responding:
            return
        # 1. 現在の応答をキャンセルします。
        if self._current_response_id:
            await self.cancel_response()

        self._is_responding = False
        self._current_response_id = None
        self._current_item_id = None

    async def handle_messages(self) -> None:
        try:
            async for message in self.ws:
                event = json.loads(message)
                event_type = event.get("type")
                if event_type == "error":
                    print(" Error: ", event['error'])
                    continue
                elif event_type == "response.created":
                    self._current_response_id = event.get("response", {}).get("id")
                    self._is_responding = True
                elif event_type == "response.output_item.added":
                    self._current_item_id = event.get("item", {}).get("id")
                elif event_type == "response.done":
                    self._is_responding = False
                    self._current_response_id = None
                    self._current_item_id = None
                elif event_type == "input_audio_buffer.speech_started":
                    print("Speech start detected")
                    if self._is_responding:
                        print("Handling interruption")
                        await self.handle_interruption()
                elif event_type == "input_audio_buffer.speech_stopped":
                    print("Speech end detected")
                elif event_type == "response.text.delta":
                    if self.on_text_delta:
                        self.on_text_delta(event["delta"])
                elif event_type == "response.audio.delta":
                    if self.on_audio_delta:
                        audio_bytes = base64.b64decode(event["delta"])
                        self.on_audio_delta(audio_bytes)
                elif event_type == "conversation.item.input_audio_transcription.completed":
                    transcript = event.get("transcript", "")
                    print(f"User: {transcript}")
                    if self.on_input_transcript:
                        await asyncio.to_thread(self.on_input_transcript, transcript)
                        self._print_input_transcript = True
                elif event_type == "response.audio_transcript.delta":
                    if self.on_output_transcript:
                        delta = event.get("delta", "")
                        if not self._print_input_transcript:
                            self._output_transcript_buffer += delta
                        else:
                            if self._output_transcript_buffer:
                                await asyncio.to_thread(self.on_output_transcript, self._output_transcript_buffer)
                                self._output_transcript_buffer = ""
                            await asyncio.to_thread(self.on_output_transcript, delta)
                elif event_type == "response.audio_transcript.done":
                    print(f"LLM: {event.get('transcript', '')}")
                    self._print_input_transcript = False
                elif event_type in self.extra_event_handlers:
                    self.extra_event_handlers[event_type](event)
        except websockets.exceptions.ConnectionClosed:
            print(" Connection closed")
        except Exception as e:
            print(" Error in message handling: ", str(e))
    async def close(self) -> None:
        """WebSocket 接続を閉じます。"""
        if self.ws:
            await self.ws.close()

インタラクションモードの選択

VAD モード (音声区間検出、音声の開始と終了を自動的に検出)
Realtime API は、ユーザーがいつ話し始め、いつ話し終えたかを自動的に判断し、それに応じて応答します。
手動モード (押して話し、離して送信)
クライアントは音声の開始と終了を制御します。ユーザーが話し終えた後、クライアントはサーバーに積極的にメッセージを送信する必要があります。

VAD モード

omni_realtime_client.py と同じディレクトリに、vad_mode.py という名前の別の Python ファイルを作成し、次のコードをファイルにコピーします:

vad_mode.py

# -- coding: utf-8 --
import os, asyncio, pyaudio, queue, threading
from omni_realtime_client import OmniRealtimeClient, TurnDetectionMode

# オーディオプレーヤークラス (中断を処理)
class AudioPlayer:
    def __init__(self, pyaudio_instance, rate=24000):
        self.stream = pyaudio_instance.open(format=pyaudio.paInt16, channels=1, rate=rate, output=True)
        self.queue = queue.Queue()
        self.stop_evt = threading.Event()
        self.interrupt_evt = threading.Event()
        threading.Thread(target=self._run, daemon=True).start()

    def _run(self):
        while not self.stop_evt.is_set():
            try:
                data = self.queue.get(timeout=0.5)
                if data is None: break
                if not self.interrupt_evt.is_set(): self.stream.write(data)
                self.queue.task_done()
            except queue.Empty: continue

    def add_audio(self, data): self.queue.put(data)
    def handle_interrupt(self): self.interrupt_evt.set(); self.queue.queue.clear()
    def stop(self): self.stop_evt.set(); self.queue.put(None); self.stream.stop_stream(); self.stream.close()

# マイクから録音して送信
async def record_and_send(client):
    p = pyaudio.PyAudio()
    stream = p.open(format=pyaudio.paInt16, channels=1, rate=16000, input=True, frames_per_buffer=3200)
    print("Recording started. Please speak...")
    try:
        while True:
            audio_data = stream.read(3200)
            await client.stream_audio(audio_data)
            await asyncio.sleep(0.02)
    finally:
        stream.stop_stream(); stream.close(); p.terminate()

async def main():
    p = pyaudio.PyAudio()
    player = AudioPlayer(pyaudio_instance=p)

    client = OmniRealtimeClient(
        # 以下は国際 (シンガポール) リージョンの base_url です。中国 (北京) リージョンの base_url は wss://dashscope.aliyuncs.com/api-ws/v1/realtime です
        base_url="wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime",
        api_key=os.environ.get("DASHSCOPE_API_KEY"),
        model="qwen3-omni-flash-realtime",
        voice="Cherry",
        instructions="You are Xiaoyun, a witty and humorous assistant.",
        turn_detection_mode=TurnDetectionMode.SERVER_VAD,
        on_text_delta=lambda t: print(f"\nAssistant: {t}", end="", flush=True),
        on_audio_delta=player.add_audio,
    )

    await client.connect()
    print("Connection successful. Starting real-time conversation...")

    # 同時実行
    await asyncio.gather(client.handle_messages(), record_and_send(client))

if __name__ == "__main__":
    try:
        asyncio.run(main())
    except KeyboardInterrupt:
        print("\nProgram exited.")

vad_mode.py を実行して、マイクを通じて Qwen-Omni-Realtime モデルとリアルタイムで会話します。システムはあなたの音声の開始と終了を検出し、手動介入なしで自動的にサーバーに送信します。

手動モード

omni_realtime_client.py と同じディレクトリに、manual_mode.py という名前の別の Python ファイルを作成し、次のコードをファイルにコピーします:

manual_mode.py

# -- coding: utf-8 --
import os
import asyncio
import time
import threading
import queue
import pyaudio
from omni_realtime_client import OmniRealtimeClient, TurnDetectionMode


class AudioPlayer:
    """リアルタイムオーディオプレーヤークラス"""

    def __init__(self, sample_rate=24000, channels=1, sample_width=2):
        self.sample_rate = sample_rate
        self.channels = channels
        self.sample_width = sample_width  # 16 ビットの場合は 2 バイト
        self.audio_queue = queue.Queue()
        self.is_playing = False
        self.play_thread = None
        self.pyaudio_instance = None
        self.stream = None
        self._lock = threading.Lock()  # 同期アクセスのためのロックを追加
        self._last_data_time = time.time()  # 最後のデータを受信した時刻を記録
        self._response_done = False  # 応答完了を示すフラグを追加
        self._waiting_for_response = False  # サーバーの応答を待機しているかどうかを示すフラグ
        # オーディオストリームに最後のデータが書き込まれた時刻と、より正確な再生終了検出のための最新のオーディオチャンクの持続時間を記録
        self._last_play_time = time.time()
        self._last_chunk_duration = 0.0

    def start(self):
        """オーディオプレーヤーを開始"""
        with self._lock:
            if self.is_playing:
                return

            self.is_playing = True

            try:
                self.pyaudio_instance = pyaudio.PyAudio()

                # オーディオ出力ストリームを作成
                self.stream = self.pyaudio_instance.open(
                    format=pyaudio.paInt16,  # 16 ビット
                    channels=self.channels,
                    rate=self.sample_rate,
                    output=True,
                    frames_per_buffer=1024
                )

                # 再生スレッドを開始
                self.play_thread = threading.Thread(target=self._play_audio)
                self.play_thread.daemon = True
                self.play_thread.start()

                print("Audio player started")
            except Exception as e:
                print(f"Failed to start audio player: {e}")
                self._cleanup_resources()
                raise

    def stop(self):
        """オーディオプレーヤーを停止"""
        with self._lock:
            if not self.is_playing:
                return

            self.is_playing = False

        # キューをクリア
        while not self.audio_queue.empty():
            try:
                self.audio_queue.get_nowait()
            except queue.Empty:
                break

        # 再生スレッドが終了するのを待つ (デッドロックを避けるためにロックの外で待つ)
        if self.play_thread and self.play_thread.is_alive():
            self.play_thread.join(timeout=2.0)

        # リソースをクリーンアップするために再度ロックを取得
        with self._lock:
            self._cleanup_resources()

        print("Audio player stopped")

    def _cleanup_resources(self):
        """オーディオリソースをクリーンアップ (ロック内で呼び出す必要があります)"""
        try:
            # オーディオストリームを閉じる
            if self.stream:
                if not self.stream.is_stopped():
                    self.stream.stop_stream()
                self.stream.close()
                self.stream = None
        except Exception as e:
            print(f"Error closing audio stream: {e}")

        try:
            if self.pyaudio_instance:
                self.pyaudio_instance.terminate()
                self.pyaudio_instance = None
        except Exception as e:
            print(f"Error terminating PyAudio: {e}")

    def add_audio_data(self, audio_data):
        """再生キューにオーディオデータを追加"""
        if self.is_playing and audio_data:
            self.audio_queue.put(audio_data)
            with self._lock:
                self._last_data_time = time.time()  # 最後のデータを受信した時刻を更新
                self._waiting_for_response = False  # データ受信、待機終了

    def stop_receiving_data(self):
        """これ以上新しいオーディオデータを受信しないことをマーク"""
        with self._lock:
            self._response_done = True
            self._waiting_for_response = False  # 応答終了、待機終了

    def prepare_for_next_turn(self):
        """次の会話ターンのためにプレーヤーの状態をリセットします。"""
        with self._lock:
            self._response_done = False
            self._last_data_time = time.time()
            self._last_play_time = time.time()
            self._last_chunk_duration = 0.0
            self._waiting_for_response = True  # 次の応答の待機を開始

        # 前のターンから残っているオーディオデータをクリア
        while not self.audio_queue.empty():
            try:
                self.audio_queue.get_nowait()
            except queue.Empty:
                break

    def is_finished_playing(self):
        """すべてのオーディオデータが再生されたかどうかを確認"""
        with self._lock:
            queue_size = self.audio_queue.qsize()
            time_since_last_data = time.time() - self._last_data_time
            time_since_last_play = time.time() - self._last_play_time

            # ---------------------- スマート終了検出 ----------------------
            # 1. 推奨: サーバーが完了をマークし、再生キューが空の場合。
            #    最新のオーディオチャンクが再生し終わるのを待つ (チャンク持続時間 + 0.1 秒の許容範囲)。
            if self._response_done and queue_size == 0:
                min_wait = max(self._last_chunk_duration + 0.1, 0.5)  # 少なくとも 0.5 秒待つ
                if time_since_last_play >= min_wait:
                    return True

            # 2. フォールバック: 長い間新しいデータが受信されず、再生キューが空の場合。
            #    このロジックは、サーバーが明示的に `response.done` を送信しない場合のセーフガードとして機能します。
            if not self._waiting_for_response and queue_size == 0 and time_since_last_data > 1.0:
                print("\n(No new audio received for a while, assuming playback is finished)")
                return True

            return False

    def _play_audio(self):
        """オーディオデータを再生するためのワーカースレッド"""
        while True:
            # 停止すべきかどうかを確認
            with self._lock:
                if not self.is_playing:
                    break
                stream_ref = self.stream  # ストリームへの参照を取得

            try:
                # キューからオーディオデータを取得、タイムアウト 0.1 秒
                audio_data = self.audio_queue.get(timeout=0.1)

                # ステータスとストリームの有効性を再度確認
                with self._lock:
                    if self.is_playing and stream_ref and not stream_ref.is_stopped():
                        try:
                            # オーディオデータを再生
                            stream_ref.write(audio_data)
                            # 最新の再生情報を更新
                            self._last_play_time = time.time()
                            self._last_chunk_duration = len(audio_data) / (
                                        self.channels * self.sample_width) / self.sample_rate
                        except Exception as e:
                            print(f"Error writing to audio stream: {e}")
                            break

                # このデータブロックを処理済みとしてマーク
                self.audio_queue.task_done()

            except queue.Empty:
                # キューが空の場合は待機を続ける
                continue
            except Exception as e:
                print(f"Error playing audio: {e}")
                break


class MicrophoneRecorder:
    """リアルタイムマイクレコーダー"""

    def __init__(self, sample_rate=16000, channels=1, chunk_size=3200):
        self.sample_rate = sample_rate
        self.channels = channels
        self.chunk_size = chunk_size
        self.pyaudio_instance = None
        self.stream = None
        self.frames = []
        self._is_recording = False
        self._record_thread = None

    def _recording_thread(self):
        """録音ワーカースレッド"""
        # _is_recording が True の間、オーディオストリームからデータを継続的に読み取る
        while self._is_recording:
            try:
                # バッファオーバーフローによるクラッシュを避けるために exception_on_overflow=False を使用
                data = self.stream.read(self.chunk_size, exception_on_overflow=False)
                self.frames.append(data)
            except (IOError, OSError) as e:
                # ストリームが閉じられると、読み取り時にエラーが発生する可能性がある
                print(f"Error reading from recording stream, it might be closed: {e}")
                break

    def start(self):
        """録音を開始"""
        if self._is_recording:
            print("Recording is already in progress.")
            return

        self.frames = []
        self._is_recording = True

        try:
            self.pyaudio_instance = pyaudio.PyAudio()
            self.stream = self.pyaudio_instance.open(
                format=pyaudio.paInt16,
                channels=self.channels,
                rate=self.sample_rate,
                input=True,
                frames_per_buffer=self.chunk_size
            )

            self._record_thread = threading.Thread(target=self._recording_thread)
            self._record_thread.daemon = True
            self._record_thread.start()
            print("Microphone recording started...")
        except Exception as e:
            print(f"Failed to start microphone: {e}")
            self._is_recording = False
            self._cleanup()
            raise

    def stop(self):
        """録音を停止し、オーディオデータを返す"""
        if not self._is_recording:
            return None

        self._is_recording = False

        # 録音スレッドが安全に終了するのを待つ
        if self._record_thread:
            self._record_thread.join(timeout=1.0)

        self._cleanup()

        print("Microphone recording stopped.")
        return b''.join(self.frames)

    def _cleanup(self):
        """PyAudio リソースを安全にクリーンアップ"""
        if self.stream:
            try:
                if self.stream.is_active():
                    self.stream.stop_stream()
                self.stream.close()
            except Exception as e:
                print(f"Error closing audio stream: {e}")

        if self.pyaudio_instance:
            try:
                self.pyaudio_instance.terminate()
            except Exception as e:
                print(f"Error terminating PyAudio instance: {e}")

        self.stream = None
        self.pyaudio_instance = None


async def interactive_test():
    """
    対話型テストスクリプト: 各ターンで音声と画像を送信するマルチターン対話を可能にします。
    """
    # ------------------- 1. 初期化と接続 (1 回限り) -------------------
    # シンガポールリージョンと北京リージョンの API キーは異なります。API キーを取得するには、https://www.alibabacloud.com/help/en/model-studio/get-api-key をご参照ください
    api_key = os.environ.get("DASHSCOPE_API_KEY")
    if not api_key:
        print("Please set the DASHSCOPE_API_KEY environment variable.")
        return

    print("--- Real-time Multimodal Audio/Video Chat Client ---")
    print("Initializing audio player and client...")

    audio_player = AudioPlayer()
    audio_player.start()

    def on_audio_received(audio_data):
        audio_player.add_audio_data(audio_data)

    def on_response_done(event):
        print("\n(Received response end marker)")
        audio_player.stop_receiving_data()

    realtime_client = OmniRealtimeClient(
        # 以下はシンガポールリージョンの base_url です。北京リージョンのモデルを使用する場合は、base_url を wss://dashscope.aliyuncs.com/api-ws/v1/realtime に置き換えてください
        base_url="wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime",
        api_key=api_key,
        model="qwen3-omni-flash-realtime",
        voice="Ethan",
        instructions="あなたは Xiaoyun というパーソナルアシスタントです。ユーザーの質問に正確かつフレンドリに回答し、常に親切な態度で対応してください。", # モデルロールを設定
        on_text_delta=lambda text: print(f"Assistant reply: {text}", end="", flush=True),
        on_audio_delta=on_audio_received,
        turn_detection_mode=TurnDetectionMode.MANUAL,
        extra_event_handlers={"response.done": on_response_done}
    )

    message_handler_task = None
    try:
        await realtime_client.connect()
        print("Connected to the server. Enter 'q' or 'quit' to exit at any time.")
        message_handler_task = asyncio.create_task(realtime_client.handle_messages())
        await asyncio.sleep(0.5)

        turn_counter = 1
        # ------------------- 2. マルチターン対話ループ -------------------
        while True:
            print(f"\n--- Turn {turn_counter} ---")
            audio_player.prepare_for_next_turn()

            recorded_audio = None
            image_paths = []

            # --- ユーザー入力を取得: マイクから録音 ---
            loop = asyncio.get_event_loop()
            recorder = MicrophoneRecorder(sample_rate=16000)  # 音声認識には 16k のサンプルレートが推奨されます

            print("Ready to record. Press Enter to start recording (or enter 'q' to exit)...")
            user_input = await loop.run_in_executor(None, input)
            if user_input.strip().lower() in ['q', 'quit']:
                print("User requested to exit...")
                return

            try:
                recorder.start()
            except Exception:
                print("Could not start recording. Please check your microphone permissions and device. Skipping this turn.")
                continue

            print("Recording... Press Enter again to stop.")
            await loop.run_in_executor(None, input)

            recorded_audio = recorder.stop()

            if not recorded_audio or len(recorded_audio) == 0:
                print("No valid audio was recorded. Please start this turn again.")
                continue

            # --- 画像入力を取得 (オプション) ---
            # 以下の画像入力機能はコメントアウトされており、一時的に無効になっています。有効にするには、以下のコードのコメントを解除してください。
            # print("\nEnter the absolute path of an [image file] on each line (optional). When finished, enter 's' or press Enter to send the request.")
            # while True:
            #     path = input("Image path: ").strip()
            #     if path.lower() == 's' or path == '':
            #         break
            #     if path.lower() in ['q', 'quit']:
            #         print("User requested to exit...")
            #         return
            #
            #     if not os.path.isabs(path):
            #         print("Error: Please enter an absolute path.")
            #         continue
            #     if not os.path.exists(path):
            #         print(f"Error: File not found -> {path}")
            #         continue
            #     image_paths.append(path)
            #     print(f"Image added: {os.path.basename(path)}")

            # --- 3. データを送信して応答を取得 ---
            print("\n--- Input Confirmation ---")
            print(f"Audio to process: 1 (from microphone), Images: {len(image_paths)}")
            print("------------------")

            # 3.1 録音した音声を送信
            try:
                print(f"Sending microphone recording ({len(recorded_audio)} bytes)")
                await realtime_client.stream_audio(recorded_audio)
                await asyncio.sleep(0.1)
            except Exception as e:
                print(f"Failed to send microphone recording: {e}")
                continue

            # 3.2 すべての画像ファイルを送信
            # 以下の画像送信コードはコメントアウトされており、一時的に無効になっています。
            # for i, path in enumerate(image_paths):
            #     try:
            #         with open(path, "rb") as f:
            #             data = f.read()
            #         print(f"Sending image {i+1}: {os.path.basename(path)} ({len(data)} bytes)")
            #         await realtime_client.append_image(data)
            #         await asyncio.sleep(0.1)
            #     except Exception as e:
            #         print(f"Failed to send image {os.path.basename(path)}: {e}")

            # 3.3 すべての入力を送信し、応答を待機
            print("Submitting all inputs, requesting server response...")
            await realtime_client.commit_audio_buffer()
            await realtime_client.create_response()

            print("Waiting for and playing server response audio...")
            start_time = time.time()
            max_wait_time = 60
            while not audio_player.is_finished_playing():
                if time.time() - start_time > max_wait_time:
                    print(f"\nWait timed out ({max_wait_time} seconds). Moving to the next turn.")
                    break
                await asyncio.sleep(0.2)

            print("\nAudio playback for this turn is complete!")
            turn_counter += 1

    except (asyncio.CancelledError, KeyboardInterrupt):
        print("\nProgram was interrupted.")
    except Exception as e:
        print(f"An unhandled error occurred: {e}")
    finally:
        # ------------------- 4. リソースのクリーンアップ -------------------
        print("\nClosing connection and cleaning up resources...")
        if message_handler_task and not message_handler_task.done():
            message_handler_task.cancel()

        if 'realtime_client' in locals() and realtime_client.ws and not realtime_client.ws.close:
            await realtime_client.close()
            print("Connection closed.")

        audio_player.stop()
        print("Program exited.")


if __name__ == "__main__":
    try:
        asyncio.run(interactive_test())
    except KeyboardInterrupt:
        print("\nProgram was forcibly exited by the user.")

manual_mode.py を実行し、Enter キーを押して話し始め、再度 Enter キーを押してモデルの音声応答を受信します。

インタラクションフロー

VAD モード

session.update イベントの session.turn_detection を "server_vad" に設定して VAD モードを有効にします。このモードでは、サーバーは音声の開始と終了を自動的に検出し、それに応じて応答します。このモードは、音声通話シナリオに適しています。

インタラクションフローは次のとおりです:

サーバーは音声の開始を検出し、input_audio_buffer.speech_started イベントを送信します。
クライアントはいつでも input_audio_buffer.append および input_image_buffer.append イベントを送信して、音声と画像をバッファーに追加できます。
input_image_buffer.append イベントを送信する前に、少なくとも 1 つの input_audio_buffer.append イベントを送信する必要があります。
サーバーは音声の終了を検出し、input_audio_buffer.speech_stopped イベントを送信します。
サーバーは input_audio_buffer.committed イベントを送信して、オーディオバッファーをコミットします。
サーバーは conversation.item.created イベントを送信します。これには、バッファーから作成されたユーザーメッセージ項目が含まれます。

ライフサイクル	クライアントイベント	サーバーイベント
セッションの初期化	session.update セッション設定	session.created セッションが作成されました session.updated セッション設定が更新されました
ユーザーの音声入力	input_audio_buffer.append バッファーに音声を追加 input_image_buffer.append バッファーに画像を追加	input_audio_buffer.speech_started 音声開始を検出 input_audio_buffer.speech_stopped 音声終了を検出 input_audio_buffer.committed サーバーが送信された音声を受信しました
サーバーの音声出力	なし	response.created サーバーが応答の生成を開始します response.output_item.added 応答中の新しい出力コンテンツ conversation.item.created 会話項目が作成されました response.content_part.added アシスタントメッセージに新しい出力コンテンツが追加されました response.audio_transcript.delta インクリメンタルに生成された文字起こしテキスト response.audio.delta モデルからインクリメンタルに生成された音声 response.audio_transcript.done テキストの文字起こしが完了しました response.audio.done 音声生成が完了しました response.content_part.done アシスタントメッセージのテキストまたは音声コンテンツのストリーミングが完了しました response.output_item.done アシスタントメッセージの出力項目全体のストリーミングが完了しました response.done 応答が完了しました

手動モード

session.update イベントの session.turn_detection を null に設定して手動モードを有効にします。このモードでは、クライアントは input_audio_buffer.commit および response.create イベントを明示的に送信してサーバーの応答を要求します。このモードは、チャットアプリケーションで音声メッセージを送信するなど、プッシュツートークシナリオに適しています。

インタラクションフローは次のとおりです:

クライアントはいつでも input_audio_buffer.append および input_image_buffer.append イベントを送信して、音声と画像をバッファーに追加できます。
input_image_buffer.append イベントを送信する前に、少なくとも 1 つの input_audio_buffer.append イベントを送信する必要があります。
クライアントは input_audio_buffer.commit イベントを送信して、音声および画像バッファーをコミットします。これにより、現在のターンのすべてのユーザー入力 (音声と画像を含む) が送信されたことがサーバーに通知されます。
サーバーは input_audio_buffer.committed イベントで応答します。
クライアントは response.create イベントを送信し、サーバーからのモデルの出力を待ちます。
サーバーは conversation.item.created イベントで応答します。

ライフサイクル	クライアントイベント	サーバーイベント
セッションの初期化	session.update セッション設定	session.created セッションが作成されました session.updated セッション設定が更新されました
ユーザーの音声入力	input_audio_buffer.append バッファーに音声を追加 input_image_buffer.append バッファーに画像を追加 input_audio_buffer.commit 音声と画像をサーバーに送信 response.create モデル応答を作成	input_audio_buffer.committed サーバーが送信された音声を受信しました
サーバーの音声出力	input_audio_buffer.clear バッファーから音声をクリア	response.created サーバーが応答の生成を開始します response.output_item.added 応答中の新しい出力コンテンツ conversation.item.created 会話項目が作成されました response.content_part.added アシスタントメッセージ項目に新しい出力コンテンツが追加されました response.audio_transcript.delta インクリメンタルに生成された文字起こしテキスト response.audio.delta モデルからインクリメンタルに生成された音声 response.audio_transcript.done テキストの文字起こしが完了しました response.audio.done 音声生成が完了しました response.content_part.done アシスタントメッセージのテキストまたは音声コンテンツのストリーミングが完了しました response.output_item.done アシスタントメッセージの出力項目全体のストリーミングが完了しました response.done 応答が完了しました

API リファレンス

課金とスロットリング

課金ルール

Qwen-Omni-Realtime モデルは、音声や画像など、さまざまな入力モダリティに使用されるトークンの数に基づいて課金されます。課金の詳細については、「モデルリスト」をご参照ください。

音声と画像をトークンに変換するルール

音声

Qwen-Omni-Turbo-Realtime: 合計トークン = 音声の長さ (秒) × 25。音声の長さが 1 秒未満の場合は、1 秒として計算されます。

画像

Qwen3-Omni-Flash-Realtime モデル: 32×32 ピクセルあたり 1 トークン
Qwen-Omni-Turbo-Realtime モデル: 28×28 ピクセルあたり 1 トークン

画像には最低 4 トークンが必要で、最大 1,280 トークンをサポートします。次のコードを使用して、画像が消費するトークンの総数を見積もることができます:

# 次のコマンドを使用して Pillow ライブラリをインストールします: pip install Pillow
from PIL import Image
import math

# Qwen-Omni-Turbo-Realtime モデルの場合、ズーム係数は 28 です。
# factor = 28
# Qwen3-Omni-Flash-Realtime モデルの場合、ズーム係数は 32 です。
factor = 32

def token_calculate(image_path='', duration=10):
    """
    :param image_path: 画像のパス。
    :param duration: セッション接続の持続時間。
    :return: 画像のトークン数。
    """
    if len(image_path) > 0:
        # 指定された PNG 画像ファイルを開きます。
        image = Image.open(image_path)
        # 画像の元のディメンションを取得します。
        height = image.height
        width = image.width
    print(f"Image dimensions before scaling: height={height}, width={width}")
    # 高さを係数の整数倍に調整します。
    h_bar = round(height / factor) * factor
    # 幅を係数の整数倍に調整します。
    w_bar = round(width / factor) * factor
    # 画像トークンの下限: 4 トークン。
    min_pixels = factor * factor * 4
    # 画像トークンの上限: 1280 トークン。
    max_pixels = 1280 * factor * factor
    # スケーリングされた画像の総ピクセル数が [min_pixels, max_pixels] の範囲内になるように画像をスケーリングします。
    if h_bar * w_bar > max_pixels:
        # スケーリングされた画像の総ピクセル数が max_pixels を超えないようにスケーリング係数 beta を計算します。
        beta = math.sqrt((height * width) / max_pixels)
        # 調整後の高さが係数の整数倍になるように再計算します。
        h_bar = math.floor(height / beta / factor) * factor
        # 調整後の幅が係数の整数倍になるように再計算します。
        w_bar = math.floor(width / beta / factor) * factor
    elif h_bar * w_bar < min_pixels:
        # スケーリングされた画像の総ピクセル数が min_pixels 未満にならないようにスケーリング係数 beta を計算します。
        beta = math.sqrt(min_pixels / (height * width))
        # 調整後の高さが係数の整数倍になるように再計算します。
        h_bar = math.ceil(height * beta / factor) * factor
        # 調整後の幅が係数の整数倍になるように再計算します。
        w_bar = math.ceil(width * beta / factor) * factor
    print(f"Image dimensions after scaling: height={h_bar}, width={w_bar}")
    # 画像のトークン数を計算します: 総ピクセル数を (factor * factor) で割ります。
    token = int((h_bar * w_bar) / (factor * factor))
    print(f"Number of tokens after scaling: {token}")
    total_token = token * math.ceil(duration / 2)
    print(f"Total number of tokens: {total_token}")
    return total_token
if __name__ == "__main__":
    total_token = token_calculate(image_path="xxx/test.jpg", duration=10)

スロットリング

モデルのスロットリングルールの詳細については、「スロットリング」をご参照ください。

エラーコード

呼び出しが失敗した場合は、「エラーメッセージ」を参照してトラブルシューティングを行ってください。

音声リスト

Qwen3-Omni-Flash-Realtime

名前	`voice` パラメーター	音声効果	説明	サポートされている言語
Cherry	Cherry		明るく、フレンドリで、自然な若い女性の声。	中国語、英語、フランス語、ドイツ語、ロシア語、イタリア語、スペイン語、ポルトガル語、日本語、韓国語、タイ語
Ethan	Ethan		わずかな北方なまりのある標準中国語。明るく、温かく、エネルギッシュな声。	中国語、英語、フランス語、ドイツ語、ロシア語、イタリア語、スペイン語、ポルトガル語、日本語、韓国語、タイ語
Nofish	Nofish		そり舌音を使用しないデザイナー。	中国語、英語、フランス語、ドイツ語、ロシア語、イタリア語、スペイン語、ポルトガル語、日本語、韓国語、タイ語
Jennifer	Jennifer		プレミアムで映画のようなアメリカ英語の女性音声。	中国語、英語、フランス語、ドイツ語、ロシア語、イタリア語、スペイン語、ポルトガル語、日本語、韓国語、タイ語
Ryan	Ryan		リアリズムと緊張感を備えた、リズミカルでドラマチックな声。	中国語、英語、フランス語、ドイツ語、ロシア語、イタリア語、スペイン語、ポルトガル語、日本語、韓国語、タイ語
Katerina	Katerina		成熟したリズミカルな女性の声。	中国語、英語、フランス語、ドイツ語、ロシア語、イタリア語、スペイン語、ポルトガル語、日本語、韓国語、タイ語
エリアス	エリアス		学術的な厳密さと明確なストーリーテリングで、複雑なトピックを解説します。	中国語、英語、フランス語、ドイツ語、ロシア語、イタリア語、スペイン語、ポルトガル語、日本語、韓国語、タイ語
Shanghai-Jada	Jada		上海出身の活発な女性。	中国語 (上海語)、英語、フランス語、ドイツ語、ロシア語、イタリア語、スペイン語、ポルトガル語、日本語、韓国語、タイ語
Beijing-Dylan	Dylan		北京の胡同で育った 10代の少年。	中国語 (北京語)、英語、フランス語、ドイツ語、ロシア語、イタリア語、スペイン語、ポルトガル語、日本語、韓国語、タイ語
Sichuan-Sunny	Sunny		四川の甘い女性の声。	中国語 (四川語)、英語、フランス語、ドイツ語、ロシア語、イタリア語、スペイン語、ポルトガル語、日本語、韓国語、タイ語
南京-Li	Li		忍耐強いヨガの先生です。	中国語 (南京方言)、英語、フランス語、ドイツ語、ロシア語、イタリア語、スペイン語、ポルトガル語、日本語、韓国語、タイ語
Shaanxi-Marcus	マーカス		陝西なまりの誠実で深みのある音声。	中国語 (陝西方言)、英語、フランス語、ドイツ語、ロシア語、イタリア語、スペイン語、ポルトガル語、日本語、韓国語、タイ語
マン・ナン・ロイ	ロイ		ユーモラスで活気のある、閩南語なまりの若い男性の声。	中国語 (閩南語)、英語、フランス語、ドイツ語、ロシア語、イタリア語、スペイン語、ポルトガル語、日本語、韓国語、タイ語
天津-ピーター	ピーター		天津クロストークのツッコミ役用の声。	中国語 (天津方言)、英語、フランス語、ドイツ語、ロシア語、イタリア語、スペイン語、ポルトガル語、日本語、韓国語、タイ語
カントニーズ-ロッキー	Rocky		オンラインチャット向けの、ウィットに富んだユーモラスな男性の声。	中国語 (広東語)、英語、フランス語、ドイツ語、ロシア語、イタリア語、スペイン語、ポルトガル語、日本語、韓国語、タイ語
広東語-Kiki	キキ		香港出身の優しい親友。	中国語 (広東語)、英語、フランス語、ドイツ語、ロシア語、イタリア語、スペイン語、ポルトガル語、日本語、韓国語、タイ語
四川-Eric	エリック		四川省成都の、型破りで洗練された男性の声。	中国語 (四川語)、英語、フランス語、ドイツ語、ロシア語、イタリア語、スペイン語、ポルトガル語、日本語、韓国語、タイ語

Qwen-Omni-Turbo-Realtime

名前	`voice` パラメーター	音声効果	説明	対応言語
Cherry	Cherry		明るく、フレンドリで、誠実な若い女性。	中国語、英語
Serena	Serena		親切な若い女性。	中国語、英語
Ethan	Ethan		標準的な中国語に、わずかな北方なまりがあります。明るく、温かく、エネルギッシュな声です。	中国語、英語
Chelsie	Chelsie		アニメ風のバーチャル彼女の声です。	中国語、英語