音声および動画ファイルの翻訳 – Qwen - Alibaba Cloud Model Studio

事前準備

API キーを作成します。
環境変数として API キーを設定します。
(オプション) OpenAI SDK を使用する場合は、SDK をインストールします。

クイックスタート

すべての例で OpenAI 互換のストリーミング API を使用し、translation_options で翻訳元言語と翻訳先言語を設定します。デフォルトの入力は音声ですが、代わりにビデオファイルを翻訳する場合は、各例のビデオ入力ブロックのコメントを解除してください。

source_lang を指定すると、精度が向上します。省略すると、自動言語検出が有効になります。

Python

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    # 次の URL はシンガポールリージョン用です。呼び出す際は、WorkspaceId を実際のワークスペース ID に置き換えてください。URL はリージョンによって異なります。
    base_url="https://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/compatible-mode/v1",
)

# --- 音声入力 ---
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "input_audio",
                "input_audio": {
                    "data": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250211/tixcef/cherry.wav",
                    "format": "wav",
                },
            }
        ],
    }
]

# --- 動画入力 (使用する場合はコメントアウトを解除) ---
# messages = [
#     {
#         "role": "user",
#         "content": [
#             {
#                 "type": "video_url",
#                 "video_url": {
#                     "url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241115/cqqkru/1.mp4"
#                 },
#             }
#         ],
#     },
# ]

completion = client.chat.completions.create(
    model="qwen3-livetranslate-flash",
    messages=messages,
    modalities=["text", "audio"],
    audio={"voice": "Cherry", "format": "wav"},
    stream=True,
    stream_options={"include_usage": True},
    # translation_options は標準の OpenAI パラメーターではないため、extra_body を介して渡します
    extra_body={"translation_options": {"source_lang": "zh", "target_lang": "en"}},
)

for chunk in completion:
    print(chunk)

Node.js

import OpenAI from "openai";

const client = new OpenAI({
    apiKey: process.env.DASHSCOPE_API_KEY,
    // 次の URL はシンガポールリージョン用です。呼び出す際は、WorkspaceId を実際のワークスペース ID に置き換えてください。URL はリージョンによって異なります。
    baseURL: "https://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/compatible-mode/v1",
});

// --- 音声入力 ---
const messages = [
    {
        role: "user",
        content: [
            {
                type: "input_audio",
                input_audio: {
                    data: "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250211/tixcef/cherry.wav",
                    format: "wav",
                },
            },
        ],
    },
];

// --- 動画入力 (使用する場合はコメントアウトを解除) ---
// const messages = [
//     {
//         role: "user",
//         content: [
//             {
//                 type: "video_url",
//                 video_url: {
//                     url: "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241115/cqqkru/1.mp4",
//                 },
//             },
//         ],
//     },
// ];

async function main() {
    const completion = await client.chat.completions.create({
        model: "qwen3-livetranslate-flash",
        messages: messages,
        modalities: ["text", "audio"],
        audio: { voice: "Cherry", format: "wav" },
        stream: true,
        stream_options: { include_usage: true },
        translation_options: { source_lang: "zh", target_lang: "en" },
    });

    for await (const chunk of completion) {
        console.log(JSON.stringify(chunk));
    }
}

main();

curl

curl -X POST https://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/compatible-mode/v1/chat/completions \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
    "model": "qwen3-livetranslate-flash",
    "messages": [
        {
            "role": "user",
            "content": [
                {
                    "type": "input_audio",
                    "input_audio": {
                        "data": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250211/tixcef/cherry.wav",
                        "format": "wav"
                    }
                }
            ]
        }
    ],
    "modalities": ["text", "audio"],
    "audio": {
        "voice": "Cherry",
        "format": "wav"
    },
    "stream": true,
    "stream_options": {
        "include_usage": true
    },
    "translation_options": {
        "source_lang": "zh",
        "target_lang": "en"
    }
}'

これらの例では、公開ファイルの URL を使用しています。ローカルファイルを使用するには、「Base64 エンコードされたローカルファイルの入力」をご参照ください。

リクエストパラメーター

入力

messages 配列には、role が user に設定されたメッセージが 1 つだけ含まれている必要があります。content フィールドには、翻訳対象のオーディオまたはビデオが格納されます。

オーディオ: type を input_audio に設定します。input_audio.data にファイル URL または Base64 エンコードされたデータを指定し、input_audio.format に形式 (たとえば、wav) を指定します。
動画: type を video_url に設定します。video_url.url にファイル URL を指定します。

翻訳オプション

translation_options パラメーターで、翻訳元言語と翻訳先言語を指定します:

"translation_options": {"source_lang": "zh", "target_lang": "en"}

Python SDK では、translation_options は標準の OpenAI パラメーターではありません。extra_body 経由で渡します:

extra_body={"translation_options": {"source_lang": "zh", "target_lang": "en"}}

出力モダリティ

出力形式は、modalities パラメーターで制御します。

`modalities` の値	出力
`["text"]`	翻訳されたテキストのみ
`["text", "audio"]`	翻訳されたテキストと Base64 エンコードされた合成音声

出力に音声が含まれる場合は、audio パラメーターで音声を指定します。詳細については、「サポートされている音声」をご参照ください。

制約事項

シングルターンのみ：モデルはリクエストごとに 1 つの翻訳を処理します。複数ターンにわたる対話はサポートされていません。
システムメッセージなし：system ロールはサポートされていません。
ストリーミングのみ：OpenAI 互換のストリーミング出力のみサポートされています。

レスポンスの解析

各ストリーミング chunk オブジェクトには、以下が含まれています：

テキスト: chunk.choices[0].delta.content
オーディオ: chunk.choices[0].delta.audio["data"] (Base64 エンコード、24 kHz サンプルレート)

音声のファイルへの保存

ストリームからすべての Base64 音声フラグメントを連結し、ストリーム完了後に結果をデコードして保存します。

Python

import os
from openai import OpenAI
import base64
import numpy as np
import soundfile as sf

client = OpenAI(
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    # 次の URL はシンガポールリージョン用です。呼び出す際は、WorkspaceId を実際のワークスペース ID に置き換えてください。URL はリージョンによって異なります。
    base_url="https://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/compatible-mode/v1",
)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "input_audio",
                "input_audio": {
                    "data": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250211/tixcef/cherry.wav",
                    "format": "wav",
                },
            }
        ],
    }
]

completion = client.chat.completions.create(
    model="qwen3-livetranslate-flash",
    messages=messages,
    modalities=["text", "audio"],
    audio={"voice": "Cherry", "format": "wav"},
    stream=True,
    stream_options={"include_usage": True},
    extra_body={"translation_options": {"source_lang": "zh", "target_lang": "en"}},
)

# Base64 フラグメントを連結し、ストリーム完了後にデコードします
audio_string = ""
for chunk in completion:
    if chunk.choices:
        if hasattr(chunk.choices[0].delta, "audio"):
            try:
                audio_string += chunk.choices[0].delta.audio["data"]
            except Exception as e:
                print(chunk.choices[0].delta.audio["transcript"])
    else:
        print(chunk.usage)

wav_bytes = base64.b64decode(audio_string)
audio_np = np.frombuffer(wav_bytes, dtype=np.int16)
sf.write("output.wav", audio_np, samplerate=24000)

Node.js

import OpenAI from "openai";
import { createWriteStream } from "node:fs";
import { Writer } from "wav";

const client = new OpenAI({
    apiKey: process.env.DASHSCOPE_API_KEY,
    // 次の URL はシンガポールリージョン用です。呼び出す際は、WorkspaceId を実際のワークスペース ID に置き換えてください。URL はリージョンによって異なります。
    baseURL: "https://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/compatible-mode/v1",
});

const messages = [
    {
        role: "user",
        content: [
            {
                type: "input_audio",
                input_audio: {
                    data: "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250211/tixcef/cherry.wav",
                    format: "wav",
                },
            },
        ],
    },
];

const completion = await client.chat.completions.create({
    model: "qwen3-livetranslate-flash",
    messages: messages,
    modalities: ["text", "audio"],
    audio: { voice: "Cherry", format: "wav" },
    stream: true,
    stream_options: { include_usage: true },
    translation_options: { source_lang: "zh", target_lang: "en" },
});

// Base64 フラグメントを連結し、ストリーム完了後にデコードします
let audioString = "";
for await (const chunk of completion) {
    if (Array.isArray(chunk.choices) && chunk.choices.length > 0) {
        if (chunk.choices[0].delta.audio?.data) {
            audioString += chunk.choices[0].delta.audio.data;
        }
    } else {
        console.log(chunk.usage);
    }
}

// WAV ファイルとして保存
async function saveAudio(base64Data, outputPath) {
    const wavBuffer = Buffer.from(base64Data, "base64");
    const writer = new Writer({
        sampleRate: 24000,
        channels: 1,
        bitDepth: 16,
    });
    const outputStream = createWriteStream(outputPath);
    writer.pipe(outputStream);
    writer.write(wavBuffer);
    writer.end();
    await new Promise((resolve, reject) => {
        outputStream.on("finish", resolve);
        outputStream.on("error", reject);
    });
    console.log(`音声を ${outputPath} に保存しました`);
}

saveAudio(audioString, "output.wav");

リアルタイム再生

各 Base64 フラグメントを受信時にデコードし、直接再生します。このアプローチには、プラットフォーム固有の音声ライブラリが必要です。

Python

まず pyaudio をインストールします:

プラットフォーム	インストール
macOS	`brew install portaudio && pip install pyaudio`
Ubuntu / Debian	`sudo apt-get install python-pyaudio python3-pyaudiopip install pyaudio` または
CentOS	`sudo yum install -y portaudio portaudio-devel && pip install pyaudio`
Windows	`python -m pip install pyaudio`

import os
from openai import OpenAI
import base64
import numpy as np
import pyaudio
import time

client = OpenAI(
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    # 次の URL はシンガポールリージョン用です。呼び出す際は、WorkspaceId を実際のワークスペース ID に置き換えてください。URL はリージョンによって異なります。
    base_url="https://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/compatible-mode/v1",
)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "input_audio",
                "input_audio": {
                    "data": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250211/tixcef/cherry.wav",
                    "format": "wav",
                },
            }
        ],
    }
]

completion = client.chat.completions.create(
    model="qwen3-livetranslate-flash",
    messages=messages,
    modalities=["text", "audio"],
    audio={"voice": "Cherry", "format": "wav"},
    stream=True,
    stream_options={"include_usage": True},
    extra_body={"translation_options": {"source_lang": "zh", "target_lang": "en"}},
)

# リアルタイム再生のため PyAudio を初期化します
p = pyaudio.PyAudio()
stream = p.open(format=pyaudio.paInt16, channels=1, rate=24000, output=True)

for chunk in completion:
    if chunk.choices:
        if hasattr(chunk.choices[0].delta, "audio"):
            try:
                audio_data = chunk.choices[0].delta.audio["data"]
                wav_bytes = base64.b64decode(audio_data)
                audio_np = np.frombuffer(wav_bytes, dtype=np.int16)
                stream.write(audio_np.tobytes())
            except Exception as e:
                print(chunk.choices[0].delta.audio["transcript"])

time.sleep(0.8)
stream.stop_stream()
stream.close()
p.terminate()

Node.js

まず、依存関係をインストールします。

プラットフォーム	インストール
macOS	`brew install portaudio && npm install speaker`
Ubuntu / Debian	`sudo apt-get install libasound2-dev && npm install speaker`
Windows	`npm install speaker`

import OpenAI from "openai";
import Speaker from "speaker";

const client = new OpenAI({
    apiKey: process.env.DASHSCOPE_API_KEY,
    // 次の URL はシンガポールリージョン用です。呼び出す際は、WorkspaceId を実際のワークスペース ID に置き換えてください。URL はリージョンによって異なります。
    baseURL: "https://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/compatible-mode/v1",
});

const messages = [
    {
        role: "user",
        content: [
            {
                type: "input_audio",
                input_audio: {
                    data: "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250211/tixcef/cherry.wav",
                    format: "wav",
                },
            },
        ],
    },
];

const completion = await client.chat.completions.create({
    model: "qwen3-livetranslate-flash",
    messages: messages,
    modalities: ["text", "audio"],
    audio: { voice: "Cherry", format: "wav" },
    stream: true,
    stream_options: { include_usage: true },
    translation_options: { source_lang: "zh", target_lang: "en" },
});

// 音声をスピーカーにリアルタイムでストリーミングします
const speaker = new Speaker({
    sampleRate: 24000,
    channels: 1,
    bitDepth: 16,
    signed: true,
});

for await (const chunk of completion) {
    if (Array.isArray(chunk.choices) && chunk.choices.length > 0) {
        if (chunk.choices[0].delta.audio?.data) {
            const pcmBuffer = Buffer.from(chunk.choices[0].delta.audio.data, "base64");
            speaker.write(pcmBuffer);
        }
    } else {
        console.log(chunk.usage);
    }
}

speaker.on("finish", () => console.log("再生が完了しました"));
speaker.end();

課金

音声

入力または出力音声 1 秒あたり 12.5 トークンを消費します。1 秒未満の音声は 1 秒として課金されます。

動画

動画のトークン消費量には、2つの要素があります。

音声トークン：音声 1 秒あたり 12.5 トークン。1 秒未満の音声は 1 秒として課金されます。
動画トークン：フレーム数と解像度に基づいて計算されます。計算式は次のとおりです：
```
  video_tokens = ceil(frame_count / 2) x (height / 32) x (width / 32) + 2
```
ここで、
- フレームは 2 FPS でサンプリングされ、[4, 128] の範囲にクランプされます。
- 高さと幅は 32 ピクセルの倍数に調整され、総ピクセル制限内に収まるように動的にスケーリングされます。

動画トークンを計算する Python スクリプト

# インストール： pip install opencv-python
import math
import cv2

FRAME_FACTOR = 2
IMAGE_FACTOR = 32
MAX_RATIO = 200
VIDEO_MIN_PIXELS = 128 * 32 * 32
VIDEO_MAX_PIXELS = 768 * 32 * 32
FPS = 2
FPS_MIN_FRAMES = 4
FPS_MAX_FRAMES = 128
VIDEO_TOTAL_PIXELS = 16384 * 32 * 32

def round_by_factor(number, factor):
    return round(number / factor) * factor

def ceil_by_factor(number, factor):
    return math.ceil(number / factor) * factor

def floor_by_factor(number, factor):
    return math.floor(number / factor) * factor

def get_video(video_path):
    cap = cv2.VideoCapture(video_path)
    frame_width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
    frame_height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
    total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
    video_fps = cap.get(cv2.CAP_PROP_FPS)
    cap.release()
    return frame_height, frame_width, total_frames, video_fps

def smart_nframes(total_frames, video_fps):
    min_frames = ceil_by_factor(FPS_MIN_FRAMES, FRAME_FACTOR)
    max_frames = floor_by_factor(min(FPS_MAX_FRAMES, total_frames), FRAME_FACTOR)
    duration = total_frames / video_fps if video_fps != 0 else 0
    if duration - int(duration) > (1 / FPS):
        total_frames = math.ceil(duration * video_fps)
    else:
        total_frames = math.ceil(int(duration) * video_fps)
    nframes = total_frames / video_fps * FPS
    nframes = int(min(min(max(nframes, min_frames), max_frames), total_frames))
    if not (FRAME_FACTOR <= nframes <= total_frames):
        raise ValueError(f"nframes should in interval [{FRAME_FACTOR}, {total_frames}], but got {nframes}.")
    return nframes

def smart_resize(height, width, nframes, factor=IMAGE_FACTOR):
    min_pixels = VIDEO_MIN_PIXELS
    total_pixels = VIDEO_TOTAL_PIXELS
    max_pixels = max(min(VIDEO_MAX_PIXELS, total_pixels / nframes * FRAME_FACTOR), int(min_pixels * 1.05))
    if max(height, width) / min(height, width) > MAX_RATIO:
        raise ValueError(f"absolute aspect ratio must be smaller than {MAX_RATIO}, got {max(height, width) / min(height, width)}")
    h_bar = max(factor, round_by_factor(height, factor))
    w_bar = max(factor, round_by_factor(width, factor))
    if h_bar * w_bar > max_pixels:
        beta = math.sqrt((height * width) / max_pixels)
        h_bar = floor_by_factor(height / beta, factor)
        w_bar = floor_by_factor(width / beta, factor)
    elif h_bar * w_bar < min_pixels:
        beta = math.sqrt(min_pixels / (height * width))
        h_bar = ceil_by_factor(height * beta, factor)
        w_bar = ceil_by_factor(width * beta, factor)
    return h_bar, w_bar

def video_token_calculate(video_path):
    height, width, total_frames, video_fps = get_video(video_path)
    nframes = smart_nframes(total_frames, video_fps)
    resized_height, resized_width = smart_resize(height, width, nframes)
    video_token = int(math.ceil(nframes / FPS) * resized_height / 32 * resized_width / 32)
    video_token += 2
    return video_token

if __name__ == "__main__":
    video_path = "spring_mountain.mp4"  # ご自身の動画パスに置き換えてください
    video_token = video_token_calculate(video_path)
    print("video_tokens:", video_token)

トークンの価格については、「モデルリスト」をご参照ください。

モデル詳細

モデル	バージョン	コンテキストウィンドウ	最大入力	最大出力
qwen3-livetranslate-flash	Stable	53,248 トークン	49,152 トークン	4,096 トークン
qwen3-livetranslate-flash-2025-12-01	Snapshot	53,248 トークン	49,152 トークン	4,096 トークン

現在、qwen3-livetranslate-flash は qwen3-livetranslate-flash-2025-12-01 と同じ機能です。

サポートされている言語

source_lang と target_lang には、これらの言語コードを使用します。一部の翻訳先言語は、テキスト出力のみをサポートしています。

言語コード	言語	サポートされている出力
en	英語	音声、テキスト
zh	中国語	音声、テキスト
ru	ロシア語	音声、テキスト
fr	フランス語	音声、テキスト
de	ドイツ語	音声、テキスト
pt	ポルトガル語	音声、テキスト
es	スペイン語	音声、テキスト
it	イタリア語	音声、テキスト
id	インドネシア語	テキスト
ko	韓国語	音声、テキスト
ja	日本語	音声、テキスト
vi	ベトナム語	テキスト
th	タイ語	テキスト
ar	アラビア語	テキスト
yue	広東語	音声、テキスト
hi	ヒンディー語	テキスト
el	ギリシャ語	テキスト
tr	トルコ語	テキスト

サポートされている音声

出力に合成音声が含まれる場合は、audio の voice パラメーターを設定します。

音声名	`voice` パラメーター	説明	サポートされている言語
Cherry	Cherry	快活で、親しみやすく、誠実な若い女性。	中国語、英語、フランス語、ドイツ語、ロシア語、イタリア語、スペイン語、ポルトガル語、日本語、韓国語
Nofish	Nofish	そり舌子音の発音が苦手なデザイナー。	中国語、英語、フランス語、ドイツ語、ロシア語、イタリア語、スペイン語、ポルトガル語、日本語、韓国語
Shanghai-Jada	Jada	賑やかでエネルギッシュな上海の女性。	中国語
Beijing-Dylan	Dylan	北京の胡同で育った若い男性。	中国語
Sichuan-Sunny	Sunny	四川出身の可愛らしい女の子。	中国語
Tianjin-Peter	Peter	天津の漫才師 (ツッコミ役) スタイルの音声。	中国語
Cantonese-Kiki	Kiki	香港出身の可愛らしい親友。	広東語
Sichuan-Eric	Eric	型にはまらず、群衆から際立つ四川省成都出身の男性。	中国語

よくある質問

動画ファイルを入力すると、どのコンテンツが翻訳されますか？

モデルは動画の音声トラックを翻訳します。視覚情報により、翻訳の精度が向上します。

例えば、音声が「This is a mask」の場合、

動画に医療用マスクが映っている場合、モデルは「This is a medical mask」と翻訳します。
動画に仮面舞踏会のマスクが映っている場合、モデルは「This is a masquerade mask」と翻訳します。

API リファレンス

詳細な入力および出力パラメーターについては、「音声および動画の翻訳 - Qwen」をご参照ください。