All Products
Search
Document Center

Alibaba Cloud Model Studio:Audio and Video File Translation – Qwen

Last Updated:Mar 17, 2026

qwen3-livetranslate-flash translates audio and video files across 18 languages. It accepts audio or video input and returns translated text, synthesized audio, or both via a streaming API. For video input, visual context improves translation accuracy (e.g., distinguishing "medical mask" vs. "masquerade mask" based on video frames).

Before you begin

  1. Create an API key.

  2. Configure the API key as an environment variable.

  3. (Optional) If you use the OpenAI SDK, install the SDK.

Quick start

All examples use the OpenAI-compatible streaming API. Set source and target languages via translation_options. The default input is audio; uncomment the video input block in each example to translate video files instead.

Specifying source_lang improves accuracy. Omit it to enable automatic language detection.
import os
from openai import OpenAI

client = OpenAI(
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    # Singapore region. For Beijing, use: https://dashscope.aliyuncs.com/compatible-mode/v1
    base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)

# --- Audio input ---
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "input_audio",
                "input_audio": {
                    "data": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250211/tixcef/cherry.wav",
                    "format": "wav",
                },
            }
        ],
    }
]

# --- Video input (uncomment to use) ---
# messages = [
#     {
#         "role": "user",
#         "content": [
#             {
#                 "type": "video_url",
#                 "video_url": {
#                     "url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241115/cqqkru/1.mp4"
#                 },
#             }
#         ],
#     },
# ]

completion = client.chat.completions.create(
    model="qwen3-livetranslate-flash",
    messages=messages,
    modalities=["text", "audio"],
    audio={"voice": "Cherry", "format": "wav"},
    stream=True,
    stream_options={"include_usage": True},
    # translation_options is not a standard OpenAI parameter; pass it through extra_body
    extra_body={"translation_options": {"source_lang": "zh", "target_lang": "en"}},
)

for chunk in completion:
    print(chunk)
import OpenAI from "openai";

const client = new OpenAI({
    apiKey: process.env.DASHSCOPE_API_KEY,
    // Singapore region. For Beijing, use: https://dashscope.aliyuncs.com/compatible-mode/v1
    baseURL: "https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
});

// --- Audio input ---
const messages = [
    {
        role: "user",
        content: [
            {
                type: "input_audio",
                input_audio: {
                    data: "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250211/tixcef/cherry.wav",
                    format: "wav",
                },
            },
        ],
    },
];

// --- Video input (uncomment to use) ---
// const messages = [
//     {
//         role: "user",
//         content: [
//             {
//                 type: "video_url",
//                 video_url: {
//                     url: "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241115/cqqkru/1.mp4",
//                 },
//             },
//         ],
//     },
// ];

async function main() {
    const completion = await client.chat.completions.create({
        model: "qwen3-livetranslate-flash",
        messages: messages,
        modalities: ["text", "audio"],
        audio: { voice: "Cherry", format: "wav" },
        stream: true,
        stream_options: { include_usage: true },
        translation_options: { source_lang: "zh", target_lang: "en" },
    });

    for await (const chunk of completion) {
        console.log(JSON.stringify(chunk));
    }
}

main();
curl -X POST https://dashscope-intl.aliyuncs.com/compatible-mode/v1/chat/completions \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
    "model": "qwen3-livetranslate-flash",
    "messages": [
        {
            "role": "user",
            "content": [
                {
                    "type": "input_audio",
                    "input_audio": {
                        "data": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250211/tixcef/cherry.wav",
                        "format": "wav"
                    }
                }
            ]
        }
    ],
    "modalities": ["text", "audio"],
    "audio": {
        "voice": "Cherry",
        "format": "wav"
    },
    "stream": true,
    "stream_options": {
        "include_usage": true
    },
    "translation_options": {
        "source_lang": "zh",
        "target_lang": "en"
    }
}'

These examples use a public file URL. To use a local file, see Input a Base64-encoded local file.

Request parameters

Input

The messages array must contain exactly one message with role set to user. The content field holds the audio or video to translate:

  • Audio: Set type to input_audio. Provide the file URL or Base64-encoded data in input_audio.data, and specify the format (for example, wav) in input_audio.format.

  • Video: Set type to video_url. Provide the file URL in video_url.url.

Translation options

Specify the source and target languages in the translation_options parameter:

"translation_options": {"source_lang": "zh", "target_lang": "en"}

In the Python SDK, translation_options is not a standard OpenAI parameter. Pass it through extra_body:

extra_body={"translation_options": {"source_lang": "zh", "target_lang": "en"}}

Output modality

Control the output format with the modalities parameter:

modalities value Output
["text"] Translated text only
["text", "audio"] Translated text and Base64-encoded synthesized audio

When the output includes audio, set the voice in the audio parameter. See Supported voices for available options.

Constraints

  • Single-turn only: The model handles one translation per request. Multi-turn conversations are not supported.

  • No system message: The system role is not supported.

  • Streaming only: Only OpenAI-compatible streaming output is supported.

Parse the response

Each streaming chunk object contains:

  • Text: chunk.choices[0].delta.content

  • Audio: chunk.choices[0].delta.audio["data"] (Base64-encoded, 24 kHz sample rate)

Save audio to a file

Concatenate all Base64 audio fragments from the stream, then decode and save the result after the stream completes.

Python

import os
from openai import OpenAI
import base64
import numpy as np
import soundfile as sf

client = OpenAI(
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    # Singapore region. For Beijing, use: https://dashscope.aliyuncs.com/compatible-mode/v1
    base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "input_audio",
                "input_audio": {
                    "data": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250211/tixcef/cherry.wav",
                    "format": "wav",
                },
            }
        ],
    }
]

completion = client.chat.completions.create(
    model="qwen3-livetranslate-flash",
    messages=messages,
    modalities=["text", "audio"],
    audio={"voice": "Cherry", "format": "wav"},
    stream=True,
    stream_options={"include_usage": True},
    extra_body={"translation_options": {"source_lang": "zh", "target_lang": "en"}},
)

# Concatenate Base64 fragments, then decode after the stream completes
audio_string = ""
for chunk in completion:
    if chunk.choices:
        if hasattr(chunk.choices[0].delta, "audio"):
            try:
                audio_string += chunk.choices[0].delta.audio["data"]
            except Exception as e:
                print(chunk.choices[0].delta.audio["transcript"])
    else:
        print(chunk.usage)

wav_bytes = base64.b64decode(audio_string)
audio_np = np.frombuffer(wav_bytes, dtype=np.int16)
sf.write("output.wav", audio_np, samplerate=24000)

Node.js

import OpenAI from "openai";
import { createWriteStream } from "node:fs";
import { Writer } from "wav";

const client = new OpenAI({
    apiKey: process.env.DASHSCOPE_API_KEY,
    // Singapore region. For Beijing, use: https://dashscope.aliyuncs.com/compatible-mode/v1
    baseURL: "https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
});

const messages = [
    {
        role: "user",
        content: [
            {
                type: "input_audio",
                input_audio: {
                    data: "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250211/tixcef/cherry.wav",
                    format: "wav",
                },
            },
        ],
    },
];

const completion = await client.chat.completions.create({
    model: "qwen3-livetranslate-flash",
    messages: messages,
    modalities: ["text", "audio"],
    audio: { voice: "Cherry", format: "wav" },
    stream: true,
    stream_options: { include_usage: true },
    translation_options: { source_lang: "zh", target_lang: "en" },
});

// Concatenate Base64 fragments, then decode after the stream completes
let audioString = "";
for await (const chunk of completion) {
    if (Array.isArray(chunk.choices) && chunk.choices.length > 0) {
        if (chunk.choices[0].delta.audio?.data) {
            audioString += chunk.choices[0].delta.audio.data;
        }
    } else {
        console.log(chunk.usage);
    }
}

// Save as WAV file
async function saveAudio(base64Data, outputPath) {
    const wavBuffer = Buffer.from(base64Data, "base64");
    const writer = new Writer({
        sampleRate: 24000,
        channels: 1,
        bitDepth: 16,
    });
    const outputStream = createWriteStream(outputPath);
    writer.pipe(outputStream);
    writer.write(wavBuffer);
    writer.end();
    await new Promise((resolve, reject) => {
        outputStream.on("finish", resolve);
        outputStream.on("error", reject);
    });
    console.log(`Audio saved to ${outputPath}`);
}

saveAudio(audioString, "output.wav");

Real-time playback

Decode each Base64 fragment as it arrives and play it directly. This approach requires platform-specific audio libraries.

Python

Install pyaudio first:

Platform Installation
macOS brew install portaudio && pip install pyaudio
Ubuntu / Debian sudo apt-get install python-pyaudio python3-pyaudio or pip install pyaudio
CentOS sudo yum install -y portaudio portaudio-devel && pip install pyaudio
Windows python -m pip install pyaudio
import os
from openai import OpenAI
import base64
import numpy as np
import pyaudio
import time

client = OpenAI(
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    # Singapore region. For Beijing, use: https://dashscope.aliyuncs.com/compatible-mode/v1
    base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "input_audio",
                "input_audio": {
                    "data": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250211/tixcef/cherry.wav",
                    "format": "wav",
                },
            }
        ],
    }
]

completion = client.chat.completions.create(
    model="qwen3-livetranslate-flash",
    messages=messages,
    modalities=["text", "audio"],
    audio={"voice": "Cherry", "format": "wav"},
    stream=True,
    stream_options={"include_usage": True},
    extra_body={"translation_options": {"source_lang": "zh", "target_lang": "en"}},
)

# Initialize PyAudio for real-time playback
p = pyaudio.PyAudio()
stream = p.open(format=pyaudio.paInt16, channels=1, rate=24000, output=True)

for chunk in completion:
    if chunk.choices:
        if hasattr(chunk.choices[0].delta, "audio"):
            try:
                audio_data = chunk.choices[0].delta.audio["data"]
                wav_bytes = base64.b64decode(audio_data)
                audio_np = np.frombuffer(wav_bytes, dtype=np.int16)
                stream.write(audio_np.tobytes())
            except Exception as e:
                print(chunk.choices[0].delta.audio["transcript"])

time.sleep(0.8)
stream.stop_stream()
stream.close()
p.terminate()

Node.js

Install dependencies first:

Platform Installation
macOS brew install portaudio && npm install speaker
Ubuntu / Debian sudo apt-get install libasound2-dev && npm install speaker
Windows npm install speaker
import OpenAI from "openai";
import Speaker from "speaker";

const client = new OpenAI({
    apiKey: process.env.DASHSCOPE_API_KEY,
    // Singapore region. For Beijing, use: https://dashscope.aliyuncs.com/compatible-mode/v1
    baseURL: "https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
});

const messages = [
    {
        role: "user",
        content: [
            {
                type: "input_audio",
                input_audio: {
                    data: "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250211/tixcef/cherry.wav",
                    format: "wav",
                },
            },
        ],
    },
];

const completion = await client.chat.completions.create({
    model: "qwen3-livetranslate-flash",
    messages: messages,
    modalities: ["text", "audio"],
    audio: { voice: "Cherry", format: "wav" },
    stream: true,
    stream_options: { include_usage: true },
    translation_options: { source_lang: "zh", target_lang: "en" },
});

// Stream audio to speaker in real time
const speaker = new Speaker({
    sampleRate: 24000,
    channels: 1,
    bitDepth: 16,
    signed: true,
});

for await (const chunk of completion) {
    if (Array.isArray(chunk.choices) && chunk.choices.length > 0) {
        if (chunk.choices[0].delta.audio?.data) {
            const pcmBuffer = Buffer.from(chunk.choices[0].delta.audio.data, "base64");
            speaker.write(pcmBuffer);
        }
    } else {
        console.log(chunk.usage);
    }
}

speaker.on("finish", () => console.log("Playback complete"));
speaker.end();

Billing

Audio

Each second of input or output audio consumes 12.5 tokens. Audio shorter than 1 second is billed as 1 second.

Video

Video token consumption has two components:

  • Audio tokens: 12.5 tokens per second of audio. Audio shorter than 1 second is billed as 1 second.

  • Video tokens: Calculated based on frame count and resolution. The formula is:

      video_tokens = ceil(frame_count / 2) x (height / 32) x (width / 32) + 2

    Where:

    • Frames are sampled at 2 FPS, clamped to the range [4, 128].

    • Height and width are adjusted to multiples of 32 pixels and dynamically scaled to fit within the total pixel limit.

Python script to calculate video tokens

# Install: pip install opencv-python
import math
import cv2

FRAME_FACTOR = 2
IMAGE_FACTOR = 32
MAX_RATIO = 200
VIDEO_MIN_PIXELS = 128 * 32 * 32
VIDEO_MAX_PIXELS = 768 * 32 * 32
FPS = 2
FPS_MIN_FRAMES = 4
FPS_MAX_FRAMES = 128
VIDEO_TOTAL_PIXELS = 16384 * 32 * 32

def round_by_factor(number, factor):
    return round(number / factor) * factor

def ceil_by_factor(number, factor):
    return math.ceil(number / factor) * factor

def floor_by_factor(number, factor):
    return math.floor(number / factor) * factor

def get_video(video_path):
    cap = cv2.VideoCapture(video_path)
    frame_width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
    frame_height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
    total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
    video_fps = cap.get(cv2.CAP_PROP_FPS)
    cap.release()
    return frame_height, frame_width, total_frames, video_fps

def smart_nframes(total_frames, video_fps):
    min_frames = ceil_by_factor(FPS_MIN_FRAMES, FRAME_FACTOR)
    max_frames = floor_by_factor(min(FPS_MAX_FRAMES, total_frames), FRAME_FACTOR)
    duration = total_frames / video_fps if video_fps != 0 else 0
    if duration - int(duration) > (1 / FPS):
        total_frames = math.ceil(duration * video_fps)
    else:
        total_frames = math.ceil(int(duration) * video_fps)
    nframes = total_frames / video_fps * FPS
    nframes = int(min(min(max(nframes, min_frames), max_frames), total_frames))
    if not (FRAME_FACTOR <= nframes <= total_frames):
        raise ValueError(f"nframes should in interval [{FRAME_FACTOR}, {total_frames}], but got {nframes}.")
    return nframes

def smart_resize(height, width, nframes, factor=IMAGE_FACTOR):
    min_pixels = VIDEO_MIN_PIXELS
    total_pixels = VIDEO_TOTAL_PIXELS
    max_pixels = max(min(VIDEO_MAX_PIXELS, total_pixels / nframes * FRAME_FACTOR), int(min_pixels * 1.05))
    if max(height, width) / min(height, width) > MAX_RATIO:
        raise ValueError(f"absolute aspect ratio must be smaller than {MAX_RATIO}, got {max(height, width) / min(height, width)}")
    h_bar = max(factor, round_by_factor(height, factor))
    w_bar = max(factor, round_by_factor(width, factor))
    if h_bar * w_bar > max_pixels:
        beta = math.sqrt((height * width) / max_pixels)
        h_bar = floor_by_factor(height / beta, factor)
        w_bar = floor_by_factor(width / beta, factor)
    elif h_bar * w_bar < min_pixels:
        beta = math.sqrt(min_pixels / (height * width))
        h_bar = ceil_by_factor(height * beta, factor)
        w_bar = ceil_by_factor(width * beta, factor)
    return h_bar, w_bar

def video_token_calculate(video_path):
    height, width, total_frames, video_fps = get_video(video_path)
    nframes = smart_nframes(total_frames, video_fps)
    resized_height, resized_width = smart_resize(height, width, nframes)
    video_token = int(math.ceil(nframes / FPS) * resized_height / 32 * resized_width / 32)
    video_token += 2
    return video_token

if __name__ == "__main__":
    video_path = "spring_mountain.mp4"  # Replace with your video path
    video_token = video_token_calculate(video_path)
    print("video_tokens:", video_token)

For token pricing, see Model list.

Model details

Model Version Context window Max input Max output
qwen3-livetranslate-flash Stable 53,248 tokens 49,152 tokens 4,096 tokens
qwen3-livetranslate-flash-2025-12-01 Snapshot 53,248 tokens 49,152 tokens 4,096 tokens
qwen3-livetranslate-flash currently has the same capabilities as qwen3-livetranslate-flash-2025-12-01.

Supported languages

Use these language codes for source_lang and target_lang. Some target languages support text output only.

Language code Language Supported output
en English Audio, text
zh Chinese Audio, text
ru Russian Audio, text
fr French Audio, text
de German Audio, text
pt Portuguese Audio, text
es Spanish Audio, text
it Italian Audio, text
id Indonesian Text
ko Korean Audio, text
ja Japanese Audio, text
vi Vietnamese Text
th Thai Text
ar Arabic Text
yue Cantonese Audio, text
hi Hindi Text
el Greek Text
tr Turkish Text

Supported voices

Set the voice parameter in audio when output includes synthesized audio.

Voice name voice parameter Description Supported languages
Cherry Cherry A cheerful, friendly, and genuine young woman. Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Nofish Nofish A designer who has difficulty pronouncing retroflex consonants. Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Shanghai-Jada Jada A bustling and energetic Shanghai lady. Chinese
Beijing-Dylan Dylan A young man who grew up in the hutongs of Beijing. Chinese
Sichuan-Sunny Sunny A sweet girl from Sichuan. Chinese
Tianjin-Peter Peter A voice in the style of a Tianjin crosstalk performer (the supporting role). Chinese
Cantonese-Kiki Kiki A sweet best friend from Hong Kong. Cantonese
Sichuan-Eric Eric A man from Chengdu, Sichuan, who is unconventional and stands out from the crowd. Chinese

FAQ

When I input a video file, what content is translated?

The model translates the video's audio track. Visual information improves translation accuracy.

For example, if the audio says "This is a mask":

  • When the video shows a medical mask, the model translates it as "This is a medical mask."

  • When the video shows a masquerade mask, the model translates it as "This is a masquerade mask."

API reference

For full input and output parameter details, see Audio and video translation - Qwen.