Audio and video translation - Qwen - Alibaba Cloud Model Studio

Qwen3-LiveTranslate-Flash is an audio and video translation model. It supports translation between 18 languages, such as Chinese, English, Russian, and French. The model uses visual context to improve translation accuracy and can output both text and audio.

Procedure

Set languages: Set the source language (source_lang) and the target language (target_lang) in the translation_options parameter. Supported languages.
Omitting source_lang enables automatic detection, but specifying the language improves translation accuracy.
Input file: The messages array must contain exactly one message where the role is user. The content field must contain the URL or Base64-encoded data of the audio or video to be translated.
Control the output modality: Use the modalities parameter:
- ["text"]: Outputs only text.
- ["text","audio"]: Outputs text and Base64-encoded audio.
  If the output includes audio, you must set the voice (voice) in the audio parameter. Supported voices.

The following is the core code for using the OpenAI Python SDK:

# Import dependencies and create a client...
completion = client.chat.completions.create(
    model="qwen3-livetranslate-flash",    # Select the model
    # The messages array contains only one user message, and the content is the file to be translated
    messages = [
    {
      "role": "user",
      "content": [
        {
          "type": "input_audio",
          "input_audio": {
            "data": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250211/tixcef/cherry.wav",
            "format": "wav"
          }
        }
      ]
    }
  ],    
    # translation_options is not a standard OpenAI parameter and must be passed through extra_body
    extra_body={"translation_options": {"source_lang": "zh", "target_lang": "en"}},
    # Example: Output text and audio
    modalities = ["text","audio"],
    audio={"voice": "Cherry", "format": "wav"}
)

Limits

Single-turn translation: The model is designed for translation tasks and does not support multi-turn conversations.
No system message: The model does not support setting the global behavior using the system role.
Call method: Only streaming output that is compatible with the OpenAI protocol is supported.

Parsing the response

chunk is a streaming response object:

Text: Read the content from chunk.choices[0].delta.content.
Audio: The Base64-encoded audio data is in chunk.choices[0].delta.audio["data"]. The audio output from the model has a sample rate of 24 kHz.

Model availability

Model	Version	Context window	Max input	Max output
		(Tokens)
qwen3-livetranslate-flash Currently same capabilities as qwen3-livetranslate-flash-2025-12-01	Stable	53,248	49,152	4,096
qwen3-livetranslate-flash-2025-12-01	Snapshot

Getting started

qwen3-livetranslate-flash supports audio or video input and outputs text and audio.

First create an API key and export the API key as an environment variable. If you use the OpenAI SDK, install the SDK.

The following example uses audio input. To use video input, uncomment the corresponding code.

Python

import os
from openai import OpenAI

client = OpenAI(
    # If you have not configured an environment variable, replace the following line with your Model Studio API key: api_key="sk-xxx",
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    # The following is the base_url for the Singapore region. If you use a model in the Beijing region, replace the base_url with: https://dashscope.aliyuncs.com/compatible-mode/v1
    base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)

# ---------------- Audio input ----------------
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "input_audio",
                "input_audio": {
                    "data": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250211/tixcef/cherry.wav",
                    "format": "wav",
                },
            }
        ],
    }
]

# ---------------- Video input (uncomment to use) ----------------
# messages = [
#     {
#         "role": "user",
#         "content": [
#             {
#                 "type": "video_url",
#                 "video_url": {
#                     "url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241115/cqqkru/1.mp4"
#                 },
#             }
#         ],
#     },
# ]

completion = client.chat.completions.create(
    model="qwen3-livetranslate-flash",
    messages=messages,
    modalities=["text", "audio"],
    audio={"voice": "Cherry", "format": "wav"},
    stream=True,
    stream_options={"include_usage": True},
    extra_body={"translation_options": {"source_lang": "zh", "target_lang": "en"}},
)

for chunk in completion:
    print(chunk)

Node.js

import OpenAI from "openai";

const client = new OpenAI({
    // If you have not configured an environment variable, replace the following line with your Model Studio API key: apiKey: "sk-xxx",
    apiKey: process.env.DASHSCOPE_API_KEY,
    // The following is the base_url for the Singapore region. If you use a model in the Beijing region, replace the base_url with: https://dashscope.aliyuncs.com/compatible-mode/v1
    baseURL: "https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
});

// ---------------- Audio input ----------------
const messages = [
    {
        role: "user",
        content: [
            {
                type: "input_audio",
                input_audio: {
                    data: "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250211/tixcef/cherry.wav",
                    format: "wav",
                },
            },
        ],
    },
];

// ---------------- Video input (uncomment to use) ----------------
// const messages = [
//     {
//         role: "user",
//         content: [
//             {
//                 type: "video_url",
//                 video_url: {
//                     url: "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241115/cqqkru/1.mp4",
//                 },
//             },
//         ],
//     },
// ];

async function main() {
    const completion = await client.chat.completions.create({
        model: "qwen3-livetranslate-flash",
        messages: messages,
        modalities: ["text", "audio"],
        audio: { voice: "Cherry", format: "wav" },
        stream: true,
        stream_options: { include_usage: true },
        translation_options: { source_lang: "zh", target_lang: "en" },
    });

    for await (const chunk of completion) {
        console.log(JSON.stringify(chunk));
    }
}

main();

curl

# ======= Important =======
# The following is an example for the Singapore region. If you use a model in the Beijing region, replace the request URL with: https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions
# === Delete this comment before execution ===
curl -X POST https://dashscope-intl.aliyuncs.com/compatible-mode/v1/chat/completions \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
    "model": "qwen3-livetranslate-flash",
    "messages": [
        {
            "role": "user",
            "content": [
                {
                    "type": "input_audio",
                    "input_audio": {
                        "data": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250211/tixcef/cherry.wav",
                        "format": "wav"
                    }
                }
            ]
        }
    ],
    "modalities": ["text", "audio"],
    "audio": {
        "voice": "Cherry",
        "format": "wav"
    },
    "stream": true,
    "stream_options": {
        "include_usage": true
    },
    "translation_options": {
        "source_lang": "zh",
        "target_lang": "en"
    }
}'

This example uses a public file URL. For information about using a local file, see Input a Base64-encoded local file.

Parse Base64 audio data

The model outputs audio in a streaming Base64-encoded format. You can process the data in two ways:

Concatenate and decode: Concatenate all Base64 fragments from the stream. After the stream is complete, decode the result and save it as an audio file.
Real-time playback: Decode each Base64 fragment in real time and play it directly.

Python

# Installation instructions for pyaudio:
# APPLE Mac OS X
#   brew install portaudio
#   pip install pyaudio
# Debian/Ubuntu
#   sudo apt-get install python-pyaudio python3-pyaudio
#   or
#   pip install pyaudio
# CentOS
#   sudo yum install -y portaudio portaudio-devel && pip install pyaudio
# Microsoft Windows
#   python -m pip install pyaudio

import os
from openai import OpenAI
import base64
import numpy as np
import soundfile as sf

# Initialize the OpenAI client
client = OpenAI(
    # If you have not configured an environment variable, replace the following line with your Model Studio API key: api_key="sk-xxx",
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    # The following is the base_url for the Beijing region. If you use a model in the Singapore region, replace the base_url with: https://dashscope-intl.aliyuncs.com/compatible-mode/v1
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
)
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "input_audio",
                "input_audio": {
                    "data": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250211/tixcef/cherry.wav",
                    "format": "wav",
                },
            }
        ],
    }
]
completion = client.chat.completions.create(
    model="qwen3-livetranslate-flash",
    messages=messages,
    modalities=["text", "audio"],
    audio={"voice": "Cherry", "format": "wav"},
    stream=True,
    stream_options={"include_usage": True},
    extra_body={"translation_options": {"source_lang": "zh", "target_lang": "en"}},
)

# Method 1: Decode after generation is complete
audio_string = ""
for chunk in completion:
    if chunk.choices:
        if hasattr(chunk.choices[0].delta, "audio"):
            try:
                audio_string += chunk.choices[0].delta.audio["data"]
            except Exception as e:
                print(chunk.choices[0].delta.audio["transcript"])
    else:
        print(chunk.usage)

wav_bytes = base64.b64decode(audio_string)
audio_np = np.frombuffer(wav_bytes, dtype=np.int16)
sf.write("audio_assistant_py.wav", audio_np, samplerate=24000)

# Method 2: Decode during generation (comment out the code for Method 1 to use Method 2)
# # Initialize PyAudio
# import pyaudio
# import time
# p = pyaudio.PyAudio()
# # Create an audio stream
# stream = p.open(format=pyaudio.paInt16,
#                 channels=1,
#                 rate=24000,
#                 output=True)

# for chunk in completion:
#     if chunk.choices:
#         if hasattr(chunk.choices[0].delta, "audio"):
#             try:
#                 audio_string = chunk.choices[0].delta.audio["data"]
#                 wav_bytes = base64.b64decode(audio_string)
#                 audio_np = np.frombuffer(wav_bytes, dtype=np.int16)
#                 # Play the audio data directly
#                 stream.write(audio_np.tobytes())
#             except Exception as e:
#                 print(chunk.choices[0].delta.audio["transcript"])

# time.sleep(0.8)
# # Clean up resources
# stream.stop_stream()
# stream.close()
# p.terminate()

Node.js

// Preparations before running:
// For Windows/Mac/Linux:
// 1. Make sure Node.js is installed (version >= 14 is recommended).
// 2. Run the following command to install the necessary dependencies:
//    npm install openai wav
// 
// To use the real-time playback feature (Method 2), you also need to:
// Windows:
//    npm install speaker
// Mac:
//    brew install portaudio
//    npm install speaker
// Linux (Ubuntu/Debian):
//    sudo apt-get install libasound2-dev
//    npm install speaker

import OpenAI from "openai";

const client = new OpenAI({
    // If you have not configured an environment variable, replace the following line with your Model Studio API key: apiKey: "sk-xxx",
    apiKey: process.env.DASHSCOPE_API_KEY,
    // The following is the base_url for the Beijing region. If you use a model in the Singapore region, replace the base_url with: https://dashscope-intl.aliyuncs.com/compatible-mode/v1
    baseURL: "https://dashscope.aliyuncs.com/compatible-mode/v1",
});

// ---------------- Audio input ----------------
const messages = [
    {
        role: "user",
        content: [
            {
                type: "input_audio",
                input_audio: {
                    data: "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250211/tixcef/cherry.wav",
                    format: "wav",
                },
            },
        ],
    },
];

const completion = await client.chat.completions.create({
    model: "qwen3-livetranslate-flash",
    messages: messages,
    modalities: ["text", "audio"],
    audio: { voice: "Cherry", format: "wav" },
    stream: true,
    stream_options: { include_usage: true },
    translation_options: { source_lang: "zh", target_lang: "en" },
});

// Method 1: Decode after generation is complete
// Requires installation: npm install wav
import { createWriteStream } from 'node:fs';  // node:fs is a built-in Node.js module and does not need to be installed
import { Writer } from 'wav';

async function convertAudio(audioString, audioPath) {
    try {
        // Decode the Base64 string into a Buffer
        const wavBuffer = Buffer.from(audioString, 'base64');
        // Create a WAV file write stream
        const writer = new Writer({
            sampleRate: 24000,  // Sample rate
            channels: 1,        // Mono channel
            bitDepth: 16        // 16-bit depth
        });
        // Create an output file stream and pipe the connection
        const outputStream = createWriteStream(audioPath);
        writer.pipe(outputStream);

        // Write PCM data and end the stream
        writer.write(wavBuffer);
        writer.end();

        // Use a Promise to wait for the file to finish writing
        await new Promise((resolve, reject) => {
            outputStream.on('finish', resolve);
            outputStream.on('error', reject);
        });

        // Add extra wait time to ensure the audio is complete
        await new Promise(resolve => setTimeout(resolve, 800));

        console.log(`Audio file successfully saved as ${audioPath}`);
    } catch (error) {
        console.error('An error occurred during processing:', error);
    }
}

let audioString = "";
for await (const chunk of completion) {
    if (Array.isArray(chunk.choices) && chunk.choices.length > 0) {
        if (chunk.choices[0].delta.audio) {
            if (chunk.choices[0].delta.audio["data"]) {
                audioString += chunk.choices[0].delta.audio["data"];
            }
        }
    } else {
        console.log(chunk.usage);
    }
}
// Execute the conversion
convertAudio(audioString, "audio_assistant_mjs.wav");

// Method 2: Decode and play in real time during generation
// You must first install the necessary components according to the instructions for your system above.
// import Speaker from 'speaker'; // Import the audio playback library

// // Create a Speaker instance (configuration matches WAV file parameters)
// const speaker = new Speaker({
//     sampleRate: 24000,  // Sample rate
//     channels: 1,        // Number of channels
//     bitDepth: 16,       // Bit depth
//     signed: true        // Signed PCM
// });
// for await (const chunk of completion) {
//     if (Array.isArray(chunk.choices) && chunk.choices.length > 0) {
//         if (chunk.choices[0].delta.audio) {
//             if (chunk.choices[0].delta.audio["data"]) {
//                 const pcmBuffer = Buffer.from(chunk.choices[0].delta.audio.data, 'base64');
//                 // Write directly to the speaker for playback
//                 speaker.write(pcmBuffer);
//             }
//         }
//     } else {
//         console.log(chunk.usage);
//     }
// }
// speaker.on('finish', () => console.log('Playback complete'));
// speaker.end(); // Call based on the actual end of the API stream

Billing details

Audio

Each second of input or output audio consumes 12.5 tokens.

Video

Tokens for video files are divided into video_tokens (visual) and audio_tokens (audio).

video_tokens

The following script is used to calculate video_tokens. The calculation process is as follows:

Frame sampling: Frames are sampled at a rate of 2 frames per second (FPS). The number of frames is limited to a range of [4, 128].
Size adjustment: The height and width are adjusted to be multiples of 32 pixels. The dimensions are dynamically scaled based on the number of frames to fit within the total pixel limit.
Token calculation: $⌈ f r am e C o u n t /2 ⌉ \times (h e i g h t /32) \times (w i d t h /32) + 2$ .

# Before use, install: pip install opencv-python
import math
import os
import cv2

# Fixed parameters
FRAME_FACTOR = 2

IMAGE_FACTOR = 32

# Aspect ratio of video frames
MAX_RATIO = 200

# Lower limit for video frame pixels
VIDEO_MIN_PIXELS = 128 * 32 * 32

# Upper limit for video frame pixels
VIDEO_MAX_PIXELS = 768 * 32 * 32

FPS = 2
# Minimum number of extracted frames
FPS_MIN_FRAMES = 4

# Maximum number of extracted frames
FPS_MAX_FRAMES = 128

# Maximum pixel value for video input
VIDEO_TOTAL_PIXELS = 16384 * 32 * 32

def round_by_factor(number, factor):
    return round(number / factor) * factor

def ceil_by_factor(number, factor):
    return math.ceil(number / factor) * factor

def floor_by_factor(number, factor):
    return math.floor(number / factor) * factor

def get_video(video_path):
    cap = cv2.VideoCapture(video_path)
    frame_width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
    frame_height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
    total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
    video_fps = cap.get(cv2.CAP_PROP_FPS)
    cap.release()
    return frame_height, frame_width, total_frames, video_fps

def smart_nframes(total_frames, video_fps):
    min_frames = ceil_by_factor(FPS_MIN_FRAMES, FRAME_FACTOR)
    max_frames = floor_by_factor(min(FPS_MAX_FRAMES, total_frames), FRAME_FACTOR)
    duration = total_frames / video_fps if video_fps != 0 else 0
    if duration - int(duration) > (1 / FPS):
        total_frames = math.ceil(duration * video_fps)
    else:
        total_frames = math.ceil(int(duration) * video_fps)
    nframes = total_frames / video_fps * FPS
    nframes = int(min(min(max(nframes, min_frames), max_frames), total_frames))
    if not (FRAME_FACTOR <= nframes <= total_frames):
        raise ValueError(f"nframes should in interval [{FRAME_FACTOR}, {total_frames}], but got {nframes}.")
    return nframes

def smart_resize(height, width, nframes, factor=IMAGE_FACTOR):
    min_pixels = VIDEO_MIN_PIXELS
    total_pixels = VIDEO_TOTAL_PIXELS
    max_pixels = max(min(VIDEO_MAX_PIXELS, total_pixels / nframes * FRAME_FACTOR), int(min_pixels * 1.05))
    if max(height, width) / min(height, width) > MAX_RATIO:
        raise ValueError(f"absolute aspect ratio must be smaller than {MAX_RATIO}, got {max(height, width) / min(height, width)}")
    h_bar = max(factor, round_by_factor(height, factor))
    w_bar = max(factor, round_by_factor(width, factor))
    if h_bar * w_bar > max_pixels:
        beta = math.sqrt((height * width) / max_pixels)
        h_bar = floor_by_factor(height / beta, factor)
        w_bar = floor_by_factor(width / beta, factor)
    elif h_bar * w_bar < min_pixels:
        beta = math.sqrt(min_pixels / (height * width))
        h_bar = ceil_by_factor(height * beta, factor)
        w_bar = ceil_by_factor(width * beta, factor)
    return h_bar, w_bar

def video_token_calculate(video_path):
    height, width, total_frames, video_fps = get_video(video_path)
    nframes = smart_nframes(total_frames, video_fps)
    resized_height, resized_width = smart_resize(height, width, nframes)
    video_token = int(math.ceil(nframes / FPS) * resized_height / 32 * resized_width / 32)
    video_token += 2  # Visual marker
    return video_token

if __name__ == "__main__":
    video_path = "spring_mountain.mp4"  # Your video path
    video_token = video_token_calculate(video_path)
    print("video_tokens:", video_token)

audio_tokens
Each second of audio consumes 12.5 tokens. If the audio duration is less than 1 second, it is billed as 1 second.

For information about token costs, see Model list.

API reference

For the input and output parameters for qwen3-livetranslate-flash, see Audio and video translation - Qwen.

Supported languages

The language codes in the following table can be used to specify the source and target languages.

Some target languages support text output only.

Language code	Language	Supported output modalities
en	English	Audio, text
zh	Chinese	Audio, text
ru	Russian	Audio, text
fr	French	Audio, text
de	German	Audio, text
pt	Portuguese	Audio, text
es	Spanish	Audio, text
it	Italian	Audio, text
id	Indonesian	Text
ko	Korean	Audio, text
ja	Japanese	Audio, text
vi	Vietnamese	Text
th	Thai	Text
ar	Arabic	Text
yue	Cantonese	Audio, text
hi	Hindi	Text
el	Greek	Text
tr	Turkish	Text

Supported voices

Voice name	`voice` parameter	Voice effects	Description	Supported languages
Cherry	Cherry		A cheerful, friendly, and genuine young woman.	Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Nofish	Nofish		A designer who has difficulty pronouncing retroflex consonants.	Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Shanghai-Jada	Jada		A bustling and energetic Shanghai lady.	Chinese
Beijing-Dylan	Dylan		A young man who grew up in the hutongs of Beijing.	Chinese
Sichuan-Sunny	Sunny		A sweet girl from Sichuan.	Chinese
Tianjin-Peter	Peter		A voice in the style of a Tianjin crosstalk performer (the supporting role).	Chinese
Cantonese-Kiki	Kiki		A sweet best friend from Hong Kong.	Cantonese
Sichuan-Eric	Eric		A man from Chengdu, Sichuan, who is unconventional and stands out from the crowd.	Chinese

FAQ

Q: When I input a video file, what content is translated?

A: The model translates the audio from the video. Visual information is used as context to improve accuracy.

Example:

If the audio content is This is a mask:

If the image shows a medical mask, it is described as "This is a medical mask".
If an image contains a masquerade mask, it is described as "This is a masquerade mask."