Qwen3-LiveTranslate-Flash is an audio and video translation model. It supports translation between 18 languages, such as Chinese, English, Russian, and French. The model uses visual context to improve translation accuracy and can output both text and audio.
Procedure
Set languages: Set the source language (
source_lang) and the target language (target_lang) in thetranslation_optionsparameter. Supported languages.Omitting
source_langenables automatic detection, but specifying the language improves translation accuracy.Input file: The
messagesarray must contain exactly one message where theroleisuser. Thecontentfield must contain the URL or Base64-encoded data of the audio or video to be translated.Control the output modality: Use the
modalitiesparameter:["text"]: Outputs only text.["text","audio"]: Outputs text and Base64-encoded audio.If the output includes audio, you must set the voice (
voice) in theaudioparameter. Supported voices.
The following is the core code for using the OpenAI Python SDK:
# Import dependencies and create a client...
completion = client.chat.completions.create(
model="qwen3-livetranslate-flash", # Select the model
# The messages array contains only one user message, and the content is the file to be translated
messages = [
{
"role": "user",
"content": [
{
"type": "input_audio",
"input_audio": {
"data": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250211/tixcef/cherry.wav",
"format": "wav"
}
}
]
}
],
# translation_options is not a standard OpenAI parameter and must be passed through extra_body
extra_body={"translation_options": {"source_lang": "zh", "target_lang": "en"}},
# Example: Output text and audio
modalities = ["text","audio"],
audio={"voice": "Cherry", "format": "wav"}
)Limits
Single-turn translation: The model is designed for translation tasks and does not support multi-turn conversations.
No system message: The model does not support setting the global behavior using the
systemrole.Call method: Only streaming output that is compatible with the OpenAI protocol is supported.
Parsing the response
chunk is a streaming response object:
Text: Read the content from
chunk.choices[0].delta.content.Audio: The Base64-encoded audio data is in
chunk.choices[0].delta.audio["data"]. The audio output from the model has a sample rate of 24 kHz.
Model availability
Model | Version | Context window | Max input | Max output |
(Tokens) | ||||
qwen3-livetranslate-flash Currently same capabilities as qwen3-livetranslate-flash-2025-12-01 | Stable | 53,248 | 49,152 | 4,096 |
qwen3-livetranslate-flash-2025-12-01 | Snapshot | |||
Getting started
qwen3-livetranslate-flash supports audio or video input and outputs text and audio.
First create an API key and export the API key as an environment variable. If you use the OpenAI SDK, install the SDK.
The following example uses audio input. To use video input, uncomment the corresponding code.
import os
from openai import OpenAI
client = OpenAI(
# If you have not configured an environment variable, replace the following line with your Model Studio API key: api_key="sk-xxx",
api_key=os.getenv("DASHSCOPE_API_KEY"),
# The following is the base_url for the Singapore region. If you use a model in the Beijing region, replace the base_url with: https://dashscope.aliyuncs.com/compatible-mode/v1
base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)
# ---------------- Audio input ----------------
messages = [
{
"role": "user",
"content": [
{
"type": "input_audio",
"input_audio": {
"data": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250211/tixcef/cherry.wav",
"format": "wav",
},
}
],
}
]
# ---------------- Video input (uncomment to use) ----------------
# messages = [
# {
# "role": "user",
# "content": [
# {
# "type": "video_url",
# "video_url": {
# "url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241115/cqqkru/1.mp4"
# },
# }
# ],
# },
# ]
completion = client.chat.completions.create(
model="qwen3-livetranslate-flash",
messages=messages,
modalities=["text", "audio"],
audio={"voice": "Cherry", "format": "wav"},
stream=True,
stream_options={"include_usage": True},
extra_body={"translation_options": {"source_lang": "zh", "target_lang": "en"}},
)
for chunk in completion:
print(chunk)import OpenAI from "openai";
const client = new OpenAI({
// If you have not configured an environment variable, replace the following line with your Model Studio API key: apiKey: "sk-xxx",
apiKey: process.env.DASHSCOPE_API_KEY,
// The following is the base_url for the Singapore region. If you use a model in the Beijing region, replace the base_url with: https://dashscope.aliyuncs.com/compatible-mode/v1
baseURL: "https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
});
// ---------------- Audio input ----------------
const messages = [
{
role: "user",
content: [
{
type: "input_audio",
input_audio: {
data: "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250211/tixcef/cherry.wav",
format: "wav",
},
},
],
},
];
// ---------------- Video input (uncomment to use) ----------------
// const messages = [
// {
// role: "user",
// content: [
// {
// type: "video_url",
// video_url: {
// url: "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241115/cqqkru/1.mp4",
// },
// },
// ],
// },
// ];
async function main() {
const completion = await client.chat.completions.create({
model: "qwen3-livetranslate-flash",
messages: messages,
modalities: ["text", "audio"],
audio: { voice: "Cherry", format: "wav" },
stream: true,
stream_options: { include_usage: true },
translation_options: { source_lang: "zh", target_lang: "en" },
});
for await (const chunk of completion) {
console.log(JSON.stringify(chunk));
}
}
main();# ======= Important =======
# The following is an example for the Singapore region. If you use a model in the Beijing region, replace the request URL with: https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions
# === Delete this comment before execution ===
curl -X POST https://dashscope-intl.aliyuncs.com/compatible-mode/v1/chat/completions \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3-livetranslate-flash",
"messages": [
{
"role": "user",
"content": [
{
"type": "input_audio",
"input_audio": {
"data": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250211/tixcef/cherry.wav",
"format": "wav"
}
}
]
}
],
"modalities": ["text", "audio"],
"audio": {
"voice": "Cherry",
"format": "wav"
},
"stream": true,
"stream_options": {
"include_usage": true
},
"translation_options": {
"source_lang": "zh",
"target_lang": "en"
}
}'This example uses a public file URL. For information about using a local file, see Input a Base64-encoded local file.
Parse Base64 audio data
The model outputs audio in a streaming Base64-encoded format. You can process the data in two ways:
Concatenate and decode: Concatenate all Base64 fragments from the stream. After the stream is complete, decode the result and save it as an audio file.
Real-time playback: Decode each Base64 fragment in real time and play it directly.
# Installation instructions for pyaudio:
# APPLE Mac OS X
# brew install portaudio
# pip install pyaudio
# Debian/Ubuntu
# sudo apt-get install python-pyaudio python3-pyaudio
# or
# pip install pyaudio
# CentOS
# sudo yum install -y portaudio portaudio-devel && pip install pyaudio
# Microsoft Windows
# python -m pip install pyaudio
import os
from openai import OpenAI
import base64
import numpy as np
import soundfile as sf
# Initialize the OpenAI client
client = OpenAI(
# If you have not configured an environment variable, replace the following line with your Model Studio API key: api_key="sk-xxx",
api_key=os.getenv("DASHSCOPE_API_KEY"),
# The following is the base_url for the Beijing region. If you use a model in the Singapore region, replace the base_url with: https://dashscope-intl.aliyuncs.com/compatible-mode/v1
base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
)
messages = [
{
"role": "user",
"content": [
{
"type": "input_audio",
"input_audio": {
"data": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250211/tixcef/cherry.wav",
"format": "wav",
},
}
],
}
]
completion = client.chat.completions.create(
model="qwen3-livetranslate-flash",
messages=messages,
modalities=["text", "audio"],
audio={"voice": "Cherry", "format": "wav"},
stream=True,
stream_options={"include_usage": True},
extra_body={"translation_options": {"source_lang": "zh", "target_lang": "en"}},
)
# Method 1: Decode after generation is complete
audio_string = ""
for chunk in completion:
if chunk.choices:
if hasattr(chunk.choices[0].delta, "audio"):
try:
audio_string += chunk.choices[0].delta.audio["data"]
except Exception as e:
print(chunk.choices[0].delta.audio["transcript"])
else:
print(chunk.usage)
wav_bytes = base64.b64decode(audio_string)
audio_np = np.frombuffer(wav_bytes, dtype=np.int16)
sf.write("audio_assistant_py.wav", audio_np, samplerate=24000)
# Method 2: Decode during generation (comment out the code for Method 1 to use Method 2)
# # Initialize PyAudio
# import pyaudio
# import time
# p = pyaudio.PyAudio()
# # Create an audio stream
# stream = p.open(format=pyaudio.paInt16,
# channels=1,
# rate=24000,
# output=True)
# for chunk in completion:
# if chunk.choices:
# if hasattr(chunk.choices[0].delta, "audio"):
# try:
# audio_string = chunk.choices[0].delta.audio["data"]
# wav_bytes = base64.b64decode(audio_string)
# audio_np = np.frombuffer(wav_bytes, dtype=np.int16)
# # Play the audio data directly
# stream.write(audio_np.tobytes())
# except Exception as e:
# print(chunk.choices[0].delta.audio["transcript"])
# time.sleep(0.8)
# # Clean up resources
# stream.stop_stream()
# stream.close()
# p.terminate()// Preparations before running:
// For Windows/Mac/Linux:
// 1. Make sure Node.js is installed (version >= 14 is recommended).
// 2. Run the following command to install the necessary dependencies:
// npm install openai wav
//
// To use the real-time playback feature (Method 2), you also need to:
// Windows:
// npm install speaker
// Mac:
// brew install portaudio
// npm install speaker
// Linux (Ubuntu/Debian):
// sudo apt-get install libasound2-dev
// npm install speaker
import OpenAI from "openai";
const client = new OpenAI({
// If you have not configured an environment variable, replace the following line with your Model Studio API key: apiKey: "sk-xxx",
apiKey: process.env.DASHSCOPE_API_KEY,
// The following is the base_url for the Beijing region. If you use a model in the Singapore region, replace the base_url with: https://dashscope-intl.aliyuncs.com/compatible-mode/v1
baseURL: "https://dashscope.aliyuncs.com/compatible-mode/v1",
});
// ---------------- Audio input ----------------
const messages = [
{
role: "user",
content: [
{
type: "input_audio",
input_audio: {
data: "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250211/tixcef/cherry.wav",
format: "wav",
},
},
],
},
];
const completion = await client.chat.completions.create({
model: "qwen3-livetranslate-flash",
messages: messages,
modalities: ["text", "audio"],
audio: { voice: "Cherry", format: "wav" },
stream: true,
stream_options: { include_usage: true },
translation_options: { source_lang: "zh", target_lang: "en" },
});
// Method 1: Decode after generation is complete
// Requires installation: npm install wav
import { createWriteStream } from 'node:fs'; // node:fs is a built-in Node.js module and does not need to be installed
import { Writer } from 'wav';
async function convertAudio(audioString, audioPath) {
try {
// Decode the Base64 string into a Buffer
const wavBuffer = Buffer.from(audioString, 'base64');
// Create a WAV file write stream
const writer = new Writer({
sampleRate: 24000, // Sample rate
channels: 1, // Mono channel
bitDepth: 16 // 16-bit depth
});
// Create an output file stream and pipe the connection
const outputStream = createWriteStream(audioPath);
writer.pipe(outputStream);
// Write PCM data and end the stream
writer.write(wavBuffer);
writer.end();
// Use a Promise to wait for the file to finish writing
await new Promise((resolve, reject) => {
outputStream.on('finish', resolve);
outputStream.on('error', reject);
});
// Add extra wait time to ensure the audio is complete
await new Promise(resolve => setTimeout(resolve, 800));
console.log(`Audio file successfully saved as ${audioPath}`);
} catch (error) {
console.error('An error occurred during processing:', error);
}
}
let audioString = "";
for await (const chunk of completion) {
if (Array.isArray(chunk.choices) && chunk.choices.length > 0) {
if (chunk.choices[0].delta.audio) {
if (chunk.choices[0].delta.audio["data"]) {
audioString += chunk.choices[0].delta.audio["data"];
}
}
} else {
console.log(chunk.usage);
}
}
// Execute the conversion
convertAudio(audioString, "audio_assistant_mjs.wav");
// Method 2: Decode and play in real time during generation
// You must first install the necessary components according to the instructions for your system above.
// import Speaker from 'speaker'; // Import the audio playback library
// // Create a Speaker instance (configuration matches WAV file parameters)
// const speaker = new Speaker({
// sampleRate: 24000, // Sample rate
// channels: 1, // Number of channels
// bitDepth: 16, // Bit depth
// signed: true // Signed PCM
// });
// for await (const chunk of completion) {
// if (Array.isArray(chunk.choices) && chunk.choices.length > 0) {
// if (chunk.choices[0].delta.audio) {
// if (chunk.choices[0].delta.audio["data"]) {
// const pcmBuffer = Buffer.from(chunk.choices[0].delta.audio.data, 'base64');
// // Write directly to the speaker for playback
// speaker.write(pcmBuffer);
// }
// }
// } else {
// console.log(chunk.usage);
// }
// }
// speaker.on('finish', () => console.log('Playback complete'));
// speaker.end(); // Call based on the actual end of the API streamBilling details
Audio
Each second of input or output audio consumes 12.5 tokens.
Video
Tokens for video files are divided into video_tokens (visual) and audio_tokens (audio).
video_tokensThe following script is used to calculate
video_tokens. The calculation process is as follows:Frame sampling: Frames are sampled at a rate of 2 frames per second (FPS). The number of frames is limited to a range of [4, 128].
Size adjustment: The height and width are adjusted to be multiples of 32 pixels. The dimensions are dynamically scaled based on the number of frames to fit within the total pixel limit.
Token calculation:
.
# Before use, install: pip install opencv-python import math import os import cv2 # Fixed parameters FRAME_FACTOR = 2 IMAGE_FACTOR = 32 # Aspect ratio of video frames MAX_RATIO = 200 # Lower limit for video frame pixels VIDEO_MIN_PIXELS = 128 * 32 * 32 # Upper limit for video frame pixels VIDEO_MAX_PIXELS = 768 * 32 * 32 FPS = 2 # Minimum number of extracted frames FPS_MIN_FRAMES = 4 # Maximum number of extracted frames FPS_MAX_FRAMES = 128 # Maximum pixel value for video input VIDEO_TOTAL_PIXELS = 16384 * 32 * 32 def round_by_factor(number, factor): return round(number / factor) * factor def ceil_by_factor(number, factor): return math.ceil(number / factor) * factor def floor_by_factor(number, factor): return math.floor(number / factor) * factor def get_video(video_path): cap = cv2.VideoCapture(video_path) frame_width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH)) frame_height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT)) total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT)) video_fps = cap.get(cv2.CAP_PROP_FPS) cap.release() return frame_height, frame_width, total_frames, video_fps def smart_nframes(total_frames, video_fps): min_frames = ceil_by_factor(FPS_MIN_FRAMES, FRAME_FACTOR) max_frames = floor_by_factor(min(FPS_MAX_FRAMES, total_frames), FRAME_FACTOR) duration = total_frames / video_fps if video_fps != 0 else 0 if duration - int(duration) > (1 / FPS): total_frames = math.ceil(duration * video_fps) else: total_frames = math.ceil(int(duration) * video_fps) nframes = total_frames / video_fps * FPS nframes = int(min(min(max(nframes, min_frames), max_frames), total_frames)) if not (FRAME_FACTOR <= nframes <= total_frames): raise ValueError(f"nframes should in interval [{FRAME_FACTOR}, {total_frames}], but got {nframes}.") return nframes def smart_resize(height, width, nframes, factor=IMAGE_FACTOR): min_pixels = VIDEO_MIN_PIXELS total_pixels = VIDEO_TOTAL_PIXELS max_pixels = max(min(VIDEO_MAX_PIXELS, total_pixels / nframes * FRAME_FACTOR), int(min_pixels * 1.05)) if max(height, width) / min(height, width) > MAX_RATIO: raise ValueError(f"absolute aspect ratio must be smaller than {MAX_RATIO}, got {max(height, width) / min(height, width)}") h_bar = max(factor, round_by_factor(height, factor)) w_bar = max(factor, round_by_factor(width, factor)) if h_bar * w_bar > max_pixels: beta = math.sqrt((height * width) / max_pixels) h_bar = floor_by_factor(height / beta, factor) w_bar = floor_by_factor(width / beta, factor) elif h_bar * w_bar < min_pixels: beta = math.sqrt(min_pixels / (height * width)) h_bar = ceil_by_factor(height * beta, factor) w_bar = ceil_by_factor(width * beta, factor) return h_bar, w_bar def video_token_calculate(video_path): height, width, total_frames, video_fps = get_video(video_path) nframes = smart_nframes(total_frames, video_fps) resized_height, resized_width = smart_resize(height, width, nframes) video_token = int(math.ceil(nframes / FPS) * resized_height / 32 * resized_width / 32) video_token += 2 # Visual marker return video_token if __name__ == "__main__": video_path = "spring_mountain.mp4" # Your video path video_token = video_token_calculate(video_path) print("video_tokens:", video_token)audio_tokensEach second of audio consumes 12.5 tokens. If the audio duration is less than 1 second, it is billed as 1 second.
For information about token costs, see Model list.
API reference
For the input and output parameters for qwen3-livetranslate-flash, see Audio and video translation - Qwen.
Supported languages
The language codes in the following table can be used to specify the source and target languages.
Some target languages support text output only.
Language code | Language | Supported output modalities |
en | English | Audio, text |
zh | Chinese | Audio, text |
ru | Russian | Audio, text |
fr | French | Audio, text |
de | German | Audio, text |
pt | Portuguese | Audio, text |
es | Spanish | Audio, text |
it | Italian | Audio, text |
id | Indonesian | Text |
ko | Korean | Audio, text |
ja | Japanese | Audio, text |
vi | Vietnamese | Text |
th | Thai | Text |
ar | Arabic | Text |
yue | Cantonese | Audio, text |
hi | Hindi | Text |
el | Greek | Text |
tr | Turkish | Text |
Supported voices
Voice name |
| Voice effects | Description | Supported languages |
Cherry | Cherry | A cheerful, friendly, and genuine young woman. | Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean | |
Nofish | Nofish | A designer who has difficulty pronouncing retroflex consonants. | Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean | |
Shanghai-Jada | Jada | A bustling and energetic Shanghai lady. | Chinese | |
Beijing-Dylan | Dylan | A young man who grew up in the hutongs of Beijing. | Chinese | |
Sichuan-Sunny | Sunny | A sweet girl from Sichuan. | Chinese | |
Tianjin-Peter | Peter | A voice in the style of a Tianjin crosstalk performer (the supporting role). | Chinese | |
Cantonese-Kiki | Kiki | A sweet best friend from Hong Kong. | Cantonese | |
Sichuan-Eric | Eric | A man from Chengdu, Sichuan, who is unconventional and stands out from the crowd. | Chinese |
FAQ
Q: When I input a video file, what content is translated?
A: The model translates the audio from the video. Visual information is used as context to improve accuracy.
Example:
If the audio content is This is a mask:
If the image shows a medical mask, it is described as "This is a medical mask".
If an image contains a masquerade mask, it is described as "This is a masquerade mask."