qwen3-livetranslate-flash translates audio and video files across 18 languages. It accepts audio or video input and returns translated text, synthesized audio, or both via a streaming API. For video input, visual context improves translation accuracy (e.g., distinguishing "medical mask" vs. "masquerade mask" based on video frames).
Before you begin
-
(Optional) If you use the OpenAI SDK, install the SDK.
Quick start
All examples use the OpenAI-compatible streaming API. Set source and target languages via translation_options. The default input is audio; uncomment the video input block in each example to translate video files instead.
Specifying source_lang improves accuracy. Omit it to enable automatic language detection.
import os
from openai import OpenAI
client = OpenAI(
api_key=os.getenv("DASHSCOPE_API_KEY"),
# Singapore region. For Beijing, use: https://dashscope.aliyuncs.com/compatible-mode/v1
base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)
# --- Audio input ---
messages = [
{
"role": "user",
"content": [
{
"type": "input_audio",
"input_audio": {
"data": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250211/tixcef/cherry.wav",
"format": "wav",
},
}
],
}
]
# --- Video input (uncomment to use) ---
# messages = [
# {
# "role": "user",
# "content": [
# {
# "type": "video_url",
# "video_url": {
# "url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241115/cqqkru/1.mp4"
# },
# }
# ],
# },
# ]
completion = client.chat.completions.create(
model="qwen3-livetranslate-flash",
messages=messages,
modalities=["text", "audio"],
audio={"voice": "Cherry", "format": "wav"},
stream=True,
stream_options={"include_usage": True},
# translation_options is not a standard OpenAI parameter; pass it through extra_body
extra_body={"translation_options": {"source_lang": "zh", "target_lang": "en"}},
)
for chunk in completion:
print(chunk)import OpenAI from "openai";
const client = new OpenAI({
apiKey: process.env.DASHSCOPE_API_KEY,
// Singapore region. For Beijing, use: https://dashscope.aliyuncs.com/compatible-mode/v1
baseURL: "https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
});
// --- Audio input ---
const messages = [
{
role: "user",
content: [
{
type: "input_audio",
input_audio: {
data: "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250211/tixcef/cherry.wav",
format: "wav",
},
},
],
},
];
// --- Video input (uncomment to use) ---
// const messages = [
// {
// role: "user",
// content: [
// {
// type: "video_url",
// video_url: {
// url: "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241115/cqqkru/1.mp4",
// },
// },
// ],
// },
// ];
async function main() {
const completion = await client.chat.completions.create({
model: "qwen3-livetranslate-flash",
messages: messages,
modalities: ["text", "audio"],
audio: { voice: "Cherry", format: "wav" },
stream: true,
stream_options: { include_usage: true },
translation_options: { source_lang: "zh", target_lang: "en" },
});
for await (const chunk of completion) {
console.log(JSON.stringify(chunk));
}
}
main();curl -X POST https://dashscope-intl.aliyuncs.com/compatible-mode/v1/chat/completions \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3-livetranslate-flash",
"messages": [
{
"role": "user",
"content": [
{
"type": "input_audio",
"input_audio": {
"data": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250211/tixcef/cherry.wav",
"format": "wav"
}
}
]
}
],
"modalities": ["text", "audio"],
"audio": {
"voice": "Cherry",
"format": "wav"
},
"stream": true,
"stream_options": {
"include_usage": true
},
"translation_options": {
"source_lang": "zh",
"target_lang": "en"
}
}'These examples use a public file URL. To use a local file, see Input a Base64-encoded local file.
Request parameters
Input
The messages array must contain exactly one message with role set to user. The content field holds the audio or video to translate:
-
Audio: Set
typetoinput_audio. Provide the file URL or Base64-encoded data ininput_audio.data, and specify the format (for example,wav) ininput_audio.format. -
Video: Set
typetovideo_url. Provide the file URL invideo_url.url.
Translation options
Specify the source and target languages in the translation_options parameter:
"translation_options": {"source_lang": "zh", "target_lang": "en"}
In the Python SDK, translation_options is not a standard OpenAI parameter. Pass it through extra_body:
extra_body={"translation_options": {"source_lang": "zh", "target_lang": "en"}}
Output modality
Control the output format with the modalities parameter:
modalities value |
Output |
|---|---|
["text"] |
Translated text only |
["text", "audio"] |
Translated text and Base64-encoded synthesized audio |
When the output includes audio, set the voice in the audio parameter. See Supported voices for available options.
Constraints
-
Single-turn only: The model handles one translation per request. Multi-turn conversations are not supported.
-
No system message: The
systemrole is not supported. -
Streaming only: Only OpenAI-compatible streaming output is supported.
Parse the response
Each streaming chunk object contains:
-
Text:
chunk.choices[0].delta.content -
Audio:
chunk.choices[0].delta.audio["data"](Base64-encoded, 24 kHz sample rate)
Save audio to a file
Concatenate all Base64 audio fragments from the stream, then decode and save the result after the stream completes.
Python
import os
from openai import OpenAI
import base64
import numpy as np
import soundfile as sf
client = OpenAI(
api_key=os.getenv("DASHSCOPE_API_KEY"),
# Singapore region. For Beijing, use: https://dashscope.aliyuncs.com/compatible-mode/v1
base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)
messages = [
{
"role": "user",
"content": [
{
"type": "input_audio",
"input_audio": {
"data": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250211/tixcef/cherry.wav",
"format": "wav",
},
}
],
}
]
completion = client.chat.completions.create(
model="qwen3-livetranslate-flash",
messages=messages,
modalities=["text", "audio"],
audio={"voice": "Cherry", "format": "wav"},
stream=True,
stream_options={"include_usage": True},
extra_body={"translation_options": {"source_lang": "zh", "target_lang": "en"}},
)
# Concatenate Base64 fragments, then decode after the stream completes
audio_string = ""
for chunk in completion:
if chunk.choices:
if hasattr(chunk.choices[0].delta, "audio"):
try:
audio_string += chunk.choices[0].delta.audio["data"]
except Exception as e:
print(chunk.choices[0].delta.audio["transcript"])
else:
print(chunk.usage)
wav_bytes = base64.b64decode(audio_string)
audio_np = np.frombuffer(wav_bytes, dtype=np.int16)
sf.write("output.wav", audio_np, samplerate=24000)
Node.js
import OpenAI from "openai";
import { createWriteStream } from "node:fs";
import { Writer } from "wav";
const client = new OpenAI({
apiKey: process.env.DASHSCOPE_API_KEY,
// Singapore region. For Beijing, use: https://dashscope.aliyuncs.com/compatible-mode/v1
baseURL: "https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
});
const messages = [
{
role: "user",
content: [
{
type: "input_audio",
input_audio: {
data: "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250211/tixcef/cherry.wav",
format: "wav",
},
},
],
},
];
const completion = await client.chat.completions.create({
model: "qwen3-livetranslate-flash",
messages: messages,
modalities: ["text", "audio"],
audio: { voice: "Cherry", format: "wav" },
stream: true,
stream_options: { include_usage: true },
translation_options: { source_lang: "zh", target_lang: "en" },
});
// Concatenate Base64 fragments, then decode after the stream completes
let audioString = "";
for await (const chunk of completion) {
if (Array.isArray(chunk.choices) && chunk.choices.length > 0) {
if (chunk.choices[0].delta.audio?.data) {
audioString += chunk.choices[0].delta.audio.data;
}
} else {
console.log(chunk.usage);
}
}
// Save as WAV file
async function saveAudio(base64Data, outputPath) {
const wavBuffer = Buffer.from(base64Data, "base64");
const writer = new Writer({
sampleRate: 24000,
channels: 1,
bitDepth: 16,
});
const outputStream = createWriteStream(outputPath);
writer.pipe(outputStream);
writer.write(wavBuffer);
writer.end();
await new Promise((resolve, reject) => {
outputStream.on("finish", resolve);
outputStream.on("error", reject);
});
console.log(`Audio saved to ${outputPath}`);
}
saveAudio(audioString, "output.wav");
Real-time playback
Decode each Base64 fragment as it arrives and play it directly. This approach requires platform-specific audio libraries.
Python
Install pyaudio first:
| Platform | Installation |
|---|---|
| macOS | brew install portaudio && pip install pyaudio |
| Ubuntu / Debian | sudo apt-get install python-pyaudio python3-pyaudio or pip install pyaudio |
| CentOS | sudo yum install -y portaudio portaudio-devel && pip install pyaudio |
| Windows | python -m pip install pyaudio |
import os
from openai import OpenAI
import base64
import numpy as np
import pyaudio
import time
client = OpenAI(
api_key=os.getenv("DASHSCOPE_API_KEY"),
# Singapore region. For Beijing, use: https://dashscope.aliyuncs.com/compatible-mode/v1
base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)
messages = [
{
"role": "user",
"content": [
{
"type": "input_audio",
"input_audio": {
"data": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250211/tixcef/cherry.wav",
"format": "wav",
},
}
],
}
]
completion = client.chat.completions.create(
model="qwen3-livetranslate-flash",
messages=messages,
modalities=["text", "audio"],
audio={"voice": "Cherry", "format": "wav"},
stream=True,
stream_options={"include_usage": True},
extra_body={"translation_options": {"source_lang": "zh", "target_lang": "en"}},
)
# Initialize PyAudio for real-time playback
p = pyaudio.PyAudio()
stream = p.open(format=pyaudio.paInt16, channels=1, rate=24000, output=True)
for chunk in completion:
if chunk.choices:
if hasattr(chunk.choices[0].delta, "audio"):
try:
audio_data = chunk.choices[0].delta.audio["data"]
wav_bytes = base64.b64decode(audio_data)
audio_np = np.frombuffer(wav_bytes, dtype=np.int16)
stream.write(audio_np.tobytes())
except Exception as e:
print(chunk.choices[0].delta.audio["transcript"])
time.sleep(0.8)
stream.stop_stream()
stream.close()
p.terminate()
Node.js
Install dependencies first:
| Platform | Installation |
|---|---|
| macOS | brew install portaudio && npm install speaker |
| Ubuntu / Debian | sudo apt-get install libasound2-dev && npm install speaker |
| Windows | npm install speaker |
import OpenAI from "openai";
import Speaker from "speaker";
const client = new OpenAI({
apiKey: process.env.DASHSCOPE_API_KEY,
// Singapore region. For Beijing, use: https://dashscope.aliyuncs.com/compatible-mode/v1
baseURL: "https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
});
const messages = [
{
role: "user",
content: [
{
type: "input_audio",
input_audio: {
data: "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250211/tixcef/cherry.wav",
format: "wav",
},
},
],
},
];
const completion = await client.chat.completions.create({
model: "qwen3-livetranslate-flash",
messages: messages,
modalities: ["text", "audio"],
audio: { voice: "Cherry", format: "wav" },
stream: true,
stream_options: { include_usage: true },
translation_options: { source_lang: "zh", target_lang: "en" },
});
// Stream audio to speaker in real time
const speaker = new Speaker({
sampleRate: 24000,
channels: 1,
bitDepth: 16,
signed: true,
});
for await (const chunk of completion) {
if (Array.isArray(chunk.choices) && chunk.choices.length > 0) {
if (chunk.choices[0].delta.audio?.data) {
const pcmBuffer = Buffer.from(chunk.choices[0].delta.audio.data, "base64");
speaker.write(pcmBuffer);
}
} else {
console.log(chunk.usage);
}
}
speaker.on("finish", () => console.log("Playback complete"));
speaker.end();
Billing
Audio
Each second of input or output audio consumes 12.5 tokens. Audio shorter than 1 second is billed as 1 second.
Video
Video token consumption has two components:
-
Audio tokens: 12.5 tokens per second of audio. Audio shorter than 1 second is billed as 1 second.
-
Video tokens: Calculated based on frame count and resolution. The formula is:
video_tokens = ceil(frame_count / 2) x (height / 32) x (width / 32) + 2Where:
-
Frames are sampled at 2 FPS, clamped to the range [4, 128].
-
Height and width are adjusted to multiples of 32 pixels and dynamically scaled to fit within the total pixel limit.
-
Python script to calculate video tokens
# Install: pip install opencv-python
import math
import cv2
FRAME_FACTOR = 2
IMAGE_FACTOR = 32
MAX_RATIO = 200
VIDEO_MIN_PIXELS = 128 * 32 * 32
VIDEO_MAX_PIXELS = 768 * 32 * 32
FPS = 2
FPS_MIN_FRAMES = 4
FPS_MAX_FRAMES = 128
VIDEO_TOTAL_PIXELS = 16384 * 32 * 32
def round_by_factor(number, factor):
return round(number / factor) * factor
def ceil_by_factor(number, factor):
return math.ceil(number / factor) * factor
def floor_by_factor(number, factor):
return math.floor(number / factor) * factor
def get_video(video_path):
cap = cv2.VideoCapture(video_path)
frame_width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
frame_height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
video_fps = cap.get(cv2.CAP_PROP_FPS)
cap.release()
return frame_height, frame_width, total_frames, video_fps
def smart_nframes(total_frames, video_fps):
min_frames = ceil_by_factor(FPS_MIN_FRAMES, FRAME_FACTOR)
max_frames = floor_by_factor(min(FPS_MAX_FRAMES, total_frames), FRAME_FACTOR)
duration = total_frames / video_fps if video_fps != 0 else 0
if duration - int(duration) > (1 / FPS):
total_frames = math.ceil(duration * video_fps)
else:
total_frames = math.ceil(int(duration) * video_fps)
nframes = total_frames / video_fps * FPS
nframes = int(min(min(max(nframes, min_frames), max_frames), total_frames))
if not (FRAME_FACTOR <= nframes <= total_frames):
raise ValueError(f"nframes should in interval [{FRAME_FACTOR}, {total_frames}], but got {nframes}.")
return nframes
def smart_resize(height, width, nframes, factor=IMAGE_FACTOR):
min_pixels = VIDEO_MIN_PIXELS
total_pixels = VIDEO_TOTAL_PIXELS
max_pixels = max(min(VIDEO_MAX_PIXELS, total_pixels / nframes * FRAME_FACTOR), int(min_pixels * 1.05))
if max(height, width) / min(height, width) > MAX_RATIO:
raise ValueError(f"absolute aspect ratio must be smaller than {MAX_RATIO}, got {max(height, width) / min(height, width)}")
h_bar = max(factor, round_by_factor(height, factor))
w_bar = max(factor, round_by_factor(width, factor))
if h_bar * w_bar > max_pixels:
beta = math.sqrt((height * width) / max_pixels)
h_bar = floor_by_factor(height / beta, factor)
w_bar = floor_by_factor(width / beta, factor)
elif h_bar * w_bar < min_pixels:
beta = math.sqrt(min_pixels / (height * width))
h_bar = ceil_by_factor(height * beta, factor)
w_bar = ceil_by_factor(width * beta, factor)
return h_bar, w_bar
def video_token_calculate(video_path):
height, width, total_frames, video_fps = get_video(video_path)
nframes = smart_nframes(total_frames, video_fps)
resized_height, resized_width = smart_resize(height, width, nframes)
video_token = int(math.ceil(nframes / FPS) * resized_height / 32 * resized_width / 32)
video_token += 2
return video_token
if __name__ == "__main__":
video_path = "spring_mountain.mp4" # Replace with your video path
video_token = video_token_calculate(video_path)
print("video_tokens:", video_token)
For token pricing, see Model list.
Model details
| Model | Version | Context window | Max input | Max output |
|---|---|---|---|---|
| qwen3-livetranslate-flash | Stable | 53,248 tokens | 49,152 tokens | 4,096 tokens |
| qwen3-livetranslate-flash-2025-12-01 | Snapshot | 53,248 tokens | 49,152 tokens | 4,096 tokens |
qwen3-livetranslate-flash currently has the same capabilities as qwen3-livetranslate-flash-2025-12-01.
Supported languages
Use these language codes for source_lang and target_lang. Some target languages support text output only.
| Language code | Language | Supported output |
|---|---|---|
| en | English | Audio, text |
| zh | Chinese | Audio, text |
| ru | Russian | Audio, text |
| fr | French | Audio, text |
| de | German | Audio, text |
| pt | Portuguese | Audio, text |
| es | Spanish | Audio, text |
| it | Italian | Audio, text |
| id | Indonesian | Text |
| ko | Korean | Audio, text |
| ja | Japanese | Audio, text |
| vi | Vietnamese | Text |
| th | Thai | Text |
| ar | Arabic | Text |
| yue | Cantonese | Audio, text |
| hi | Hindi | Text |
| el | Greek | Text |
| tr | Turkish | Text |
Supported voices
Set the voice parameter in audio when output includes synthesized audio.
| Voice name | voice parameter |
Description | Supported languages |
|---|---|---|---|
| Cherry | Cherry | A cheerful, friendly, and genuine young woman. | Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean |
| Nofish | Nofish | A designer who has difficulty pronouncing retroflex consonants. | Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean |
| Shanghai-Jada | Jada | A bustling and energetic Shanghai lady. | Chinese |
| Beijing-Dylan | Dylan | A young man who grew up in the hutongs of Beijing. | Chinese |
| Sichuan-Sunny | Sunny | A sweet girl from Sichuan. | Chinese |
| Tianjin-Peter | Peter | A voice in the style of a Tianjin crosstalk performer (the supporting role). | Chinese |
| Cantonese-Kiki | Kiki | A sweet best friend from Hong Kong. | Cantonese |
| Sichuan-Eric | Eric | A man from Chengdu, Sichuan, who is unconventional and stands out from the crowd. | Chinese |
FAQ
When I input a video file, what content is translated?
The model translates the video's audio track. Visual information improves translation accuracy.
For example, if the audio says "This is a mask":
-
When the video shows a medical mask, the model translates it as "This is a medical mask."
-
When the video shows a masquerade mask, the model translates it as "This is a masquerade mask."
API reference
For full input and output parameter details, see Audio and video translation - Qwen.