All Products
Search
Document Center

Alibaba Cloud Model Studio:Audio and video translation - Qwen

Last Updated:Dec 11, 2025

Qwen3-LiveTranslate-Flash is an audio and video translation model. It supports translation between 18 languages, such as Chinese, English, Russian, and French. The model uses visual context to improve translation accuracy and can output both text and audio.

Procedure

  1. Set languages: Set the source language (source_lang) and the target language (target_lang) in the translation_options parameter. Supported languages.

    Omitting source_lang enables automatic detection, but specifying the language improves translation accuracy.
  2. Input file: The messages array must contain exactly one message where the role is user. The content field must contain the URL or Base64-encoded data of the audio or video to be translated.

  3. Control the output modality: Use the modalities parameter:

    • ["text"]: Outputs only text.

    • ["text","audio"]: Outputs text and Base64-encoded audio.

      If the output includes audio, you must set the voice (voice) in the audio parameter. Supported voices.

The following is the core code for using the OpenAI Python SDK:

# Import dependencies and create a client...
completion = client.chat.completions.create(
    model="qwen3-livetranslate-flash",    # Select the model
    # The messages array contains only one user message, and the content is the file to be translated
    messages = [
    {
      "role": "user",
      "content": [
        {
          "type": "input_audio",
          "input_audio": {
            "data": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250211/tixcef/cherry.wav",
            "format": "wav"
          }
        }
      ]
    }
  ],    
    # translation_options is not a standard OpenAI parameter and must be passed through extra_body
    extra_body={"translation_options": {"source_lang": "zh", "target_lang": "en"}},
    # Example: Output text and audio
    modalities = ["text","audio"],
    audio={"voice": "Cherry", "format": "wav"}
)

Limits

  • Single-turn translation: The model is designed for translation tasks and does not support multi-turn conversations.

  • No system message: The model does not support setting the global behavior using the system role.

  • Call method: Only streaming output that is compatible with the OpenAI protocol is supported.

Parsing the response

chunk is a streaming response object:

  • Text: Read the content from chunk.choices[0].delta.content.

  • Audio: The Base64-encoded audio data is in chunk.choices[0].delta.audio["data"]. The audio output from the model has a sample rate of 24 kHz.

Model availability

Model

Version

Context window

Max input

Max output

(Tokens)

qwen3-livetranslate-flash

Currently same capabilities as qwen3-livetranslate-flash-2025-12-01

Stable

53,248

49,152

4,096

qwen3-livetranslate-flash-2025-12-01

Snapshot

Getting started

qwen3-livetranslate-flash supports audio or video input and outputs text and audio.

First create an API key and export the API key as an environment variable. If you use the OpenAI SDK, install the SDK.

The following example uses audio input. To use video input, uncomment the corresponding code.

import os
from openai import OpenAI

client = OpenAI(
    # If you have not configured an environment variable, replace the following line with your Model Studio API key: api_key="sk-xxx",
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    # The following is the base_url for the Singapore region. If you use a model in the Beijing region, replace the base_url with: https://dashscope.aliyuncs.com/compatible-mode/v1
    base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)

# ---------------- Audio input ----------------
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "input_audio",
                "input_audio": {
                    "data": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250211/tixcef/cherry.wav",
                    "format": "wav",
                },
            }
        ],
    }
]

# ---------------- Video input (uncomment to use) ----------------
# messages = [
#     {
#         "role": "user",
#         "content": [
#             {
#                 "type": "video_url",
#                 "video_url": {
#                     "url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241115/cqqkru/1.mp4"
#                 },
#             }
#         ],
#     },
# ]

completion = client.chat.completions.create(
    model="qwen3-livetranslate-flash",
    messages=messages,
    modalities=["text", "audio"],
    audio={"voice": "Cherry", "format": "wav"},
    stream=True,
    stream_options={"include_usage": True},
    extra_body={"translation_options": {"source_lang": "zh", "target_lang": "en"}},
)

for chunk in completion:
    print(chunk)
import OpenAI from "openai";

const client = new OpenAI({
    // If you have not configured an environment variable, replace the following line with your Model Studio API key: apiKey: "sk-xxx",
    apiKey: process.env.DASHSCOPE_API_KEY,
    // The following is the base_url for the Singapore region. If you use a model in the Beijing region, replace the base_url with: https://dashscope.aliyuncs.com/compatible-mode/v1
    baseURL: "https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
});

// ---------------- Audio input ----------------
const messages = [
    {
        role: "user",
        content: [
            {
                type: "input_audio",
                input_audio: {
                    data: "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250211/tixcef/cherry.wav",
                    format: "wav",
                },
            },
        ],
    },
];

// ---------------- Video input (uncomment to use) ----------------
// const messages = [
//     {
//         role: "user",
//         content: [
//             {
//                 type: "video_url",
//                 video_url: {
//                     url: "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241115/cqqkru/1.mp4",
//                 },
//             },
//         ],
//     },
// ];

async function main() {
    const completion = await client.chat.completions.create({
        model: "qwen3-livetranslate-flash",
        messages: messages,
        modalities: ["text", "audio"],
        audio: { voice: "Cherry", format: "wav" },
        stream: true,
        stream_options: { include_usage: true },
        translation_options: { source_lang: "zh", target_lang: "en" },
    });

    for await (const chunk of completion) {
        console.log(JSON.stringify(chunk));
    }
}

main();
# ======= Important =======
# The following is an example for the Singapore region. If you use a model in the Beijing region, replace the request URL with: https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions
# === Delete this comment before execution ===
curl -X POST https://dashscope-intl.aliyuncs.com/compatible-mode/v1/chat/completions \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
    "model": "qwen3-livetranslate-flash",
    "messages": [
        {
            "role": "user",
            "content": [
                {
                    "type": "input_audio",
                    "input_audio": {
                        "data": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250211/tixcef/cherry.wav",
                        "format": "wav"
                    }
                }
            ]
        }
    ],
    "modalities": ["text", "audio"],
    "audio": {
        "voice": "Cherry",
        "format": "wav"
    },
    "stream": true,
    "stream_options": {
        "include_usage": true
    },
    "translation_options": {
        "source_lang": "zh",
        "target_lang": "en"
    }
}'

This example uses a public file URL. For information about using a local file, see Input a Base64-encoded local file.

Parse Base64 audio data

The model outputs audio in a streaming Base64-encoded format. You can process the data in two ways:

  • Concatenate and decode: Concatenate all Base64 fragments from the stream. After the stream is complete, decode the result and save it as an audio file.

  • Real-time playback: Decode each Base64 fragment in real time and play it directly.

# Installation instructions for pyaudio:
# APPLE Mac OS X
#   brew install portaudio
#   pip install pyaudio
# Debian/Ubuntu
#   sudo apt-get install python-pyaudio python3-pyaudio
#   or
#   pip install pyaudio
# CentOS
#   sudo yum install -y portaudio portaudio-devel && pip install pyaudio
# Microsoft Windows
#   python -m pip install pyaudio

import os
from openai import OpenAI
import base64
import numpy as np
import soundfile as sf

# Initialize the OpenAI client
client = OpenAI(
    # If you have not configured an environment variable, replace the following line with your Model Studio API key: api_key="sk-xxx",
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    # The following is the base_url for the Beijing region. If you use a model in the Singapore region, replace the base_url with: https://dashscope-intl.aliyuncs.com/compatible-mode/v1
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
)
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "input_audio",
                "input_audio": {
                    "data": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250211/tixcef/cherry.wav",
                    "format": "wav",
                },
            }
        ],
    }
]
completion = client.chat.completions.create(
    model="qwen3-livetranslate-flash",
    messages=messages,
    modalities=["text", "audio"],
    audio={"voice": "Cherry", "format": "wav"},
    stream=True,
    stream_options={"include_usage": True},
    extra_body={"translation_options": {"source_lang": "zh", "target_lang": "en"}},
)

# Method 1: Decode after generation is complete
audio_string = ""
for chunk in completion:
    if chunk.choices:
        if hasattr(chunk.choices[0].delta, "audio"):
            try:
                audio_string += chunk.choices[0].delta.audio["data"]
            except Exception as e:
                print(chunk.choices[0].delta.audio["transcript"])
    else:
        print(chunk.usage)

wav_bytes = base64.b64decode(audio_string)
audio_np = np.frombuffer(wav_bytes, dtype=np.int16)
sf.write("audio_assistant_py.wav", audio_np, samplerate=24000)

# Method 2: Decode during generation (comment out the code for Method 1 to use Method 2)
# # Initialize PyAudio
# import pyaudio
# import time
# p = pyaudio.PyAudio()
# # Create an audio stream
# stream = p.open(format=pyaudio.paInt16,
#                 channels=1,
#                 rate=24000,
#                 output=True)

# for chunk in completion:
#     if chunk.choices:
#         if hasattr(chunk.choices[0].delta, "audio"):
#             try:
#                 audio_string = chunk.choices[0].delta.audio["data"]
#                 wav_bytes = base64.b64decode(audio_string)
#                 audio_np = np.frombuffer(wav_bytes, dtype=np.int16)
#                 # Play the audio data directly
#                 stream.write(audio_np.tobytes())
#             except Exception as e:
#                 print(chunk.choices[0].delta.audio["transcript"])

# time.sleep(0.8)
# # Clean up resources
# stream.stop_stream()
# stream.close()
# p.terminate()
// Preparations before running:
// For Windows/Mac/Linux:
// 1. Make sure Node.js is installed (version >= 14 is recommended).
// 2. Run the following command to install the necessary dependencies:
//    npm install openai wav
// 
// To use the real-time playback feature (Method 2), you also need to:
// Windows:
//    npm install speaker
// Mac:
//    brew install portaudio
//    npm install speaker
// Linux (Ubuntu/Debian):
//    sudo apt-get install libasound2-dev
//    npm install speaker

import OpenAI from "openai";

const client = new OpenAI({
    // If you have not configured an environment variable, replace the following line with your Model Studio API key: apiKey: "sk-xxx",
    apiKey: process.env.DASHSCOPE_API_KEY,
    // The following is the base_url for the Beijing region. If you use a model in the Singapore region, replace the base_url with: https://dashscope-intl.aliyuncs.com/compatible-mode/v1
    baseURL: "https://dashscope.aliyuncs.com/compatible-mode/v1",
});

// ---------------- Audio input ----------------
const messages = [
    {
        role: "user",
        content: [
            {
                type: "input_audio",
                input_audio: {
                    data: "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250211/tixcef/cherry.wav",
                    format: "wav",
                },
            },
        ],
    },
];

const completion = await client.chat.completions.create({
    model: "qwen3-livetranslate-flash",
    messages: messages,
    modalities: ["text", "audio"],
    audio: { voice: "Cherry", format: "wav" },
    stream: true,
    stream_options: { include_usage: true },
    translation_options: { source_lang: "zh", target_lang: "en" },
});

// Method 1: Decode after generation is complete
// Requires installation: npm install wav
import { createWriteStream } from 'node:fs';  // node:fs is a built-in Node.js module and does not need to be installed
import { Writer } from 'wav';

async function convertAudio(audioString, audioPath) {
    try {
        // Decode the Base64 string into a Buffer
        const wavBuffer = Buffer.from(audioString, 'base64');
        // Create a WAV file write stream
        const writer = new Writer({
            sampleRate: 24000,  // Sample rate
            channels: 1,        // Mono channel
            bitDepth: 16        // 16-bit depth
        });
        // Create an output file stream and pipe the connection
        const outputStream = createWriteStream(audioPath);
        writer.pipe(outputStream);

        // Write PCM data and end the stream
        writer.write(wavBuffer);
        writer.end();

        // Use a Promise to wait for the file to finish writing
        await new Promise((resolve, reject) => {
            outputStream.on('finish', resolve);
            outputStream.on('error', reject);
        });

        // Add extra wait time to ensure the audio is complete
        await new Promise(resolve => setTimeout(resolve, 800));

        console.log(`Audio file successfully saved as ${audioPath}`);
    } catch (error) {
        console.error('An error occurred during processing:', error);
    }
}

let audioString = "";
for await (const chunk of completion) {
    if (Array.isArray(chunk.choices) && chunk.choices.length > 0) {
        if (chunk.choices[0].delta.audio) {
            if (chunk.choices[0].delta.audio["data"]) {
                audioString += chunk.choices[0].delta.audio["data"];
            }
        }
    } else {
        console.log(chunk.usage);
    }
}
// Execute the conversion
convertAudio(audioString, "audio_assistant_mjs.wav");

// Method 2: Decode and play in real time during generation
// You must first install the necessary components according to the instructions for your system above.
// import Speaker from 'speaker'; // Import the audio playback library

// // Create a Speaker instance (configuration matches WAV file parameters)
// const speaker = new Speaker({
//     sampleRate: 24000,  // Sample rate
//     channels: 1,        // Number of channels
//     bitDepth: 16,       // Bit depth
//     signed: true        // Signed PCM
// });
// for await (const chunk of completion) {
//     if (Array.isArray(chunk.choices) && chunk.choices.length > 0) {
//         if (chunk.choices[0].delta.audio) {
//             if (chunk.choices[0].delta.audio["data"]) {
//                 const pcmBuffer = Buffer.from(chunk.choices[0].delta.audio.data, 'base64');
//                 // Write directly to the speaker for playback
//                 speaker.write(pcmBuffer);
//             }
//         }
//     } else {
//         console.log(chunk.usage);
//     }
// }
// speaker.on('finish', () => console.log('Playback complete'));
// speaker.end(); // Call based on the actual end of the API stream

Billing details

Audio

Each second of input or output audio consumes 12.5 tokens.

Video

Tokens for video files are divided into video_tokens (visual) and audio_tokens (audio).

  • video_tokens

    The following script is used to calculate video_tokens. The calculation process is as follows:

    1. Frame sampling: Frames are sampled at a rate of 2 frames per second (FPS). The number of frames is limited to a range of [4, 128].

    2. Size adjustment: The height and width are adjusted to be multiples of 32 pixels. The dimensions are dynamically scaled based on the number of frames to fit within the total pixel limit.

    3. Token calculation: .

    # Before use, install: pip install opencv-python
    import math
    import os
    import cv2
    
    # Fixed parameters
    FRAME_FACTOR = 2
    
    IMAGE_FACTOR = 32
    
    # Aspect ratio of video frames
    MAX_RATIO = 200
    
    # Lower limit for video frame pixels
    VIDEO_MIN_PIXELS = 128 * 32 * 32
    
    # Upper limit for video frame pixels
    VIDEO_MAX_PIXELS = 768 * 32 * 32
    
    FPS = 2
    # Minimum number of extracted frames
    FPS_MIN_FRAMES = 4
    
    # Maximum number of extracted frames
    FPS_MAX_FRAMES = 128
    
    # Maximum pixel value for video input
    VIDEO_TOTAL_PIXELS = 16384 * 32 * 32
    
    def round_by_factor(number, factor):
        return round(number / factor) * factor
    
    def ceil_by_factor(number, factor):
        return math.ceil(number / factor) * factor
    
    def floor_by_factor(number, factor):
        return math.floor(number / factor) * factor
    
    def get_video(video_path):
        cap = cv2.VideoCapture(video_path)
        frame_width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
        frame_height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
        total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
        video_fps = cap.get(cv2.CAP_PROP_FPS)
        cap.release()
        return frame_height, frame_width, total_frames, video_fps
    
    def smart_nframes(total_frames, video_fps):
        min_frames = ceil_by_factor(FPS_MIN_FRAMES, FRAME_FACTOR)
        max_frames = floor_by_factor(min(FPS_MAX_FRAMES, total_frames), FRAME_FACTOR)
        duration = total_frames / video_fps if video_fps != 0 else 0
        if duration - int(duration) > (1 / FPS):
            total_frames = math.ceil(duration * video_fps)
        else:
            total_frames = math.ceil(int(duration) * video_fps)
        nframes = total_frames / video_fps * FPS
        nframes = int(min(min(max(nframes, min_frames), max_frames), total_frames))
        if not (FRAME_FACTOR <= nframes <= total_frames):
            raise ValueError(f"nframes should in interval [{FRAME_FACTOR}, {total_frames}], but got {nframes}.")
        return nframes
    
    def smart_resize(height, width, nframes, factor=IMAGE_FACTOR):
        min_pixels = VIDEO_MIN_PIXELS
        total_pixels = VIDEO_TOTAL_PIXELS
        max_pixels = max(min(VIDEO_MAX_PIXELS, total_pixels / nframes * FRAME_FACTOR), int(min_pixels * 1.05))
        if max(height, width) / min(height, width) > MAX_RATIO:
            raise ValueError(f"absolute aspect ratio must be smaller than {MAX_RATIO}, got {max(height, width) / min(height, width)}")
        h_bar = max(factor, round_by_factor(height, factor))
        w_bar = max(factor, round_by_factor(width, factor))
        if h_bar * w_bar > max_pixels:
            beta = math.sqrt((height * width) / max_pixels)
            h_bar = floor_by_factor(height / beta, factor)
            w_bar = floor_by_factor(width / beta, factor)
        elif h_bar * w_bar < min_pixels:
            beta = math.sqrt(min_pixels / (height * width))
            h_bar = ceil_by_factor(height * beta, factor)
            w_bar = ceil_by_factor(width * beta, factor)
        return h_bar, w_bar
    
    def video_token_calculate(video_path):
        height, width, total_frames, video_fps = get_video(video_path)
        nframes = smart_nframes(total_frames, video_fps)
        resized_height, resized_width = smart_resize(height, width, nframes)
        video_token = int(math.ceil(nframes / FPS) * resized_height / 32 * resized_width / 32)
        video_token += 2  # Visual marker
        return video_token
    
    if __name__ == "__main__":
        video_path = "spring_mountain.mp4"  # Your video path
        video_token = video_token_calculate(video_path)
        print("video_tokens:", video_token)
  • audio_tokens

    Each second of audio consumes 12.5 tokens. If the audio duration is less than 1 second, it is billed as 1 second.

For information about token costs, see Model list.

API reference

For the input and output parameters for qwen3-livetranslate-flash, see Audio and video translation - Qwen.

Supported languages

The language codes in the following table can be used to specify the source and target languages.

Some target languages support text output only.

Language code

Language

Supported output modalities

en

English

Audio, text

zh

Chinese

Audio, text

ru

Russian

Audio, text

fr

French

Audio, text

de

German

Audio, text

pt

Portuguese

Audio, text

es

Spanish

Audio, text

it

Italian

Audio, text

id

Indonesian

Text

ko

Korean

Audio, text

ja

Japanese

Audio, text

vi

Vietnamese

Text

th

Thai

Text

ar

Arabic

Text

yue

Cantonese

Audio, text

hi

Hindi

Text

el

Greek

Text

tr

Turkish

Text

Supported voices

Voice name

voice parameter

Voice effects

Description

Supported languages

Cherry

Cherry

A cheerful, friendly, and genuine young woman.

Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean

Nofish

Nofish

A designer who has difficulty pronouncing retroflex consonants.

Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean

Shanghai-Jada

Jada

A bustling and energetic Shanghai lady.

Chinese

Beijing-Dylan

Dylan

A young man who grew up in the hutongs of Beijing.

Chinese

Sichuan-Sunny

Sunny

A sweet girl from Sichuan.

Chinese

Tianjin-Peter

Peter

A voice in the style of a Tianjin crosstalk performer (the supporting role).

Chinese

Cantonese-Kiki

Kiki

A sweet best friend from Hong Kong.

Cantonese

Sichuan-Eric

Eric

A man from Chengdu, Sichuan, who is unconventional and stands out from the crowd.

Chinese

FAQ

Q: When I input a video file, what content is translated?

A: The model translates the audio from the video. Visual information is used as context to improve accuracy.

Example:

If the audio content is This is a mask:

  • If the image shows a medical mask, it is described as "This is a medical mask".

  • If an image contains a masquerade mask, it is described as "This is a masquerade mask."