All Products
Search
Document Center

Alibaba Cloud Model Studio:Audio and video translation - Qwen API reference

Last Updated:Mar 15, 2026

qwen3-livetranslate-flash translates audio and video through the OpenAI-compatible chat completions endpoint. All requests are streamed.

Note: The DashScope interface is not supported.

Supported models

  • qwen3-livetranslate-flash

  • qwen3-livetranslate-flash-2025-12-01

Prerequisites

Before you begin, complete the following:

  1. Create an API key

  2. Configure the API key as an environment variable

  3. Install the OpenAI SDK (for Python or Node.js)

Endpoints

Region SDK base_url HTTP endpoint
Singapore https://dashscope-intl.aliyuncs.com/compatible-mode/v1 POST https://dashscope-intl.aliyuncs.com/compatible-mode/v1/chat/completions
Beijing https://dashscope.aliyuncs.com/compatible-mode/v1 POST https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions

Quick start

The following examples translate an audio file and return both translated text and audio through streaming. Replace the base_url if you use the Beijing region.

Python

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",  # Singapore
)

completion = client.chat.completions.create(
    model="qwen3-livetranslate-flash",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "input_audio",
                    "input_audio": {
                        "data": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250211/tixcef/cherry.wav",
                        "format": "wav",
                    },
                }
            ],
        }
    ],
    modalities=["text", "audio"],
    audio={"voice": "Cherry", "format": "wav"},
    stream=True,
    stream_options={"include_usage": True},
    extra_body={"translation_options": {"source_lang": "zh", "target_lang": "en"}},
)

for chunk in completion:
    print(chunk)

Node.js

import OpenAI from "openai";

const client = new OpenAI({
    apiKey: process.env.DASHSCOPE_API_KEY,
    baseURL: "https://dashscope-intl.aliyuncs.com/compatible-mode/v1",  // Singapore
});

async function main() {
    const completion = await client.chat.completions.create({
        model: "qwen3-livetranslate-flash",
        messages: [
            {
                role: "user",
                content: [
                    {
                        type: "input_audio",
                        input_audio: {
                            data: "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250211/tixcef/cherry.wav",
                            format: "wav",
                        },
                    },
                ],
            },
        ],
        modalities: ["text", "audio"],
        audio: { voice: "Cherry", format: "wav" },
        stream: true,
        stream_options: { include_usage: true },
        translation_options: { source_lang: "zh", target_lang: "en" },
    });

    for await (const chunk of completion) {
        console.log(JSON.stringify(chunk));
    }
}

main();

curl

curl -X POST https://dashscope-intl.aliyuncs.com/compatible-mode/v1/chat/completions \
  -H "Authorization: Bearer $DASHSCOPE_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3-livetranslate-flash",
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "type": "input_audio",
            "input_audio": {
              "data": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250211/tixcef/cherry.wav",
              "format": "wav"
            }
          }
        ]
      }
    ],
    "modalities": ["text", "audio"],
    "audio": {
      "voice": "Cherry",
      "format": "wav"
    },
    "stream": true,
    "stream_options": {
      "include_usage": true
    },
    "translation_options": {
      "source_lang": "zh",
      "target_lang": "en"
    }
  }'

Video input

To translate video instead of audio, set the content type to video_url:

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video_url",
                "video_url": {
                    "url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241115/cqqkru/1.mp4"
                },
            }
        ],
    },
]

All other parameters remain the same.

Request body

Required parameters

Parameter Type Description
model string Model name. Valid values: qwen3-livetranslate-flash, qwen3-livetranslate-flash-2025-12-01.
messages array An array of messages. Only one user message is supported.
stream boolean Must be true. Default is false, but only streaming output is supported, so you must set this to true.
translation_options object Translation configuration. See Translation options. This is a non-standard OpenAI parameter. In the Python SDK, pass it inside extra_body. In Node.js or HTTP, pass it at the top level.

Optional parameters

Parameter Type Default Description
modalities array ["text"] Output modality. Set to ["text", "audio"] to receive both text and audio output, or ["text"] for text only.
audio object - Output audio configuration. Required when modalities includes "audio". See Audio output options.
stream_options object - Streaming configuration. See Stream options.
max_tokens integer Model maximum The maximum number of tokens to generate. Generation stops at this limit or when complete.
seed integer - Random seed for reproducibility. The same seed produces identical output for identical requests. Range: [0, 2^31-1].

Sampling parameters

For translation accuracy, keep these parameters at their default values.

Parameter Type Default Range Notes
temperature float 0.000001 [0, 2) Controls output diversity.
top_p float 0.8 (0, 1.0] Nucleus sampling threshold.
presence_penalty float 0 [-2.0, 2.0] Reduces repetition when positive.
top_k integer 1 >= 0 Candidate set size. If the value is None or greater than 100, top_k is disabled and only top_p takes effect. Non-standard OpenAI parameter. Python SDK: use extra_body.
repetition_penalty float 1.05 > 0 Penalizes repeated sequences. Non-standard OpenAI parameter. Python SDK: use extra_body.

Message object

The messages array must contain exactly one object with role set to user.

Properties of content array items:

Field Type Required Description
type string Yes input_audio for audio input, video_url for video input.
input_audio object When type is input_audio Audio input. See below.
video_url object When type is video_url Video input. See below.

input_audio object:

Field Type Required Description
data string Yes URL of the audio file, or a Base64 data URL. For local files, see Input a Base64-encoded local file.
format string Yes Audio format, such as mp3 or wav.

video_url object:

Field Type Required Description
url string Yes Public URL of the video file, or a Base64 data URL. For local files, see Input a Base64-encoded local file.

Translation options

Field Type Required Description
source_lang string No Full English name of the source language. See Supported languages. If omitted, language is auto-detected.
target_lang string Yes Full English name of the target language. See Supported languages.
Note: translation_options is a non-standard OpenAI parameter. In the Python SDK, pass it inside extra_body: In Node.js or HTTP, pass it at the top level of the request body.
extra_body={"translation_options": {"source_lang": "zh", "target_lang": "en"}}

Audio output options

Required when modalities is ["text", "audio"].

Field Type Required Description
voice string Yes Voice for the output audio. See Supported voices.
format string Yes Output audio format. Only wav is supported.

Stream options

Field Type Default Description
include_usage boolean false When true, the final chunk includes token usage details.

Response

The API returns a series of streaming chunks, each as a chat.completion.chunk object. Chunks fall into three categories: text, audio, and token usage.

Text chunk

Contains incremental translated text in choices[0].delta.content:

{
  "id": "chatcmpl-c22a54b8-40cc-4a1d-988b-f84cdf86868f",
  "choices": [
    {
      "delta": {
        "content": " of",
        "role": null,
        "audio": null
      },
      "finish_reason": null,
      "index": 0
    }
  ],
  "created": 1764755440,
  "model": "qwen3-livetranslate-flash",
  "object": "chat.completion.chunk"
}

Audio chunk

Contains incremental Base64-encoded audio in choices[0].delta.audio.data:

{
  "id": "chatcmpl-c22a54b8-40cc-4a1d-988b-f84cdf86868f",
  "choices": [
    {
      "delta": {
        "content": null,
        "role": null,
        "audio": {
          "data": "///+//7////+////////////AAAAAAAAAAABA......",
          "expires_at": 1764755440,
          "id": "audio_c22a54b8-40cc-4a1d-988b-f84cdf86868f"
        }
      },
      "finish_reason": null,
      "index": 0
    }
  ],
  "created": 1764755440,
  "model": "qwen3-livetranslate-flash",
  "object": "chat.completion.chunk"
}

Token usage chunk

Returned as the final chunk when include_usage is true. The choices array is empty, and usage contains the token breakdown:

{
  "id": "chatcmpl-c22a54b8-40cc-4a1d-988b-f84cdf86868f",
  "choices": [],
  "created": 1764755440,
  "model": "qwen3-livetranslate-flash",
  "object": "chat.completion.chunk",
  "usage": {
    "completion_tokens": 242,
    "prompt_tokens": 415,
    "total_tokens": 657,
    "completion_tokens_details": {
      "accepted_prediction_tokens": null,
      "audio_tokens": 191,
      "reasoning_tokens": null,
      "rejected_prediction_tokens": null,
      "text_tokens": 51
    },
    "prompt_tokens_details": {
      "audio_tokens": 415,
      "cached_tokens": null,
      "text_tokens": 0,
      "video_tokens": null
    }
  }
}
Note: For video input, prompt_tokens_details.audio_tokens includes the audio tokens extracted from the video. video_tokens reports the video-specific token count.

Response fields

Field Type Description
id string The request identifier. Identical across all chunks.
choices array Generated content. Empty in the final usage chunk.
choices[].delta.content string Incremental translated text. null in audio chunks.
choices[].delta.audio object Incremental audio data. null in text chunks.
choices[].delta.audio.data string Base64-encoded audio segment.
choices[].delta.audio.id string Unique identifier for the output audio.
choices[].delta.audio.expires_at integer Timestamp when the request was created.
choices[].delta.role string Message role. Present only in the first chunk.
choices[].finish_reason string stop when generation completes normally, length when truncated by max_tokens, null while in progress.
choices[].index integer Always 0.
created integer Unix timestamp for the request. Identical across all chunks.
model string The model name.
object string Always chat.completion.chunk.
usage object Token consumption. Present only in the final chunk when include_usage is true.
usage.prompt_tokens integer Total input tokens.
usage.completion_tokens integer Total output tokens.
usage.total_tokens integer Sum of prompt_tokens and completion_tokens.
usage.completion_tokens_details.audio_tokens integer Output audio tokens.
usage.completion_tokens_details.text_tokens integer Output text tokens.
usage.prompt_tokens_details.audio_tokens integer Input audio tokens. For video input, this includes audio extracted from the video.
usage.prompt_tokens_details.text_tokens integer Input text tokens. Always 0.
usage.prompt_tokens_details.video_tokens integer Input video tokens. Present only for video input.

Fields fixed to null

The following fields are present in the response for OpenAI compatibility but always return null:

reasoning_content, function_call, refusal, tool_calls, logprobs, service_tier, system_fingerprint

Usage notes

  • Streaming only. Set stream to true. Non-streaming calls are unsupported.

  • Single message. The messages array accepts one user message only.

  • Non-standard parameters. translation_options, top_k, and repetition_penalty are not in the standard OpenAI API. Python SDK: pass in extra_body. Node.js/HTTP: include at top level.

  • Sampling defaults. Defaults for temperature, top_p, top_k, presence_penalty, and repetition_penalty are optimized for translation accuracy. Changing them may degrade quality.

  • Output audio format. Only wav is supported.

  • Automatic language detection. If source_lang is omitted, the input language is auto-detected.

References