qwen3-livetranslate-flash-realtime is a vision-enhanced, real-time translation model from Qwen. It can simultaneously process streaming audio and image inputs, such as from a video stream. It uses visual context to improve translation accuracy and outputs high-quality translated text and audio in real time.
For an online demo, see One-click deployment using Function Compute.
How to use
1. Configure the connection
The qwen3-livetranslate-flash-realtime model connects using the WebSocket protocol. The connection requires the following configuration items:
Configuration item | Description |
Endpoint | China site: wss://dashscope.aliyuncs.com/api-ws/v1/realtime International site: wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime |
Query parameter | The query parameter is `model`. You must set it to the name of the model you want to access. Example: |
Message header | Use Bearer Token for authentication: Authorization: Bearer DASHSCOPE_API_KEY DASHSCOPE_API_KEY is the API key that you request from Alibaba Cloud Model Studio. |
Use the following Python sample code to establish a connection.
2. Configure the target language, output modality, and voice
To configure these settings, send the session.update client event:
Target translation language
Use the
session.translation.languageparameter to set the target language. For more information, see Supported languages.Output modality
Use the
session.modalitiesparameter to set the output modality. You can set it to["text"](text-only output) or["text","audio"](text and audio output).Voice
Use the
session.voiceparameter to set the voice. For more information, see Supported voices.
3. Input audio and images
The client sends Base64-encoded audio and image data using the input_audio_buffer.append and input_image_buffer.append events. Audio input is required. Image input is optional.
Images can be from local files or captured in real time from a video stream.
The server automatically detects the start and end of the audio and triggers a model response.
4. Receive the model response
When the server detects the end of the audio, the model begins to respond. The response format depends on the configured output modality.
Text-only output
The server returns the complete translated text in a response.text.done event.
Text and audio output
Text
The complete translated text is returned in a response.audio_transcript.done event.
Audio
Incremental, Base64-encoded audio data is returned in response.audio.delta events.
Supported models
qwen3-livetranslate-flash-realtime is a multilingual, real-time audio and video translation model. It can recognize 18 languages and translate them into audio in 10 languages in real time.
Core features:
Multilingual support: Supports 18 languages, such as Chinese, English, French, German, Russian, Japanese, and Korean, and 6 Chinese dialects, including Mandarin, Cantonese, and Sichuanese.
Vision enhancement: Uses visual content to improve translation accuracy. The model analyzes lip movements, actions, and on-screen text to enhance translation in noisy environments or for words with multiple meanings.
3-second latency: Achieves simultaneous interpretation latency as low as 3 seconds.
Lossless simultaneous interpretation: Uses semantic unit prediction technology to resolve word order issues between languages. The real-time translation quality is close to that of offline translation.
Natural voice: Generates natural, human-like speech. The model automatically adjusts its tone and emotion based on the source audio.
Model name | Version | Context length | Max input | Max output |
(Tokens) | ||||
qwen3-livetranslate-flash-realtime Current capabilities are equivalent to qwen3-livetranslate-flash-realtime-2025-09-22 | Stable | 53,248 | 49,152 | 4,096 |
qwen3-livetranslate-flash-realtime-2025-09-22 | Snapshot | |||
Getting Started
Prepare the environment
Your Python version must be 3.10 or later.
First, install pyaudio.
macOS
brew install portaudio && pip install pyaudioDebian/Ubuntu
sudo apt-get install python3-pyaudio or pip install pyaudioCentOS
sudo yum install -y portaudio portaudio-devel && pip install pyaudioWindows
pip install pyaudioAfter the installation is complete, use pip to install the required WebSocket dependencies:
pip install websocket-client==1.8.0 websocketsCreate the client
Create a new Python file locally, name it
livetranslate_client.py, and copy the following code into the file:Interact with the model
In the same folder as
livetranslate_client.py, create another Python file, name itmain.py, and copy the following code into the file:Run
main.pyand speak the sentences you want to translate into the microphone. The model provides the translated audio and text in real time. The system automatically detects your speech and sends the audio to the server, so no manual action is required.
Use images to improve translation accuracy
The qwen3-livetranslate-flash-realtime model can accept image input to assist with audio translation. This is useful for scenarios involving homonyms or recognizing uncommon proper nouns. You can send a maximum of two images per second.
Download the following sample images to your local machine: mask_medical.png and mask_masquerade.png
Download the following code to the same folder as livetranslate_client.py and run it. Say "What is mask?" into the microphone. When you input the medical mask image, the model translates the phrase to “什么是口罩?” When you input the masquerade mask image, the model translates the phrase to “什么是面具?”
import os
import time
import json
import asyncio
import contextlib
import functools
from livetranslate_client import LiveTranslateClient
IMAGE_PATH = "mask_medical.png"
# IMAGE_PATH = "mask_masquerade.png"
def print_banner():
print("=" * 60)
print(" Powered by Qwen qwen3-livetranslate-flash-realtime — Single-turn interaction example (mask)")
print("=" * 60 + "\n")
async def stream_microphone_once(client: LiveTranslateClient, image_bytes: bytes):
pa = client.pyaudio_instance
stream = pa.open(
format=client.input_format,
channels=client.input_channels,
rate=client.input_rate,
input=True,
frames_per_buffer=client.input_chunk,
)
print(f"[INFO] Recording started. Please speak...")
loop = asyncio.get_event_loop()
last_img_time = 0.0
frame_interval = 0.5 # 2 fps
try:
while client.is_connected:
data = await loop.run_in_executor(None, stream.read, client.input_chunk)
await client.send_audio_chunk(data)
# Append an image frame every 0.5 seconds
now = time.time()
if now - last_img_time >= frame_interval:
await client.send_image_frame(image_bytes)
last_img_time = now
finally:
stream.stop_stream()
stream.close()
async def main():
print_banner()
api_key = os.environ.get("DASHSCOPE_API_KEY")
if not api_key:
print("[ERROR] First, configure the API KEY in the DASHSCOPE_API_KEY environment variable.")
return
client = LiveTranslateClient(api_key=api_key, target_language="zh", voice="Cherry", audio_enabled=True)
def on_text(text: str):
print(text, end="", flush=True)
try:
await client.connect()
client.start_audio_player()
message_task = asyncio.create_task(client.handle_server_messages(on_text))
with open(IMAGE_PATH, "rb") as f:
img_bytes = f.read()
await stream_microphone_once(client, img_bytes)
await asyncio.sleep(15)
finally:
await client.close()
if not message_task.done():
message_task.cancel()
with contextlib.suppress(asyncio.CancelledError):
await message_task
if __name__ == "__main__":
asyncio.run(main())One-click deployment using Function Compute
The console does not currently support this demo. You can deploy it with one click as follows:
Open the Function Compute template, enter your API key, and click Create And Deploy Default Environment to try it online.
Wait for about one minute. In Environment Details > Environment Context, retrieve the endpoint. Change
httptohttpsin the endpoint (for example, https://qwen-livetranslate-flash-realtime-intl.fcv3.xxx.ap-southeast-1.fc.devsapp.net/) and use the link to interact with the model.ImportantThis link uses a self-signed certificate and is for temporary testing only. When you first access it, your browser will display a security warning. This is expected behavior. Do not use this in a production environment. To proceed, follow your browser's instructions, such as clicking "Advanced" → "Proceed to (unsafe)".
To enable Resource Access Management (RAM) permissions, follow the on-screen instructions.
You can view the project source code under Resource Information - Function Resources.
Function Compute and Alibaba Cloud Model Studio both offer a free quota for new users. This quota can cover the cost of simple testing. After the free quota is exhausted, you are charged on a pay-as-you-go basis. Charges are incurred only when the service is accessed.
Interaction flow
The interaction flow for real-time speech translation follows the standard WebSocket event-driven model, where the server automatically detects the start and end of speech and responds.
Lifecycle | Client event | Server event |
Session initialization | session.update Session configuration | session.created Session created session.updated Session configuration updated |
User audio input | input_audio_buffer.append Add audio to buffer input_image_buffer.append Add image to buffer | None |
Server audio output | None | response.created Server starts generating response response.output_item.added New output content in response response.content_part.added New output content added to assistant message response.audio_transcript.text Incrementally generated transcript text response.audio.delta Incrementally generated audio from the model response.audio_transcript.done Text transcription complete response.audio.done Audio generation complete response.content_part.done Streaming of text or audio content for the assistant message is complete response.output_item.done Streaming of the entire output item for the assistant message is complete response.done Response complete |
API reference
For more information, see Real-time audio and video translation (Qwen-Livetranslate).
Billing
Audio
Each second of input audio consumes 25 tokens. Each second of output audio consumes 12.5 tokens.
Image
Each input of 28 × 28 pixels consumes 0.5 tokens.
For token pricing, see the Model List.
Supported languages
Language code | Language | Supported output modality |
en | English | Audio + Text |
zh | Chinese | Audio + Text |
ru | Russian | Audio + Text |
fr | French | Audio + Text |
de | German | Audio + Text |
pt | Portuguese | Audio + Text |
es | Spanish | Audio + Text |
it | Italian | Audio + Text |
id | Indonesian | Text |
ko | Korean | Audio + Text |
ja | Japanese | Audio + Text |
vi | Vietnamese | Text |
th | Thai | Text |
ar | Arabic | Text |
yue | Cantonese | Audio + Text |
hi | Hindi | Text |
el | Greek | Text |
tr | Turkish | Text |
Supported voices
Voice name |
| Timbre effects | Description | Supported languages |
Qianyue | Cherry | A sunny, positive, and friendly female voice. | Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean | |
Do not consume fish. | Nofish | A designer who cannot pronounce retroflex consonants. | Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean | |
Shanghai - Jada | Jada | A lively and energetic Shanghainese woman. | Chinese | |
Beijing - Dylan | Dylan | A young man who grew up in the hutongs of Beijing. | Chinese | |
Sichuan - Sunny | Sunny | A sweet-voiced Sichuanese girl. | Chinese | |
Tianjin - Peter | Peter | Tianjin Crosstalk: The art of the supporting role. | Chinese | |
Cantonese - Kiki | Kiki | A sweet-voiced best friend from Hong Kong. | Cantonese | |
Sichuan-Chengchuan | Eric | A man from Chengdu, Sichuan, with a voice that stands out from the crowd. | Chinese |