qwen3.5-livetranslate-flash-realtime is a vision-enhanced real-time translation model that translates between 60 languages (29 with audio + text output and 31 with text-only output). It processes both audio and image input from real-time video streams or local video files, leverages visual context to improve translation accuracy, and outputs translated text and audio in real time.
Try an online demo with one-click deployment using Function Compute .
Features
Multi-language support: Translates between 60 languages — 29 with audio and text output, 31 with text-only output — including Chinese, English, French, German, Russian, Japanese, Korean, Spanish, Portuguese, and Arabic.
Visual enhancement: Analyzes visual cues, such as lip movements, gestures, and on-screen text, to improve translation accuracy, especially in noisy environments or for ambiguous words.
3-second latency: Delivers simultaneous interpretation with latency as low as 3 seconds.
Lossless simultaneous interpretation: Uses semantic unit prediction to resolve word order differences between languages, delivering real-time translation quality comparable to offline translation.
Natural voice: Generates a natural voice by automatically matching the intonation and emotion of the source audio.
Hotword configuration: Lets you configure hotwords to improve translation accuracy for specific terms.
Voice cloning: Clones the speaker's voice for translated output, so the translation sounds like the speaker speaking in another language. Supports both server-side real-time cloning and pre-cloned voice profiles.
Procedure
1. Configure the connection
The qwen3.5-livetranslate-flash-realtime model uses the WebSocket protocol. The connection requires the following parameters:
|
Parameter |
Description |
|
endpoint |
Chinese mainland: wss://dashscope.aliyuncs.com/api-ws/v1/realtime International: wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime |
|
query parameter |
The model query parameter must be set to the model name. Example: |
|
message header |
Use a Bearer Token for authentication: Authorization: Bearer DASHSCOPE_API_KEY DASHSCOPE_API_KEY is your API key from Model Studio. |
Sample Python code for establishing a connection:
2. Configure language, modality, and voice
Send the session.update client event with the following parameters:
-
Language
-
Source language: Configure using the
session.input_audio_transcription.languageparameter.The default value is
en(English). -
Target language: Configure using the
session.translation.languageparameter.The default value is
en(English).
See Supported languages.
-
-
Output source language recognition results
To enable this feature, set the
session.input_audio_transcription.modelparameter. When set toqwen3-asr-flash-realtime, the server returns both the translation and the speech recognition result (original text) for the input audio.When this feature is enabled, the server returns the following events:
conversation.item.input_audio_transcription.text: Streams the recognition results.conversation.item.input_audio_transcription.completed: Returns the final result after the recognition is complete.
-
Output modality
Set the
session.modalitiesparameter to["text"](text only) or["text","audio"](text and audio). -
Voice
Configure using the
session.voiceparameter. See Supported voices. -
Hotword
Configure hotwords using the
session.translation.corpus.phrasesparameter. Hotwords are key-value pairs that map source terms to target translations, improving accuracy for specific terms.Example: Map
"artificial intelligence"to"Artificial Intelligence". -
Voice cloning
Configure using the
session.enable_voice_clone,session.voice_clone_options.frequency, andsession.voiceparameters. Supports three modes: pre-cloned voice profile (frequency:never), server-side clone once at session start (once), or real-time clone before each response (always). See Voice cloning.
3. Input audio and images
Send Base64-encoded audio and image data using the input_audio_buffer.append and input_image_buffer.append events. Audio input is required; image input is optional.
Images can be from a local file or captured in real time from a video stream.
The server automatically detects speech boundaries and triggers the model response.
4. Receive the model response
When the server detects the end of audio input, the model responds. The response format depends on the configured output modality.
-
Text-only output
The server returns the complete translated text in a response.text.done event.
-
Text and audio output
-
Text
The server returns the complete translated text in a response.audio_transcript.done event.
-
Audio
The server returns incremental, Base64-encoded audio data in response.audio.delta events.
-
Supported models
|
Model |
Version |
Context window |
Max input |
Max output |
|
(tokens) |
||||
|
qwen3.5-livetranslate-flash-realtime Alias for qwen3.5-livetranslate-flash-realtime-2026-05-19 |
Stable |
53,248 |
49,152 |
4,096 |
|
qwen3.5-livetranslate-flash-realtime-2026-05-19 |
Snapshot |
|||
|
qwen3-livetranslate-flash-realtime Alias for qwen3-livetranslate-flash-realtime-2025-09-22 |
Stable |
53,248 |
49,152 |
4,096 |
|
qwen3-livetranslate-flash-realtime-2025-09-22 |
Snapshot |
|||
Getting started
-
Prepare the environment
Requires Python 3.10 or later.
First, install pyaudio.
macOS
brew install portaudio && pip install pyaudioDebian/Ubuntu
sudo apt-get install python3-pyaudio or pip install pyaudioCentOS
sudo yum install -y portaudio portaudio-devel && pip install pyaudioWindows
pip install pyaudioThen install the WebSocket dependencies:
pip install websocket-client==1.8.0 websockets -
Create the client
Create a file named
livetranslate_client.pywith the following code: -
Interact with the model
In the same directory, create a file named
main.pywith the following code:Run
main.pyand speak into your microphone. The model outputs translated audio and text in real time. The system automatically detects speech and sends it to the server.
Voice cloning
The model clones the speaker's voice from the input audio and uses the cloned voice for translated output, so the translation sounds like the speaker delivering it in another language. Use a pre-cloned voice profile, or let the server clone the voice in real time. This is useful in scenarios where preserving the speaker's voice matters, such as conference interpreting, live streaming, and video dubbing.
Set the following parameters in session.update to enable voice cloning:
session.enable_voice_clone: Set totrueto enable voice cloning.-
session.voice_clone_options.frequency: Controls when voice cloning occurs. Accepted values:never: Does not clone on the server. Uses a pre-cloned voice profile instead. Setsession.voiceto your custom cloned voice ID.once: Clones the voice from the input audio once at session start, then reuses it for all subsequent output. Best for single-speaker scenarios. Setsession.voicetodefault.always: Clones the voice before each response, dynamically adapting to speaker changes. Best for multi-speaker conversations. Setsession.voicetodefault.
-
session.voice: Specifies the output voice. The value depends on thefrequencysetting:Set to
default: Use withfrequencyset toonceoralways. The server clones the speaker's voice from the input audio. A default voice is used until cloning completes.Set to a custom cloned voice ID (for example,
qwen-translate-vc-xxx-yyy-zzz): Use withfrequencyset tonever. You must prepare the voice in advance using the Voice Cloning API withtargetModelset toqwen3.5-livetranslate-flash-realtime.
Whenfrequencyis set toonceoralways, thevoiceparameter must be set todefault. Any other value causes the server to return an error.
Voice cloning configuration examples
Pre-cloned voice profile (consistent quality; recommended when a stable voice identity is required):
{
"type": "session.update",
"session": {
"modalities": ["text","audio"],
"voice": "qwen-translate-vc-xxx-yyy-zzz",
"translation": {
"language": "en"
},
"enable_voice_clone": true,
"voice_clone_options": {
"frequency": "never"
}
}
}
Server-side cloning, once per session (best for single-speaker scenarios):
{
"type": "session.update",
"session": {
"modalities": ["text","audio"],
"voice": "default",
"translation": {
"language": "en"
},
"enable_voice_clone": true,
"voice_clone_options": {
"frequency": "once"
}
}
}
Server-side cloning, every response (best for multi-speaker conversations):
{
"type": "session.update",
"session": {
"modalities": ["text","audio"],
"voice": "default",
"translation": {
"language": "en"
},
"enable_voice_clone": true,
"voice_clone_options": {
"frequency": "always"
}
}
}
Improve translation with images
The qwen3.5-livetranslate-flash-realtime model uses image input to improve audio translation, helping disambiguate homonyms and recognize uncommon proper nouns. Send no more than 2 images per second.
Download the following sample images: medical mask.png, masquerade mask.png
Download the following code to the same directory as livetranslate_client.py and run it. Say "What is mask?" into your microphone. The model uses the provided image to disambiguate the word "mask." For example, using the medical mask.png file translates the phrase as "What is a medical mask?", while using the masquerade mask.png file translates it as "What is a masquerade mask?".
import os
import time
import json
import asyncio
import contextlib
import functools
from livetranslate_client import LiveTranslateClient
IMAGE_PATH = "medical mask.png"
# IMAGE_PATH = "masquerade mask.png"
def print_banner():
print("=" * 60)
print(" Powered by Qwen qwen3.5-livetranslate-flash-realtime — single-turn interaction example (mask)")
print("=" * 60 + "\n")
async def stream_microphone_once(client: LiveTranslateClient, image_bytes: bytes):
pa = client.pyaudio_instance
stream = pa.open(
format=client.input_format,
channels=client.input_channels,
rate=client.input_rate,
input=True,
frames_per_buffer=client.input_chunk,
)
print(f"[INFO] Recording started. Please speak...")
loop = asyncio.get_event_loop()
last_img_time = 0.0
frame_interval = 0.5 # 2 fps
try:
while client.is_connected:
data = await loop.run_in_executor(None, stream.read, client.input_chunk)
await client.send_audio_chunk(data)
# Append an image frame every 0.5 seconds
now = time.time()
if now - last_img_time >= frame_interval:
await client.send_image_frame(image_bytes)
last_img_time = now
finally:
stream.stop_stream()
stream.close()
async def main():
print_banner()
api_key = os.environ.get("DASHSCOPE_API_KEY")
if not api_key:
print("[ERROR] Please set the DASHSCOPE_API_KEY environment variable.")
return
client = LiveTranslateClient(api_key=api_key, target_language="zh", audio_enabled=True)
def on_text(text: str):
print(text, end="", flush=True)
try:
await client.connect()
client.start_audio_player()
message_task = asyncio.create_task(client.handle_server_messages(on_text))
with open(IMAGE_PATH, "rb") as f:
img_bytes = f.read()
await stream_microphone_once(client, img_bytes)
await asyncio.sleep(15)
finally:
await client.close()
if not message_task.done():
message_task.cancel()
with contextlib.suppress(asyncio.CancelledError):
await message_task
if __name__ == "__main__":
asyncio.run(main())
One-click Function Compute deployment
To deploy the application:
Open the Function Compute template, enter your API key, and click Create and Deploy Default Environment to test the application.
-
Wait for about a minute. In Environment Details > Environment Context, retrieve the endpoint, change the protocol from http to https (for example, https://qwen-livetranslate-flash-realtime-intl.fcv3.xxx.ap-southeast-1.fc.devsapp.net/), and open the URL in a browser to interact with the model.
ImportantThis endpoint uses a self-signed certificate and is for temporary testing only. Your browser will display a security warning on your first visit. This is expected behavior. Do not use this endpoint in a production environment. To proceed, follow the on-screen instructions (for example, click Advanced → Proceed to (unsafe)).
If you are prompted to configure Resource Access Management permissions, follow the on-screen instructions.
To view the project source code, go to Resource Information > Function Resources .
Both Function Compute and Model Studio provide a free quota for new users, sufficient for basic debugging. After the free quota is used up, pay-as-you-go billing applies.
Interaction flow
Real-time speech translation follows an event-driven WebSocket model. The server automatically detects speech boundaries and responds.
|
Lifecycle |
Client event |
Server event |
|
Session initialization |
session.update Session configuration |
session.created Session created session.updated Session configuration updated |
|
User audio input |
input_audio_buffer.append Append audio to the buffer |
None |
|
Server audio output |
None |
response.created Signals that the server starts generating a response. response.output_item.added Signals that a new output item is available. response.content_part.added Signals that a new content part has been added to the assistant message. response.audio_transcript.text Contains an incremental update to the text transcript. response.audio.delta Contains an incremental chunk of the synthesized audio. response.audio_transcript.done Signals that the full text transcript is complete. response.audio.done Signals that the synthesized audio is complete. response.content_part.done Signals that a text or audio content part for the assistant message is complete. response.output_item.done Signals that the entire output item for the assistant message is complete. response.done Signals that the entire response is complete. |
API
Billing
Audio: Each second of audio input or output consumes 12.5 tokens.
Image: Every 28×28 pixels consumes 0.5 tokens.
Text: When source language speech recognition is enabled, the service returns a transcript of the input audio in addition to the translation. This transcript is billed as output text tokens.
For pricing, see Model list.
Supported languages
Use the following language codes to specify the source and target languages.
Some target languages only support text. The legacy model qwen3-livetranslate-flash-realtime supports only the following 18 languages: en, zh, ru, fr, de, pt, es, it, id, ko, ja, vi, th, ar, yue, hi, el, tr.
|
Language code |
Language |
Output |
|
zh |
Chinese |
Audio + text |
|
en |
English |
Audio + text |
|
ar |
Arabic |
Audio + text |
|
de |
German |
Audio + text |
|
fr |
French |
Audio + text |
|
es |
Spanish |
Audio + text |
|
pt |
Portuguese |
Audio + text |
|
id |
Indonesian |
Audio + text |
|
it |
Italian |
Audio + text |
|
ko |
Korean |
Audio + text |
|
ru |
Russian |
Audio + text |
|
th |
Thai |
Audio + text |
|
vi |
Vietnamese |
Audio + text |
|
ja |
Japanese |
Audio + text |
|
tr |
Turkish |
Audio + text |
|
hi |
Hindi |
Audio + text |
|
ms |
Malay |
Audio + text |
|
nl |
Dutch |
Audio + text |
|
ur |
Urdu |
Audio + text |
|
nb |
Norwegian Bokmål |
Audio + text |
|
sv |
Swedish |
Audio + text |
|
da |
Danish |
Audio + text |
|
he |
Hebrew |
Audio + text |
|
fi |
Finnish |
Audio + text |
|
pl |
Polish |
Audio + text |
|
is |
Icelandic |
Audio + text |
|
cs |
Czech |
Audio + text |
|
fil |
Filipino |
Audio + text |
|
fa |
Persian |
Audio + text |
|
yue |
Cantonese |
Text |
|
el |
Greek |
Text |
|
af |
Afrikaans |
Text |
|
ast |
Asturian |
Text |
|
be |
Belarusian |
Text |
|
bg |
Bulgarian |
Text |
|
bn |
Bengali |
Text |
|
bs |
Bosnian |
Text |
|
ca |
Catalan |
Text |
|
ceb |
Cebuano |
Text |
|
et |
Estonian |
Text |
|
gl |
Galician |
Text |
|
gu |
Gujarati |
Text |
|
hr |
Croatian |
Text |
|
hu |
Hungarian |
Text |
|
jv |
Javanese |
Text |
|
kk |
Kazakh |
Text |
|
kn |
Kannada |
Text |
|
ky |
Kyrgyz |
Text |
|
lv |
Latvian |
Text |
|
mk |
Macedonian |
Text |
|
ml |
Malayalam |
Text |
|
mr |
Marathi |
Text |
|
pa |
Punjabi |
Text |
|
ro |
Romanian |
Text |
|
sk |
Slovak |
Text |
|
sl |
Slovenian |
Text |
|
sw |
Swahili |
Text |
|
tg |
Tajik |
Text |
|
az |
Azerbaijani |
Text |
|
uk |
Ukrainian |
Text |
Supported voices
For supported voices and the corresponding voice parameter values, see Voice list.