Real-time audio and video translation - Qwen - Alibaba Cloud Model Studio

qwen3-livetranslate-flash-realtime is a vision-enhanced real-time translation model that supports translation between 18 languages, such as Chinese, English, Russian, and French. This model simultaneously processes audio and image input from real-time video streams or local video files, using visual context to improve translation accuracy and outputting high-quality translated text and audio in real time.

For an online demo, see One-click deployment using Function Compute.

Procedure

1. Configure the connection

qwen3-livetranslate-flash-realtime connects using the WebSocket protocol. The connection requires the following configuration items:

Configuration item	Description
Endpoint	China (Beijing): wss://dashscope.aliyuncs.com/api-ws/v1/realtime International (Singapore): wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime
Query parameter	The query parameter is `model`. Set it to the name of the model you want to access. Example: `?model=qwen3-livetranslate-flash-realtime`
Message header	Use Bearer Token for authentication: Authorization: Bearer DASHSCOPE_API_KEY DASHSCOPE_API_KEY is the API key that you request from Alibaba Cloud Model Studio.

Use the following Python sample code to establish a connection.

WebSocket connection Python sample code

# pip install websocket-client
import json
import websocket
import os

API_KEY=os.getenv("DASHSCOPE_API_KEY")
API_URL = "wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime?model=qwen3-livetranslate-flash-realtime"

headers = [
    "Authorization: Bearer " + API_KEY
]

def on_open(ws):
    print(f"Connected to server: {API_URL}")
def on_message(ws, message):
    data = json.loads(message)
    print("Received event:", json.dumps(data, indent=2))
def on_error(ws, error):
    print("Error:", error)

ws = websocket.WebSocketApp(
    API_URL,
    header=headers,
    on_open=on_open,
    on_message=on_message,
    on_error=on_error
)

ws.run_forever()

2. Configure the language, output modality, and voice

Send client event session.update:

Language
- Source language: Use the session.input_audio_transcription.language parameter.
  The default value is en (English).
- Target language: Use the session.translation.language parameter.
  The default value is en (English).
Supported languages
Output modality
Use the session.modalities parameter. Supported values are ["text"] (text only) and ["text","audio"] (text and audio).
Voice
Use the session.voice parameter. Supported voices

3. Input audio and images

The client sends Base64-encoded audio and image data using the input_audio_buffer.append and input_image_buffer.append events. Audio input is required. Image input is optional.

Images can be from local files or captured in real time from a video stream.

The server automatically detects the start and end of the audio and triggers a model response.

4. Receive the model response

When the server detects the end of the audio, the model begins to respond. The response format depends on the configured output modality.

Text-only output
The server returns the complete translated text in a response.text.done event.
Text and audio output
- Text
  The complete translated text is returned in a response.audio_transcript.done event.
- Audio
  Incremental, Base64-encoded audio data is returned in response.audio.delta events.

Model availability

qwen3-livetranslate-flash-realtime is a multilingual, real-time audio and video translation model. It can recognize 18 languages and translate them into audio in 10 languages in real time.

Core features:

Multilingual support: Supports 18 languages, such as Chinese, English, French, German, Russian, Japanese, and Korean, and 6 Chinese dialects, including Mandarin, Cantonese, and Sichuanese.
Vision enhancement: Uses visual content to improve translation accuracy. The model analyzes lip movements, actions, and on-screen text to enhance translation in noisy environments or for words with multiple meanings.
3-second latency: Achieves simultaneous interpretation latency as low as 3 seconds.
Lossless simultaneous interpretation: Uses semantic unit prediction technology to resolve word order issues between languages. The real-time translation quality is close to that of offline translation.
Natural voice: Generates natural, human-like speech. The model automatically adjusts its tone and emotion based on the source audio.

Model	Version	Context window	Max input	Max output
		(Tokens)
qwen3-livetranslate-flash-realtime Current capabilities are equivalent to qwen3-livetranslate-flash-realtime-2025-09-22	Stable	53,248	49,152	4,096
qwen3-livetranslate-flash-realtime-2025-09-22	Snapshot

Getting started

Prepare the environment

Your Python version must be 3.10 or later.

First, install pyaudio.

macOS

brew install portaudio && pip install pyaudio

Debian/Ubuntu

sudo apt-get install python3-pyaudio

or

pip install pyaudio

CentOS

sudo yum install -y portaudio portaudio-devel && pip install pyaudio

Windows

pip install pyaudio

After the installation is complete, use pip to install the required WebSocket dependencies:

pip install websocket-client==1.8.0 websockets

Create the client

Create a new Python file locally, name it livetranslate_client.py, and copy the following code into the file:

Client code - livetranslate_client.py

import os
import time
import base64
import asyncio
import json
import websockets
import pyaudio
import queue
import threading
import traceback

class LiveTranslateClient:
    def __init__(self, api_key: str, target_language: str = "en", voice: str | None = "Cherry", *, audio_enabled: bool = True):
        if not api_key:
            raise ValueError("API key cannot be empty.")
            
        self.api_key = api_key
        self.target_language = target_language
        self.audio_enabled = audio_enabled
        self.voice = voice if audio_enabled else "Cherry"
        self.ws = None
        self.api_url = "wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime?model=qwen3-livetranslate-flash-realtime"
        
        # Audio input configuration (from microphone)
        self.input_rate = 16000
        self.input_chunk = 1600
        self.input_format = pyaudio.paInt16
        self.input_channels = 1
        
        # Audio output configuration (for playback)
        self.output_rate = 24000
        self.output_chunk = 2400
        self.output_format = pyaudio.paInt16
        self.output_channels = 1
        
        # State management
        self.is_connected = False
        self.audio_player_thread = None
        self.audio_playback_queue = queue.Queue()
        self.pyaudio_instance = pyaudio.PyAudio()

    async def connect(self):
        """Establish a WebSocket connection to the translation service."""
        headers = {"Authorization": f"Bearer {self.api_key}"}
        try:
            self.ws = await websockets.connect(self.api_url, additional_headers=headers)
            self.is_connected = True
            print(f"Successfully connected to the server: {self.api_url}")
            await self.configure_session()
        except Exception as e:
            print(f"Connection failed: {e}")
            self.is_connected = False
            raise

    async def configure_session(self):
        """Configure the translation session, setting the target language, voice, and other parameters."""
        config = {
            "event_id": f"event_{int(time.time() * 1000)}",
            "type": "session.update",
            "session": {
                # 'modalities' controls the output type.
                # ["text", "audio"]: Returns both translated text and synthesized audio (recommended).
                # ["text"]: Returns only the translated text.
                "modalities": ["text", "audio"] if self.audio_enabled else ["text"],
                **({"voice": self.voice} if self.audio_enabled and self.voice else {}),
                "input_audio_format": "pcm16",
                "output_audio_format": "pcm16",
                "translation": {
                    "language": self.target_language
                }
            }
        }
        print(f"Sending session configuration: {json.dumps(config, indent=2, ensure_ascii=False)}")
        await self.ws.send(json.dumps(config))

    async def send_audio_chunk(self, audio_data: bytes):
        """Encode and send an audio data block to the server."""
        if not self.is_connected:
            return
            
        event = {
            "event_id": f"event_{int(time.time() * 1000)}",
            "type": "input_audio_buffer.append",
            "audio": base64.b64encode(audio_data).decode()
        }
        await self.ws.send(json.dumps(event))

    async def send_image_frame(self, image_bytes: bytes, *, event_id: str | None = None):
        # Send image data to the server
        if not self.is_connected:
            return

        if not image_bytes:
            raise ValueError("image_bytes cannot be empty")

        # Encode to Base64
        image_b64 = base64.b64encode(image_bytes).decode()

        event = {
            "event_id": event_id or f"event_{int(time.time() * 1000)}",
            "type": "input_image_buffer.append",
            "image": image_b64,
        }

        await self.ws.send(json.dumps(event))

    def _audio_player_task(self):
        stream = self.pyaudio_instance.open(
            format=self.output_format,
            channels=self.output_channels,
            rate=self.output_rate,
            output=True,
            frames_per_buffer=self.output_chunk,
        )
        try:
            while self.is_connected or not self.audio_playback_queue.empty():
                try:
                    audio_chunk = self.audio_playback_queue.get(timeout=0.1)
                    if audio_chunk is None: # Termination signal
                        break
                    stream.write(audio_chunk)
                    self.audio_playback_queue.task_done()
                except queue.Empty:
                    continue
        finally:
            stream.stop_stream()
            stream.close()

    def start_audio_player(self):
        """Start the audio player thread (only when audio output is enabled)."""
        if not self.audio_enabled:
            return
        if self.audio_player_thread is None or not self.audio_player_thread.is_alive():
            self.audio_player_thread = threading.Thread(target=self._audio_player_task, daemon=True)
            self.audio_player_thread.start()

    async def handle_server_messages(self, on_text_received):
        """Loop to process messages from the server."""
        try:
            async for message in self.ws:
                event = json.loads(message)
                event_type = event.get("type")
                if event_type == "response.audio.delta" and self.audio_enabled:
                    audio_b64 = event.get("delta", "")
                    if audio_b64:
                        audio_data = base64.b64decode(audio_b64)
                        self.audio_playback_queue.put(audio_data)

                elif event_type == "response.done":
                    print("\n[INFO] A round of response is complete.")
                    usage = event.get("response", {}).get("usage", {})
                    if usage:
                        print(f"[INFO] Token usage: {json.dumps(usage, indent=2, ensure_ascii=False)}")
                elif event_type == "response.audio_transcript.done":
                    print("\n[INFO] Text translation complete.")
                    text = event.get("transcript", "")
                    if text:
                        print(f"[INFO] Translated text: {text}")
                elif event_type == "response.text.done":
                    print("\n[INFO] Text translation complete.")
                    text = event.get("text", "")
                    if text:
                        print(f"[INFO] Translated text: {text}")

        except websockets.exceptions.ConnectionClosed as e:
            print(f"[WARNING] Connection closed: {e}")
            self.is_connected = False
        except Exception as e:
            print(f"[ERROR] An unknown error occurred while processing messages: {e}")
            traceback.print_exc()
            self.is_connected = False

    async def start_microphone_streaming(self):
        """Capture audio from the microphone and stream it to the server."""
        stream = self.pyaudio_instance.open(
            format=self.input_format,
            channels=self.input_channels,
            rate=self.input_rate,
            input=True,
            frames_per_buffer=self.input_chunk
        )
        print("Microphone is on. Start speaking...")
        try:
            while self.is_connected:
                audio_chunk = await asyncio.get_event_loop().run_in_executor(
                    None, stream.read, self.input_chunk
                )
                await self.send_audio_chunk(audio_chunk)
        finally:
            stream.stop_stream()
            stream.close()

    async def close(self):
        """Gracefully close the connection and release resources."""
        self.is_connected = False
        if self.ws:
            await self.ws.close()
            print("WebSocket connection closed.")
        
        if self.audio_player_thread:
            self.audio_playback_queue.put(None) # Send termination signal
            self.audio_player_thread.join(timeout=1)
            print("Audio player thread stopped.")
            
        self.pyaudio_instance.terminate()
        print("PyAudio instance released.")

Interact with the model

In the same folder as livetranslate_client.py, create another Python file, name it main.py, and copy the following code into the file:

main.py

import os
import asyncio
from livetranslate_client import LiveTranslateClient

def print_banner():
    print("=" * 60)
    print("  Powered by Qwen qwen3-livetranslate-flash-realtime")
    print("=" * 60 + "\n")

def get_user_config():
    """Get user configuration"""
    print("Select a mode:")
    print("1. Voice + Text [Default] | 2. Text Only")
    mode_choice = input("Enter your choice (press Enter for Voice + Text): ").strip()
    audio_enabled = (mode_choice != "2")

    if audio_enabled:
        lang_map = {
            "1": "en", "2": "zh", "3": "ru", "4": "fr", "5": "de", "6": "pt",
            "7": "es", "8": "it", "9": "ko", "10": "ja", "11": "yue"
        }
        print("Select the target translation language (Voice + Text mode):")
        print("1. English | 2. Chinese | 3. Russian | 4. French | 5. German | 6. Portuguese | 7. Spanish | 8. Italian | 9. Korean | 10. Japanese | 11. Cantonese")
    else:
        lang_map = {
            "1": "en", "2": "zh", "3": "ru", "4": "fr", "5": "de", "6": "pt", "7": "es", "8": "it",
            "9": "id", "10": "ko", "11": "ja", "12": "vi", "13": "th", "14": "ar",
            "15": "yue", "16": "hi", "17": "el", "18": "tr"
        }
        print("Select the target translation language (Text Only mode):")
        print("1. English | 2. Chinese | 3. Russian | 4. French | 5. German | 6. Portuguese | 7. Spanish | 8. Italian | 9. Indonesian | 10. Korean | 11. Japanese | 12. Vietnamese | 13. Thai | 14. Arabic | 15. Cantonese | 16. Hindi | 17. Greek | 18. Turkish")

    choice = input("Enter your choice (defaults to the first option): ").strip()
    target_language = lang_map.get(choice, next(iter(lang_map.values())))

    voice = None
    if audio_enabled:
        print("\nSelect a speech synthesis voice:")
        voice_map = {"1": "Cherry", "2": "Nofish", "3": "Sunny", "4": "Jada", "5": "Dylan", "6": "Peter", "7": "Eric", "8": "Kiki"}
        print("1. Cherry (Female) [Default] | 2. Nofish (Male) | 3. Sunny (Sichuan Female) | 4. Jada (Shanghai Female) | 5. Dylan (Beijing Male) | 6. Peter (Tianjin Male) | 7. Eric (Sichuan Male) | 8. Kiki (Cantonese Female)")
        voice_choice = input("Enter your choice (press Enter for Cherry): ").strip()
        voice = voice_map.get(voice_choice, "Cherry")
    return target_language, voice, audio_enabled

async def main():
    """Main program entry point"""
    print_banner()
    
    api_key = os.environ.get("DASHSCOPE_API_KEY")
    if not api_key:
        print("[ERROR] Set the DASHSCOPE_API_KEY environment variable.")
        print("  For example: export DASHSCOPE_API_KEY='your_api_key_here'")
        return
        
    target_language, voice, audio_enabled = get_user_config()
    print("\nConfiguration complete:")
    print(f"  - Target language: {target_language}")
    if audio_enabled:
        print(f"  - Synthesized voice: {voice}")
    else:
        print("  - Output mode: Text Only")
    
    client = LiveTranslateClient(api_key=api_key, target_language=target_language, voice=voice, audio_enabled=audio_enabled)
    
    # Define the callback function
    def on_translation_text(text):
        print(text, end="", flush=True)

    try:
        print("Connecting to the translation service...")
        await client.connect()
        
        # Start audio playback based on the mode
        client.start_audio_player()
        
        print("\n" + "-" * 60)
        print("Connection successful! Speak into the microphone.")
        print("The program will translate your speech in real time and play the result. Press Ctrl+C to exit.")
        print("-" * 60 + "\n")

        # Run message handling and microphone recording concurrently
        message_handler = asyncio.create_task(client.handle_server_messages(on_translation_text))
        tasks = [message_handler]
        # Audio must be captured from the microphone for translation, regardless of whether audio output is enabled
        microphone_streamer = asyncio.create_task(client.start_microphone_streaming())
        tasks.append(microphone_streamer)

        await asyncio.gather(*tasks)

    except KeyboardInterrupt:
        print("\n\nUser interrupted. Exiting...")
    except Exception as e:
        print(f"\nA critical error occurred: {e}")
    finally:
        print("\nCleaning up resources...")
        await client.close()
        print("Program exited.")

if __name__ == "__main__":
    asyncio.run(main())

Run main.py and speak the sentences you want to translate into the microphone. The model provides the translated audio and text in real time. The system automatically detects your speech and sends the audio to the server, so no manual action is required.

Use images to improve translation accuracy

qwen3-livetranslate-flash-realtime can accept image input to assist with audio translation. This is useful for scenarios involving homonyms or recognizing uncommon proper nouns. You can send a maximum of two images per second.

Download the following sample images to your local machine: mask_medical.png and mask_masquerade.png

Download the following code to the same folder as livetranslate_client.py and run it. Say "What is mask?" into the microphone. When you input the medical mask image, the model translates the phrase as “What is medical mask?” When you input the masquerade mask image, the model translates the phrase as “What is masquerade mask?”

import os
import time
import json
import asyncio
import contextlib
import functools

from livetranslate_client import LiveTranslateClient

IMAGE_PATH = "mask_medical.png"
# IMAGE_PATH = "mask_masquerade.png"

def print_banner():
    print("=" * 60)
    print("  Powered by Qwen qwen3-livetranslate-flash-realtime — Single-turn interaction example (mask)")
    print("=" * 60 + "\n")

async def stream_microphone_once(client: LiveTranslateClient, image_bytes: bytes):
    pa = client.pyaudio_instance
    stream = pa.open(
        format=client.input_format,
        channels=client.input_channels,
        rate=client.input_rate,
        input=True,
        frames_per_buffer=client.input_chunk,
    )
    print(f"[INFO] Recording started. Please speak...")
    loop = asyncio.get_event_loop()
    last_img_time = 0.0
    frame_interval = 0.5  # 2 fps
    try:
        while client.is_connected:
            data = await loop.run_in_executor(None, stream.read, client.input_chunk)
            await client.send_audio_chunk(data)

            # Append an image frame every 0.5 seconds
            now = time.time()
            if now - last_img_time >= frame_interval:
                await client.send_image_frame(image_bytes)
                last_img_time = now
    finally:
        stream.stop_stream()
        stream.close()

async def main():
    print_banner()
    api_key = os.environ.get("DASHSCOPE_API_KEY")
    if not api_key:
        print("[ERROR] First, configure the API KEY in the DASHSCOPE_API_KEY environment variable.")
        return

    client = LiveTranslateClient(api_key=api_key, target_language="zh", voice="Cherry", audio_enabled=True)

    def on_text(text: str):
        print(text, end="", flush=True)

    try:
        await client.connect()
        client.start_audio_player()
        message_task = asyncio.create_task(client.handle_server_messages(on_text))
        with open(IMAGE_PATH, "rb") as f:
            img_bytes = f.read()
        await stream_microphone_once(client, img_bytes)
        await asyncio.sleep(15)
    finally:
        await client.close()
        if not message_task.done():
            message_task.cancel()
            with contextlib.suppress(asyncio.CancelledError):
                await message_task

if __name__ == "__main__":
    asyncio.run(main())

One-click deployment using Function Compute

The console does not currently support this demo. You can deploy it with one click as follows:

Open the Function Compute template, enter your API key, and click Create and Deploy Default Environment to try it online.
Wait for about one minute. In Environment Details > Environment Context, retrieve the endpoint. Change http to https in the endpoint (for example, https://qwen-livetranslate-flash-realtime-intl.fcv3.xxx.ap-southeast-1.fc.devsapp.net/) and use the link to interact with the model.
Important
This link uses a self-signed certificate and is for temporary testing only. When you first access it, your browser will display a security warning. This is expected behavior. Do not use this in a production environment. To proceed, follow your browser's instructions, such as clicking "Advanced" → "Proceed to (unsafe)".

To enable Resource Access Management (RAM) permissions, follow the on-screen instructions.

You can view the project source code under Resource Information - Function Resources.

Function Compute and Alibaba Cloud Model Studio both offer a free quota for new users. This quota can cover the cost of simple testing. After the free quota is exhausted, you are charged on a pay-as-you-go basis. Charges are incurred only when the service is accessed.

Interaction flow

The interaction flow for real-time speech translation follows the standard WebSocket event-driven model, where the server automatically detects the start and end of speech and responds.

Lifecycle	Client event	Server event
Session initialization	session.update Session configuration	session.created Session created session.updated Session configuration updated
User audio input	input_audio_buffer.append Add audio to buffer input_image_buffer.append Add image to buffer	None
Server audio output	None	response.created Server starts generating response response.output_item.added New output content in response response.content_part.added New output content added to assistant message response.audio_transcript.text Incrementally generated transcript text response.audio.delta Incrementally generated audio from the model response.audio_transcript.done Text transcription complete response.audio.done Audio generation complete response.content_part.done Streaming of text or audio content for the assistant message is complete response.output_item.done Streaming of the entire output item for the assistant message is complete response.done Response complete

API reference

For more information, see Real-time audio and video translation (Qwen-Livetranslate).

Billing

Audio: Each second of audio input or output consumes 12.5 tokens.
Image: Every 28×28 pixels consumes 0.5 tokens.

For the pricing of tokens, see the Model list.

Supported languages

Use the following language codes to specify source and target languages.

Some languages support only text output, but not audio output.

Language code	Language	Supported output modality
en	English	Audio + Text
zh	Chinese	Audio + Text
ru	Russian	Audio + Text
fr	French	Audio + Text
de	German	Audio + Text
pt	Portuguese	Audio + Text
es	Spanish	Audio + Text
it	Italian	Audio + Text
id	Indonesian	Text
ko	Korean	Audio + Text
ja	Japanese	Audio + Text
vi	Vietnamese	Text
th	Thai	Text
ar	Arabic	Text
yue	Cantonese	Audio + Text
hi	Hindi	Text
el	Greek	Text
tr	Turkish	Text

Supported voices

Name	`voice` parameter	Effects	Description	Supported languages
Cherry	Cherry		A sunny, positive, and friendly female voice.	Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Nofish	Nofish		A designer who cannot pronounce retroflex consonants.	Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Shanghai - Jada	Jada		A lively and energetic Shanghainese woman.	Chinese
Beijing - Dylan	Dylan		A young man who grew up in the hutongs of Beijing.	Chinese
Sichuan - Sunny	Sunny		A sweet-voiced Sichuanese girl.	Chinese
Tianjin - Peter	Peter		Tianjin Crosstalk: The art of the supporting role.	Chinese
Cantonese - Kiki	Kiki		A sweet-voiced best friend from Hong Kong.	Cantonese
Sichuan - Eric	Eric		A man from Chengdu, Sichuan, with a voice that stands out from the crowd.	Chinese