All Products
Search
Document Center

Alibaba Cloud Model Studio:Real-time audio and video translation - Qwen

Last Updated:Nov 15, 2025

qwen3-livetranslate-flash-realtime is a vision-enhanced, real-time translation model from Qwen. It can simultaneously process streaming audio and image inputs, such as from a video stream. It uses visual context to improve translation accuracy and outputs high-quality translated text and audio in real time.

For an online demo, see One-click deployment using Function Compute.

How to use

1. Configure the connection

The qwen3-livetranslate-flash-realtime model connects using the WebSocket protocol. The connection requires the following configuration items:

Configuration item

Description

Endpoint

China site: wss://dashscope.aliyuncs.com/api-ws/v1/realtime

International site: wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime

Query parameter

The query parameter is `model`. You must set it to the name of the model you want to access. Example: ?model=qwen3-livetranslate-flash-realtime

Message header

Use Bearer Token for authentication: Authorization: Bearer DASHSCOPE_API_KEY

DASHSCOPE_API_KEY is the API key that you request from Alibaba Cloud Model Studio.

Use the following Python sample code to establish a connection.

WebSocket connection Python sample code

# pip install websocket-client
import json
import websocket
import os

API_KEY=os.getenv("DASHSCOPE_API_KEY")
API_URL = "wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime?model=qwen3-livetranslate-flash-realtime"

headers = [
    "Authorization: Bearer " + API_KEY
]

def on_open(ws):
    print(f"Connected to server: {API_URL}")
def on_message(ws, message):
    data = json.loads(message)
    print("Received event:", json.dumps(data, indent=2))
def on_error(ws, error):
    print("Error:", error)

ws = websocket.WebSocketApp(
    API_URL,
    header=headers,
    on_open=on_open,
    on_message=on_message,
    on_error=on_error
)

ws.run_forever()

2. Configure the target language, output modality, and voice

To configure these settings, send the session.update client event:

  • Target translation language

    Use the session.translation.language parameter to set the target language. For more information, see Supported languages.

  • Output modality

    Use the session.modalities parameter to set the output modality. You can set it to ["text"] (text-only output) or ["text","audio"] (text and audio output).

  • Voice

    Use the session.voice parameter to set the voice. For more information, see Supported voices.

3. Input audio and images

The client sends Base64-encoded audio and image data using the input_audio_buffer.append and input_image_buffer.append events. Audio input is required. Image input is optional.

Images can be from local files or captured in real time from a video stream.
The server automatically detects the start and end of the audio and triggers a model response.

4. Receive the model response

When the server detects the end of the audio, the model begins to respond. The response format depends on the configured output modality.

Supported models

qwen3-livetranslate-flash-realtime is a multilingual, real-time audio and video translation model. It can recognize 18 languages and translate them into audio in 10 languages in real time.

Core features:

  • Multilingual support: Supports 18 languages, such as Chinese, English, French, German, Russian, Japanese, and Korean, and 6 Chinese dialects, including Mandarin, Cantonese, and Sichuanese.

  • Vision enhancement: Uses visual content to improve translation accuracy. The model analyzes lip movements, actions, and on-screen text to enhance translation in noisy environments or for words with multiple meanings.

  • 3-second latency: Achieves simultaneous interpretation latency as low as 3 seconds.

  • Lossless simultaneous interpretation: Uses semantic unit prediction technology to resolve word order issues between languages. The real-time translation quality is close to that of offline translation.

  • Natural voice: Generates natural, human-like speech. The model automatically adjusts its tone and emotion based on the source audio.

Model name

Version

Context length

Max input

Max output

(Tokens)

qwen3-livetranslate-flash-realtime

Current capabilities are equivalent to qwen3-livetranslate-flash-realtime-2025-09-22

Stable

53,248

49,152

4,096

qwen3-livetranslate-flash-realtime-2025-09-22

Snapshot

Getting Started

  1. Prepare the environment

    Your Python version must be 3.10 or later.

    First, install pyaudio.

    macOS

    brew install portaudio && pip install pyaudio

    Debian/Ubuntu

    sudo apt-get install python3-pyaudio
    
    or
    
    pip install pyaudio

    CentOS

    sudo yum install -y portaudio portaudio-devel && pip install pyaudio

    Windows

    pip install pyaudio

    After the installation is complete, use pip to install the required WebSocket dependencies:

    pip install websocket-client==1.8.0 websockets
  2. Create the client

    Create a new Python file locally, name it livetranslate_client.py, and copy the following code into the file:

    Client code - livetranslate_client.py

    import os
    import time
    import base64
    import asyncio
    import json
    import websockets
    import pyaudio
    import queue
    import threading
    import traceback
    
    class LiveTranslateClient:
        def __init__(self, api_key: str, target_language: str = "en", voice: str | None = "Cherry", *, audio_enabled: bool = True):
            if not api_key:
                raise ValueError("API key cannot be empty.")
                
            self.api_key = api_key
            self.target_language = target_language
            self.audio_enabled = audio_enabled
            self.voice = voice if audio_enabled else "Cherry"
            self.ws = None
            self.api_url = "wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime?model=qwen3-livetranslate-flash-realtime"
            
            # Audio input configuration (from microphone)
            self.input_rate = 16000
            self.input_chunk = 1600
            self.input_format = pyaudio.paInt16
            self.input_channels = 1
            
            # Audio output configuration (for playback)
            self.output_rate = 24000
            self.output_chunk = 2400
            self.output_format = pyaudio.paInt16
            self.output_channels = 1
            
            # State management
            self.is_connected = False
            self.audio_player_thread = None
            self.audio_playback_queue = queue.Queue()
            self.pyaudio_instance = pyaudio.PyAudio()
    
        async def connect(self):
            """Establish a WebSocket connection to the translation service."""
            headers = {"Authorization": f"Bearer {self.api_key}"}
            try:
                self.ws = await websockets.connect(self.api_url, additional_headers=headers)
                self.is_connected = True
                print(f"Successfully connected to the server: {self.api_url}")
                await self.configure_session()
            except Exception as e:
                print(f"Connection failed: {e}")
                self.is_connected = False
                raise
    
        async def configure_session(self):
            """Configure the translation session, setting the target language, voice, and other parameters."""
            config = {
                "event_id": f"event_{int(time.time() * 1000)}",
                "type": "session.update",
                "session": {
                    # 'modalities' controls the output type.
                    # ["text", "audio"]: Returns both translated text and synthesized audio (recommended).
                    # ["text"]: Returns only the translated text.
                    "modalities": ["text", "audio"] if self.audio_enabled else ["text"],
                    **({"voice": self.voice} if self.audio_enabled and self.voice else {}),
                    "input_audio_format": "pcm16",
                    "output_audio_format": "pcm16",
                    "translation": {
                        "language": self.target_language
                    }
                }
            }
            print(f"Sending session configuration: {json.dumps(config, indent=2, ensure_ascii=False)}")
            await self.ws.send(json.dumps(config))
    
        async def send_audio_chunk(self, audio_data: bytes):
            """Encode and send an audio data block to the server."""
            if not self.is_connected:
                return
                
            event = {
                "event_id": f"event_{int(time.time() * 1000)}",
                "type": "input_audio_buffer.append",
                "audio": base64.b64encode(audio_data).decode()
            }
            await self.ws.send(json.dumps(event))
    
        async def send_image_frame(self, image_bytes: bytes, *, event_id: str | None = None):
            # Send image data to the server
            if not self.is_connected:
                return
    
            if not image_bytes:
                raise ValueError("image_bytes cannot be empty")
    
            # Encode to Base64
            image_b64 = base64.b64encode(image_bytes).decode()
    
            event = {
                "event_id": event_id or f"event_{int(time.time() * 1000)}",
                "type": "input_image_buffer.append",
                "image": image_b64,
            }
    
            await self.ws.send(json.dumps(event))
    
        def _audio_player_task(self):
            stream = self.pyaudio_instance.open(
                format=self.output_format,
                channels=self.output_channels,
                rate=self.output_rate,
                output=True,
                frames_per_buffer=self.output_chunk,
            )
            try:
                while self.is_connected or not self.audio_playback_queue.empty():
                    try:
                        audio_chunk = self.audio_playback_queue.get(timeout=0.1)
                        if audio_chunk is None: # Termination signal
                            break
                        stream.write(audio_chunk)
                        self.audio_playback_queue.task_done()
                    except queue.Empty:
                        continue
            finally:
                stream.stop_stream()
                stream.close()
    
        def start_audio_player(self):
            """Start the audio player thread (only when audio output is enabled)."""
            if not self.audio_enabled:
                return
            if self.audio_player_thread is None or not self.audio_player_thread.is_alive():
                self.audio_player_thread = threading.Thread(target=self._audio_player_task, daemon=True)
                self.audio_player_thread.start()
    
        async def handle_server_messages(self, on_text_received):
            """Loop to process messages from the server."""
            try:
                async for message in self.ws:
                    event = json.loads(message)
                    event_type = event.get("type")
                    if event_type == "response.audio.delta" and self.audio_enabled:
                        audio_b64 = event.get("delta", "")
                        if audio_b64:
                            audio_data = base64.b64decode(audio_b64)
                            self.audio_playback_queue.put(audio_data)
    
                    elif event_type == "response.done":
                        print("\n[INFO] A round of response is complete.")
                        usage = event.get("response", {}).get("usage", {})
                        if usage:
                            print(f"[INFO] Token usage: {json.dumps(usage, indent=2, ensure_ascii=False)}")
                    elif event_type == "response.audio_transcript.done":
                        print("\n[INFO] Text translation complete.")
                        text = event.get("transcript", "")
                        if text:
                            print(f"[INFO] Translated text: {text}")
                    elif event_type == "response.text.done":
                        print("\n[INFO] Text translation complete.")
                        text = event.get("text", "")
                        if text:
                            print(f"[INFO] Translated text: {text}")
    
            except websockets.exceptions.ConnectionClosed as e:
                print(f"[WARNING] Connection closed: {e}")
                self.is_connected = False
            except Exception as e:
                print(f"[ERROR] An unknown error occurred while processing messages: {e}")
                traceback.print_exc()
                self.is_connected = False
    
        async def start_microphone_streaming(self):
            """Capture audio from the microphone and stream it to the server."""
            stream = self.pyaudio_instance.open(
                format=self.input_format,
                channels=self.input_channels,
                rate=self.input_rate,
                input=True,
                frames_per_buffer=self.input_chunk
            )
            print("Microphone is on. Start speaking...")
            try:
                while self.is_connected:
                    audio_chunk = await asyncio.get_event_loop().run_in_executor(
                        None, stream.read, self.input_chunk
                    )
                    await self.send_audio_chunk(audio_chunk)
            finally:
                stream.stop_stream()
                stream.close()
    
        async def close(self):
            """Gracefully close the connection and release resources."""
            self.is_connected = False
            if self.ws:
                await self.ws.close()
                print("WebSocket connection closed.")
            
            if self.audio_player_thread:
                self.audio_playback_queue.put(None) # Send termination signal
                self.audio_player_thread.join(timeout=1)
                print("Audio player thread stopped.")
                
            self.pyaudio_instance.terminate()
            print("PyAudio instance released.")
  3. Interact with the model

    In the same folder as livetranslate_client.py, create another Python file, name it main.py, and copy the following code into the file:

    main.py

    import os
    import asyncio
    from livetranslate_client import LiveTranslateClient
    
    def print_banner():
        print("=" * 60)
        print("  Powered by Qwen qwen3-livetranslate-flash-realtime")
        print("=" * 60 + "\n")
    
    def get_user_config():
        """Get user configuration"""
        print("Select a mode:")
        print("1. Voice + Text [Default] | 2. Text Only")
        mode_choice = input("Enter your choice (press Enter for Voice + Text): ").strip()
        audio_enabled = (mode_choice != "2")
    
        if audio_enabled:
            lang_map = {
                "1": "en", "2": "zh", "3": "ru", "4": "fr", "5": "de", "6": "pt",
                "7": "es", "8": "it", "9": "ko", "10": "ja", "11": "yue"
            }
            print("Select the target translation language (Voice + Text mode):")
            print("1. English | 2. Chinese | 3. Russian | 4. French | 5. German | 6. Portuguese | 7. Spanish | 8. Italian | 9. Korean | 10. Japanese | 11. Cantonese")
        else:
            lang_map = {
                "1": "en", "2": "zh", "3": "ru", "4": "fr", "5": "de", "6": "pt", "7": "es", "8": "it",
                "9": "id", "10": "ko", "11": "ja", "12": "vi", "13": "th", "14": "ar",
                "15": "yue", "16": "hi", "17": "el", "18": "tr"
            }
            print("Select the target translation language (Text Only mode):")
            print("1. English | 2. Chinese | 3. Russian | 4. French | 5. German | 6. Portuguese | 7. Spanish | 8. Italian | 9. Indonesian | 10. Korean | 11. Japanese | 12. Vietnamese | 13. Thai | 14. Arabic | 15. Cantonese | 16. Hindi | 17. Greek | 18. Turkish")
    
        choice = input("Enter your choice (defaults to the first option): ").strip()
        target_language = lang_map.get(choice, next(iter(lang_map.values())))
    
        voice = None
        if audio_enabled:
            print("\nSelect a speech synthesis voice:")
            voice_map = {"1": "Cherry", "2": "Nofish", "3": "Sunny", "4": "Jada", "5": "Dylan", "6": "Peter", "7": "Eric", "8": "Kiki"}
            print("1. Cherry (Female) [Default] | 2. Nofish (Male) | 3. Sunny (Sichuan Female) | 4. Jada (Shanghai Female) | 5. Dylan (Beijing Male) | 6. Peter (Tianjin Male) | 7. Eric (Sichuan Male) | 8. Kiki (Cantonese Female)")
            voice_choice = input("Enter your choice (press Enter for Cherry): ").strip()
            voice = voice_map.get(voice_choice, "Cherry")
        return target_language, voice, audio_enabled
    
    async def main():
        """Main program entry point"""
        print_banner()
        
        api_key = os.environ.get("DASHSCOPE_API_KEY")
        if not api_key:
            print("[ERROR] Set the DASHSCOPE_API_KEY environment variable.")
            print("  For example: export DASHSCOPE_API_KEY='your_api_key_here'")
            return
            
        target_language, voice, audio_enabled = get_user_config()
        print("\nConfiguration complete:")
        print(f"  - Target language: {target_language}")
        if audio_enabled:
            print(f"  - Synthesized voice: {voice}")
        else:
            print("  - Output mode: Text Only")
        
        client = LiveTranslateClient(api_key=api_key, target_language=target_language, voice=voice, audio_enabled=audio_enabled)
        
        # Define the callback function
        def on_translation_text(text):
            print(text, end="", flush=True)
    
        try:
            print("Connecting to the translation service...")
            await client.connect()
            
            # Start audio playback based on the mode
            client.start_audio_player()
            
            print("\n" + "-" * 60)
            print("Connection successful! Speak into the microphone.")
            print("The program will translate your speech in real time and play the result. Press Ctrl+C to exit.")
            print("-" * 60 + "\n")
    
            # Run message handling and microphone recording concurrently
            message_handler = asyncio.create_task(client.handle_server_messages(on_translation_text))
            tasks = [message_handler]
            # Audio must be captured from the microphone for translation, regardless of whether audio output is enabled
            microphone_streamer = asyncio.create_task(client.start_microphone_streaming())
            tasks.append(microphone_streamer)
    
            await asyncio.gather(*tasks)
    
        except KeyboardInterrupt:
            print("\n\nUser interrupted. Exiting...")
        except Exception as e:
            print(f"\nA critical error occurred: {e}")
        finally:
            print("\nCleaning up resources...")
            await client.close()
            print("Program exited.")
    
    if __name__ == "__main__":
        asyncio.run(main())

    Run main.py and speak the sentences you want to translate into the microphone. The model provides the translated audio and text in real time. The system automatically detects your speech and sends the audio to the server, so no manual action is required.

Use images to improve translation accuracy

The qwen3-livetranslate-flash-realtime model can accept image input to assist with audio translation. This is useful for scenarios involving homonyms or recognizing uncommon proper nouns. You can send a maximum of two images per second.

Download the following sample images to your local machine: mask_medical.png and mask_masquerade.png

Download the following code to the same folder as livetranslate_client.py and run it. Say "What is mask?" into the microphone. When you input the medical mask image, the model translates the phrase to “什么是口罩?” When you input the masquerade mask image, the model translates the phrase to “什么是面具?”

import os
import time
import json
import asyncio
import contextlib
import functools

from livetranslate_client import LiveTranslateClient

IMAGE_PATH = "mask_medical.png"
# IMAGE_PATH = "mask_masquerade.png"

def print_banner():
    print("=" * 60)
    print("  Powered by Qwen qwen3-livetranslate-flash-realtime — Single-turn interaction example (mask)")
    print("=" * 60 + "\n")

async def stream_microphone_once(client: LiveTranslateClient, image_bytes: bytes):
    pa = client.pyaudio_instance
    stream = pa.open(
        format=client.input_format,
        channels=client.input_channels,
        rate=client.input_rate,
        input=True,
        frames_per_buffer=client.input_chunk,
    )
    print(f"[INFO] Recording started. Please speak...")
    loop = asyncio.get_event_loop()
    last_img_time = 0.0
    frame_interval = 0.5  # 2 fps
    try:
        while client.is_connected:
            data = await loop.run_in_executor(None, stream.read, client.input_chunk)
            await client.send_audio_chunk(data)

            # Append an image frame every 0.5 seconds
            now = time.time()
            if now - last_img_time >= frame_interval:
                await client.send_image_frame(image_bytes)
                last_img_time = now
    finally:
        stream.stop_stream()
        stream.close()

async def main():
    print_banner()
    api_key = os.environ.get("DASHSCOPE_API_KEY")
    if not api_key:
        print("[ERROR] First, configure the API KEY in the DASHSCOPE_API_KEY environment variable.")
        return

    client = LiveTranslateClient(api_key=api_key, target_language="zh", voice="Cherry", audio_enabled=True)

    def on_text(text: str):
        print(text, end="", flush=True)

    try:
        await client.connect()
        client.start_audio_player()
        message_task = asyncio.create_task(client.handle_server_messages(on_text))
        with open(IMAGE_PATH, "rb") as f:
            img_bytes = f.read()
        await stream_microphone_once(client, img_bytes)
        await asyncio.sleep(15)
    finally:
        await client.close()
        if not message_task.done():
            message_task.cancel()
            with contextlib.suppress(asyncio.CancelledError):
                await message_task

if __name__ == "__main__":
    asyncio.run(main())

One-click deployment using Function Compute

The console does not currently support this demo. You can deploy it with one click as follows:

  1. Open the Function Compute template, enter your API key, and click Create And Deploy Default Environment to try it online.

  2. Wait for about one minute. In Environment Details > Environment Context, retrieve the endpoint. Change http to https in the endpoint (for example, https://qwen-livetranslate-flash-realtime-intl.fcv3.xxx.ap-southeast-1.fc.devsapp.net/) and use the link to interact with the model.

    Important

    This link uses a self-signed certificate and is for temporary testing only. When you first access it, your browser will display a security warning. This is expected behavior. Do not use this in a production environment. To proceed, follow your browser's instructions, such as clicking "Advanced" → "Proceed to (unsafe)".

To enable Resource Access Management (RAM) permissions, follow the on-screen instructions.
You can view the project source code under Resource Information - Function Resources.
Function Compute and Alibaba Cloud Model Studio both offer a free quota for new users. This quota can cover the cost of simple testing. After the free quota is exhausted, you are charged on a pay-as-you-go basis. Charges are incurred only when the service is accessed.

Interaction flow

The interaction flow for real-time speech translation follows the standard WebSocket event-driven model, where the server automatically detects the start and end of speech and responds.

Lifecycle

Client event

Server event

Session initialization

session.update

Session configuration

session.created

Session created

session.updated

Session configuration updated

User audio input

input_audio_buffer.append

Add audio to buffer

input_image_buffer.append

Add image to buffer

None

Server audio output

None

response.created

Server starts generating response

response.output_item.added

New output content in response

response.content_part.added

New output content added to assistant message

response.audio_transcript.text

Incrementally generated transcript text

response.audio.delta

Incrementally generated audio from the model

response.audio_transcript.done

Text transcription complete

response.audio.done

Audio generation complete

response.content_part.done

Streaming of text or audio content for the assistant message is complete

response.output_item.done

Streaming of the entire output item for the assistant message is complete

response.done

Response complete

API reference

For more information, see Real-time audio and video translation (Qwen-Livetranslate).

Billing

  • Audio

    Each second of input audio consumes 25 tokens. Each second of output audio consumes 12.5 tokens.

  • Image

    Each input of 28 × 28 pixels consumes 0.5 tokens.

For token pricing, see the Model List.

Supported languages

Language code

Language

Supported output modality

en

English

Audio + Text

zh

Chinese

Audio + Text

ru

Russian

Audio + Text

fr

French

Audio + Text

de

German

Audio + Text

pt

Portuguese

Audio + Text

es

Spanish

Audio + Text

it

Italian

Audio + Text

id

Indonesian

Text

ko

Korean

Audio + Text

ja

Japanese

Audio + Text

vi

Vietnamese

Text

th

Thai

Text

ar

Arabic

Text

yue

Cantonese

Audio + Text

hi

Hindi

Text

el

Greek

Text

tr

Turkish

Text

Supported voices

Voice name

voice parameter

Timbre effects

Description

Supported languages

Qianyue

Cherry

A sunny, positive, and friendly female voice.

Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean

Do not consume fish.

Nofish

A designer who cannot pronounce retroflex consonants.

Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean

Shanghai - Jada

Jada

A lively and energetic Shanghainese woman.

Chinese

Beijing - Dylan

Dylan

A young man who grew up in the hutongs of Beijing.

Chinese

Sichuan - Sunny

Sunny

A sweet-voiced Sichuanese girl.

Chinese

Tianjin - Peter

Peter

Tianjin Crosstalk: The art of the supporting role.

Chinese

Cantonese - Kiki

Kiki

A sweet-voiced best friend from Hong Kong.

Cantonese

Sichuan-Chengchuan

Eric

A man from Chengdu, Sichuan, with a voice that stands out from the crowd.

Chinese