All Products
Search
Document Center

Alibaba Cloud Model Studio:Real-time audio and video translation - Qwen

Last Updated:May 19, 2026

qwen3.5-livetranslate-flash-realtime is a vision-enhanced real-time translation model that translates between 60 languages (29 with audio + text output and 31 with text-only output). It processes both audio and image input from real-time video streams or local video files, leverages visual context to improve translation accuracy, and outputs translated text and audio in real time.

Try an online demo with one-click deployment using Function Compute .

Features

  • Multi-language support: Translates between 60 languages — 29 with audio and text output, 31 with text-only output — including Chinese, English, French, German, Russian, Japanese, Korean, Spanish, Portuguese, and Arabic.

  • Visual enhancement: Analyzes visual cues, such as lip movements, gestures, and on-screen text, to improve translation accuracy, especially in noisy environments or for ambiguous words.

  • 3-second latency: Delivers simultaneous interpretation with latency as low as 3 seconds.

  • Lossless simultaneous interpretation: Uses semantic unit prediction to resolve word order differences between languages, delivering real-time translation quality comparable to offline translation.

  • Natural voice: Generates a natural voice by automatically matching the intonation and emotion of the source audio.

  • Hotword configuration: Lets you configure hotwords to improve translation accuracy for specific terms.

  • Voice cloning: Clones the speaker's voice for translated output, so the translation sounds like the speaker speaking in another language. Supports both server-side real-time cloning and pre-cloned voice profiles.

Procedure

1. Configure the connection

The qwen3.5-livetranslate-flash-realtime model uses the WebSocket protocol. The connection requires the following parameters:

Parameter

Description

endpoint

Chinese mainland: wss://dashscope.aliyuncs.com/api-ws/v1/realtime

International: wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime

query parameter

The model query parameter must be set to the model name. Example: ?model=qwen3.5-livetranslate-flash-realtime

message header

Use a Bearer Token for authentication: Authorization: Bearer DASHSCOPE_API_KEY

DASHSCOPE_API_KEY is your API key from Model Studio.

Sample Python code for establishing a connection:

Python sample code for WebSocket connection

# pip install websocket-client
import json
import websocket
import os

API_KEY=os.getenv("DASHSCOPE_API_KEY")
API_URL = "wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime?model=qwen3.5-livetranslate-flash-realtime"

headers = [
    "Authorization: Bearer " + API_KEY
]

def on_open(ws):
    print(f"Connected to server: {API_URL}")
def on_message(ws, message):
    data = json.loads(message)
    print("Received event:", json.dumps(data, indent=2))
def on_error(ws, error):
    print("Error:", error)

ws = websocket.WebSocketApp(
    API_URL,
    header=headers,
    on_open=on_open,
    on_message=on_message,
    on_error=on_error
)

ws.run_forever()

2. Configure language, modality, and voice

Send the session.update client event with the following parameters:

  • Language

    • Source language: Configure using the session.input_audio_transcription.language parameter.

      The default value is en (English).
    • Target language: Configure using the session.translation.language parameter.

      The default value is en (English).

    See Supported languages.

  • Output source language recognition results

    To enable this feature, set the session.input_audio_transcription.model parameter. When set to qwen3-asr-flash-realtime, the server returns both the translation and the speech recognition result (original text) for the input audio.

    When this feature is enabled, the server returns the following events:

    • conversation.item.input_audio_transcription.text: Streams the recognition results.

    • conversation.item.input_audio_transcription.completed: Returns the final result after the recognition is complete.

  • Output modality

    Set the session.modalities parameter to ["text"] (text only) or ["text","audio"] (text and audio).

  • Voice

    Configure using the session.voice parameter. See Supported voices.

  • Hotword

    Configure hotwords using the session.translation.corpus.phrases parameter. Hotwords are key-value pairs that map source terms to target translations, improving accuracy for specific terms.

    Example: Map "artificial intelligence" to "Artificial Intelligence".

  • Voice cloning

    Configure using the session.enable_voice_clone, session.voice_clone_options.frequency, and session.voice parameters. Supports three modes: pre-cloned voice profile (frequency: never), server-side clone once at session start (once), or real-time clone before each response (always). See Voice cloning.

3. Input audio and images

Send Base64-encoded audio and image data using the input_audio_buffer.append and input_image_buffer.append events. Audio input is required; image input is optional.

Images can be from a local file or captured in real time from a video stream.
The server automatically detects speech boundaries and triggers the model response.

4. Receive the model response

When the server detects the end of audio input, the model responds. The response format depends on the configured output modality.

Supported models

Model

Version

Context window

Max input

Max output

(tokens)

qwen3.5-livetranslate-flash-realtime

Alias for qwen3.5-livetranslate-flash-realtime-2026-05-19

Stable

53,248

49,152

4,096

qwen3.5-livetranslate-flash-realtime-2026-05-19

Snapshot

qwen3-livetranslate-flash-realtime

Alias for qwen3-livetranslate-flash-realtime-2025-09-22

Stable

53,248

49,152

4,096

qwen3-livetranslate-flash-realtime-2025-09-22

Snapshot

Getting started

  1. Prepare the environment

    Requires Python 3.10 or later.

    First, install pyaudio.

    macOS

    brew install portaudio && pip install pyaudio

    Debian/Ubuntu

    sudo apt-get install python3-pyaudio
    
    or
    
    pip install pyaudio

    CentOS

    sudo yum install -y portaudio portaudio-devel && pip install pyaudio

    Windows

    pip install pyaudio

    Then install the WebSocket dependencies:

    pip install websocket-client==1.8.0 websockets
  2. Create the client

    Create a file named livetranslate_client.py with the following code:

    Client code - livetranslate_client.py

    import os
    import time
    import base64
    import asyncio
    import json
    import websockets
    import pyaudio
    import queue
    import threading
    import traceback
    
    class LiveTranslateClient:
        def __init__(self, api_key: str, target_language: str = "en", *, audio_enabled: bool = True):
            if not api_key:
                raise ValueError("API key cannot be empty.")
                
            self.api_key = api_key
            self.target_language = target_language
            self.audio_enabled = audio_enabled
            self.ws = None
            self.api_url = "wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime?model=qwen3.5-livetranslate-flash-realtime"
            
            # Audio input configuration (from microphone)
            self.input_rate = 16000
            self.input_chunk = 1600
            self.input_format = pyaudio.paInt16
            self.input_channels = 1
            
            # Audio output configuration (for playback)
            self.output_rate = 24000
            self.output_chunk = 2400
            self.output_format = pyaudio.paInt16
            self.output_channels = 1
            
            # State management
            self.is_connected = False
            self.audio_player_thread = None
            self.audio_playback_queue = queue.Queue()
            self.pyaudio_instance = pyaudio.PyAudio()
    
        async def connect(self):
            """Establish a WebSocket connection to the translation service."""
            headers = {"Authorization": f"Bearer {self.api_key}"}
            try:
                self.ws = await websockets.connect(self.api_url, additional_headers=headers)
                self.is_connected = True
                print(f"Successfully connected to the server: {self.api_url}")
                await self.configure_session()
            except Exception as e:
                print(f"Connection failed: {e}")
                self.is_connected = False
                raise
    
        async def configure_session(self):
            """Configure the translation session, setting the target language, voice, etc."""
            config = {
                "event_id": f"event_{int(time.time() * 1000)}",
                "type": "session.update",
                "session": {
                    # 'modalities' controls the output type.
                    # ["text", "audio"]: Returns both translated text and synthesized audio (recommended).
                    # ["text"]: Returns only the translated text.
                    "modalities": ["text", "audio"] if self.audio_enabled else ["text"],
                    "input_audio_format": "pcm",
                    "output_audio_format": "pcm",
                    # 'input_audio_transcription' configures source language recognition.
                    # Set 'model' to 'qwen3-asr-flash-realtime' to also output the source language recognition result.
                    # "input_audio_transcription": {
                    #     "model": "qwen3-asr-flash-realtime",
                    #     "language": "zh"  # source language, default 'en'
                    # },
                    "translation": {
                        "language": self.target_language,
                        # 'corpus' configures hotwords to improve the translation accuracy of specific terms.
                        # "corpus": {
                        #     "phrases": {
                        #         "Artificial Intelligence": "Artificial Intelligence",
                        #         "Machine Learning": "Machine Learning"
                        #     }
                        # }
                    }
                }
            }
            print(f"Sending session configuration: {json.dumps(config, indent=2, ensure_ascii=False)}")
            await self.ws.send(json.dumps(config))
    
        async def send_audio_chunk(self, audio_data: bytes):
            """Encode and send an audio chunk to the server."""
            if not self.is_connected:
                return
                
            event = {
                "event_id": f"event_{int(time.time() * 1000)}",
                "type": "input_audio_buffer.append",
                "audio": base64.b64encode(audio_data).decode()
            }
            await self.ws.send(json.dumps(event))
    
        async def send_image_frame(self, image_bytes: bytes, *, event_id: str | None = None):
            # Send an image frame to the server.
            if not self.is_connected:
                return
    
            if not image_bytes:
                raise ValueError("image_bytes cannot be empty.")
    
            # Encode to Base64
            image_b64 = base64.b64encode(image_bytes).decode()
    
            event = {
                "event_id": event_id or f"event_{int(time.time() * 1000)}",
                "type": "input_image_buffer.append",
                "image": image_b64,
            }
    
            await self.ws.send(json.dumps(event))
    
        def _audio_player_task(self):
            stream = self.pyaudio_instance.open(
                format=self.output_format,
                channels=self.output_channels,
                rate=self.output_rate,
                output=True,
                frames_per_buffer=self.output_chunk,
            )
            try:
                while self.is_connected or not self.audio_playback_queue.empty():
                    try:
                        audio_chunk = self.audio_playback_queue.get(timeout=0.1)
                        if audio_chunk is None: # Termination signal
                            break
                        stream.write(audio_chunk)
                        self.audio_playback_queue.task_done()
                    except queue.Empty:
                        continue
            finally:
                stream.stop_stream()
                stream.close()
    
        def start_audio_player(self):
            """Start the audio player thread (only when audio output is enabled)."""
            if not self.audio_enabled:
                return
            if self.audio_player_thread is None or not self.audio_player_thread.is_alive():
                self.audio_player_thread = threading.Thread(target=self._audio_player_task, daemon=True)
                self.audio_player_thread.start()
    
        async def handle_server_messages(self, on_text_received):
            """Handle incoming messages from the server in a loop."""
            try:
                async for message in self.ws:
                    event = json.loads(message)
                    event_type = event.get("type")
                    if event_type == "response.audio.delta" and self.audio_enabled:
                        audio_b64 = event.get("delta", "")
                        if audio_b64:
                            audio_data = base64.b64decode(audio_b64)
                            self.audio_playback_queue.put(audio_data)
    
                    elif event_type == "response.done":
                        print("\n[INFO] Response round complete.")
                        usage = event.get("response", {}).get("usage", {})
                        if usage:
                            print(f"[INFO] token usage: {json.dumps(usage, indent=2, ensure_ascii=False)}")
                    # Process source language recognition results (requires enabling input_audio_transcription.model)
                    # elif event_type == "conversation.item.input_audio_transcription.text":
                    #     stash = event.get("stash", "")  # Pending recognition text
                    #     print(f"[Recognizing] {stash}")
                    # elif event_type == "conversation.item.input_audio_transcription.completed":
                    #     transcript = event.get("transcript", "")  # Complete recognition result
                    #     print(f"[Source language] {transcript}")
                    elif event_type == "response.audio_transcript.done":
                        print("\n[INFO] Translation complete.")
                        text = event.get("transcript", "")
                        if text:
                            print(f"[INFO] Translated text: {text}")
                    elif event_type == "response.text.done":
                        print("\n[INFO] Translation complete.")
                        text = event.get("text", "")
                        if text:
                            print(f"[INFO] Translated text: {text}")
    
            except websockets.exceptions.ConnectionClosed as e:
                print(f"[WARNING] Connection closed: {e}")
                self.is_connected = False
            except Exception as e:
                print(f"[ERROR] An unexpected error occurred while processing messages: {e}")
                traceback.print_exc()
                self.is_connected = False
    
        async def start_microphone_streaming(self):
            """Capture audio from the microphone and stream it to the server."""
            stream = self.pyaudio_instance.open(
                format=self.input_format,
                channels=self.input_channels,
                rate=self.input_rate,
                input=True,
                frames_per_buffer=self.input_chunk
            )
            print("Microphone is on. Start speaking...")
            try:
                while self.is_connected:
                    audio_chunk = await asyncio.get_event_loop().run_in_executor(
                        None, stream.read, self.input_chunk
                    )
                    await self.send_audio_chunk(audio_chunk)
            finally:
                stream.stop_stream()
                stream.close()
    
        async def close(self):
            """Gracefully close the connection and release resources."""
            self.is_connected = False
            if self.ws:
                await self.ws.close()
                print("WebSocket connection closed.")
            
            if self.audio_player_thread:
                self.audio_playback_queue.put(None) # Send termination signal
                self.audio_player_thread.join(timeout=1)
                print("Audio player thread stopped.")
                
            self.pyaudio_instance.terminate()
            print("PyAudio instance released.")
  3. Interact with the model

    In the same directory, create a file named main.py with the following code:

    main.py

    import os
    import asyncio
    from livetranslate_client import LiveTranslateClient
    
    def print_banner():
        print("=" * 60)
        print("  Powered by Qwen qwen3.5-livetranslate-flash-realtime")
        print("=" * 60 + "\n")
    
    def get_user_config():
        """Get user configuration."""
        print("Select a mode:")
        print("1. Voice + Text [Default] | 2. Text Only")
        mode_choice = input("Enter your choice (press Enter for Voice + Text): ").strip()
        audio_enabled = (mode_choice != "2")
    
        if audio_enabled:
            lang_map = {
                "1": "en", "2": "zh", "3": "ru", "4": "fr", "5": "de", "6": "pt",
                "7": "es", "8": "it", "9": "ko", "10": "ja", "11": "yue"
            }
            print("Select the target language (Voice + Text mode):")
            print("1. English | 2. Chinese | 3. Russian | 4. French | 5. German | 6. Portuguese | 7. Spanish | 8. Italian | 9. Korean | 10. Japanese | 11. Cantonese")
        else:
            lang_map = {
                "1": "en", "2": "zh", "3": "ru", "4": "fr", "5": "de", "6": "pt", "7": "es", "8": "it",
                "9": "id", "10": "ko", "11": "ja", "12": "vi", "13": "th", "14": "ar",
                "15": "yue", "16": "hi", "17": "el", "18": "tr"
            }
            print("Select the target language (Text Only mode):")
            print("1. English | 2. Chinese | 3. Russian | 4. French | 5. German | 6. Portuguese | 7. Spanish | 8. Italian | 9. Indonesian | 10. Korean | 11. Japanese | 12. Vietnamese | 13. Thai | 14. Arabic | 15. Cantonese | 16. Hindi | 17. Greek | 18. Turkish")
    
        choice = input("Enter your choice (defaults to the first option): ").strip()
        target_language = lang_map.get(choice, next(iter(lang_map.values())))
    
        return target_language, audio_enabled
    
    async def main():
        """Main program entry point."""
        print_banner()
        
        api_key = os.environ.get("DASHSCOPE_API_KEY")
        if not api_key:
            print("[ERROR] Please set the DASHSCOPE_API_KEY environment variable.")
            print("  For example: export DASHSCOPE_API_KEY='your_api_key_here'")
            return
            
        target_language, audio_enabled = get_user_config()
        print("\nConfiguration complete:")
        print(f"  - Target language: {target_language}")
        if not audio_enabled:
            print("  - Output mode: Text Only")
    
        client = LiveTranslateClient(api_key=api_key, target_language=target_language, audio_enabled=audio_enabled)
        
        # Define the callback function.
        def on_translation_text(text):
            print(text, end="", flush=True)
    
        try:
            print("Connecting to the translation service...")
            await client.connect()
            
            # Start audio playback based on the mode.
            client.start_audio_player()
            
            print("\n" + "-" * 60)
            print("Connection successful! Speak into the microphone.")
            print("The program will translate your speech in real time and play the translated audio. Press Ctrl+C to exit.")
            print("-" * 60 + "\n")
    
            # Run message handling and microphone recording concurrently.
            message_handler = asyncio.create_task(client.handle_server_messages(on_translation_text))
            tasks = [message_handler]
            # Capture audio from the microphone for translation, regardless of whether audio output is enabled.
            microphone_streamer = asyncio.create_task(client.start_microphone_streaming())
            tasks.append(microphone_streamer)
    
            await asyncio.gather(*tasks)
    
        except KeyboardInterrupt:
            print("\n\nUser interrupted. Exiting...")
        except Exception as e:
            print(f"\nA critical error occurred: {e}")
        finally:
            print("\nCleaning up resources...")
            await client.close()
            print("Program exited.")
    
    if __name__ == "__main__":
        asyncio.run(main())

    Run main.py and speak into your microphone. The model outputs translated audio and text in real time. The system automatically detects speech and sends it to the server.

Voice cloning

The model clones the speaker's voice from the input audio and uses the cloned voice for translated output, so the translation sounds like the speaker delivering it in another language. Use a pre-cloned voice profile, or let the server clone the voice in real time. This is useful in scenarios where preserving the speaker's voice matters, such as conference interpreting, live streaming, and video dubbing.

Set the following parameters in session.update to enable voice cloning:

  • session.enable_voice_clone: Set to true to enable voice cloning.

  • session.voice_clone_options.frequency: Controls when voice cloning occurs. Accepted values:

    • never: Does not clone on the server. Uses a pre-cloned voice profile instead. Set session.voice to your custom cloned voice ID.

    • once: Clones the voice from the input audio once at session start, then reuses it for all subsequent output. Best for single-speaker scenarios. Set session.voice to default.

    • always: Clones the voice before each response, dynamically adapting to speaker changes. Best for multi-speaker conversations. Set session.voice to default.

  • session.voice: Specifies the output voice. The value depends on the frequency setting:

    • Set to default: Use with frequency set to once or always. The server clones the speaker's voice from the input audio. A default voice is used until cloning completes.

    • Set to a custom cloned voice ID (for example, qwen-translate-vc-xxx-yyy-zzz): Use with frequency set to never. You must prepare the voice in advance using the Voice Cloning API with targetModel set to qwen3.5-livetranslate-flash-realtime.

When frequency is set to once or always , the voice parameter must be set to default . Any other value causes the server to return an error.

Voice cloning configuration examples

Pre-cloned voice profile (consistent quality; recommended when a stable voice identity is required):

{
    "type": "session.update",
    "session": {
        "modalities": ["text","audio"],
        "voice": "qwen-translate-vc-xxx-yyy-zzz",
        "translation": {
            "language": "en"
        },
        "enable_voice_clone": true,
        "voice_clone_options": {
            "frequency": "never"
        }
    }
}

Server-side cloning, once per session (best for single-speaker scenarios):

{
    "type": "session.update",
    "session": {
        "modalities": ["text","audio"],
        "voice": "default",
        "translation": {
            "language": "en"
        },
        "enable_voice_clone": true,
        "voice_clone_options": {
            "frequency": "once"
        }
    }
}

Server-side cloning, every response (best for multi-speaker conversations):

{
    "type": "session.update",
    "session": {
        "modalities": ["text","audio"],
        "voice": "default",
        "translation": {
            "language": "en"
        },
        "enable_voice_clone": true,
        "voice_clone_options": {
            "frequency": "always"
        }
    }
}

Improve translation with images

The qwen3.5-livetranslate-flash-realtime model uses image input to improve audio translation, helping disambiguate homonyms and recognize uncommon proper nouns. Send no more than 2 images per second.

Download the following sample images: medical mask.png, masquerade mask.png

Download the following code to the same directory as livetranslate_client.py and run it. Say "What is mask?" into your microphone. The model uses the provided image to disambiguate the word "mask." For example, using the medical mask.png file translates the phrase as "What is a medical mask?", while using the masquerade mask.png file translates it as "What is a masquerade mask?".

import os
import time
import json
import asyncio
import contextlib
import functools

from livetranslate_client import LiveTranslateClient

IMAGE_PATH = "medical mask.png"
# IMAGE_PATH = "masquerade mask.png"

def print_banner():
    print("=" * 60)
    print("  Powered by Qwen qwen3.5-livetranslate-flash-realtime — single-turn interaction example (mask)")
    print("=" * 60 + "\n")

async def stream_microphone_once(client: LiveTranslateClient, image_bytes: bytes):
    pa = client.pyaudio_instance
    stream = pa.open(
        format=client.input_format,
        channels=client.input_channels,
        rate=client.input_rate,
        input=True,
        frames_per_buffer=client.input_chunk,
    )
    print(f"[INFO] Recording started. Please speak...")
    loop = asyncio.get_event_loop()
    last_img_time = 0.0
    frame_interval = 0.5  # 2 fps
    try:
        while client.is_connected:
            data = await loop.run_in_executor(None, stream.read, client.input_chunk)
            await client.send_audio_chunk(data)

            # Append an image frame every 0.5 seconds
            now = time.time()
            if now - last_img_time >= frame_interval:
                await client.send_image_frame(image_bytes)
                last_img_time = now
    finally:
        stream.stop_stream()
        stream.close()

async def main():
    print_banner()
    api_key = os.environ.get("DASHSCOPE_API_KEY")
    if not api_key:
        print("[ERROR] Please set the DASHSCOPE_API_KEY environment variable.")
        return

    client = LiveTranslateClient(api_key=api_key, target_language="zh", audio_enabled=True)

    def on_text(text: str):
        print(text, end="", flush=True)

    try:
        await client.connect()
        client.start_audio_player()
        message_task = asyncio.create_task(client.handle_server_messages(on_text))
        with open(IMAGE_PATH, "rb") as f:
            img_bytes = f.read()
        await stream_microphone_once(client, img_bytes)
        await asyncio.sleep(15)
    finally:
        await client.close()
        if not message_task.done():
            message_task.cancel()
            with contextlib.suppress(asyncio.CancelledError):
                await message_task

if __name__ == "__main__":
    asyncio.run(main())

One-click Function Compute deployment

To deploy the application:

  1. Open the Function Compute template, enter your API key, and click Create and Deploy Default Environment to test the application.

  2. Wait for about a minute. In Environment Details > Environment Context, retrieve the endpoint, change the protocol from http to https (for example, https://qwen-livetranslate-flash-realtime-intl.fcv3.xxx.ap-southeast-1.fc.devsapp.net/), and open the URL in a browser to interact with the model.

    Important

    This endpoint uses a self-signed certificate and is for temporary testing only. Your browser will display a security warning on your first visit. This is expected behavior. Do not use this endpoint in a production environment. To proceed, follow the on-screen instructions (for example, click Advanced → Proceed to (unsafe)).

If you are prompted to configure Resource Access Management permissions, follow the on-screen instructions.
To view the project source code, go to Resource Information > Function Resources .
Both Function Compute and Model Studio provide a free quota for new users, sufficient for basic debugging. After the free quota is used up, pay-as-you-go billing applies.

Interaction flow

Real-time speech translation follows an event-driven WebSocket model. The server automatically detects speech boundaries and responds.

Lifecycle

Client event

Server event

Session initialization

session.update

Session configuration

session.created

Session created

session.updated

Session configuration updated

User audio input

input_audio_buffer.append

Append audio to the buffer

None

Server audio output

None

response.created

Signals that the server starts generating a response.

response.output_item.added

Signals that a new output item is available.

response.content_part.added

Signals that a new content part has been added to the assistant message.

response.audio_transcript.text

Contains an incremental update to the text transcript.

response.audio.delta

Contains an incremental chunk of the synthesized audio.

response.audio_transcript.done

Signals that the full text transcript is complete.

response.audio.done

Signals that the synthesized audio is complete.

response.content_part.done

Signals that a text or audio content part for the assistant message is complete.

response.output_item.done

Signals that the entire output item for the assistant message is complete.

response.done

Signals that the entire response is complete.

API

See Qwen-Livetranslate-Realtime.

Billing

  • Audio: Each second of audio input or output consumes 12.5 tokens.

  • Image: Every 28×28 pixels consumes 0.5 tokens.

  • Text: When source language speech recognition is enabled, the service returns a transcript of the input audio in addition to the translation. This transcript is billed as output text tokens.

For pricing, see Model list.

Supported languages

Use the following language codes to specify the source and target languages.

Some target languages only support text. The legacy model qwen3-livetranslate-flash-realtime supports only the following 18 languages: en, zh, ru, fr, de, pt, es, it, id, ko, ja, vi, th, ar, yue, hi, el, tr.

Language code

Language

Output

zh

Chinese

Audio + text

en

English

Audio + text

ar

Arabic

Audio + text

de

German

Audio + text

fr

French

Audio + text

es

Spanish

Audio + text

pt

Portuguese

Audio + text

id

Indonesian

Audio + text

it

Italian

Audio + text

ko

Korean

Audio + text

ru

Russian

Audio + text

th

Thai

Audio + text

vi

Vietnamese

Audio + text

ja

Japanese

Audio + text

tr

Turkish

Audio + text

hi

Hindi

Audio + text

ms

Malay

Audio + text

nl

Dutch

Audio + text

ur

Urdu

Audio + text

nb

Norwegian Bokmål

Audio + text

sv

Swedish

Audio + text

da

Danish

Audio + text

he

Hebrew

Audio + text

fi

Finnish

Audio + text

pl

Polish

Audio + text

is

Icelandic

Audio + text

cs

Czech

Audio + text

fil

Filipino

Audio + text

fa

Persian

Audio + text

yue

Cantonese

Text

el

Greek

Text

af

Afrikaans

Text

ast

Asturian

Text

be

Belarusian

Text

bg

Bulgarian

Text

bn

Bengali

Text

bs

Bosnian

Text

ca

Catalan

Text

ceb

Cebuano

Text

et

Estonian

Text

gl

Galician

Text

gu

Gujarati

Text

hr

Croatian

Text

hu

Hungarian

Text

jv

Javanese

Text

kk

Kazakh

Text

kn

Kannada

Text

ky

Kyrgyz

Text

lv

Latvian

Text

mk

Macedonian

Text

ml

Malayalam

Text

mr

Marathi

Text

pa

Punjabi

Text

ro

Romanian

Text

sk

Slovak

Text

sl

Slovenian

Text

sw

Swahili

Text

tg

Tajik

Text

az

Azerbaijani

Text

uk

Ukrainian

Text

Supported voices

For supported voices and the corresponding voice parameter values, see Voice list.