Qwen3.5-LiveTranslate: From Sound to Sight, From Word to Right

main

Qwen3.5-LiveTranslate-Flash is the latest simultaneous interpretation model in the Qwen family, built on top of Qwen3.5-Omni. It delivers real-time, multimodal translation that not only hears and translates speech, but also sees and understands visual context to produce more accurate translations. Compared with its predecessor Qwen3-LiveTranslate, Qwen3.5-LiveTranslate-Flash brings major upgrades across language coverage, latency, voice cloning, and terminology handling, making it well-suited for international meetings, livestream localization, online classrooms, and business negotiations.

Key Highlights

Massively expanded language coverage: understands 18 → 60 languages, speaks 10 → 29 languages. The language support of input audio and output text has grown from 18 to 60, and output audio language support from 10 to 29, covering far more cross-lingual combinations to meet multilingual interpretation needs in international meetings, livestream localization, online classrooms, and business negotiations.
Ultra-low latency: powered by Readable Unit technology, faster text and speech output. A novel Readable Unit real-time translation technique achieves more aggressive streaming output while preserving translation readability and semantic consistency. Average speech-to-speech per-token latency is reduced to to 2.8 seconds, ideal for latency-sensitive scenarios such as livestreams, co-hosting, and press conferences.
Real-time voice cloning: one sentence to start, instantly “interpret in your voice”. During simultaneous interpretation, the system automatically replicates the speaker’s vocal characteristics, keeping the translated speech sounding like “the same person” across languages, enhancing immersion and identity consistency, especially critical for streamers, guests, and hosts.
Hotword enhancement: proper nouns and industry terms “recognized right, written right, translated right”. Built-in Hotword capability prioritizes the recognition and translation of names, places, brand names, product models, and industry terminology. Hotwords can be dynamically configured and updated in real time per scenario, significantly reducing terminology mistranslation risk, well-suited for technical launches, medical/legal/financial meetings, and enterprise training.

Performance

We evaluate Qwen3.5-LiveTranslate-Flash in both offline and real-time (streaming) settings.

Offline Translation

On public multilingual speech translation benchmarks (FLEURS, CoVoST2), Qwen3.5-LiveTranslate-Flash achieves higher translation accuracy than mainstream commerical large speech models, significantly surpasses its predecessor Qwen3-LiveTranslate-Flash, and delivers breakthroughs in both language coverage and translation quality.

Demo1: Overview English → X

offline_overview_en_xx

Demo2: Overview X → English

offline_overview_xx_en2

Demo3: FLEURS English → X

offline_fleurs_en_xx

Demo4: FLEURS X → English

offline_fleurs_xx_en4

Demo5: CoVoST2 English → X

offline_covost_en_xx

Demo6: CoVoST2 X → English

offline_covost_xx_en6

Real-Time Translation

With the Readable Unit streaming strategy, Qwen3.5-LiveTranslate-Flash reduces first-token latency by 3.45 s and per-token latency by 1.88 s compared to Qwen3-LiveTranslate-Flash, achieving an average speech-to-speech per-token latency of 2.8 s, with virtually no loss in translation quality.

Demo1: Overview

online_overview

Model Architecture

Qwen3.5-LiveTranslate is a translation large model built on the Qwen3.5-Omni Thinker-Talker architecture. The Thinker receives interleaved visual and audio inputs and generates text translations, while the Talker takes the translated text and source audio to produce speech with crosslingual voice cloning. For real-time simultaneous interpretation, we adopt a chunk-wise streaming input mechanism and introduce Readable Unit tags to control speech synthesis granularity, effectively reducing interpretation latency. Meanwhile, dynamic crosslingual voice cloning enables the model to preserve the speaker’s original vocal characteristics during real-time translation.

model_arch_v3
Qwen3.5-LiveTranslate model architecture overview

More Supported Languages

Compared to Qwen3-LiveTranslate, Qwen3.5-LiveTranslate significantly expands language coverage. The support of input audio and output text grows from 18 to 60 languages, and output audio support from 10 to 29 languages, enabling a far wider range of cross-lingual translation combinations across global scenarios.

	Qwen3-LiveTranslate	Qwen3.5-LiveTranslate
Input Modality	Audio / Video	Audio / Video
Inference Mode	Offline / Streaming	Offline / Streaming
Voice Cloning	✗	✓ (3 modes: pre-registered / clone-once / real-time)
Hotwords	Up to 1,000	Up to 1,000
Input Audio Languages & Output Text Languages	18 languages Chinese, English, Russian, French, German, Portuguese, Spanish, Italian, Indonesian, Korean, Japanese, Vietnamese, Thai, Arabic, Cantonese, Hindi, Greek, Turkish	60 languages Afrikaans, Arabic, Asturian, Azerbaijani, Basque, Belarusian, Bengali, Bosnian, Bulgarian, Cantonese, Catalan, Cebuano, Chinese, Croatian, Czech, Danish, Dutch, English, Esperanto, Estonian, Filipino, Finnish, French, Galician, Georgian, German, Greek, Hebrew, Hindi, Hungarian, Icelandic, Indonesian, Interlingua, Italian, Japanese, Javanese, Kannada, Kazakh, Korean, Kyrgyz, Lingala, Latvian, Lithuanian, Macedonian, Malay, Malayalam, Maltese, Maori, Marathi, Mongolian, Norwegian Bokmål, Nynorsk, Odia, Persian, Polish, Portuguese, Punjabi, Romanian, Russian, Serbian, Slovak, Slovenian, Spanish, Swahili, Swedish, Tajik, Tamil, Telugu, Thai, Turkish, Ukrainian, Urdu, Uyghur, Vietnamese
Output Audio Languages	10 languages Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean	29 languages Chinese, English, German, Italian, Portuguese, Spanish, Japanese, Korean, French, Russian, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Filipino, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian

See It in Action

International Meeting

A multilingual business meeting where participants speak in different languages and switch between them mid-sentence. Qwen3.5-LiveTranslate handles code-switching, diverse accents, and domain-specific terminology in real time — delivering fluent, natural translations without missing a beat.

Traveling Abroad

A real-world travel scenario powered by Qwen AI Glasses: a Chinese tourist orders food at a local restaurant in Thailand. The model performs live Thai-to-Chinese translation on-device, combining visual context from the menu with spoken dialogue to produce accurate, context-aware translations — making cross-language communication effortless on the go.

Livestream Scenarios

E-commerce livestream translation scenario. Qwen3.5-LiveTranslate accurately translates product specifications and numerical information, ensuring precise cross-language delivery of product parameters.

Classical Chinese Translation

A scene from Romance of the Three Kingdoms narrated in classical Chinese (文言文). Qwen3.5-LiveTranslate accurately interprets and translates archaic Chinese prose into modern English, demonstrating its ability to handle literary and historical language beyond everyday speech.

Visual Disambiguation

Qwen3.5-LiveTranslate leverages visual context to resolve translation ambiguities. When a word or phrase has multiple possible meanings, the model uses what it sees — on-screen text, objects, or scene context — to select the correct interpretation, producing translations that are both accurate and contextually grounded.

Future Directions

We will continue exploring the capability boundaries of multimodal translation and focus on the following directions:

Lower latency: keep reducing end-to-end simultaneous interpretation latency toward real-time experience limits.
More languages and dialects: expand input/output coverage for low-resource languages, regional dialects, and cross-regional expressions.
Longer context and stronger consistency: maintain terminology, names, and context consistency in long meetings and multi-turn dialogues.
Higher-fidelity voice cloning: preserve speaker characteristics while restoring ambient sounds and on-site atmosphere more naturally.
Richer interaction modes: support multilingual, mixed-dialect expression, speaker separation, and joint multimodal modeling with gestures, lip movement, and expressions.

Citation

Feel free to cite the following article if you find Qwen3.5-LiveTranslate helpful:

@misc{qwen35livetranslateblog,
    title = {Qwen3.5-LiveTranslate: From Sound to Sight, From Word to Right},
    url = {https://qwen.ai/blog?id=qwen3.5-livetranslate},
    author = {Qwen Team},
    month = {May},
    year = {2026}
}

Source

Community