Qwen3‑LiveTranslate: Real‑Time Multimodal Interpretation — See It, Hear It, Speak It！

This article introduces Qwen3‑LiveTranslate, a real‑time multimodal AI for fast, vision‑enhanced, high‑quality multilingual audio and video interpretation.

Qwen3‑LiveTranslate‑Flash delivers high‑precision, lightning‑fast and ultra‑reliable real‑time multilingual audio and video interpretation. With the extensive capabilities of Qwen3‑Omni and training on millions of hours of multimodal data, it enables both offline and live translation in 18 languages, making cross‑language communication seamless.

Key Features:

Multilingual and Dialect Coverage: Supports major official languages including Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Indonesian, Thai, Vietnamese, Arabic, Hindi, Greek, Turkish; as well as dialect and accent translation for Mandarin, Cantonese, Beijing, Wu, Sichuan, and Tianjin dialects.
Vision‑Enhanced Comprehension: For the first time, Qwen3‑LiveTranslate‑Flash incorporates visual context augmentation, enabling it to not only understand what it hears but also understand what it sees. By detecting and interpreting lip movements, gestures, on‑screen text, and real‑world entities, the system robustly handles noisy audio environments and resolves ambiguities in terms with multiple meanings.
3s Latency: A lightweight mixture‑of‑experts architecture, coupled with dynamic sampling, enables simultaneous interpretation with latency as low as three seconds.
Lossless Interpretation: Utilizes semantic unit prediction to mitigate cross‑lingual reordering challenges in translation, achieving real‑time translation quality that is close to offline translation.
Natural Voice Quality: With training on massive speech datasets, the model delivers lifelike voices whose tone and expressiveness naturally follow the meaning of the source speech.

Performance

Qwen3‑LiveTranslate‑Flash achieves significantly higher accuracy than strong large-scale models, including Gemini‑2.5‑Flash, GPT‑4o‑Audio‑Preview, and Voxtral Small‑24B, on public benchmarks for Chinese, English and multilingual speech translation.

2_jpeg

Qwen3‑LiveTranslate‑Flash consistently delivers leading translation performance across different domains and under challenging acoustic conditions.

3_jpeg

Semantic unit prediction technology alleviates cross-lingual reordering issues, enabling real-time simultaneous interpretation to significantly reduce latency while maintaining over 94% of the accuracy achieved by non-real-time translation.

Visual enhancement technology further improves Qwen3-LiveTranslate-Flash’s translation precision in challenging scenarios such as noisy audio, ambiguous word meanings, and proper noun translation. In real-time settings, visual information compensates for missing speech context, making its advantages even more pronounced.

4_jpeg

Original source: https://qwen.ai/blog?id=b2de6ae8555599bf3b87eec55a285cdf496b78e4&from=research.latest-advancements-list

0 1 0

Share on

Community

Qwen3‑LiveTranslate: Real‑Time Multimodal Interpretation — See It, Hear It, Speak It！

Key Features:

Performance

Read previous post:

Read next post:

Alibaba Cloud Community

You may also like

Comments

Alibaba Cloud Community

Related Products

Container Compute Service (ACS)

Container Service for Kubernetes

Tongyi Qianwen (Qwen)

Alibaba Cloud for Generative AI