×
Community Blog Qwen3-Omni-Flash-2025-12-01: Hear You. See You. Follow Smarter!

Qwen3-Omni-Flash-2025-12-01: Hear You. See You. Follow Smarter!

This article introduces the upgraded Qwen3-Omni-Flash-2025-12-01 model, which delivers smarter, more natural multimodal interaction across text, audio, images, and video.

1

Qwen3-Omni is a next-generation native multimodal large model capable of seamlessly processing multiple input modalities—including text, images, audio, and video—and generating both text and natural-sounding speech outputs simultaneously via real-time streaming responses. This version introduces multiple enhancements to improve model performance and efficiency.

Qwen3-Omni-Flash-2025-12-01 is a comprehensively upgraded iteration built upon Qwen3-Omni.

Key highlights of this upgraded version include:

  • Greatly Enhanced Audio-Visual Interaction Experience: Dramatically improved understanding and execution of audio-visual instructions, effectively resolving the “intelligence drop” issue commonly seen in casual spoken scenarios. Multi-turn audio-visual conversations now achieve significantly higher stability and coherence, enabling more natural and seamless interactions.
  • Strengthened System Prompt Control: Full customization of system prompts is now supported, enabling precise control over model behavior. Whether it’s persona style (e.g., sweet, cool, anime-inspired), colloquial tone preferences, or output length constraints—every detail can be finely tuned, offering unprecedented command over response characteristics.
  • More Reliable Multilingual Compliance: Supports text-based interaction in 119 languages, speech recognition in 19 languages, and speech synthesis in 10 languages. Language-following instability from the previous version has been fully addressed, ensuring accurate and consistent performance across diverse linguistic contexts.
  • More Human-Like and Fluent Speech Synthesis: Eliminates sluggish or robotic speech by significantly enhancing adaptive control over prosody. The model now intelligently adjusts speaking rate, pauses, and intonation based on textual context, delivering expressive, natural-sounding voice output that closely mimics real human speech.

View the Video

Performance

On objective benchmarks, Qwen3-Omni-Flash-2025-12-01 achieves substantial improvements across all modalities compared to Qwen3-Omni-Flash:

🧠 Stronger Text Understanding & Generation:

Major gains in logical reasoning (ZebraLogic +5.6), code generation (LiveCodeBench-v6 +9.3, MultiPL-E +2.7), and holistic writing quality (WritingBench +2.2), enabling more reliable execution of complex, multi-step instructions.

👂 More Accurate Speech Understanding:

Significantly lower word error rate on Fleurs-zh, along with a +3.2 improvement on VoiceBench, reflecting enhanced comprehension of spoken language in real-world dialogue scenarios.

🎙️More Natural Speech Synthesis:

Higher-quality, human-like voice generation across multiple languages—especially in Chinese and multilingual contexts—with improved prosody, pacing, and pausing that closely mirrors natural human speech.

👁️Deeper Image Understanding:

Breakthrough performance on visual reasoning tasks, including +4.7 on MMMU, +4.8 on MMMU-Pro, and +2.2 on MathVision_full, demonstrating a stronger ability to “see,” interpret, and reason about complex visual content—from diagrams to mathematical figures.

🎬 More Coherent Video Understanding:

Steady improvement in video semantic comprehension (MLVU +1.6), further strengthened by tighter audio-visual synchronization, laying a solid foundation for seamless real-time video conversations.

With this upgrade, Qwen3-Omni-Flash-2025-12-01 truly embodies the vision of “Hear You. See You. Follow Smarter.”—delivering an AI interaction experience that is more natural, precise, and vivid than ever before.

2

What’s Next

We are eager to hear your feedback and see the innovative applications you create with Qwen3-Omni. In the near future, we will further advance the model along multiple axes, including multi-speaker ASR, video OCR, audio–video proactive learning, and enhance support for agent-based workflows and function calling.

Citation

If you find our model helpful in your research, we’d appreciate a citation!

BibTeX
@misc{qwen3_omni_20251201,
author = {{Qwen Team, Alibaba}},
title = {{Qwen3-Omni-Flash-2025-12-01:Hear You. See You. Follow Smarter!}},
year = {2025},
url = {https://qwen.ai/blog?id=qwen3-omni-20251201},
urldate = {2025-12-09}
}


See the original source here

0 0 0
Share on

Alibaba Cloud Community

1,297 posts | 456 followers

You may also like

Comments