×
Community Blog Qwen3-Omni: Natively Omni-Modal Foundation Models!

Qwen3-Omni: Natively Omni-Modal Foundation Models!

This article introduces Qwen3‑Omni—an end‑to‑end multilingual, omni‑modal foundation model with real‑time text and speech across text, image, audio, and video.

1

Qwen3-Omni is the natively end-to-end multilingual omni model. It processes text, images, audio, and video, and delivers real-time streaming responses in both text and natural speech. We introduce several upgrades to improve performance and efficiency.

Key Features:

  • Natively Omni-Modal Pretraining: Qwen3-Omni is a natively end-to-end multilingual omni model, without performance degradation compared to the single modality models.
  • Powerful Performance: Qwen3-Omni achieves SOTA on 32 benchmarks and overall SOTA on 22 across 36 audio and audio-visual benchmarks, outperforming strong closed-source models such as Gemini-2.5-Pro, Seed-ASR, and GPT-4o-Transcribe.
  • Multilingual Support: Qwen3-Omni supports text interaction in 119 languages, speech understanding in 19 languages, and speech generation in 10 languages.
  • Faster Response: Qwen3-Omni achieves a latency as low as 211ms in audio-only scenarios and a latency as low as 507ms in audio–video scenarios.
  • Longer Understanding: Qwen3-Omni supports audio understanding of up to 30 minutes.
  • Personalized Customization: Qwen3-Omni can be freely adapted via system prompts to modify response styles, personas, and behavioral attributes.
  • Tool Calling: Qwen3-Omni supports function call, enabling seamless integration with external tools and services.
  • Open-Source Universal Audio Captioner: Qwen3-Omni-30B-A3B-Captioner, a low-hallucination yet highly detailed universal audio caption model, fills the gap in the open-source community.

Watch the video

Architecture

Qwen3-Omni adopts the Thinker-Talker architecture. Thinker is tasked with text generation while Talker focuses on generating streaming speech tokens by receives high-level representations directly from Thinker. To achieve ultra–low-latency streaming, Talker autoregressively predicts a multi-codebook sequence. At each decoding step, an MTP module outputs the residual codebooks for the current frame, after which the Code2Wav renderer incrementally synthesizes the corresponding waveform, enabling frame-by-frame streaming generation.

  • Innovative Architecture Design

AuT: The audio encoder adopts the AuT, trained on 20 million hours of audio data, providing extremely strong general audio representation capability.

MoE: Both the Thinker and Talker adopt MoE architectures to support high concurrency and fast inference.

Multi-Codebook: The Talker adopts a multi-codebook autoregressive scheme that the Talker generates one codec frame per step, while the MTP module produces the remaining residual codebooks.

  • Non-Degradation Across Modalities

Mixing unimodal and cross-modal data during the early stage of text pretraining can achieve parity across all modalities—i.e., no modality-specific performance degradation—while markedly enhancing cross-modal capabilities.

  • Excellent Spoken Dialogue and Instruction Following

Qwen3-Omni achieves Gemini-2.5-Pro-level performance in speech recognition and instruction following tasks.

  • Real-Time Audio and Audiovisual Interaction

Substantial latency reduction across the entire pipeline—from encoders to Thinker, Talker, and Code2Wav—enables fully streaming generation, with enabling streaming from the first codec frame.

2

Performance

We conducted a comprehensive evaluation of Qwen3-Omni, which matches the performance of same-sized single-modal models within the Qwen series and excels particularly on audio tasks. Across 36 audio and audio-visual benchmarks, Qwen3-Omni achieves open-source SOTA on 32 benchmarks and overall SOTA on 22, outperforming strong closed-source models such as Gemini-2.5-Pro, Seed-ASR, and GPT-4o-Transcribe.

3
4

What's Next

We are eager to hear your feedback and see the innovative applications you create with Qwen3-Omni. In the near future, we will further advance the model along multiple axes, including multi-speaker ASR, video OCR, audio–video proactive learning, and enhance support for agent-based workflows and function calling.

Original source: https://qwen.ai/blog?id=fdfbaf2907a36b7659a470c77fb135e381302028&from=research.research-list

0 1 0
Share on

Alibaba Cloud Community

1,278 posts | 453 followers

You may also like

Comments