All Products
Search
Document Center

Alibaba Cloud Model Studio:Model releases and updates

Last Updated:Dec 19, 2025

International (Singapore)

Type

Listing date

Model

Feature description

Speech recognition

2025-12-17

qwen3-asr-flash-realtime, qwen3-asr-flash-realtime-2025-10-27

Speech recognition is now supported in nine additional languages, including Czech and Danish. Real-time speech recognition - Qwen

Speech recognition

2025-12-17

qwen3-asr-flash, qwen3-asr-flash-2025-09-08

Audio with any sample rate and sound channel is supported. Audio file recognition - Qwen

Speech recognition

2025-12-17

fun-asr-mtl, fun-asr-mtl-2025-08-25

Support for speech recognition in 31 languages, including Chinese, English, Japanese, and Korean, ideal for Southeast Asia scenarios. Audio file recognition - Fun-ASR/Paraformer

Voice design

2025-12-16

qwen-voice-design

Qwen's' voice design model for generating customized voices from text descriptions. Use this model with qwen3-tts-vd-realtime-2025-12-16 to generate speech in 10 languages. Voice design

Speech synthesis

2025-12-16

qwen3-tts-vd-realtime-2025-12-16 (snapshot)

Qwen's real-time speech synthesis snapshot model that enables low-latency, high-stability real-time synthesis using designed voices, supports multi-language output, automatically adjusts the tone based on the text, and optimizes synthesis performance for complex text. Real-time speech synthesis - Qwen

Speech recognition

2025-12-12

fun-asr, fun-asr-2025-11-07

Feature updates for Fun-ASR audio file recognition:

Multimodal

2025-12-04

qwen3-omni-flash-2025-12-01

The latest Qwen Omni snapshot model increases the number of supported timbres to 49 and features a significant upgrade to its instruction-following capabilities, enabling it to efficiently understand text, images, audio, and video. Omni-modal

Real-time multimodal

2025-12-04

qwen3-omni-flash-realtime-2025-12-01

The latest snapshot model for the Qwen Omni real-time version offers low-latency multimodal interaction. The number of supported timbres is increased to 49, and the model's instruction-following ability and interactive experience are significantly upgraded. Real-time multimodal

Speech translation

2025-12-04

qwen3-livetranslate-flash, qwen3-livetranslate-flash-2025-12-01

Qwen3-LiveTranslate-Flash is an audio and video translation model that translates between 18 languages, such as Chinese, English, Russian, and French. It leverages visual context to improve translation accuracy and provides both text and speech output. Audio and video translation - Qwen

Multilingual translation

2025-12-02

qwen-mt-lite

A basic text translation model from Qwen. It supports translation between 31 languages. It offers a faster response time and lower cost than qwen-mt-flash, making it suitable for latency-sensitive scenarios. Translation capabilities (Qwen-MT).

Voice cloning

2025-11-27

qwen-voice-enrollment

Qwen released a voice cloning model. It generates a highly similar voice from just over 5 seconds of audio. When used with the qwen3-tts-vc-realtime-2025-11-27 model, it can create a high-fidelity clone of a person's voice and output it in real time across 11 languages. Voice cloning.

Speech synthesis

2025-11-27

qwen3-tts-vc-realtime-2025-11-27 (snapshot)

Qwen real-time speech synthesis released a new snapshot model. It uses voices generated by Voice cloning for low-latency, high-stability, real-time synthesis. The model supports multilingual output. It automatically adjusts the tone based on the text and improves synthesis performance for complex text. Real-time speech synthesis - Qwen.

Speech synthesis

2025-11-27

qwen3-tts-flash-realtime-2025-11-27 (snapshot)

Qwen real-time speech synthesis released a new snapshot model. It features low latency and high stability. The model offers a richer selection of voices, and each voice supports multilingual output. It automatically adjusts the tone based on the text and enhances synthesis performance for complex text. Real-time speech synthesis - Qwen.

Speech synthesis

2025-11-27

qwen3-tts-flash-2025-11-27 (snapshot)

Qwen speech synthesis released a new snapshot model. It offers a richer selection of voices. Each voice supports multilingual output. The model adaptively adjusts the tone based on the text and has optimized synthesis capabilities for complex text. Speech synthesis - Qwen.

Text extraction

2025-11-21

qwen-vl-ocr-2025-11-20 (snapshot)

This snapshot of the Qwen text extraction model is based on the Qwen3-VL architecture and significantly improves document parsing and text localization. Text extraction

Speech recognition

2025-11-20

qwen3-asr-flash-filetrans, qwen3-asr-flash-filetrans-2025-11-17 (snapshot)

Qwen audio file recognition released a new model. It is designed for asynchronous transcription of audio files and supports recordings up to 12 hours long. Audio file recognition - Qwen.

Speech recognition

2025-11-19

fun-asr-2025-11-07 (snapshot)

Fun-ASR audio file recognition released a new snapshot model. It optimizes far-field voice activity detection (VAD) to improve recognition accuracy and stability. In addition to Chinese and English, the model now supports multiple Chinese dialects and Japanese. Audio file recognition - Fun-ASR/Paraformer.

Multilingual translation

2025-11-11

qwen-mt-flash

Compared to qwen-mt-turbo, this model supports streaming incremental output and offers improved overall performance. Translation capabilities (Qwen-MT).

Image-to-video

2025-11-10

wan2.2-animate-move

This model transfers the actions and expressions of a character from a template video to a single static image to generate a video of the character in motion. Wan - Image-to-motion.

Image-to-video

2025-11-10

wan2.2-animate-mix

This model replaces the main character in a reference video with a character from an image. It preserves the original video's scene, lighting, and tone for seamless character replacement. Wan - Video character replacement.

Reasoning model

2025-11-03

qwen3-max-preview

The thinking mode of the qwen3-max-preview model features significantly improved overall inference capabilities. It performs better in agent programming, common-sense reasoning, and tasks related to math, science, and general purposes. Deep thinking.

Image editing

2025-10-31

qwen-image-edit-plus, qwen-image-edit-plus-2025-10-30

Built on qwen-image-edit, this model has optimized inference performance and system stability. It significantly reduces the response time for image generation and editing and supports returning multiple images in a single request. Image editing - Qwen.

Real-time speech recognition

2025-10-27

qwen3-asr-flash-realtime, qwen3-asr-flash-realtime-2025-10-27

The Qwen real-time speech recognition model features automatic language detection. It can identify 11 language types and provides accurate transcription in complex audio environments. Real-time speech recognition - Qwen.

Visual understanding

2025-10-21

qwen3-vl-32b-thinking, qwen3-vl-32b-instruct

A 32B dense model from the Qwen3-VL series. Its overall performance is second only to the Qwen3-VL-235B model. It excels in document recognition and understanding, spatial intelligence and object recognition, and visual 2D detection/spatial reasoning. This makes it suitable for complex perception tasks in general scenarios. Visual understanding.

Visual understanding

2025-10-16

qwen3-vl-flash, qwen3-vl-flash-2025-10-15

A small-scale visual understanding model from the Qwen3 series. It effectively combines thinking and non-thinking modes. Compared to the open-source Qwen3-VL-30B-A3B, it delivers better performance and a faster response time. Visual understanding.

Visual understanding

2025-10-14

qwen3-vl-8b-thinking, qwen3-vl-8b-instruct

An 8B dense model from the Qwen3-VL series. It uses less GPU memory and can perform multimodal understanding and inference. It supports ultra-long contexts such as long videos and documents, and visual 2D/3D positioning. It also has comprehensive spatial intelligence and object recognition capabilities. Visual understanding.

Visual understanding

2025-10-03

qwen3-vl-30b-a3b-thinking, qwen3-vl-30b-a3b-instruct

Based on the new generation of open-source Qwen3-VL models, this model has a fast response time. It features stronger multimodal understanding and inference, visual agent capabilities, and support for ultra-long contexts such as long videos and documents. It also has comprehensively upgraded spatial intelligence and object recognition capabilities, making it suitable for complex real-world tasks. Visual understanding.

Visual understanding

2025-09-23

qwen3-vl-plus, qwen3-vl-plus-2025-09-23, qwen3-vl-235b-a22b-thinking, qwen3-vl-235b-a22b-instruct

A visual understanding model from the Qwen3 series. It effectively combines thinking and non-thinking modes, and its visual agent capabilities are among the best in the world. This version features comprehensive upgrades in visual encoding, spatial intelligence, and multimodal thinking. Its visual perception and recognition capabilities are also significantly improved. Visual understanding.

Text-to-image

2025-09-23

qwen-image-plus

This model excels at complex text rendering, especially for Chinese and English text. It can create complex mixed layouts of images and text and is more cost-effective than qwen-image. Text-to-image (Qwen-Image).

Code model

2025-09-23

qwen3-coder-plus-2025-09-23

Compared to the previous version (snapshot from July 22), this model has improved performance on downstream tasks and greater robustness in tool calling. It also features enhanced code security. Code capabilities (Qwen-Coder).

Reasoning model

2025-09-11

qwen-plus-2025-09-11

A model from the Qwen3 series. Compared to qwen-plus-2025-07-28, it has improved instruction following capabilities and provides more concise summary responses in thinking mode. Deep thinking. In non-thinking mode, its Chinese understanding and logical reasoning abilities are enhanced. Overview of text generation models.

Reasoning model

2025-09-11

qwen3-next-80b-a3b-thinking, qwen3-next-80b-a3b-instruct

A new generation of open-source models based on Qwen3. Compared to qwen3-235b-a22b-thinking-2507, the thinking model has improved instruction following capabilities and provides more concise summary responses. Deep thinking. Compared to qwen3-235b-a22b-instruct-2507, the instruct model has enhanced Chinese understanding, logical reasoning, and text generation capabilities. Overview of text generation models.

Text-to-text

2025-09-05

qwen3-max-preview

A preview version of the Qwen-Max model based on Qwen3. Compared to the Qwen 2.5 series, its overall general capabilities are greatly improved. It has significantly enhanced abilities in general text understanding in both Chinese and English, complex instruction following, subjective open-ended tasks, multilingual tasks, and tool calling. The model also has fewer knowledge hallucinations. Qwen-Max.

Image editing

2025-08-19

qwen-image-edit

Qwen image editing supports precise text editing in both Chinese and English, rendering intent, detail enhancement, style transfer, adding or removing objects, and changing positions and actions. It can perform complex image and text editing. Image editing - Qwen.

Visual understanding

2025-08-18

qwen-vl-plus-2025-08-15

A visual understanding model. It has significantly improved capabilities in object recognition and localization, and multilingual processing. Visual understanding.

Text-to-image

2025-08-14

qwen-image

The Qwen-Image model excels at complex text rendering, especially for Chinese and English text. It can create complex mixed layouts of images and text. Text-to-image (Qwen-Image).

Visual understanding

2025-08-13

qwen-vl-max-2025-08-13

A visual understanding model. It has comprehensively improved visual understanding metrics and significantly enhanced capabilities in math, reasoning, object recognition, and multilingual processing. Visual understanding.

Code model

2025-08-05

qwen3-coder-flash, qwen3-coder-flash-2025-07-28

The fastest and most cost-effective model in the Qwen-Coder series. Code capabilities (Qwen-Coder).

Reasoning model

2025-08-05

qwen-flash, qwen-flash-2025-07-28

The fastest and most cost-effective model in the Qwen series, suitable for simple jobs. Model List.

Reasoning model

2025-07-30

qwen-plus-2025-07-28

A model from the Qwen3 series. Compared to the previous version, it increases the context length to 1,000,000. For more information about thinking mode, see Deep thinking. For more information about non-thinking mode, see Overview of text generation models.

Reasoning model

2025-07-30

qwen3-30b-a3b-thinking-2507

qwen3-30b-a3b-instruct-2507

An upgraded version of qwen3-30b-a3b. The thinking model has improved logical, general, knowledge-enhanced, and creative capabilities. Deep thinking. The instruct model has improved creative capabilities and model security. Overview of text generation models.

Image-to-video

2025-07-28

wan2.2-i2v-plus

Compared to the 2.1 model, this new version has significantly improved image detail and motion stability. The generation speed is increased by 50%. First-frame-to-video.

Text-to-video

2025-07-28

wan2.2-t2v-plus

Compared to the 2.1 model, this new version has significantly improved image detail and motion stability. The generation speed is increased by 50%. Text-to-video.

Text-to-image

2025-07-28

wan2.2-t2i-flash, wan2.2-t2i-plus

Compared to the 2.1 model, this new version is comprehensively upgraded in creativity, stability, and photorealism. The generation speed is increased by 50%. Text-to-image.

Reasoning model

2025-07-24

qwen3-235b-a22b-thinking-2507, qwen3-235b-a22b-instruct-2507

An upgraded version of qwen3-235b-a22b. The thinking model has greatly improved logical, general, knowledge-enhanced, and creative capabilities, making it suitable for high-difficulty, strong-inference scenarios. Deep thinking. The instruct model has improved creative capabilities and model security. Overview of text generation models.

Code model

2025-07-23

qwen3-coder, qwen3-coder-plus-2025-07-22

A code generation model based on Qwen3. It has powerful coding agent capabilities and excels at tool calling and environment interaction. It combines excellent code capabilities with general-purpose abilities. Code capabilities (Qwen-Coder).

Visual understanding

2025-06-04

qwen-vl-plus-2025-05-07

A visual understanding model. The model has significantly improved capabilities in math, reasoning, and understanding content from monitoring videos. Visual understanding.

Text-to-image

2025-05-22

wan2.1-t2i-turbo, wan2.1-t2i-plus

Generates an image from a single sentence. The model supports generating images of any resolution and aspect ratio, up to 2 million pixels. It is available in a turbo version and a professional edition (plus). Text-to-image.

Visual understanding

2025-05-16

qwen-vl-max-2025-04-08

A visual understanding model. It has improved math and reasoning abilities. The response style is adjusted to align with human preferences, and the detail and format clarity of model responses are significantly improved. Visual understanding.

Visual understanding

2025-05-16

qwen-vl-plus-2025-01-25

A visual understanding model. It belongs to the Qwen2.5-VL series. Compared to the previous version, it expands the context to 128k, significantly enhancing image and video understanding capabilities.

Video editing

2025-05-19

wan2.1-vace-plus

A general-purpose video editing model. The model has multimodal input capabilities, combining images, videos, and text prompts. It can perform various tasks such as image-to-video (generating a video based on the subject or background of a reference image) and video repainting (extracting motion features from an input video to generate a new video). General-purpose video editing.

Reasoning model

2025-04-28

Qwen3 commercial models

qwen-plus-2025-04-28, qwen-turbo-2025-04-28

Qwen3 open-source models

qwen3-235b-a22b, qwen3-30b-a3b, qwen3-32b, qwen3-14b, qwen3-8b, qwen3-4b, qwen3-1.7b, qwen3-0.6b

Qwen3 models support both thinking and non-thinking modes. Switch between the two modes using the enable_thinking parameter. In addition, the capabilities of Qwen3 models have been greatly enhanced:

  1. Inference capability: In evaluations for math, code, and logical reasoning, the models significantly outperform QwQ and other non-reasoning models of similar size, reaching top-tier industry levels.

  2. Human preference alignment: Capabilities for creative writing, role assumption, multi-turn conversation, and instruction following are greatly improved. The general capabilities significantly exceed those of models of a similar size.

  3. Agent capability: The models achieve industry-leading performance in both reasoning and non-reasoning modes. They can accurately call external tools.

  4. Multilingual capability: The models support over 100 languages and dialects. Capabilities for multilingual translation, instruction understanding, and common-sense reasoning are significantly improved.

  5. Response format fixes: This version fixes response format issues from previous versions, such as abnormal Markdown, mid-sentence truncation, and incorrect `boxed` output.

For more information about thinking mode, see Deep thinking. For more information about non-thinking mode, see Overview of text generation models.

Text-to-video

2025-04-21

wan2.1-t2v-turbo, wan2.1-t2v-plus

  • Generates a video from a single sentence.

  • It has powerful instruction following capabilities, supports large and complex movements, and can replicate real-world physics. The generated videos feature rich artistic styles and cinematic-quality visuals. Wan - Text-to-video.

Image-to-video

2025-04-21

wan2.1-kf2v-plus, wan2.1-i2v-turbo, wan2.1-i2v-plus,

  • Based on the input first and last frame images, the model can generate a smooth and fluid dynamic video according to the prompt. First-and-last-frame-to-video.

  • Uses an input image as the first frame of the video and then generates the video based on a prompt. First-frame-to-video.

Visual reasoning

2025-03-28

qvq-max, qvq-max-latest, qvq-max-2025-03-25

A visual reasoning model. It supports visual input and chain-of-thought output, demonstrating stronger capabilities in math, programming, visual analysis, creation, and general tasks. Visual reasoning.

Omni-modal

2025-03-26

qwen2.5-omni-7b

A new omni-modal understanding and generation model from Qwen. It supports text, image, speech, and video input, and outputs text and audio. It provides two natural conversational voices. Omni-modal.

Visual understanding

2025-03-24

qwen2.5-vl-32b-instruct 

A visual understanding model. Its ability to solve math problems is close to the level of Qwen2.5VL-72B. The response style is greatly adjusted to align with human preferences. For objective questions such as math, logical reasoning, and knowledge Q&A pairs, the detail and format clarity of model responses are significantly improved. Visual understanding.

Reasoning model

2025-03-06

qwq-plus

A QwQ reasoning model trained on the Qwen2.5 model. It greatly improves the model's reasoning ability through reinforcement learning. The model's core metrics for math and code (AIME 24/25, LiveCodeBench) and some general metrics (IFEval, LiveBench, etc.) reach the level of the full-performance DeepSeek-R1. Deep thinking.

Visual understanding

2025-01-27

qwen2.5-vl-3b-instruct

qwen2.5-vl-7b-instruct

qwen2.5-vl-72b-instruct

  • Compared to the Qwen2-VL model, this model has the following improvements:

    • Significantly improved capabilities in instruction following, mathematical calculations, code generation, and structured output (JSON output).

    • Supports unified parsing of visual content such as text, charts, and layouts in images. It also adds the ability to accurately locate visual elements, supporting both bounding boxes and coordinate points.

    • Supports understanding of long video files (up to 10 minutes) with second-level event localization, and can understand temporal order and speed.

  • For more information, see Visual understanding.

Text-to-text

2025-01-27

qwen-max-2025-01-25

qwen2.5-14b-instruct-1m

qwen2.5-7b-instruct-1m

  • The qwen-max-2025-01-25 model (also known as Qwen2.5-Max): The best-performing model in the Qwen series. It has significantly improved code writing and understanding, logical, and multilingual capabilities. The response style is greatly adjusted to align with human preferences. The detail and format clarity of model responses are significantly improved, with targeted enhancements in content creation, JSON format adherence, and role assumption. Overview of text generation models.

  • The qwen2.5-14b-instruct-1m and qwen2.5-7b-instruct-1m models: Compared to the qwen2.5-14b-instruct and qwen2.5-7b-instruct models, the context length is increased to 1,000,000. Overview of text generation models.

Text-to-text

2025-01-17

qwen-plus-2025-01-12

  • Compared to the qwen-plus-2024-12-20 model, this version has improved overall capabilities in both Chinese and English. It shows significant improvements in Chinese and English common sense and reading comprehension. The ability to switch naturally between different languages, dialects, and styles is also significantly improved, as is its Chinese instruction following capability. qwen-plus-2025-01-12.

Multilingual translation

2024-12-25

qwen-mt-plus

qwen-mt-turbo

  • The Qwen-MT model is a machine translation model optimized based on the Qwen model. It excels at translation between Chinese and English, Chinese and minor languages, and English and minor languages. The minor languages include 26 languages such as Japanese, Korean, French, Spanish, German, Portuguese (Brazil), Thai, Indonesian, Vietnamese, and Arabic. In addition to multilingual translation, it provides capabilities such as terminology intervention, domain prompting, and translation memory to improve the model's translation performance in complex application scenarios. Translation capabilities (Qwen-MT).

Visual understanding

2024-12-18

qwen2-vl-72b-instruct

  • Achieved state-of-the-art results in multiple visual understanding benchmarks, significantly enhancing the processing capabilities for multimodal tasks. Visual understanding.

China (Beijing)

Type

Listing date

Models

Description

Speech recognition

2025-12-17

qwen3-asr-flash-realtime, qwen3-asr-flash-realtime-2025-10-27

Speech recognition is now supported in nine additional languages, including Czech and Danish. Real-time speech recognition - Qwen

Speech recognition

2025-12-17

qwen3-asr-flash, qwen3-asr-flash-2025-09-08

Audio with any sample rate and sound channel is supported. Audio file recognition - Qwen

Speech recognition

2025-12-17

fun-asr-mtl, fun-asr-mtl-2025-08-25

Support for speech recognition in 31 languages, including Chinese, English, Japanese, and Korean, ideal for Southeast Asia scenarios. Audio file recognition - Fun-ASR/Paraformer

Voice design

2025-12-16

qwen-voice-design

Qwen's' voice design model for generating customized voices from text descriptions. Use this model with qwen3-tts-vd-realtime-2025-12-16 to generate speech in 10 languages. Voice design

Speech synthesis

2025-12-16

qwen3-tts-vd-realtime-2025-12-16 (snapshot)

Qwen's real-time speech synthesis snapshot model that enables low-latency, high-stability real-time synthesis using designed voices, supports multi-language output, automatically adjusts the tone based on the text, and optimizes synthesis performance for complex text. Real-time speech synthesis - Qwen

Image editing

2025-12-15

qwen-image-edit-plus-2025-12-15

The latest snapshot of for Qwen-Image-Editing improves on the previous version, offering enhanced role consistency, industrial design capabilities, and geometric inference. It also optimizes the alignment between the edited and original images in terms of spatial layout, texture, and style, which produces more precise edits. Image editing - Qwen

Speech recognition

2025-12-12

fun-asr, fun-asr-2025-11-07

Feature updates for Fun-ASR audio file recognition:

语音合成

2025-12-11

cosyvoice-v3-flash, cosyvoice-v3-plus

  • The cosyvoice-v3-flash model now includes five new system voices: longanrou_v3, longyingjing_v3, longyingling_v3, longanling_v3, and longhan_v3. All these voices support the timestamp and SSML features. Voice list.

  • The voice cloning feature for the cosyvoice-v3-flash and cosyvoice-v3-plus models has been enhanced. The feature now supports timestamps and SSML, and provides improved prosody. To try the enhancement, create a new voice. CosyVoice voice cloning API.

Multimodal

2025-12-04

qwen3-omni-flash-2025-12-01

The latest Qwen Omni snapshot model increases the number of supported timbres to 49 and features a significant upgrade to its instruction-following capabilities, enabling it to efficiently understand text, images, audio, and video. Omni-modal

Real-time multimodal

2025-12-04

qwen3-omni-flash-realtime-2025-12-01

The latest snapshot model for the Qwen Omni real-time version offers low-latency multimodal interaction. The number of supported timbres is increased to 49, and the model's instruction-following ability and interactive experience are significantly upgraded. Real-time multimodal

Speech translation

2025-12-04

qwen3-livetranslate-flash, qwen3-livetranslate-flash-2025-12-01

Qwen3-LiveTranslate-Flash is an audio and video translation model that translates between 18 languages, such as Chinese, English, Russian, and French. It leverages visual context to improve translation accuracy and provides both text and speech output. Audio and video translation - Qwen

Reasoning model

2025-12-04

deepseek-v3.2

DeepSeek-V3.2 is the official model that introduces DeepSeek Sparse Attention, a sparse attention mechanism. It is also the first DeepSeek model to integrate thinking with tool usage, and it supports tool calling in both thinking and non-thinking modes.

DeepSeek

Multilingual translation

2025-12-02

qwen-mt-lite

The Qwen basic text translation model can translate between 31 languages. This model delivers faster responses at a lower cost than qwen-mt-flash, making it a suitable choice for latency-sensitive scenarios. Machine translation (Qwen-MT)

Voice cloning

2025-11-27

qwen-voice-enrollment

Qwen released a voice cloning model. It can quickly generate a voice with high similarity from an audio clip of 5 seconds or more. Use this model with qwen3-tts-vc-realtime-2025-11-27 to clone a person's voice with high fidelity and output it in real time across 11 languages. Voice cloning.

Speech synthesis

2025-11-27

qwen3-tts-vc-realtime-2025-11-27 (snapshot)

Qwen real-time speech synthesis released a new snapshot model. It can use voices generated by Voice cloning for low-latency, high-stability, real-time synthesis. It supports multilingual output, automatically adjusts tone based on the text, and improves synthesis for complex text. Real-time speech synthesis - Qwen.

Speech synthesis

2025-11-27

qwen3-tts-flash-realtime-2025-11-27 (snapshot)

Qwen real-time speech synthesis released a new snapshot model. It offers low latency and high stability. It provides more voice options, and a single voice can support multilingual output. The model automatically adjusts tone based on the text and improves synthesis for complex text. Real-time speech synthesis - Qwen.

Speech synthesis

2025-11-27

qwen3-tts-flash-2025-11-27 (snapshot)

Qwen speech synthesis released a new snapshot model. It offers more voice options, and a single voice can support multilingual output. The model can adapt its tone to the text and improves synthesis for complex text. Speech synthesis - Qwen.

Text extraction

2025-11-21

qwen-vl-ocr-2025-11-20 (snapshot)

This snapshot of the Qwen text extraction model is based on the Qwen3-VL architecture and significantly improves document parsing and text localization. Text extraction

Speech recognition

2025-11-20

qwen3-asr-flash-filetrans, qwen3-asr-flash-filetrans-2025-11-17 (snapshot)

Qwen audio file transcription released a new model. It is designed for asynchronous transcription of audio files and supports recordings up to 12 hours long. Audio file transcription - Qwen.

Speech synthesis

2025-11-19

cosyvoice-v3-flash

This version improves pronunciation accuracy and voice similarity compared to previous versions. It also adds support for more languages, including German, Spanish, French, Italian, and Russian. Real-time speech synthesis - CosyVoice/Sambert.

Inference model

2025-11-11

kimi-k2-thinking

A thinking model from Moonshot AI. It has general agent and reasoning capabilities. It excels at deep reasoning and can solve complex problems through multi-step tool calling. Kimi.

Multilingual translation

2025-11-10

qwen-mt-flash

Compared to qwen-mt-turbo, this model supports streaming incremental output and has improved overall performance. Translation capabilities (Qwen-MT).

Image-to-video

2025-11-04

wan2.2-animate-mix

Replaces the main character in a reference video with the character from an input image, while preserving the original video's scene, lighting, and tone for a seamless character swap. Wan - Video character replacement.

Inference model

2025-11-03

qwen3-max-preview

The thinking mode of the qwen3-max-preview model. It shows significant improvements in overall reasoning capabilities, especially in agent programming, common-sense reasoning, and tasks related to math, science, and general knowledge. Deep thinking.

Image-to-video

2025-11-03

wan2.2-animate-move

Transfers the actions and expressions of a character from a template video to a single static character image to generate a video of the character in motion. Wan - Image-to-action.

Image editing

2025-10-31

qwen-image-edit-plus, qwen-image-edit-plus-2025-10-30

This model optimizes the inference performance and system stability of qwen-image-edit. It greatly reduces the response time for image generation and editing and supports returning multiple images in a single request. Image editing - Qwen.

Real-time speech recognition

2025-10-27

qwen3-asr-flash-realtime, qwen3-asr-flash-realtime-2025-10-27

The Qwen real-time speech recognition large model features automatic language detection. It can recognize 11 language types and transcribe audio accurately in complex environments. Real-time speech recognition - Qwen.

Visual understanding

2025-10-21

qwen3-vl-32b-thinking, qwen3-vl-32b-instruct

A 32B dense model from the Qwen3-VL series. Its overall performance is second only to the Qwen3-VL-235B model. It excels in document recognition and understanding, spatial intelligence, object recognition, 2D visual detection, and spatial reasoning. It is suitable for complex perception tasks in general scenarios. Visual understanding.

Visual understanding

2025-10-16

qwen3-vl-flash, qwen3-vl-flash-2025-10-15

A small-sized visual understanding model from the Qwen3 series. It effectively combines thinking and non-thinking modes. Compared to the open-source Qwen3-VL-30B-A3B, it offers better performance and faster response speeds. Visual understanding.

Visual understanding

2025-10-14

qwen3-vl-8b-thinking, qwen3-vl-8b-instruct

An 8B dense open-source model from the Qwen3-VL series, available in both thinking and non-thinking versions. It uses less GPU memory and can perform multimodal understanding and reasoning. It supports ultra-long contexts such as long videos and documents, 2D/3D visual positioning, and comprehensive spatial intelligence and object recognition. Visual understanding.

Visual understanding

2025-10-03

qwen3-vl-30b-a3b-thinking, qwen3-vl-30b-a3b-instruct

Based on the new generation of open-source Qwen3-VL models, available in both thinking and non-thinking versions. It has a fast response speed and stronger capabilities for multimodal understanding, reasoning, and visual agent tasks. It also supports ultra-long contexts such as long videos and documents. Its spatial intelligence and object recognition capabilities are fully upgraded to handle complex real-world tasks. Visual understanding.

Inference model

2025-09-30

deepseek-v3.2-exp

A hybrid inference architecture model that supports both thinking and non-thinking modes. It introduces a sparse attention mechanism to improve training and inference efficiency for long texts. It is priced lower than deepseek-v3.1. DeepSeek.

Text-to-image

2025-09-23

qwen-image-plus

This model excels at rendering complex text, especially Chinese and English. It can create complex mixed-media layouts of images and text. It is more cost-effective than qwen-image. Text-to-image (Qwen-Image).

Visual understanding

2025-09-23

qwen3-vl-plus, qwen3-vl-plus-2025-09-23, qwen3-vl-235b-a22b-thinking, qwen3-vl-235b-a22b-instruct

A visual understanding model from the Qwen3 series. It effectively combines thinking and non-thinking modes, and its visual agent capabilities are world-class. This version features comprehensive upgrades in visual encoding, spatial intelligence, and multimodal thinking. Its visual perception and recognition capabilities are significantly improved. Visual understanding.

Code model

2025-09-23

qwen3-coder-plus-2025-09-23

Compared to the previous version (July 22 snapshot), it offers improved robustness in downstream tasks and tool calling, along with enhanced code security. Code capabilities (Qwen-Coder).

Inference model

2025-09-11

qwen-plus-2025-09-11

This model is part of the Qwen3 series. Compared to qwen-plus-2025-07-28, it has improved instruction-following capabilities and provides more concise summaries in thinking mode. Deep thinking. In non-thinking mode, its Chinese understanding and logical reasoning abilities are enhanced. Overview of text generation models.

Inference model

2025-09-11

qwen3-next-80b-a3b-thinking, qwen3-next-80b-a3b-instruct

A new generation of open-source models based on Qwen3. The thinking model has improved instruction-following capabilities and provides more concise summaries compared to qwen3-235b-a22b-thinking-2507. Deep thinking. The instruct model has enhanced Chinese understanding, logical reasoning, and text generation capabilities compared to qwen3-235b-a22b-instruct-2507. Overview of text generation models.

Text generation

2025-09-05

qwen3-max-preview

The Qwen-Max model (preview version) based on Qwen3. It offers a significant improvement in overall general capabilities compared to the Qwen 2.5 series. It has notably enhanced abilities in Chinese and English text understanding, complex instruction following, subjective open-ended tasks, multilingual tasks, and tool calling. The model also has fewer knowledge hallucinations. Qwen-Max.

Text, image, video, voice, etc.

2025-08-05

Models

This is the first release in the Beijing region.