Choose the right model for your use case, such as image analysis, video understanding, or OCR.
Image and video understanding
Start with qwen3.6-plus, the flagship Qwen model. It supports 1M context window, up to 2-hour videos, function calling, and built-in tools. Once your application is stable, you can switch to qwen3.6-flash to reduce costs. It offers near-flagship performance with the same context length and feature set.
Image resolution
Most models support up to 16 million pixels per image. Higher resolutions use more tokens. Token count per image: h x w / (32 x 32) + 2.
Video support
Up to 2 hours / 2 GB:
qwen3.6-plus,qwen3.6-flash,qwen3.5-plus,qwen3.5-flashUp to 1 hour / 2 GB:
qwen3-vl-plus,qwen3-vl-flashUp to 1 hour / 2 GB:
qwen3.5-omni-plus,qwen3.5-omni-flash(also supports audio input)
Function calling and built-in tools
Allows the model to perform actions based on image or video content.
Function calling: Supported by the Qwen3.6, Qwen3.5, and Qwen3-VL series.
Built-in tools (web search, code execution, no setup required): Available for
qwen3.6-plus,qwen3.6-flash,qwen3.5-plus, andqwen3.5-flash.
Structured output
Get valid JSON output from visual inputs, such as extracting product details from a photo.
Supported by the Qwen3.6, Qwen3.5, and Qwen3-VL series in non-thinking mode.
OCR and document extraction
qwen-vl-ocr is optimized for text extraction from documents, tables, exam papers, and handwritten content. For general text extraction from images, use qwen3.6-plus or qwen3.6-flash.
Recommended models
Model | Context | Max pixels/image | Max video duration | Max video size | Max images | Max videos | Function calling | Built-in tools | Structured output |
| 1M | 16M | 2 hours | 2 GB | 256 | 64 | |||
| 1M | 16M | 2 hours | 2 GB | 256 | 64 | |||
| 64k | -- | 1 hour | 2 GB | 2,048 | 512 | -- |
All models
Qwen3.6
Model ID | Input | Output | Context | Max output | Max images | Max videos | Function calling | Built-in tools | Structured output |
| Text, images, video | Text | 1M | 64k | 256 | 64 | |||
| Text, images, video | Text | 1M | 64k | 256 | 64 | |||
| Text, images, video | Text | 1M | 64k | 256 | 64 | |||
| Text, images, video | Text | 1M | 64k | 256 | 64 | |||
| Text, images, video | Text | 256k | 64k | 256 | 64 |
Qwen3.5
Model ID | Input | Output | Context | Max output | Max images | Max videos | Function calling | Built-in tools | Structured output |
| Text, images, video | Text | 1M | 64k | 256 | 64 | |||
| Text, images, video | Text | 1M | 64k | 256 | 64 | |||
| Text, images, video | Text | 1M | 64k | 256 | 64 | |||
| Text, images, video | Text | 1M | 64k | 256 | 64 | |||
| Text, images, video | Text | 32k | 8k | 256 | 64 | |||
| Text, images, video | Text | 32k | 8k | 256 | 64 | |||
| Text, images, video | Text | 32k | 8k | 256 | 64 | |||
| Text, images, video | Text | 32k | 8k | 256 | 64 |
Legacy and other models
These models are no longer recommended. For new projects, use the Qwen3.6 or Qwen3.5 series. For full model specifications, visit the Models page.
China (Beijing) | Singapore | U.S. | China (Hong Kong) | Germany (Frankfurt)