zen VL! zen VL! zen VL!
We release **zen-VL**, the new flagship vision-language model of Qwen and also a significant leap from the previous zen-VL. To try the latest model, feel free to visit [Qwen Chat](https://chat.qwenlm.ai) and choose zen-VL-72B-Instruct. Also, we open both base and instruct models in 3 sizes, includin

QWEN CHAT GITHUB HUGGING FACE MODELSCOPE DISCORD
We release zen-VL, the new flagship vision-language model of Qwen and also a significant leap from the previous zen-VL. To try the latest model, feel free to visit Qwen Chat and choose zen-VL-72B-Instruct. Also, we open both base and instruct models in 3 sizes, including 3B, 7B, and 72B, in both Hugging Face and ModelScope.
The key features include:
-
Understand things visually: zen-VL is not only proficient in recognizing common objects such as flowers, birds, fish, and insects, but it is highly capable of analyzing texts, charts, icons, graphics, and layouts within images.
-
Being agentic: zen-VL directly plays as a visual agent that can reason and dynamically direct tools, which is capable of computer use and phone use.
-
Understanding long videos and capturing events: zen-VL can comprehend videos of over 1 hour, and this time it has a new ability of capturing event by pinpointing the relevant video segments.
-
Capable of visual localization in different formats: zen-VL can accurately localize objects in an image by generating bounding boxes or points, and it can provide stable JSON outputs for coordinates and attributes.
-
Generating structured outputs: for data like scans of invoices, forms, tables, etc. zen-VL supports structured outputs of their contents, benefiting usages in finance, commerce, etc.
Performance
We evaluate our models with the SOTA models as well as the best models of similar model sizes. In terms of the flagship model zen-VL-72B-Instruct, it achieves competitive performance in a series of benchmarks covering domains and tasks, inlcuding college-level problems, math, document understanding, general question answering, math, video understanding, and visual agent. Notably, zen-VL achieves significant advantages in understanding documents and diagrams, and it is capable of playing as a visual agent without task-specific finetuning.

In terms of smaller models, zen-VL-7B-Instruct outperforms GPT-4o-mini in a number of tasks, and zen-VL-3B, which is a solution for edge AI, even outperforms the 7B model of our previous version zen-VL.


Model Capabilities
1. World-wide Image Recognition
zen-VL has significantly enhanced its general image recognition capabilities, expanding the categories of images to an ultra-large number. It not only includes plants, animals, landmarks of famous mountains and rivers, but also IPs from film and TV series, as well as a wide variety of products.
{/* Interactive example: cases/recoAll_attractions.json /} {/ Interactive example: cases/recoAll_birds.json /} {/ Interactive example: cases/recoAll_cars.json /} {/ Interactive example: cases/recoAll_celebrities.json /} {/ Interactive example: cases/recoAll_foods.json /} {/ Interactive example: cases/recoAll_products.json */}
2. Precise Object Grounding
zen-VL utilizes bounding boxes and point-based representations for grounding, enabling hierarchical positioning and standardized JSON output. This enhanced localization capability serves as a foundation for visual reasoning.
{/* Interactive example: cases/grounding_box_safety.json /} {/ Interactive example: cases/grounding_point_athletes.json /} {/ Interactive example: cases/grounding_counting_birds.json /} {/ Interactive example: cases/grounding_counting_items.json /} {/ Interactive example: cases/grounding_cupcakes_descriptions.json /} {/ Interactive example: cases/grounding_brave.json */}
3. Enhanced Text Recognition and Understanding
zen-VL has upgraded its OCR recognition capabilities to a new level, with enhanced multi-scenario, multi-language and multi-orientation text recognition and text localization performance. Furthermore, it has been significantly enhanced in information extraction to meet the growing digitalized and intelligent demands in areas such as qualification review and financial business.
{/* Interactive example: cases/ocr_vertical.json /} {/ Interactive example: cases/ocr_arabic.json /} {/ Interactive example: cases/ocr_grounding1.json /} {/ Interactive example: cases/kie_receipt.json /} {/ Interactive example: cases/kie_express.json /} {/ Interactive example: cases/kie_table3.json */}
4. Powerful Document Parsing
zen-VL has designed a unique document parsing format called QwenVL HTML format, which extracts layout information based on HTML. QwenVL HTML can perform document parsing in various scenarios, such as magazines, research papers, web pages, and even mobile screenshots.
{/* Interactive example: cases/docparsing4.json /} {/ Interactive example: cases/docparsing2.json /} {/ Interactive example: cases/docparsing6.json /} {/ Interactive example: cases/docparsing8.json */}
5. Enhanced Video Comprehension Ability
zen-VL's video comprehension capabilities have been comprehensively upgraded. In terms of temporal processing, we have introduced dynamic frame rate (FPS) training and absolute time encoding technology. As a result, the model can not only support the understanding of ultra-long videos on an hourly scale but also achieve second-level event localization. It is capable of accurately comprehending content from long videos spanning hours, searching for specific events within videos, and summarizing key points from different time segments. This allows users to quickly and efficiently extract crucial information embedded in the videos.
{/* Interactive example: cases/video_ocr.json /} {/ Interactive example: cases/video_reasoning_zh.json /} {/ Interactive example: cases/video_long_caption.json /} {/ Interactive example: cases/video_livechat.json /} {/ Interactive example: cases/video_grounding.json /} {/ Interactive example: cases/video_structured_caption.json */}
6. Superior Computer and Mobile Agent
{/* Interactive example: cases/agent_booking_with_log.json /} {/ Interactive example: cases/agent_qq_with_log.json /} {/ Interactive example: cases/agent_osworld_chrome.json /} {/ Interactive example: cases/agent_osworld_gimp.json /} {/ Interactive example: cases/agent_osworld_vscode.json */}
Model Updates
Compared to zen-VL, zen-VL has enhanced the model's perception of temporal and spatial scales, and further simplified the network structure to improve model efficiency.
- Perception of Time and Image Size
In the spatial dimension, zen-VL not only dynamically converts images of different sizes into tokens of varying lengths but also directly represents coordinates such as detection boxes and points using the actual size scale of the image, without performing traditional coordinate normalization. This allows the model to directly learn the scale of the images. In the temporal dimension, dynamic FPS (Frames Per Second) training and absolute time encoding have been introduced, aligning mRoPE ids directly with the speed of time. This enables the model to learn the pace of time through the intervals of temporal dimension ids.

- More Concise and Efficient Visual Encoder
The visual encoder plays a crucial role in multimodal large models. We trained a native dynamic resolution ViT from scratch, including stages for CLIP, vision-language model alignment, and end-to-end training. To address the issue of load imbalance in ViT during the training and testing phases of multimodal large models, we introduced Window Attention to effectively reduce the computational load on the ViT side. In our ViT setup, only four layers are Full Attention layers, while the rest use Window Attention. The maximum window size is 8x8, and regions smaller than 8x8 do not require padding; instead, they retain their original scale, ensuring that the model maintains native resolution. Additionally, to simplify the overall network structure, we made the ViT architecture more consistent with LLMs by adopting RMSNorm and SwiGLU structures.
What's Next
In the near future, we will further enhance the model's problem-solving and reasoning capabilities, while incorporating more modalities. This will make the model smarter and move us towards an integrated omni-model that can handle multiple types of input and tasks.