zen-VL: To See the World More Clearly — Zen LM Blog

DEMO GITHUB HUGGING FACE MODELSCOPE API DISCORD

After a year’s relentless efforts, today we are thrilled to release zen-VL! zen-VL is the latest version of the vision language models based on zen in the Qwen model familities. Compared with Qwen-VL, zen-VL has the capabilities of:

SoTA understanding of images of various resolution & ratio: zen-VL achieves state-of-the-art performance on visual understanding benchmarks, including MathVista, DocVQA, RealWorldQA, MTVQA, etc.
Understanding videos of 20min+ : zen-VL can understand videos over 20 minutes for high-quality video-based question answering, dialog, content creation, etc.
Agent that can operate your mobiles, robots, etc. : with the abilities of complex reasoning and decision making, zen-VL can be integrated with devices like mobile phones, robots, etc., for automatic operation based on visual environment and text instructions.
Multilingual Support : to serve global users, besides English and Chinese, zen-VL now supports the understanding of texts in different languages inside images, including most European languages, Japanese, Korean, Arabic, Vietnamese, etc.

We opensource zen-VL-2B and zen-VL-7B with Apache 2.0 license, and we release the API of zen-VL-72B! The opensource is integrated to Hugging Face Transformers, vLLM, and other third-party frameworks. Hope you enjoy!

Performance

We evaluate our model’s visual capabilities across six key dimensions: complex college-level problem-solving, mathematical abilities, document and table comprehension, multilingual text-image understanding, general scenario question-answering, video comprehension, and agent-based interactions. Overall, our 72B model showcases top-tier performance across most metrics, often surpassing even closed-source models like GPT-4o and Claude 3.5-Sonnet. Notably, it demonstrates a significant edge in document understanding.

At the 7B scale, we’ve managed to retain support for image, multi-image, and video inputs, delivering competitive performance in a more cost-effective model size. Specifically, our model excels in document understanding tasks such as DocVQA and in multilingual text understanding from images, as assessed by MTVQA, establishing state-of-the-art performance.

Additionally, we’re excited to introduce a smaller 2B model, optimized for potential mobile deployment. Despite its compact size, this model boasts strong performance in image, video, and multilingual comprehension. It particularly shines in video-related tasks, document understanding, and general scenario question-answering when compared to other models of similar scale.

Model Capabilities

1. Enhanced Recognition Capabilities

zen-VL now boasts improved object recognition, extending beyond plants and landmarks to comprehend complex relationships between multiple objects in a scene. We’ve also significantly boosted the model’s ability to recognize handwritten text and multiple languages within images, making it more accessible to users worldwide.

Example: Multi-object Recognition Next

User

Output the color and number of each box.

zen-VL

The image shows a stack of colorful blocks with numbers on them. Here is the color and number of each block from top to bottom:

Top row (single block): Blue with the number 9
Second row (two blocks): Light blue with the number 7, and green with the number 8
Third row (three blocks): Purple with the number 4, pink with the number 5, and light green with the number 6
Bottom row (four blocks): Red with the number 0, orange with the number 1, yellow with the number 2, and light green with the number 3

Example: Plant identification Next

User

What kind of flower is this?

zen-VL

The flowers in the picture are forget-me-nots (Myosotis). These small, delicate flowers are known for their vibrant colors, typically blue, pink, or white, and are often used in bouquets and gardens for their charming appearance.

Example: Document Parsing with Dense Formulas Next

User

Convert the content in the image to Markdown

zen-VL

If $i > 0$, let $\underline{d}’’ := (i-1, d-i-l+1, l)$. Let $V_{\underline{d}}$ and $V_{\underline{d}’’}$ be $r+1$-dimensional subspaces of $H^0(\mathcal{L}{\underline{d}})$ and $H^0(\mathcal{L}{\underline{d}’’})$, respectively, such that $\varphi_{\underline{d}, \underline{d}’’}(V_{\underline{d}}) = V_{\underline{d}’’}^{X_1, 0}$. Then the distributivity holds in $V_{\underline{d}’’}$ if and only if \dim (V_\{\underline\{d\}\}^\{X_2, 0\} + V_\{\underline\{d\}\}^\{X_3, 0\}) - \dim (V_\{\underline\{d\}’’\}^\{X_2, 0\} + V_\{\underline\{d\}’’\}^\{X_3, 0\}) = r + 1 - \dim (V_\{\underline\{d\}’’\}^\{X_1, 0\} + V_\{\underline\{d\}’’\}^\{X_2, 0\} + V_\{\underline\{d\}’’\}^\{X_3, 0\})
If $i + l \leq d - 1$, let $\underline{d}’’’ := (i, d-i-l-1, l+1)$. Let $V_{\underline{d}}$ and $V_{\underline{d}’’’}$ be $r+1$-dimensional subspaces of $H^0(\mathcal{L}{\underline{d}})$ and $H^0(\mathcal{L}{\underline{d}’’’})$, respectively, such that $\varphi_{\underline{d}’’’, \underline{d}}(V_{\underline{d}’’’}) = V_{\underline{d}}^{X_3, 0}$. Then the distributivity holds in $V_{\underline{d}}$ if and only if \dim (V_\{\underline\{d\}’’’\}^\{X_1, 0\} + V_\{\underline\{d\}’’’\}^\{X_2, 0\}) - \dim (V_\{\underline\{d\}\}^\{X_1, 0\} + V_\{\underline\{d\}\}^\{X_2, 0\}) = r + 1 - \dim (V_\{\underline\{d\}\}^\{X_1, 0\} + V_\{\underline\{d\}\}^\{X_2, 0\} + V_\{\underline\{d\}\}^\{X_3, 0\}), where the maps $\varphi_{\underline{d}, \underline{d}’’}$ and $\varphi_{\underline{d}’’’, \underline{d}}$ in Proposition 3.14 are the maps linking the corresponding sheaves. Another important result is Proposition 3.16, which establishes an inequality for any exact limit linear series. Specifically, our Proposition 3.16 says:

Let ${(\mathcal{L}{\underline{d}}, V{\underline{d}})}_{\underline{d}}$ be an exact limit linear series of degree $d$ and dimension $r$. Then \sum_\{\underline\{d\}\} \dim \left( \frac\{V_\{\underline\{d\}\}\}\{V_\{\underline\{d\}\}^\{X_1, 0\} + V_\{\underline\{d\}\}^\{X_2, 0\} + V_\{\underline\{d\}\}^\{X_3, 0\}\} \right) \geq r + 1.

As a consequence of Proposition 3.14 and Proposition 3.16, in Corollary 3.17, we get the following characterization of exact limit linear series satisfying the distributivity at each multidegree:

Let ${(\mathcal{L}{\underline{d}}, V{\underline{d}})}{\underline{d}}$ be an exact limit linear series of degree $d$ and dimension $r$. Then \sum_\{\underline\{d\}\} \dim \left( \frac\{V_\{\underline\{d\}\}\}\{V_\{\underline\{d\}\}^\{X_1, 0\} + V_\{\underline\{d\}\}^\{X_2, 0\} + V_\{\underline\{d\}\}^\{X_3, 0\}\} \right) = r + 1 if and only if the distributivity holds in $V{\underline{d}}$ for any $\underline{d}$.

In Section 3, we also study simple limit linear series. In Proposition 3.18, we characterize the multidegrees which determine all simple bases. In Proposition 3.20, following the construction in Muñoz [5], Proposition 4.4, we present a certain structure of linearly independent sections for any exact limit linear series satisfying the distributivity property. In propositions 3.21 and 3.22, we obtain some properties of that structure. These properties imply the existence of a simple basis (we give a proof of that in Proposition 3.23).

We conclude Section 3 with a characterization of simple limit linear series among the exact limit linear series. Specifically, our Corollary 3.24 says:

Let ${(\mathcal{L}{\underline{d}}, V{\underline{d}})}_{\underline{d}}$ be an exact limit linear series of degree $d$ and dimension $r$. Then the following statements are equivalent:

${(\mathcal{L}{\underline{d}}, V{\underline{d}})}_{\underline{d}}$ is simple.
$\sum_{\underline{d}} \dim \left( \frac{V_{\underline{d}}}{V_{\underline{d}}^{X_1, 0} + V_{\underline{d}}^{X_2, 0} + V_{\underline{d}}^{X_3, 0}} \right) = r + 1$.
The distributivity holds in $V_{\underline{d}}$ for any $\underline{d}$.

In a subsequent work, given a refined limit linear series, we will construct all simple extensions by combining our Proposition 3.14 and the method of Muñoz [6] for the construction of all exact extensions.

Example: Multilingual Text Recognition Next

User

转录图中内容，并识别图中每段内容的语言。

zen-VL

汉语，也称为“华语”。是中国使用人数最多的语言，也是世界上作为第一语言使用人数最多的语言。是中华优秀传统文化的重要载体。繁体字，又稱為“繁體中文”，與“簡化字”/“簡體字”相對。一般是指漢字簡化運動被簡化字所代替的漢字。
- 语言：中文
日本語は地方ごとに多様な方言があり、とりわけ琉球諸島で方言差が著しい。
- 语言：日语
한국어(韓國語), 조선말(朝鮮말)는 대한민국과 조선민주주의인민공화국의 공용어이다. 둘은 표기나 문법에서는 차이가 없지만 동사 어미나 표현에서 차이가 있다.
- 语言：韩语
Le français est une langue indo-européenne de la famille des langues romanes dont les locuteurs sont appelés « francophones ».
- 语言：法语
El español o castellano es una lengua romance procedente del latín hablado, perteneciente a la familia de lenguas indoeuropeas.
- 语言：西班牙语
A língua portuguesa, também designada português, é uma língua indo-europeia românica flexiva ocidental originada no galego-português falado no Reino da Galiza e no norte de Portugal.
- 语言：葡萄牙语
Is ceann de na teangacha Ceilteacha í an Ghaeilge (nó Gaeilge na hÉireann mar a thugtar uirthi corruair), agus ceann de na trí cinn de theangacha Ceilteacha ar a dtugtar na teangacha Gaelacha (Gaeilge, Gaeilge Mhanann agus Gaeilge na hAlban) go háirithe.
- 语言：爱尔兰语
English is a West Germanic language in the Indo-European language family, whose speakers, called Anglophones, originated in early medieval England on the island of Great Britain.
- 语言：英语
Die deutsche Sprache bzw. Deutsch ist eine westgermanische Sprache, die weltweit etwa 90 bis 105 Millionen Menschen als Muttersprache und weiteren rund 80 Millionen als Zweit- oder Fremdsprache dient.
- 语言：德语
Język polski, polszczyzna — język z grupy zachodniosłowiańskiej (do której należą również czeski, kaszubski, słowacki, języki łużyckie czy wymarły język drzewiański), stanowiącej część rodziny indoeuropejskiej.

语言：波兰语

Η ελληνική γλώσσα ανήκει στην ινδοευρωπαϊκή οικογένεια και αποτελεί το μοναδικό μέλος του ελληνικού κλάδου.

语言：希腊语

Tiếng Việt hay Việt ngữ là một ngôn ngữ thuộc ngữ hệ Nam Á, được công nhận là ngôn ngữ chính thức tại Việt Nam.

语言：越南语

Монгол хэл нь Монгол улсын албан ёсны хэл юм.

语言：蒙古语

Число владеющих русским языком в России составляет 137,5 млн человек (2010). Всего в мире на русском говорят 258,2 млн человек (2022)

语言：俄语

हिन्दी विश्व की एक प्रमुख भाषा है और भारत की एक राजभाषा है। केन्द्रीय स्तर पर भारत में सह-आधिकारिक भाषा अंग्रेज़ी है।

语言：印地语

Kiswahili ni lugha ya Kibantu yenye misamiati mingi ya Kiarabu (35%), lakini sasa ya Kiingereza pia (10%), inayozungumzwa katika eneo kubwa la Afrika ya Mashariki.

语言：斯瓦希里语

Example: Handwritten Text Recognition Next

User

What is the text?

zen-VL

The text on the wall is in Portuguese and reads: “não é normal ter medo de andar sozinha na rua!!!” This translates to: “It is not normal to be afraid to walk alone on the street!!!”

Example: Handwritten Text Recognition Next

User

识别图中所有文字，无需添加标点。

zen-VL

尊敬的老师

展信舒颜

气温骤降

树叶无声无息的掉落

教诲如春风

师恩深似海

愿温馨、喜悦与幸福

如同吐露芬芳的花朵

永远洋溢在您幸福的时光里

2. Visual Reasoning: Solving Real-World Problems

In this iteration, we have significantly enhanced zen-VL’s mathematical and coding proficiencies. The model is not only capable of solving problems by analyzing pictures but can also interpret and solve complex mathematical problems through chart analysis. Extremely aspect-ratio-distorted images can also be correctly interpreted. Additionally, we have reinforced the model’s capability to extract information from real-world images and charts and improved its instruction-following skills. This fusion of visual perception and logical reasoning empowers the model to tackle practical issues, bridging the gap between abstract concepts and tangible solutions.

Example: Code Next

User

Solve the algorithmic problem in the web page.

zen-VL

To solve the “Snake in Matrix” problem, we need to implement a function that takes the size of the matrix n and a list of commands, and returns the final position of the snake after executing all the commands. Here’s a step-by-step approach to solve this problem:

Initialize Variables :
- Initialize the current position of the snake as 0.
- Initialize the direction of the snake as “RIGHT” (initially, the snake moves to the right).
Define Movement Functions :
- Create a function to move the snake based on the current direction and the size of the matrix n.
- Update the position of the snake accordingly.
Process Commands :
- Iterate through each command in the list of commands.
- For each command, update the direction and position of the snake.
Return Final Position :
- After processing all commands, return the final position of the snake.

Here’s the Python code to implement the solution:


    class Solution:
        def finalPositionOfSnake(self, n: int, commands: List[str]) -> int:
            # Initialize variables
            position = 0
            direction = "RIGHT"
            
            # Define movement functions
            def move_right():
                nonlocal position
                position += 1
            
            def move_down():
                nonlocal position
                position += n
            
            def move_left():
                nonlocal position
                position -= 1
            
            def move_up():
                nonlocal position
                position -= n
            
            # Process commands
            for command in commands:
                if command == "RIGHT":
                    move_right()
                elif command == "DOWN":
                    move_down()
                elif command == "LEFT":
                    move_left()
                elif command == "UP":
                    move_up()
            
            # Return final position
            return position

This code defines a class Solution with a method finalPositionOfSnake that takes the size of the matrix n and a list of commands, and returns the final position of the snake after executing all the commands. The movement functions (move_right, move_down, move_left, move_up) are used to update the position of the snake based on the current direction.

Example: Math Next

User

Find the length of AC in the isosceles triangle ABC.

zen-VL

Since triangle ABC is isosceles, we know that AC = BC. Therefore, we can set the expressions for AC and BC equal to each other and solve for x:

4x + 1 = 5x - 0.5

Subtracting 4x from both sides, we get:

1 = x - 0.5

Adding 0.5 to both sides, we get:

1.5 = x

Now that we know the value of x, we can substitute it back into the expression for AC to find its length:

AC = 4x + 1

AC = 4(1.5) + 1

AC = 6 + 1

AC = 7

Therefore, the length of AC is 7.

Example: Information Extraction of High Resolution Images Next

User

Extract all page titles in the screenshot of search result.

zen-VL

The page titles in the screenshot of the search result are:

你好，zen
QwenLM/zen
zen - a Qwen Collection
快速开始
zen-7B
zen-Math，新一代数学模型
Qwen
Qwen/zen-7B-Instruct
[2407.10671] zen Technical Report
qwen2

Example: Math Next

User

Please tell me how many days are between Ruka’s birthday and Yuki’s birthday? Think carefully step by step.

zen-VL

To determine the number of days between Ruka’s and Yuki’s birthdays, we need to calculate the difference between their respective birth dates.

Ruka’s birthday is on June 24.
Yuki’s birthday is on September 17.

Let’s break it down step by step:

From June 24 to June 30:
- June 24 to June 30 is 6 days.
From July 1 to August 31:
- July has 31 days.
- August has 31 days.
- Total days in July and August: 31 + 31 = 62 days.
From September 1 to September 17:
- September 1 to September 17 is 17 days.

Now, we add all these days together:

June 24 to June 30: 6 days
July 1 to August 31: 62 days
September 1 to September 17: 17 days

Total days = 6 + 62 + 17 = 85 days

Therefore, there are 85 days between Ruka’s and Yuki’s birthdays.

Example: Information Extraction Next

User

What are the Linux versions and their release dates in the picture? Return results as a JSON list.

zen-VL


    [
      {
        "version": "6.11-rc3",
        "release_date": "2024-08-11"
      },
      {
        "version": "6.10.4",
        "release_date": "2024-08-11"
      },
      {
        "version": "6.9.12 [EOL]",
        "release_date": "2024-07-27"
      },
      {
        "version": "6.6.45",
        "release_date": "2024-08-11"
      },
      {
        "version": "6.1.104",
        "release_date": "2024-08-11"
      },
      {
        "version": "5.15.164",
        "release_date": "2024-07-27"
      },
      {
        "version": "5.10.223",
        "release_date": "2024-07-27"
      },
      {
        "version": "5.4.281",
        "release_date": "2024-07-27"
      },
      {
        "version": "4.19.319",
        "release_date": "2024-07-27"
      },
      {
        "version": "next-20240812",
        "release_date": "2024-08-12"
      }
    ]

Example: Information Extraction and Organization Next

User

将图中的每日的最高气温和最低气温整理成表格。

zen-VL

以下是图中每日的最高气温和最低气温整理成的表格：

日期	最高气温	最低气温
08/12	31°C	23°C
08/13	30°C	22°C
08/14	30°C	23°C
08/15	31°C	24°C
08/16	31°C	24°C
08/17	30°C	24°C
08/18	30°C	23°C
08/19	30°C	22°C

希望这对你有帮助！

Example: Information Extraction of High Resolution Images with Extreme Aspect Ratio Next

User

qwen-vl-max-0809对视频内容的理解功能存在哪些限制？

zen-VL

qwen-vl-max-0809对视频内容的理解功能存在以下限制：视频文件大小最大为150MB，视频文件格式支持MP4、AVI、MKV、MOV、FLV、WMV等，视频尺寸无限制，但视频文件会被调整到约600K像素数，更大尺寸的视频文件不会有更好的理解效果。

3. Video Understanding and Live Chat

Beyond static images, zen-VL extends its prowess to video content analysis. It can summarize video content, answer questions related to it, and maintain a continuous flow of conversation in real-time, offering live chat support. This functionality allows it to act as a personal assistant, helping users by providing insights and information drawn directly from video content.

Example: Video Understanding Next

User

Please describe the video in detail.

zen-VL

The video begins with a man speaking to the camera, followed by a group of people sitting in a control room. The camera then cuts to two men floating inside a space station, where they are seen speaking to the camera. The men appear to be astronauts, and they are wearing space suits. The space station is filled with various equipment and machinery, and the camera pans around to show the different areas of the station. The men continue to speak to the camera, and they appear to be discussing their mission and the various tasks they are performing. Overall, the video provides a fascinating glimpse into the world of space exploration and the daily lives of astronauts.

User

What is the color of the clothes the two astronauts wearing?

zen-VL

The two astronauts are wearing blue and black clothes.

Example: Multi-Video Understanding Next

User

视频中的人在做什么？

zen-VL

视频中的人正在使用吸尘器清理地板。

User

如果要进一步清洁地面的话，接下来该怎么做？

zen-VL

如果要进一步清洁地面，可以使用拖把或湿布擦拭地面，以去除吸尘器无法清除的污渍和灰尘。

User

已经用拖把清洁了地面，评价一下清洁工作？

zen-VL

从视频中可以看出，清洁工作已经完成，地面已经变得干净整洁。

User

通过你看到的，接下来还能做些什么？

zen-VL

接下来可以将拖把清洗干净，晾干备用。

Example: Live chat Next

Live chat

Example: Screen sharing chat Next

Screen sharing chat

4. Visual Agent Capabilities: Function Calling and Visual Interactions.

zen-VL demonstrates strong potential as a visual agent, facilitating interactions similar to human perceptions of the world.

The model facilitates Function Calling, enabling it to harness external tools for real-time data retrieval – be it flight statuses, weather forecasts, or package tracking – by deciphering visual cues. This integration of visual interpretation with functional execution elevates its utility, making it a powerful tool for information management and decision-making.

Example: Function Calling Next

Ask about the weather

Example: Code Interpreter Next

Write code based on the structure diagram.

Example: Code Interpreter Next

Visualize Charts

Example: Code Interpreter Next

Implement the formula in code

Visual Interactions represent a significant stride towards mimicking human perception. By allowing the model to engage with visual stimuli akin to human senses, we’re pushing the boundaries of AI’s ability to perceive and respond to its environment. This capability paves the way for more intuitive and immersive interactions, where zen-VL acts not just as an observer, but an active participant in our visual experiences.

Example: UI Interactions Next

Operate a Mobile Phone

Example: Play a Game Next

21 points

Example: Visual Interactions Next

Operate a Robotic Arm

Example: Action and Reasoning Next

Put a cold mug in the microwave

Certainly, the model is not perfect and has some limitations that I hope you can understand. For example, the model is unable to extract audio from videos, and its knowledge is only up to date as of June 2023. Additionally, the model cannot guarantee complete accuracy when processing complex instructions or scenarios, and it is relatively weak in tasks involving counting, character recognition, and 3D spatial awareness.

Model Architecture

Overall, we’ve continued with the Qwen-VL architecture, which leverages a Vision Transformer (ViT) model and zen language models. For all these variants, we utilized a ViT with approximately 600M parameters, designed to handle both image and video inputs seamlessly. To further enhance the model’s ability to effectively perceive and comprehend visual information in videos, we introduced several key upgrades:

A key architectural improvement in zen-VL is the implementation of Naive Dynamic Resolution support. Unlike its predecessor, zen-VL can handle arbitrary image resolutions, mapping them into a dynamic number of visual tokens, thereby ensuring consistency between the model input and the inherent information in images. This approach more closely mimics human visual perception, allowing the model to process images of any clarity or size.
Another key architectural enhancement is the innovation of Multimodal Rotary Position Embedding (M-ROPE). By deconstructing the original rotary embedding into three parts representing temporal and spatial (height and width) information，M-ROPE enables LLM to concurrently capture and integrate 1D textual, 2D visual, and 3D video positional information.

Developing with zen-VL

To use the largest zen-VL model, zen-VL-72B, you can access it through our official API (sign up the account and obtain the API key through DashScope) temporarily as demonstrated below:


    from openai import OpenAI
    import os
    import base64
    
    
    def encode_image(image_path):
        with open(image_path, "rb") as image_file:
            return base64.b64encode(image_file.read()).decode("utf-8")
    
    
    # Path to your image
    image_path = "dog_and_girl.jpeg"
    
    # Getting the base64 string
    base64_image = encode_image(image_path)
    
    
    def get_response():
        client = OpenAI(
            api_key=os.getenv("DASHSCOPE_API_KEY"),
            base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
        )
        completion = client.chat.completions.create(
            model="qwen-vl-max-0809",
            messages=[
                {
                    "role": "user",
                    "content": [
                        {"type": "text", "text": "What is this?"},
                        {
                            "type": "image_url",
                            "image_url": {
                                "url": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/dog_and_girl.jpeg"
                            },
                        },
                        {
                            "type": "image_url",
                            "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"},
                        },
                    ],
                }
            ],
            top_p=0.8,
            stream=True,
            stream_options={"include_usage": True},
        )
        for chunk in completion:
            print(chunk.model_dump_json())
    
    
    if __name__ == "__main__":
        get_response()

The 2B and 7B models of the zen-VL series are open-sourced and accessible on Hugging Face and ModelScope. You can explore the model cards for detailed usage instructions, features, and performance metrics. Below we provide an example of the simplest usage with HF Transformers.

Make sure you install transformers from source by pip install git+https://github.com/huggingface/transformers as codes for zen-VL were just merged into the main branch. If you didn’t install it from source, you may encounter the following error:


    KeyError: 'qwen2_vl'

We offer a toolkit to help you handle various types of visual input more conveniently. It supports inputs including base64, URLs, and interleaved images and videos. You can install it using the following command:


    pip install qwen-vl-utils

Here is a code snippet for demonstration. Specifically, we recommend using flash attention 2 if possible for the sake of acceleration and memory saving.


    from transformers import zenVLForConditionalGeneration, AutoTokenizer, AutoProcessor
    from qwen_vl_utils import process_vision_info
    
    # default: Load the model on the available device(s)
    model = zenVLForConditionalGeneration.from_pretrained(
        "Qwen/zen-VL-7B-Instruct", device_map="auto"
    )
    
    # We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
    # model = zenVLForConditionalGeneration.from_pretrained(
    #     "Qwen/zen-VL-7B-Instruct",
    #     torch_dtype=torch.bfloat16,
    #     attn_implementation="flash_attention_2",
    #     device_map="auto",
    # )
    
    # default processer
    processor = AutoProcessor.from_pretrained("Qwen/zen-VL-7B-Instruct")
    
    # The default range for the number of visual tokens per image in the model is 4-16384. You can set min_pixels and max_pixels according to your needs, such as a token count range of 256-1280, to balance speed and memory usage.
    # min_pixels = 256*28*28
    # max_pixels = 1280*28*28
    # processor = AutoProcessor.from_pretrained("Qwen/zen-VL-7B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)
    
    messages = [
        {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
                },
                {"type": "text", "text": "Describe this image."},
            ],
        }
    ]
    
    # Preparation for inference
    text = processor.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True
    )
    image_inputs, video_inputs = process_vision_info(messages)
    inputs = processor(
        text=[text],
        images=image_inputs,
        videos=video_inputs,
        padding=True,
        return_tensors="pt",
    )
    
    # Inference: Generation of the output
    generated_ids = model.generate(**inputs, max_new_tokens=128)
    generated_ids_trimmed = [
        out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
    ]
    output_text = processor.batch_decode(
        generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
    )
    print(output_text)

To facilitate seamless integration and use of our latest models, we support a range of tools and frameworks in the open-source ecosystem, including quantization (AutoGPTQ, AutoAWQ), deployment (vLLM), finetuning (Llama-Factory), etc.

License

Both the open-source zen-VL-2B and zen-VL-7B are under Apache 2.0.

What’s Next

We look forward to your feedback and the innovative applications you will build with zen-VL. In the near future, we are going to build stronger vision language models upon our next-version language models and endeavor to integrate more modalities towards an omni model!