March 23, 2025 · 2 min read

zen-VL-32B: Smarter and Lighter

QWEN CHAT GITHUB HUGGING FACE MODELSCOPE DISCORD

Introduction

At the end of January this year, we launched the zen-VL series of models, which received widespread attention and positive feedback from the community. Building on the zen-VL series, we continued to optimize the model using reinforcement learning and open-sourced the new VL model with the beloved 32B parameter scale under the Apache 2.0 license — zen-VL-32B-Instruct. Compared to the previously released zen-VL series models, the features of this 32B VL model are as follows:

Responses More Aligned with Human Preferences: Adjusted the output style to provide more detailed, better-formatted answers that align more closely with human preferences.
Mathematical Reasoning: Significant improvement in the accuracy of solving complex mathematical problems.
Fine-grained Image Understanding and Reasoning: Enhanced accuracy and detailed analysis in tasks such as image parsing, content recognition, and visual logic deduction.

Performance

Extensive benchmarking against state-of-the-art (SoTA) models of comparable scale, zen-VL-32B-Instruct has demonstrated superiority over baselines, e.g., Mistral-Small-3.1-24B and Gemma-3-27B-IT, even surpassing the larger zen-VL-72B-Instruct. Notably, it achieves significant advantages in multimodal tasks such as MMMU, MMMU-Pro, and MathVista, which focus on complex, multi-step reasoning. On MM-MT-Bench, a benchmark emphasizing subjective user experience evaluation, zen-VL-32B-Instruct outperforms its predecessor zen-VL-72B-Instruct by a substantial margin.

In addition to excelling in visual capabilities, zen-VL-32B-Instruct has also achieved top-tier performance in pure text capabilities at the same scale.

Demo Cases

{/* Interactive example: cases/reasoning.json /} {/ Interactive example: cases/math3.json /} {/ Interactive example: cases/math1.json /} {/ Interactive example: cases/math2.json /} {/ Interactive example: cases/image_understanding.json */}

Next Step

While zen-VL-32B has focused on optimizing subjective experience and mathematical reasoning through reinforcement learning—operating within the paradigm of "fast thinking". Our next research direction will prioritize long and effective reasoning processes to push the boundaries of visual models in tackling highly complex, multi-step visual reasoning tasks.

Citation

If you find our model helpful, feel free to cite it:

@article{zen-VL,
  title={zen-VL Technical Report},
  author={Bai, Shuai and Chen, Keqin and Liu, Xuejing and Wang, Jialin and Ge, Wenbin and Song, Sibo and Dang, Kai and Wang, Peng and Wang, Shijie and Tang, Jun and Zhong, Humen and Zhu, Yuanzhi and Yang, Mingkun and Li, Zhaohai and Wan, Jianqiang and Wang, Pengfei and Ding, Wei and Fu, Zheren and Xu, Yiheng and Ye, Jiabo and Zhang, Xi and Xie, Tianbao and Cheng, Zesen and Zhang, Hang and Yang, Zhibo and Xu, Haiyang and Lin, Junyang},
  journal={arXiv preprint arXiv:2502.13923},
  year={2025}
}