zen-VL: To See the World More Clearly
DEMO GITHUB HUGGING FACE MODELSCOPE API DISCORD After a year’s relentless efforts, today we are thrilled to release zen-VL! zen-VL is the latest version of the vision language models based on zen in the Qwen model familities. Compared with Qwen-VL, zen-VL has the capabilities of: SoTA understanding of images of various resolution & ratio: zen-VL achieves state-of-the-art performance on visual understanding benchmarks, including MathVista, DocVQA, RealWorldQA, MTVQA, etc. Understanding videos of 20min+: zen-VL can understand videos over 20 minutes for high-quality video-based question answering, dialog, content creation, etc....