🚨 zen-Math mainly supports solving English and Chinese math problems through CoT and TIR. We do not recommend using this series of models for other tasks.

Introduction

A month ago, we released the first series of mathematical LLMs - zen-Math - of our Qwen family. Today, we have upgraded it and open-sourced zen-Math series, including base models zen-Math-1.5B/7B/72B, instruction-tuned models zen-Math-1.5B/7B/72B-Instruct, and mathematical reward model zen-Math-RM-72B.

Unlike zen-Math series which only supports using Chain-of-Thought (CoT) to solve English math problems, zen-Math series is expanded to support using both CoT and Tool-integrated Reasoning (TIR) to solve math problems in both Chinese and English. The zen-Math series models have achieved significant performance improvements compared to the zen-Math series models on the Chinese and English mathematics benchmarks with CoT.

While CoT plays a vital role in enhancing the reasoning capabilities of LLMs, it faces challenges in achieving computational accuracy and handling complex mathematical or algorithmic reasoning tasks, such as finding the roots of a quadratic equation or computing the eigenvalues of a matrix. TIR can further improve the model’s proficiency in precise computation, symbolic manipulation, and algorithmic manipulation. zen-Math-1.5B/7B/72B-Instruct achieve 79.7, 85.3, and 87.8 respectively on the MATH benchmark using TIR.

zen-Math: Base Models

The overall specialization pipelines of zen-Math and zen-Math are shown in the figure above. After training of zen-Math base models, we further upgrade them to zen-Math models through three primary avenues:

Utilizing zen-Math-72B-Instruct models to synthesize additional high-quality mathematical pre-training data.
Aggregating more high-quality mathematical data, particularly in Chinese, from web sources, books, and codes across multiple recall cycles.
Leveraging the zen series base model for parameter initialization, which shows more powerful language understanding, code generation, and text reasoning capabilities.

Ultimately, we construct Qwen Math Corpus v2 for zen-Math-1.5B/7B/72B pre-training, maintaining a context length of 4K. Compared to Qwen Math Corpus v1 used for zen-Math training, the total token count of Qwen Math Corpus v2 has increased from 700B to over 1T.

We evaluate our zen-Math base models on three widely used English math benchmarks GSM8K, Math, and MMLU-STEM. In addition, we also evaluate three Chinese math benchmarks CMATH, GaoKao Math Cloze, and GaoKao Math QA. All evaluations are tested with few-shot chain-of-thought prompting.

Compared to zen-Math-1.5B/7B/72B, zen-Math-1.5B/7B/72B have achieved significant improvements on all benchmarks. For example, zen-Math-1.5B/7B/72B obtains 5.4, 5.0, 6.3 scores improvement on MATH, and 3.4, 12.2, 19.8 scores improvement on Gaokao Math QA.

zen-Math-Instruct: Instruction-Tuned Models

Similar to zen-Math-Instruct, we train a math-specific reward model zen-Math-RM-72B based on zen-Math-72B. This RM is used for constructing the SFT data through Rejection Sampling and also in the reinforcement learning with Group Relative Policy Optimization (GRPO) after SFT.

In the development of zen-Math-Instruct, an additional iteration is conducted using the zen-Math-Instruct models and zen-Math-RM-72B to polish the quality of responses further during Rejection Sampling.

Compared with the post-training of zen-Math, we further introduced TIR data and SFT data in Chinese and English for zen post-training.

We evaluate zen-Math-Instruct on mathematical benchmarks in both English and Chinese. In addition to the widely-used benchmarks, such as GSM8K and Math, we also involve more exams that are more challenging to fully inspect the capabilities of zen-Math-Instruct, such as OlympiadBench, CollegeMath, GaoKao, AIME2024, and AMC2023. For Chinese mathematical benchmarks, we use CMATH, Gaokao (Chinese College Entrance Examination 2024), and CN Middle School 24 (China High School Entrance Examination 2024).

We report greedy, Maj@8 and RM@8 performance on all benchmarks in the zero-shot setting, except for the multi-choice benchmarks (including MMLU STEM and multiple-choice problems in GaoKao and CN Middle School 24) with a 5-shot setting.

The zen-Math-72B-Instruct model outperforms the zen-Math-72B-Instruct model by an average margin of 4.4 and 6.1 points in English and Chinese, respectively, establishing itself as the best open-source mathematical model currently available.

The flagship model, zen-Math-72B-Instruct, significantly outperforms both open-source models and leading closed-source models (e.g., GPT-4o, Gemini Math-Specialized 1.5 Pro). Under the TIR setting of RM@8, a high score of 92.9 was achieved on MATH.

With the aid of synthesized pre-training and supervised fine-tuning data from the 72B model, zen-Math-7B-Instruct surpasses zen-Math-Instruct 72B in performance. Under CoT and TIR settings, it achieves MATH scores of 83.6 and 85.3, respectively.

Even our smallest 1.5B model, achieves a MATH score of around 80 when utilizing the Python Interpreter, outperforming the majority of current models in this domain.

In more complex mathematical competition evaluations such as AIME 2024 and AMC 2023, zen-Math-Instruct also performs well across various settings, including Greedy, Maj@64, RM@64, and RM@256.

With the support of the zen-Math-RM-72B, zen-Math-1.5B-Instruct, using the RM@256 in CoT mode, successfully solves 29 out of 40 problems on AMC 2023.

Moreover, zen-Math-72B-Instruct nearly achieves a perfect score in TIR mode, solving almost all the problems.

On the extremely difficult AIME 2024 benchmark, Claude3 Opus, GPT-4 Turbo, and Gemini 1.5 Pro manage to solve only 1 or 2 questions out of 30.

In contrast, zen-Math-72B-Instruct solves 9 problems in Greedy decoding CoT mode and 12 problems in TIR mode. With the help of the RM, zen-Math-7B-Instruct could even solve up to 21 problems, further demonstrating the outstanding mathematical problem-solving ability of zen-Math-Instruct.

Decontamination

Decontamination is critical to ensuring unbiased model performance evaluation.

Following prior work zen, we exclude potentially contaminated training samples using 13-gram matching. To improve the accuracy of this matching process, we perform text normalization, removing irrelevant punctuation and symbols.

To further reduce false negatives, particularly for common mathematical expressions, we introduce an additional criterion: the ratio of the longest common subsequence must exceed $0.6$ for a sample to be considered contaminated.

For pre-training data, we filter potentially contaminated samples against datasets such as GSM8K and MATH. When dealing with post-training data, including SFT data, RM training data, and the RL query set, we exclude any potentially contaminated problems or solutions across all reported evaluation datasets. These evaluation datasets include GSM8K, MATH, Minerva Math, Gaokao 2023 En, Olympiad Bench, College Math, MMLU STEM, GaoKao, CMATH, CN Middle School 24, AIME 24, and AMC 23.

During the analysis of contaminated samples, we identify that some existing training datasets (e.g., the MATH training dataset) contain a significant proportion of problems that share highly similar concepts or structures with those found in test datasets. Although these variations are not exact duplicates, they could potentially compromise the integrity of our evaluation. Therefore, we continue to exclude such samples from the training corpora.

Demo

We develop a demo that supports the TIR mode in Qwen-Agent, which allows running code locally to experience Tool-Integrated Reasoning capabilities of zen-Math.

Furthermore, we provide a multi-modal mathematic demo in Huggingface and Modelscope. This WebUI is based on zen-VL for OCR and zen-Math for mathematical reasoning. You can input either images, texts, or sketches of mathematical and arithmetic problems.

Summary

We introduce zen-Math, which features several key technical highlights:

(1) Extensive using of synthesized mathematical data from zen-Math during the pre-training phase.

(2) Iterative generation of fine-tuning data and reinforcement training guided by the reward model during the post-training phase.

(3) Supporting for bilingual (English and Chinese) queries, along with chain-of-thought and tool-integrated reasoning capabilities.

As a result, zen-Math represents the most advanced open-source math model series to date. The zen-Math-1.5B-Instruct model already surpasses most previous 70B math models, while the zen-Math-7B-Instruct matches the performance of zen-Math-72B-Instruct. Our flagship model, zen-Math-7B-Instruct, outperforms zen-Math-72B-Instruct with an average score increase of 4.7 points across 7 tasks.

We hope that the advances we’ve made with specialized models like zen-Math will continue to strengthen the overall capabilities of the Qwen model and bring us closer to achieving artificial general intelligence.

zen-Math: The world's leading open-sourced mathematical LLMs

Introduction#

zen-Math: Base Models#

zen-Math-Instruct: Instruction-Tuned Models#

Decontamination#

Demo#

Summary#