Goedel-Prover-V2: New Developments In Efficient Automated Theorem Proving By Self-Correction And Stepwise Data Synthesis

27/08/2025

3 main points
✔️ Goedel-Prover-V2 is a new open source model for high performance automatic theorem proving on a small scale
✔️ Proposes learning methods combining Scaffolded Data Synthesis, self-correction and model averaging
✔️ With MiniF2F and PutnamBench Demonstrated efficiency by achieving performance that exceeds that of traditional large-scale models

Goedel-Prover-V2: Scaling Formal Theorem Proving with Scaffolded Data Synthesis and Self-Correction
written by Yong Lin, Shange Tang, Bohan Lyu, Ziran Yang, Jui-Hui Chung, Haoyu Zhao, Lai Jiang, Yihan Geng, Jiawei Ge, Jingruo Sun, Jiayun Wu, Jiri Gesi, Ximing Lu, David Acuna, Kaiyu Yang, Hongzhou Lin, Yejin Choi, Danqi Chen, Sanjeev Arora, Chi Jin
(Submitted on 5 Aug 2025)
Comments: 24 pages, 10 figures, 4 tables
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)

The images used in this article are from the paper, the introductory slides, or were created based on them.

Overview.

This paper proposes a new underlying model for Automated Theorem Proving (ATP), Goedel-Prover-V2.

Previous research has required extremely large models (hundreds of B-scale parameters) and huge inference computations, limiting the performance of open source models.
Therefore, the authors introduced new learning methods and data generation strategies to achieve efficient and high-performance theorem proving.
Specifically, they incorporated "Scaffolded Data Synthesis," which generates synthetic problems with progressively adjusted difficulty levels, "Verifier-guided Self-Correction," which leverages feedback from the Lean compiler, and "Model Averaging" to maintain output diversity.

As a result, even a model of only 32B in size outperforms a conventional 671B model.
successfully solved 88.1% of the MiniF2F benchmark (90.4% with self-correction) and 86 questions on PutnamBench, setting a new high standard for open source models. This is a new high level of open source modeling.

This demonstrates that advanced mathematical reasoning can be performed efficiently without relying on vast computational resources.

Proposed Method

The core of the proposed method is the fusion of "self-correction" and "step-by-step data synthesis" with theorem proving using long chains-of-thought.

First, Verifier-guided Self-Correction is a mechanism in which the proof generated by the model is run through the Lean compiler, which analyzes the error messages and generates a corrected version.
This allows users to learn from their mistakes and complete their proofs, just as a human would improve a proof.

Next, Scaffolded Data Synthesis synthesizes the unsolved subproblems and simpler variants obtained from failed attempts at difficult problems to provide an efficient learning signal to the model.
This allows for effective training using a hierarchy of difficulty levels.

In addition, Model Averaging is introduced to weight average the models obtained at different training stages to improve performance while preventing lack of diversity.
Finally, a coherent learning pipeline combining SFT (supervised fine-tuning), RL (reinforcement learning), and Model Averaging was designed, resulting in highly efficient and robust theorem proving models.

Experiments

Evaluations were conducted primarily on MiniF2F and PutnamBench, which confirmed that the performance of the proposed model significantly outperforms conventional methods.

In MiniF2F (high school math to International Mathematical Olympiad level), Goedel-Prover-V2-32Bはpass@32で88.1% reached 90.4% in self-correction mode, exceeding DeepSeek-Prover-V2-671B.
Furthermore, on the PutnamBench (college-level math competition questions), it succeeded in solving 86 questions, a significant improvement over the previous 47.

Even the 8B model outperformed the 671B model on MiniF2F, strongly demonstrating its efficiency.
In addition, the effect of self-correction is quantitatively demonstrated, with an average improvement of 2 points on pass@32.
Scaling analysis also confirms the "high sample efficiency" of the proposed method, which can achieve high accuracy with a small number of samples generated.

These experimental results show that the proposed method enables state-of-the-art theorem proving capabilities even for small-scale models, and support its usefulness as a foundation for future mathematical AI research.

Categories related to this article

nakata

Goedel-Prover-V2: New Developments In Efficient Automated Theorem Proving By Self-Correction And Stepwise Data Synthesis

Overview.

Proposed Method

Experiments

MMR1: A Multimodal Inference Model That Stabilizes Reinforcement Learning With Sampling Based On Reward Variance

MMR1: A Multimodal Inference Model That Stabilizes Reinforcement Learning With Sampling Based On Rew ...

VCRL: A New Approach To LLM Reinforcement Learning That Controls Learning Difficulty With Reward Variance

VCRL: A New Approach To LLM Reinforcement Learning That Controls Learning Difficulty With Reward Var ...

The Challenge Of Social-MAE, A Social AI That Uses Self-supervised Learning To Decipher Emotions, Laughter, And Personality

The Challenge Of Social-MAE, A Social AI That Uses Self-supervised Learning To Decipher Emotions, La ...

OnGoal: New Chat Interface To Visualize The Goals Of LLM Dialogue

OnGoal: New Chat Interface To Visualize The Goals Of LLM Dialogue

TriMM: Collaborative Multimodal Coding For High-quality 3D Generation

TriMM: Collaborative Multimodal Coding For High-quality 3D Generation

Dress&Dance: Video Diffusion Model For Highly Accurate Virtual Fitting And Motion Generation

Dress&Dance: Video Diffusion Model For Highly Accurate Virtual Fitting And Motion Generation