Catch up on the latest AI articles

Goedel-Prover-V2: New Developments In Efficient Automated Theorem Proving By Self-Correction And Stepwise Data Synthesis

Goedel-Prover-V2: New Developments In Efficient Automated Theorem Proving By Self-Correction And Stepwise Data Synthesis

3 main points
✔️ Goedel-Prover-V2 is a new open source model for high performance automatic theorem proving on a small scale
✔️ Proposes learning methods combining Scaffolded Data Synthesis, self-correction and model averaging
✔️ With MiniF2F and PutnamBench Demonstrated efficiency by achieving performance that exceeds that of traditional large-scale models

Goedel-Prover-V2: Scaling Formal Theorem Proving with Scaffolded Data Synthesis and Self-Correction
written by Yong LinShange TangBohan LyuZiran YangJui-Hui ChungHaoyu ZhaoLai JiangYihan GengJiawei GeJingruo SunJiayun WuJiri GesiXiming LuDavid AcunaKaiyu YangHongzhou LinYejin ChoiDanqi ChenSanjeev AroraChi Jin
(Submitted on 5 Aug 2025)
Comments: 24 pages, 10 figures, 4 tables

Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)

The images used in this article are from the paper, the introductory slides, or were created based on them.

Overview.

This paper proposes a new underlying model for Automated Theorem Proving (ATP), Goedel-Prover-V2.

Previous research has required extremely large models (hundreds of B-scale parameters) and huge inference computations, limiting the performance of open source models.
Therefore, the authors introduced new learning methods and data generation strategies to achieve efficient and high-performance theorem proving.
Specifically, they incorporated "Scaffolded Data Synthesis," which generates synthetic problems with progressively adjusted difficulty levels, "Verifier-guided Self-Correction," which leverages feedback from the Lean compiler, and "Model Averaging" to maintain output diversity.

As a result, even a model of only 32B in size outperforms a conventional 671B model.
successfully solved 88.1% of the MiniF2F benchmark (90.4% with self-correction) and 86 questions on PutnamBench, setting a new high standard for open source models. This is a new high level of open source modeling.

This demonstrates that advanced mathematical reasoning can be performed efficiently without relying on vast computational resources.

Proposed Method

The core of the proposed method is the fusion of "self-correction" and "step-by-step data synthesis" with theorem proving using long chains-of-thought.

First, Verifier-guided Self-Correction is a mechanism in which the proof generated by the model is run through the Lean compiler, which analyzes the error messages and generates a corrected version.
This allows users to learn from their mistakes and complete their proofs, just as a human would improve a proof.

Next, Scaffolded Data Synthesis synthesizes the unsolved subproblems and simpler variants obtained from failed attempts at difficult problems to provide an efficient learning signal to the model.
This allows for effective training using a hierarchy of difficulty levels.

In addition, Model Averaging is introduced to weight average the models obtained at different training stages to improve performance while preventing lack of diversity.
Finally, a coherent learning pipeline combining SFT (supervised fine-tuning), RL (reinforcement learning), and Model Averaging was designed, resulting in highly efficient and robust theorem proving models.

Experiments

Evaluations were conducted primarily on MiniF2F and PutnamBench, which confirmed that the performance of the proposed model significantly outperforms conventional methods.

In MiniF2F (high school math to International Mathematical Olympiad level), Goedel-Prover-V2-32Bはpass@32で88.1% reached 90.4% in self-correction mode, exceeding DeepSeek-Prover-V2-671B.
Furthermore, on the PutnamBench (college-level math competition questions), it succeeded in solving 86 questions, a significant improvement over the previous 47.

Even the 8B model outperformed the 671B model on MiniF2F, strongly demonstrating its efficiency.
In addition, the effect of self-correction is quantitatively demonstrated, with an average improvement of 2 points on pass@32.

Scaling analysis also confirms the "high sample efficiency" of the proposed method, which can achieve high accuracy with a small number of samples generated.

These experimental results show that the proposed method enables state-of-the-art theorem proving capabilities even for small-scale models, and support its usefulness as a foundation for future mathematical AI research.

  • メルマガ登録(ver
  • ライター
  • エンジニア_大募集!!

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us