Catch up on the latest AI articles

Enhanced LLM Code Generation With Property-based Testing! New Framework PGS To Break Self-Deception

Enhanced LLM Code Generation With Property-based Testing! New Framework PGS To Break Self-Deception

3 main points
✔️ Detect and correct code generation errors by LLM with high accuracy using property-based testing (PBT)
✔️ The proposed PGS method uses two LLM agents responsible for code generation and verification that iteratively cooperate
✔️ In experiments, PGS achieved a maximum 37.3% higher correct response rate than conventional methods, showing its effectiveness especially for difficult problems . PGS has been shown to be particularly effective on difficult problems.

Use Property-Based Testing to Bridge LLM Code Generation and Validation
written by Lehan HeZeren ChenZhe ZhangJing ShaoXiang GaoLu Sheng
(Submitted on 23 Jun 2025)
Comments: Published on arxiv.
Subjects:  Software Engineering (cs.SE); Artificial Intelligence (cs.AI)

The images used in this article are from the paper, the introductory slides, or were created based on them.

Summary

LLM is a widely used technique for automatically generating code from natural language problem statements, but it is still difficult to guarantee that the output will function correctly. Traditional test-driven development (TDD) attempts to verify the correctness of code using input/output examples, but this method suffers from a lack of high-quality test cases or, conversely, from misguided models caused by incorrect output examples.

In this paper, we propose a new framework, Property-Generated Solver (PGS), which focuses on Property-Based Testing (PBT) to solve this problem. Generator," which is in charge of code generation, and "Tester," which is in charge of defining and verifying properties, are coordinated to iteratively generate and modify accurate and generalizable code.

The advantage of using PBT is that verification is based on more abstract and intrinsic program properties rather than concrete outputs, making it easier to break out of the "cycle of self-deception. Experimental results show that PGS achieves up to 37.3% higher correct response rates than traditional TDD methods.

Proposed Method

PGS is a framework for code generation and modification with PBT at its core. In this methodology, two independent agents, "Generator" and "Tester," based on LLM, work in concert.

First, the Generator generates initial code from natural language specification sentences. In contrast, Tester defines abstract properties (e.g., "outputs should be in ascending order" or "the product of outputs is equal to the original input") from the problem statement, which are then converted into code for verification. The PBT methodology then automatically generates diverse and specification-compliant test inputs to verify the code.

If any of the properties are violated, the tester selects the most concise and suggestive failure and returns detailed feedback to the generator, which corrects the code and tests it again, repeating the cycle up to five times.

In this way, PGS avoids the "misguidance based on wrong test examples" seen in traditional TDD and focuses on verification based on abstract properties derived from the specification, resulting in more robust code generation.

Experimentation

To validate the effectiveness of PGS, the authors used three code generation benchmarks (HumanEval, MBPP, and LiveCodeBench) and compared them to traditional TDD methods and state-of-the-art debugging assistance methods. Three LLMs with different performance (DeepSeek-Coder-V2, Qwen2.5-Coder, and DeepSeek-R1-Distilled-32B) were used for validation.

The evaluation metrics used were the percentage of all tests passed in the first generation (pass@1) and the percentage of successful correction if the initial code was incorrect (Repair Success Rate, RSR). The results show that PGS has the best performance across all models and benchmarks, achieving an average absolute improvement of 15.7% in RSR at pass@1で平均9.2%.

The study also revealed that the most effective feedback strategy, choosing the "shortest, briefest input failure case," increased the success rate of model modifications. In addition, the finding that LLM is better at generating verification properties than complete code is also shown, which is consistent with PGS's property-driven approach to code modification.

  • メルマガ登録(ver
  • ライター
  • エンジニア_大募集!!

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us