
How Many Times Is Debugging LLM Effective? What Is The New Indicator "DDI" To Detect The Decay Of Effectiveness?
3 main points
✔️ Debugging capability with LLMs confirmed to drop sharply after a few trials
✔️ Proposed method DDI is an evaluation metric that quantifies the decay of debugging effectiveness as an exponential function
✔️ DDI-based regenerative strategy demonstrated to efficiently improve accuracy
The Debugging Decay Index: Rethinking Debugging Strategies for Code LLMs
written by Muntasir Adnan, Carlos C. N. Kuhn
(Submitted on 23 Jun 2025 (v1), last revised 13 Jul 2025 (this version, v2))
Comments: Published on arxiv.
Subjects:Software Engineering (cs.SE); Artificial Intelligence (cs.AI)
The images used in this article are from the paper, the introductory slides, or were created based on them.
Summary
This paper focuses on the "debugging decay phenomenon," in which the debugging capability of LLMs in code generation rapidly decreases with repeated trials, and proposes a new index, the Debugging Decay Index (DDI), to quantitatively evaluate this reality.
Conventionally, code generation by LLM has relied on static indices such as pass@k, which evaluates the result of a single generation.
Therefore, this study focuses on "sequential debugging," which is similar to the actual development process, and models its characteristic of exponential decay of effectiveness; the DDI combines initial performance (E₀), decay rate (λ), strategic intervention timing (tθ), and goodness of fit (R²) to evaluate a model's code generation and debugging capability from multiple perspectives.
Experimental results also show that a significant improvement in accuracy can be achieved by performing a "fresh start" (regenerating) when the attenuation reaches a certain threshold.
Proposed Method
The proposed method, DDI, is a mathematical model for quantitatively evaluating the sequential debugging capability of LLM.
First, the effect of each debugging trial is normalized and its change is modeled as an exponential decay function E(t) = E₀e^(-λt) where E₀ is the initial debugging success rate, λ is the decay rate, and t is the number of debugging attempts.
In addition, the number of times tθ until a specific effect decay threshold θ is reached is calculated by the formula tθ = ln(100 / (100 - θ)) / λ, which is used as a criterion for strategic censoring or regenerating DDI output consists of four pairs, (E₀, λ, tθ, R²), each of which represents a model They mean initial performance, debug persistence, optimal timing of regenerations, and goodness of fit of the decay model.
The method is designed not only to visualize how the LLM improves during the debugging process and where it reaches its limits, but also to improve overall accuracy by regenerating the model with the potential for improvement remaining.
Experiments
In this study, we applied DDI to 18 state-of-the-art LLMs on the HumanEval dataset and analyzed their debugging capability decay characteristics.
For each model, the initial success rate (E₀), decay rate (λ), strategic regeneration timing (tθ), and fitness to exponential decay (R²) are calculated and compared across models.
We also tested the effectiveness of a "fresh start" (regenerate) strategy at tθ compared to the traditional continuous debugging strategy.
The results showed that regenerating improved the accuracy of all models, particularly noticeable for llama3.1:8b, from 72.6% to 82.8%, and for deepseek-coder-v2:16b, from 84.1% to 92.1%.
Thus, it is clear that strategic intervention is more efficient than simply increasing the number of trials.
The differences in λ and R² across models also suggest that there are model-specific trends in debugging persistence and response patterns.
Categories related to this article