Catch up on the latest AI articles

ToolTrain: A New Method For Repository Deep Search And Issue Localization With LLM

ToolTrain: A New Method For Repository Deep Search And Issue Localization With LLM

3 main points
✔️ Proposed ToolTrain to enable LLMs to effectively use repository search tools
✔️ Combined SFT with rejection sampling and reinforcement learning to improve tool invocation and inference accuracy
✔️ Achieved accuracy above Claude-3.7 in experiments, Issue Improved Localization and Correction Success Rates

Tool-integrated Reinforcement Learning for Repo Deep Search
written by Zexiong MaChao PengQunhong ZengPengfei GaoYanzhen ZouBing Xie
(Submitted on 5 Aug 2025 (v1), last revised 6 Aug 2025 (this version, v2))
Comments: Published on arxiv.
Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI)

The images used in this article are from the paper, the introductory slides, or were created based on them.

Summary

This paper focuses on the issue of "Issue Localization" in software development.

Issue Localization is the process of identifying code in a repository that needs to be fixed, based on bug reports written in natural language.
This process is time-consuming and labor-intensive for large code bases, and can significantly reduce development efficiency.

In recent years, LLM has been achieving success in code generation and test generation, and automation by LLM agents combined with repository search tools is expected.
However, this requires a complex search called "Repo Deep Search," which requires LLMs to have multi-step reasoning and advanced tool invocation capabilities.

Existing LLMs have faced the problem of inaccuracy due to tool invocation errors and inconsistencies in inference.
Therefore, the authors proposed a new tool-integrated learning framework called "ToolTrain," which allows LLMs to explore repositories while efficiently using tools.

Proposed Methodology

The proposed method, ToolTrain, consists of a two-stage learning process.

The first stage is "Rejection-sampled Supervised Fine-Tuning".
Here, among the search trajectories generated by LLM using the tool, only the high-quality trajectories that reach the correct code locations are selected and used as training data.
In this way, the model will learn the basic format of the repository search task and how to invoke the tool.

The second stage is Tool-integrated Reinforcement Learning.
Here, the LLM performs a trial-and-error search and uses the results as a reward signal. The reward is calculated as an evaluation measure of whether the correct code section was found, and also whether the ranking of the code sections is appropriate.
This allows the model to avoid incorrect tool calls and to explore more efficiently and strategically.

In addition, the authors designed a lightweight search agent called "RepoSearcher" that provides a suite of tools (file structure retrieval, function search, class search, etc.).
With this design, LLM avoids redundant search and provides highly accurate localization with step-by-step reasoning.

Experiments

To validate the effectiveness of the proposed method, the authors conducted experiments using SWE-Bench-Verified, an evaluation dataset based on real GitHub issues.

As a benchmark, Issue Localization accuracy at the file and function level is evaluated using multidimensional metrics such as Recall@k, MAP, MRR, and nDCG@5.
Comparisons include existing frameworks such as Agentless, LocAgent, and CoSIL, as well as commercial models such as GPT-4o and Claude-3.7.

As a result, models trained with ToolTrain consistently show higher accuracy than similarly sized LLMs, and outperform Claude-3.7, especially in function-level localization.
The ToolTrain-7B model also showed better accuracy than other frameworks using 32B-scale models, demonstrating that effective reinforcement learning can improve performance even for small-scale models.

Furthermore, it was confirmed that the improvement in accuracy of Localization is directly related to the improvement in Issue Resolution (success rate of bug fixes). In particular, ToolTrain-32B recorded the highest fix success rate of 31.6% when combined with 関数レベルRecall@5で68.55を達成し and the patch generation model.

These results demonstrate that ToolTrain is an effective way to dramatically improve LLM's repository exploration capabilities.

  • メルマガ登録(ver
  • ライター
  • エンジニア_大募集!!

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us