LLM4Decompile, A Large-scale Language Model Specialized For Decompiling
3 main points
✔️ Developed LLM4Decompile, the first open source model dedicated to decompilation
✔️ Introduces new learning objectives into the model to improve decompile accuracy
✔️Buildsfirststandardized benchmark for decompilation with a focus on recompilation and re-runnability
LLM4Decompile: Decompiling Binary Code with Large Language Models
written by Hanzhuo Tan, Qi Luo, Jing Li, Yuqun Zhang
(Submitted on 8 Mar 2024)
Comments: Published on arxiv.
Subjects: Programming Languages (cs.PL); Computation and Language (cs.CL)
code:
The images used in this article are from the paper, the introductory slides, or were created based on them.
Summary
Decompilation is the technique of converting already compiled machine language or bytecode back into the original high-level programming language. This technique is used to analyze the internal workings of software, especially when the source code is not available; dedicated tools such as Ghidra and IDA Pro have been developed, but these struggle to produce code in a human-readable format. The greatest challenge of decompilation is to completely reproduce the basic structure of the code, such as variable names, loops, conditional statements, etc., which are often lost during compilation.
Recent advances in large-scale language models (LLMs) have been noted to address this problem. These approach programming languages as a kind of language system and use pre-trained models to tackle various coding tasks. This has shown results that far outperform conventional methods and suggests that similar approaches are possible in the area of decompilation.
However, until now there have been few standard benchmarks or publicly available datasets to evaluate and compare decompiling techniques. Researchers have used different datasets, making direct comparison of results difficult. Therefore, in this paper, we develop an open-sourcelarge-scale language modeldedicated to decompilationand build the first decompilation benchmark focused on recompilability and reexecutability. This is expected to unify evaluation criteria in the field of decompilation and facilitate further research.The figure below shows the steps involved in the decompilation evaluation of this paper.
What is LLM4Decompile?
LLM4Decompile is a groundbreaking effort dedicated to decompiling programs. First, in building the pre-training data, it is based on a million publicly available compilable C files called Anghabench. We leverage this rich dataset to create assembly code and source code pairs. Specifically, the source code is first converted to binary object files, which are then disassembled into assembly code and paired with the source code on the x86 Linux platform. It also takes into account the various compiler optimization flags that programmers use to optimize execution performance. The optimization process is a technique for converting source code into faster and more efficient machine code. The optimization level compiles the source code in various stages, from the default O0 (no optimization) to O3 (aggressive optimization). Throughout these processes, unique prompts are used to make the model learn about the optimization stages.
"This is assembly code with [optimization state] optimization: [assembly code]. What is the source code?"
In this way, LLM4Decompile seeks a deeper understanding in the programming world and lays the groundwork for more accurate decompilation.
Next, LLM4Decompile's model setup uses the same architecture as LDeepSeek-Coder and initializes the model with the corresponding DeepSeek-Coder checkpoints. The learning objectives are then classified into two categories
The firstobjective is Next token prediction (NTP). It predicts the next token to come based on a given input. This approach plays a central role in the pre-training of many large-scale language models and aims to minimize the negative log probability for a true token. This process involves refining the parameters of the model to make more accurate predictions based on the input sequence.
The secondobjective is Sequence-to-sequence (S2S). It predicts the expected output for an input sequence. This is the approach employed specifically in the neural machine translation model, which focuses on minimizing the negative log probability for tokens in the C code. With this goal, only losses relative to the output sequence are computed, resulting in more accurate translations.
The main difference between these two learning objectives lies in how input sequences and assembly codes affect the computation of learning losses: in NTP, all inputs are considered, whereas in S2S only output sequences are emphasized. In this paper, we perform various ablation studies to identify how these objectives contribute to the accuracy of decompilation.
Experimental results
The table below, which summarizes the results of the study, reveals interesting results. Initially, even the basic version of DeepSeek-Coder struggles to decompile binaries, in some cases compiling is possible, but in other cases it is not possible to accurately capture the meaning of the original program. However, after fine-tuning, the LLM4Decompile model offers a significant improvement in binary decompilation capability. In fact, 90% of the code can be compiled, suggesting a deeper understanding of the code's structure and syntax.
Notably, the 6B version of LLM4Decompile shows a clear advantage over the 1B version in its ability to execute code: 21% of the code decompiled from the 6B version accurately captures the essence of the program and passes all test cases, compared to only 10% for the 1B version. The 1B version only 10% of the decompiled code passed all test cases. This improvement underscores the benefits of a larger model size in capturing the meaning of the program. On the other hand, increasing the model size to 33B also resulted in a small improvement in re-runnability, less than one percentage point. This may suggest the difficulty of adjusting for 33B models.
The table below, which summarizes the results in AnghaBench, shows that LLM4Decompile achieves particularly high BLEU and ES scores; the 6B model achieves a BLEU score of 0.82, which is very close to its source code. This impressive performance suggests that there may be significant data leakage within the test set. Realistically, a decompiled code with normalized variables could not possibly achieve such a high N-gram/ES score. This anomaly, as well as the high BLEU and ES scores reported in previous studies, highlights the importance of establishing an independent and reliable benchmark for evaluating decompilation.
The Sequence-to-sequence (S2S)prediction method also shows performance that is a step ahead of other learning methods due to its characteristics. The secret lies in the fact that assembly code is excluded from the computation of the loss function, allowing the model to concentrate on the generation of the source code. This concentration allows the model to better understand the patterns and structures behind the decompiled code.
However, including assembly code in the training process can reduce performance by about 4 percentage points, and this is especially true for theNext token prediction (NTP)task (see table below). The inherent complexity and low-level nature of assembly code makes it difficult for models to learn meaningful patterns; the S2S approach avoids this complexity and allows models to focus on high-level source code patterns.
There is also an alternative strategy that attempts initial training involving both assembly and C code followed by fine tuning focused on the translation task (NTP+S2S), but this method is not as effective as the S2S approach. This ablation study highlights how LLM4Decompile is evolving in the decompilation process and why certain methods are superior.
Conclusion
This paper provides the first open source, decompile-focused large-scale language model and standardized recompilability/re-executability benchmark. Analysis on this diverse set of compiled C code data reveals promising capabilities:LLM4Decompile-6Bachieves 87% recomposability, which indicates syntactic understanding, and 21% reexecutability, which suggests semantic preservation. As an initial exploration into data-driven decompilation, this paper establishes an open benchmark to motivate future efforts. The published data sets, models, and analyses represent an impressive potential for enhancing decompilation through new technologies.
Categories related to this article