AI To Unobfuscate Code

Deep Learning 11/03/2021

3 main points
✔️ A new learning model for programming languages
✔️ Significant performance improvements compared to CodeBERT
✔️ Interesting applications in several tasks

DOBF: A Deobfuscation Pre-Training Objective for Programming Languages
written by Baptiste Roziere, Marie-Anne Lachaux, Marc Szafraniec, Guillaume Lample
(Submitted on 15 Feb 2021 (v1), last revised 16 Feb 2021 (this version, v2))
Comments: Accepted to arXiv.
Subjects: Computation and Language (cs.CL)

code：

First of all

Code obfuscation is to make the procedural content, structure, and data of the internal operation of the source code difficult for humans to understand. This makes it difficult for humans to understand the code and can prevent the purpose of the code from being hidden or tampered with. The code can also be compressed to reduce the network payload size (in the case of Javascript). However, this is a one-way process, and it is difficult to restore the original source code. Even an experienced programmer can understand some of the source code after a thorough investigation, but it can be time-consuming and cumbersome.

Language models like BERT are primarily focused on natural language processing tasks, where a portion of the text (~15%) is masked and the model is trained to predict these masked words. This is based on the learning objective of Masked Language Modelling (MLM). This method seems to work fine for programming languages, but it only works and is not optimal. This paper proposes a new learning objective (DOBF) for programming languages and suggests that it can significantly outperform models trained using MLM (BERT, ALBERT).

MLM Issues in Programming Languages

In contrast to natural languages, programming languages are very structural and syntactic. That is, the degree of freedom of expression is quite restricted. Therefore, randomly masking the input as in MLM does not provide as much challenge as a typical MLM task. Predicting commas, brackets, and keywords (while, if, etc.) in source code is very easy. Because As mentioned above, MLM does not mask all occurrences of variables ("queue", "graph") and allows some information to leak (leak).

For example, the variable 'queue' is not masked in line 4, so it is easy to guess that it must be declared at line 2. In other words, because of the restrictions on the syntax, the answer can be derived from the rules alone, without any code understanding. This causes a decrease in the learning pattern of the model (even more so in a verbose language like Java). Therefore, the model cannot understand the meaning of the code "correctly", and the generalization performance becomes poor. Therefore, we need to introduce another goal, different from MLM.

Deobfuscation Objective (DOBF) (the purpose of obfuscation)

As shown in the figure above, in DOBF, all occurrences of a variable or function name are obfuscated (replaced) with the same mask (e.g., all occurrences of "queue" are replaced with "V3"). The learning objective is to recover the original names in the source code. Each identifier (class name, variable name, function name) is replaced by p_obf∈ [0,1]. with the probability of being replaced by a special token.

p_obf= 1 then all identifiers are obfuscated and If p_obf= 0 The i-th class, function, or variable is replaced by the tokens CLASS_i, FUNC_i, and VAR_i, respectively.

implementation

We use the DOBF objective to train a sequence-to-sequence model. Given an obfuscated source code, the model predicts a dictionary of original names and delimiter-separated identifiers. The predictions in the figure above are represented as follows

Experiment and Evaluation

The model architecture is a standard encoder-decoder. For training the model, we use a public GitHub dataset consisting of 19 GB of python code and 26 GB of Java code. The model has three different obfuscations p_obf= {0,0.5,1}, the i.e. {one identifier: 0, half identifier 0.5, and All identifiers : 1 } are learned at the obfuscation level of

Result

The table above shows the results of training with different degrees of obfuscation results. Clearly, one type of obfuscation is not compatible with another type of obfuscation. In other words, the p_obf = 0 the model trained on p_obf = 1 will perform poorly when evaluated with p_obf = 1 and vice versa. However, MLM+DOBF seems to be a good match for the programming language because the models perform better when used together.

The figure above shows how DOBF compares to CodeBERT and standard transformers on several benchmarks and tasks; DOBF and DOBF+MLM outperform other models in most cases, including python<-->java code transformation and Natural Language Code Search (NLCS). However, CodeBERT outperforms the other models for clone code detection (Clone Det).

The figure above shows an example of unobfuscation when the model is given a fully obfuscated breadth-first-search function. The model is able to provide meaningful names, which makes the code easier to understand.

Applications of single identifier de-obfuscation

This method can be used by IDEs that use the current source code to suggest variable names; some IDEs, such as PyCharm, use this method, but their approach is based on a simple heuristic algorithm.

Example of application of unobfuscation of all identifiers

Once the model has been trained to unobfuscated all identifiers, it can be used to undo files whose identifiers have been changed for compression or security reasons. This increases the readability of the code and makes it easier to understand.

Conclusion.

There are several interesting use cases for this model, but it should also be noted that it can be easily exploited. These de-obfuscation models have the potential to revert the modified code back to the original code with meaningful identifier names. In order to introduce malware into a system, once it has been de-obfuscated by this model, it becomes easier to introduce malware. Nevertheless, the purpose of DOBF is complementary to MLM, helping the model to understand the semantic features of programming languages. Further details can be found in the original paper.

Categories related to this article

Thapa Samrat: I am a second year international student from Nepal who is currently studying at the Department of Electronic and Information Engineering at Osaka University. I am interested in machine learning and deep learning. So I write articles about them in my spare time.