[SyntaxEval] Evolution Of Predictive Code Conversion Function

Software Engineering 19/03/2024

3 main points
✔️ propose a method for evaluating models that also takes grammar rules into account.
✔️ We introduced SyntaxEval, a method to evaluate how accurately MLM predicts the structure of a program.
✔️ We found that the model learns some syntactic features well, but there is still room for improvement.

Which Syntactic Capabilities Are Statistically Learned by Masked Language Models for Code?
written by Alejandro Velasco, David N. Palacio, Daniel Rodriguez-Cardenas, Denys Poshyvanyk
(Submitted on 3 Jan 2024 (v1), last revised 21 Feb 2024 (this version, v2))
Comments: Published on arxiv.
Subjects: Software Engineering (cs.SE)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

Summary

This paper talks about how to evaluate models used for something called programming completion. Programming completion is a feature, for example, when you type a message on your smartphone, that predicts the next word to type. When evaluating whether these predictions are accurate, we usually look to see if the predicted words are correct, but this method tends to miss whether the rules of programming grammar are being followed.

Therefore, this paper proposes a method for evaluating models that also takes grammar rules into account. This is a way to automatically find important parts of the programming and use them to test the model. Specifically, we use it to see if we understand the structure of the program. Using this method, we tested two popular models. The results showed that the language suggested by the models may not actually fit the structure of the program. This means that it is important to see not only if the predictions are correct, but also if they match the structure of the program.

Introduction

Large-scale language models have shown excellent performance in programming-related tasks. In particular, automatic code completion is important in the area of code generation. Code completion is a technique that completes an incomplete piece of code appropriately for its context. Previous research has attempted to use machine learning for code completion in a variety of ways.

Recently, mask language models (MLMs) have been used for code completion and have shown high accuracy. However, the extent to which these models understand the structure of programs has been less clear. In this paper, we introduce SyntaxEval, a method for evaluating how accurately MLMs predict program structure. We then investigated the impact of syntactic features on MLM's predictions.

Related Research

Accurate generation and understanding of code is important for programming-related tasks. Modern code generation techniques use probabilistic models that learn from large amounts of code and predict parts of the code (elements called tokens). Simply put, a model of code generation should automatically produce a piece of code according to the syntax rules of the programming language.

Proposed Method

First, one must understand the syntax (grammar) of the code in order to evaluate MLM performance. The code is written in a particular structure, which is important to understand. To understand this structure, we use what is called an abstract syntax tree (AST), which is a tree representation of the structure of the code.

SyntaxEval Identifier AST Node Process

The following steps are then performed to evaluate MLM performance

　 1. generate an AST from the code. 2.
　 2. randomly hide a portion of the generated AST. 3. present the hidden portion to the MLM so that the MLM can predict the hidden portion.
　 3. present the hidden part to the MLM so that the MLM can predict the hidden part 4. compare the predicted results with the correct ones and evaluate the performance of the MLM
　 4 . compare the predicted result with the correct answer of the hidden part and evaluate the performance of the MLM.

Causality interpretability is also calculated to understand the factors affecting MLM performance. This is a way to quantify the factors that affect MLM performance.

Finally, this evaluation method was used in a code completion experiment using the Python programming language. In this experiment, Python code was used to evaluate MLM performance and to investigate the impact of code structure on MLM performance.

Experiment

Global results for syntax function performance

Here are the results of evaluating the performance of the machine learning model. Machine learning models were evaluated for their ability to hide or mask parts of the code and predict those parts.

The following figure evaluates theJaccard similarity ( a statistical measure for calculating the similarity of two sets) of the nodes used by 𝑀1 between randomly masked tokens (𝑇0) and syntax-based masked tokens (𝑇1) in a per-node comparison. In other words, it measures the consistency of each node's predictions and captures the difference between a random mask and a syntax-based mask.

The following table evaluates the factors that affect model performance and examines their relationships. In other words, we look at which syntactic features the model emphasizes and understand to what extent they affect the model's performance; Jaccard, Levenshtein, and Sorensen-Dice are measures of similarity and string distance for each character.

The following diagram focuses on the syntactic features learned by the model and assesses how they are used to make predictions. In other words, we look to see if the model understands the form and structure of the code and to what extent this affects the accuracy of the model.

The results showed that the model adequately learned the syntactic features of the code. However, some specific parts of the code proved difficult for the model to predict. In other words, the results suggest that while the model does well with some forms of code, there is still room for improvement in other areas.

Causality evaluation effect

Here, causal evaluation methods were used to investigate factors affecting model performance. The results showed that the syntactic information of the code had a significant impact on the performance of the model. However, randomly masked portions of the code also affected the model's performance, indicating that the prediction performance of certain portions of the code was degraded. In short, the results suggest that the model performs well in predicting portions of the code, but there is still room for improvement.

Conclusion

This study focused on the ability of machine learning models to understand the syntax of programming languages and evaluated their performance. Results showed that the model learned some syntactic features well, but there is still room for improvement. In particular, the results showed poor prediction performance for randomly masked parts. This suggests that the machine learning model may not fully understand the syntax of the programming language.

Future research should go further and provide a deeper understanding of the methods used to evaluate the semantic capabilities of programming languages and why machine learning models are better at predicting randomly masked tokens. It is also important to consider factors related to the pre-training goals of machine learning models.

Categories related to this article

Software Engineering

Sasayama