[IRCoder] Intermediate Representation Makes The Language Model A Robust Multilingual Code Generator
3 main points
✔️ Compiler intermediate representations may be useful for transferring information between different programming languages and for improving the accuracy of code generation.
✔️ Experiments have shown that using IR is beneficial for understanding and improving the performance of heterogeneous language codes.
✔️ It is hoped that these results will stimulate extensive research activity on the incorporation of intermediate code representations. And new methods and tools may be developed to improve code understanding and generation capabilities.
IRCoder: Intermediate Representations Make Language Models Robust Multilingual Code Generators
written by Indraneil Paul, Jun Luo, Goran Glavaš, Iryna Gurevych
(Submitted on 6 Mar 2024)
Comments: Published on arxiv.
Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Programming Languages (cs.PL)
code:
The images used in this article are from the paper, the introductory slides, or were created based on them.
Summary
This study focuses on the development of language models (LMs) in code generation. While traditional LMs have focused primarily on natural language text, new developments are emerging in code generation. In particular, compiler intermediate representations may be useful for transferring information between different programming languages and for improving the accuracy of code generation.
The study leveraged a large dataset called SLTrans to train code generation models via intermediate representations; SLTrans contains intermediate representations of different programming languages and about 4 million self-contained source code files. We then used basic Code-LMs of various sizes to train causal language modeling based on the intermediate representations.This resulted in the development of a model called IRCoder, which shows immediate robustness and improved code understanding in multilingual code generation tasks and provides consistent improvements across a wide variety of code generation tasks.
Introduction
Traditionally, Code-LM benchmarks have been limited primarily to resource-rich languages such as Python, but in practice there is demand for code generation in all programming languages. Recent benchmarks have revealed differences in Code-LM performance between different programming languages. For example, languages such as Bash lag behind Python.
This problem is further exacerbated by the bias of the code corpus and the rapid changes in programming languages. Traditional approaches have difficulty accounting for this diversity, so new methods are needed. In this study, compiler intermediate representation (IR) is proposed as a basis for code understanding across heterogeneous languages; IR is a programming language- and platform-independent intermediate representation that can be used as an ideal shared representation for multilingual code LM.
Related Research
This section provides important lines of work in the development of a code generation model.
First, curation of high-quality pre-training data is mentioned. This is supposed to help sort domain-specific data for efficient pre-training of LMs. For example, models such as Phi-1 have been trained with only 7 billion tokens and have shown comparable performance.
Next, the tool chain is grounded by metadata. This is used to obtain information from the compiler output to aid in understanding the source code. An example of this is the technique of encoding the abstract syntax tree (AST).
Finally, interlanguage transfer and coordination emphasize the need for efforts to avoid performance degradation due to changes in the number of training languages and resource ratios in training multilingual models. In particular, it is suggested that pre-training with intermediate compiler representations (IRs) may improve performance on downstream tasks by facilitating interlanguage transfer of trained models between non-English languages.
Proposed Method
This section describes a technique for creating LLVM IR-transformed paired datasets called SLTrans. The table belowshows the breakdown of SLTrans between programming languages.
First, the dataset aims to obtain parallel source IR data from a mix of low, medium, and high resource programming languages. This involves converting the source code to LLVM IR; LLVM was chosen as the intermediate representation due to its prevalence and advantages, such as maintaining a human-readable IR standard. The table below outlines the parallel data sourcing and training goals.
Next, several challenges have been cited for obtaining LLVM IRs. For example, there are difficulties in tracking dependencies and obtaining compilable code units. To solve these problems, we use self-contained compilation units, such as the programming contest problem.
The acquired source files were then compiled into size-optimized and performance-optimized IRs, and both size-optimized and performance-optimized IRs were collected. Finally, MinHash-based deduplication was performed to create the final SLTrans dataset. This dataset contains approximately 4 million samples across 12 programming languages for a total of 26.2 billion tokens.
Experiment
Settings and Data
First, LLVM IR is used to establish a matching structure between heterogeneous languages and to facilitate interlanguage transfer. In our experiments, we remove header data and remove extra information from the IRs. We then select a size-optimized IR for 80% of the cases and a performance-optimized IR for the remaining 20%.
Next, subsampling is performed at the token level using UniMax-1 sampling to prepare the training corpus. The training corpus also includes 200 million tokens of open domain IR codes from TheStack, high-quality code and text data, as well as mathematical articles from the OpenWebMath dataset.
We will then secure a token budget and test the impact of IR grounding on 6 different Code-LMs from 3 different providers with parameters ranging from 1.1B to 7.3B in size. These models include StarCoderBase, DeepSeekCoder, and CodeLlama.
Finally, we perform model training. This involves introducing two new sentinel tokens and initializing the embedding from a Gaussian distribution to random. It relies on LoRA for training and uses DeepSpeed Zero Stage-2 to accelerate the training job; it uses the Adam optimizer and trains with a maximum sequence length of 4096 tokens.
Results and Discussion
First, we investigated the importance of source code and IR pairs. To this end, we compared the performance of models trained with source code and IR paired data to that of source code without the pair. Results suggest that there is some performance gain in the absence of paired data, but it is smaller than the gain gained by adding paired data. This suggests that pinning heterogeneous source code languages to the same IR accounts for a large portion of the performance gain, not just exposure to the IR.
Next, we investigated how grounding in IR affects the perturbation resistance of Code-LM. The results show that grounding in IR improves robustness, especially to syntactic variations.
We then tested the multilingual code completion and comprehension capabilities of the models after grounding in IR. Results show that the IR-trained model significantly outperforms the base LM on all multilingual benchmarks.
Finally, we tested whether grounding in IR extends to the next instruction. The results show that IR grounding leads to performance gains, and that the benefits of instruction tuning are most evident in the most powerful basic model.
These results suggest that the use of IR is beneficial for understanding and improving the performance of heterogeneous language codes.
Conclusion
This study investigated how converting source code from different programming languages into a common intermediate representation, IR, affects the ability to understand and generate code.
Researchers hope that these results will stimulate extensive research activity on the incorporation of intermediate code representations. And this may lead to the development of new methods and tools to improve code understanding and generation capabilities.
Categories related to this article