Catch up on the latest AI articles

LLM Revolutionizing Software Development: Validating Large-Scale Language Models In Integrated Development Environments (IDEs) Using A New Evaluation Harness

LLM Revolutionizing Software Development: Validating Large-Scale Language Models In Integrated Development Environments (IDEs) Using A New Evaluation Harness

Large Language Models

3 main points
✔️ Val idation of the usefulness of large-scale language models in integrated development environments ( IDEs ): Validation of the availability of large-scale language models such as OpenAI's GPT-3.5, GPT-4, and Code Llama in integrated development environments (IDEs) and their performance as programming assistants.
✔️ Various evaluation scenarios: Evaluated the contribution of large-scale language models to software development through five development scenarios, including documentation generation, bug fixing, code generation, test case generation, and workspace understanding.

✔️ Evaluation Harness Proposal: New evaluation criteria and a co-pilot evaluation harness enable more accurate measurement of the performance of large language models and their contribution to cost optimization of the development process.

Copilot Evaluation Harness: Evaluating LLM-Guided Software Programming
written by Anisha AgarwalAaron ChanShubham ChandelJinu JangShaun MillerRoshanak Zilouchian MoghaddamYevhen MohylevskyyNeel SundaresanMichele Tufano
(Submitted on 22 Feb 2024)
Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI)


The images used in this article are from the paper, the introductory slides, or were created based on them.


Software development is constantly evolving, and developers are increasingly interested in adopting state-of-the-art technologies to improve their productivity. Among these, the use of large-scale language models in integrated development environments (IDEs) is attracting particular attention, and OpenAI's GPT-3.5 and GPT-4, as well as the open source Code Llama, have the potential to be high-performance programming assistants. This paper presents an evaluation harness of their usefulness as programming assistants utilizing large-scale language models in integrated development environments and examines their adaptability in a variety of programming scenarios and languages.

The validation evaluates five major development scenarios: document generation (doc), bug fixing (fix), code generation from natural language (generate), code test case generation (test), and workspace understanding and query resolution (workspace). The evaluation harness metrics are designed with practical considerations in mind to assess these interactions, accuracy, and efficiency. We aim to properly assess the complexity of the model to generate functions in real-world code.

We also propose a new evaluation criterion that takes into account the stochasticity of output and gaps in logic due to the large-scale language model itself. This allows for more accurate performance evaluation by automatically understanding the impact of prompt and parameter changes on the actual code.

Finally, the evaluation framework proposed in this paper uses a variety of LLM models, including GPT-3.5, GPT-4, and Code Llama, to evaluate their effectiveness in integrated development environments such as Visual Studio Code. It provides a comprehensive perspective on the capabilities and limitations of programming with large language models to meet the broad needs and preferences of developers.

Evaluation of Software Programming with Large Language Models

To determine how useful modern large-scale language models are in the field of software engineering, we use a variety of metrics to measure their performance. In addition to the traditional HumanEval, match-based metrics such as BLEU and Code-BLEU are commonly used. These metrics serve as criteria for evaluating the quality of the output of large-scale language models in tasks such as code generation and code translation. In addition, output is also evaluated using large-scale language models themselves, and new evaluation criteria have been proposed that focus on functional correctness.

Documentation Generation: This section examines the effectiveness of large-scale language models for automatic documentation generation. For example, in the VS Code IDE, there is a situation where a developer asks a large-scale language model to generate documentation for a Fibonacci function. The evaluation focuses on the correctness of the location, format, and coverage of the generated documentation. This includes consideration of whether the documentation is properly inserted without breaking the syntax of the code and whether the function is properly described.

Bug Fixing: We also validate the task of fixing bugs identified by static analysis tools using a large language model. The goal is that the modified code will have fewer errors overall than the original code. The process uses static analyzers for a variety of languages, and the success of the fix is evaluated by whether the corrected code is syntactically correct and whether static analysis warnings and errors have been eliminated. For example, correcting a misspelling of "yield" in the VS Code IDE.

Through these tasks, we examine the potential value and limitations of how large-scale language models can contribute to the software development process. In particular, in proposing fixes to bugs found by static analysis tools, it is important to effectively solve the original problem while avoiding the introduction of new errors.

Code generation from natural language (generate): The ability to generate accurate code snippets from natural language instructions is an important task that represents an innovative advance in large-scale language modeling. This example visually demonstrates how this technique works in practice. The developer describes the task to the large language model, and the resulting code is displayed in the editor in a diff view.

For the generated code to be considered successful, two main criteria must be met: first, it must be syntactically correct, and second, it must pass all relevant test cases. Syntactical correctness is verified using a language-specific parser, and test pass rates are verified by running the project's test suite. This evaluates the utility and reliability of the code generated by the large-scale language model. The evaluation procedure begins with a set of repositories containing test cases and selects methods that meet specific criteria. For each selected method, we let the large-scale language model generate its body and provide it with the comment "Your Code Here" as a replacement. The generated method body is returned to its original position and the modified code is evaluated in the test suite.

Through this process, you will gain a deep understanding of how efficiently and accurately large-scale language models can generate code based on natural language instructions.

Code Test Case Generation (TEST): Advances in large-scale language models have made it possible to automatically generate test cases for code. The example of a developer requesting Fibonacci function test cases in the VS Code IDE demonstrates how this technology can be applied in a real development environment.

There are two criteria for the success of the generated test cases: one is that they are syntactically correct and the other is that they pass at runtime. This evaluation is based on the assumption that the code under test is correct. The syntactic correctness and pass rate of the generated tests are checked using a language-specific parser.

The evaluation procedure starts with a set of methods and has the large-scale language model generate tests for each method. This is done by providing the method signatures, dockstrings, and bodies to the large-scale language model, which replaces the original method bodies using the "Your Code Here" placeholder.

After the tests are generated, they are added to the repository where the method resides and an attempt is made to run the tests; for JavaScript and TypeScript, the tests are generated using the Jest and Mocha libraries, independent of the repository's existing test suites. In evaluating the generated tests, the tests are temporarily added to the method's file and executed in their entirety to avoid import errors. If the execution of an obvious test case (e.g., a test that should always be true) returns false or an error, the result of the generated test is considered unreliable.

Through this process, we are examining how test cases generated by large-scale language models can improve efficiency and accuracy during the testing phase of software development.

Workspace Understanding and Query Resolution (workspace): The ability to identify relevant code snippets to answer a user's question is an important task that tests the comprehension of a large-scale language model. This process tests the extent to which the large-scale language model can understand both the user's natural language query and the large amount of code. Specifically, the utility of this technique is demonstrated through the example of a test request for a Fibonacci function in the VS Code IDE.

The quality of snippets retrieved by the large-scale language model is evaluated in two ways: mean reverse rank (MRR) and end-to-end keyword detection. Mean Reverse Rank is calculated based on the position of the correct snippet in the list of snippets provided by the model and represents the average of the model's scores per test case. End-to-end keyword detection, on the other hand, evaluates the relevance of the retrieved snippets by checking whether the model's responses contain keywords that are relevant to the correct answer to the query.

The evaluation begins by providing the model with a user query and the full context of the associated code base and having it retrieve a list of relevant code snippets. Using mean-reverse ranking, we directly assess the quality of the snippets retrieved by the model and measure how well the model is able to find the most relevant code snippets. We further evaluate the quality of all retrieved snippets by providing the query and retrieved snippets as context to the model and asking for responses to the original user query. From the model's final response, we determine if the information needed to fully answer the question has been found.

This approach allows for an end-to-end evaluation of the model's ability to retrieve how effectively it can find code snippets that actually help answer the question at hand.

Data collection for evaluation harnesses

In this paper, we are developing a copilot evaluation harness to provide a better evaluation metric in the programming world. This harness is designed to evaluate the code created by developers from various angles in order to improve its quality. This section describes one important step in building this evaluation system: data collection.

The data collection process involves gathering information from hundreds of public GitHub repositories covering six major programming languages: JavaScript, Typescript, Python, Java, C/C++, and C#. From these repositories, we extract methods of code that meet specific criteria and use them as the basis for our evaluation. In particular, we only target repositories that can be built and tested, and for this purpose we develop a special build agent that attempts a variety of build and test strategies. This agent also allows evaluation using static analysis tools.

The criteria for selection from the GitHub repositories are strict, and each language in particular has different requirements. For example, JavaScript and Typescript repositories are selected if the `package.json` file exists in the root directory and is managed using npm; for Java, projects that use Maven and are buildable with JDK 1.8 are For Python, only repositories where all dependencies can be successfully installed within the virtual environment are selected, and for C/C++, a manually selected set of repositories that can be built and tested within a Docker image.

In selecting repositories, we exclude those that are less than 1 MB or greater than 100 MB in size, take more than 10 minutes to build or run tests, and do not contain methods. By doing so, we aim to build high quality datasets.

Through this data collection process, we aim to build evaluation tools for a variety of programming languages and projects. Our goal is to contribute to improving the quality of software development and to help developers write code more efficiently and effectively.

Test case collection for evaluation harnesses

In this paper, we carefully select appropriate repositories for each programming language and develop a methodology to generate effective test cases from them. This process is an important step in measuring and improving code quality. Here we present the main approaches we employ in this paper.

Automatic generation of documentation: test cases are created for repositories with methods of 3 lines or more that have not undergone reduction or obfuscation. The coding assistants being evaluated are given the task of generating the appropriate documentation strings for these methods. If the generated documentation is properly located, correctly formatted, and full of content, it is considered successful.

Bug Fixes: Test cases are generated based on warnings and errors noted by the static analysis tool, but excluding warnings related to import and configuration. If the generated fix is grammatically correct and reduces the number of warnings in the static analysis, it is evaluated as a success. However, it is important that the process of correcting the original problem does not create new problems.

Code generation from natural language: Focuses on methods covered by existing tests and asks coding assistants to generate method bodies that match the specified method signatures. If the generated code is grammatically correct and passes all relevant tests, it is considered successful.

Test Generation from Code: Asks the coding assistant to provide a working test for a method identified in the repository. If the provided test calls the target method and executes properly, it is evaluated as a success.

Workspace understanding and query resolution: questions about the project workspace submitted by developers are collected and resolved by providing relevant code snippets. The quality of the snippets provided is evaluated using Mean Reverse Rankings (MRR).


Using the toevaluation harness metrics and test cases described earlier, the performance of the OpenAI models GPT-3.5 and GPT-4, as well as CodeLlama, is evaluated in document generation and bug fixing scenarios. It uses the Large Language Model Powered Chat extension of the VSCode IDE, which has over 700,000 active users, as a code assistant.

This experiment aims to answer the following three questions

  • RQ1. model comparison: how do different large-scale language models compare to each other when integrated with a coding assistant?
  • RQ2. improving integration: what insights can the evaluation harness provide engineers to improve the integration of large language models in the coding assistant?
  • RQ3. DATA VALIDITY: To what extent do the evaluation test cases reflect real-world usage patterns of user interaction with the large-scale language model via the IDE?

We begin with "RQ1. model comparison: how do different large-scale language models compare to each other when integrated with a coding assistant?" The first question is "How do different large-scale language models compare to each other when integrated with coding assistants? Here we compare three models: GPT-3.5, GPT-4, and Code Llama.

In documentation generation, GPT-4 outperforms the other two models in documentation generation, as shown in the table below. In particular, in Python, Code Llama produced results comparable to GPT-4, while in C/C++, Code Llama's performance was significantly worse. This is likely due to the fact that GPT-3.5 and GPT-4 learn from a wide range of open source code on the Internet, making it easier to recognize a wide variety of code patterns. In contrast, the smaller Code Llama performs poorly in some scenarios due to its limited experience with specific code snippets.

In the bug fix testing, GPT-4 slightly outperformed GPT-3.5, followed closely by Code Llama. However, in the C# bug fixes, all three models struggled, with GPT-3.5 slightly outperforming the other two. cases were observed where the fixes proposed by GPT-4 were applied in the wrong place, and in some cases, despite potential solutions, the bugs could not be fixed.

GPT-3.5 was found to often choose more basic and less effective solutions while GPT-4 attempts more advanced and complex fixes. These differences are especially noticeable when resolving errors related to type specification, where GPT-4 attempts to guess a more appropriate type, which may not match the context of the code. GPT-3.5, on the other hand, avoids the problem by adopting a simpler approach, but this is not always the best practice.

The comparison reveals that while GPT-4 performs better overall, Code Llama and GPT-3.5 may provide effective solutions in certain scenarios. By understanding the strengths and weaknesses of each model, it is believed that the integration and use of coding assistants can be further optimized.

Next, "RQ2. Improving Integration: what insights can the copilot evaluation harness provide engineers to improve the integration of large language models in the coding assistant?" It is. Again, the focus is on the documentation generation and bug fixing processes and examines improvements that could be made to address these issues.

Challenges and Solutions in Document Generation: Four main error types have been identified from the evaluation of document generation. They are: code logic changes, syntax changes, incomplete document strings, and irrelevant document strings. In particular, GPT-4 has a greater ability to follow instructions and tends to suggest cleaner code changes, which can lead to failure of the documentation generation task.

To address this, specific instructions in the Coding Assistant prompts to not change the code in focus resulted in a marked improvement in evaluation results across all languages. This change resulted in performance gains ranging from 5% in C++ to 11% in Java. Another case study shows how sensitive GPT-4 is to specific instructions.

In evaluating bug fixes, we carefully analyze the error types detected by the static analyzer. Both models are capable of finding object namespaces and correcting type problems, but have yet to correctly resolve the "has an 'any' type" error.

To solve this problem, a large language model powered chat extension should provide the model with additional context, such as target variable types and namespaces. This will allow the model to correct the problem correctly, avoiding the use of incorrect types or the illusion of new types.

Through this experimentation, we have identified challenges in integrating large-scale language models into an integrated development environment and have proposed specific improvements to address these challenges. These findings and improvements can be made possible through the use of a robust and comprehensive evaluation system, which is expected to contribute to improving the quality of the development process.

The final question is "RQ3. Data Validity: How well do the evaluation test cases reflect real-world usage patterns of user interaction with the large language model via the large IDE?" It is.

This paper examines the extent to which user interaction with the Large Language Model via the IDE reflects real-world usage patterns, using a dataset collected from an actual git repository. In particular, we focus on the usage of the documentation generation and bug fixing features of the Large Language Model Powered Chat extension available in VS Code, and validate the dataset by collecting usage data from hundreds of Microsoft developers and comparing this to the test cases proposed in this paper. The data set is validated by collecting usage data from hundreds of Microsoft developers and comparing it to the test cases proposed in this paper.

For documentation generation, documented code snippets were embedded using OpenAI's ada embedding model and compared to Microsoft developer usage data. This comparison showed that the dataset proposed in this paper is similar to actual developer usage patterns for the documentation generation function.

For bug fixes, code snippets containing bugs were embedded and plotted in two dimensions using PCA dimensionality reduction. This analysis confirms that the test cases proposed in this paper for the bug-fixing feature exist in a space similar to that of actual usage.

The goal of this paper was not to make the test cases exactly match actual use cases, but to verify that the test cases are within the range of practical use cases. The results of this analysis suggest that the dataset proposed in this paper is consistent with real-world usage for both documentation generation and bug fixing. It is hoped that this will provide important insights into the utility of development support tools that use large-scale language models and contribute to improving the quality of the development process.


As developers utilize large-scale language models more frequently for complex engineering tasks, the need for robust evaluation of code generated by large-scale language models is increasing. With many companies and products seeking to integrate large-scale language models into their workflows, existing evaluation metrics alone cannot provide sufficient assurance of the quality and accuracy of automatically generated code. To address this issue, this paper proposes the Copilot evaluation harness and introduces five key evaluation metrics: method generation, test generation, dockstring generation, bug fixing, and workspace understanding. We describe in detail how we collected test cases and evaluation results based on these metrics and also share preliminary results across multiple programming languages.

The purpose of developing the evaluation harness is to verify the quality of code generated by large-scale language models. While there have been significant advances in machine learning techniques for code generation, careful oversight and engineering efforts are required to ensure that large-scale language models are integrated reliably and effectively into the code workflow. The goal is to provide a comprehensive evaluation suite to help developers properly integrate large-scale language models into their own coding processes. Using the Copilot evaluation harness, programmers will be able to more systematically evaluate the impact of various parameters, such as changes in the representation of prompts, the order in which information is provided, and the context provided to the model.

In addition, co-pilot evaluation harnesses can contribute to cost optimization. For example, it can show the potential for budget-conscious large-scale language models to provide satisfactory performance on tasks such as documentation. With this insight, developers can allocate resources wisely and properly utilize cost-effective large-scale language models while aiming for the best results by using high-performance large-scale language models for more complex tasks.

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us