Are Humans Or Large-scale Language Models (ChatGPT, GPT-4) Better Instructors For Teaching Beginner Programming?

Large Language Models 14/07/2023

3 main points
✔️ GPT-4 significantly outperforms ChatGPT (GPT-3.5) in several programming education scenarios, such as "modifying programs" and "explaining programs," achieving performance comparable to humans.
✔️ On the other hand, a large performance gap still exists between GPT-4 and humans in the "feedback generation" and "creating exercise programs" scenarios, which require deeper program understanding.
✔️ For further improvement, we expect further validation through various approaches, such as large-scale experiments involving more experts, expansion to other programming languages, evaluation of models in multilingual environments, and empirical studies with students.

Generative AI for Programming Education: Benchmarking ChatGPT, GPT-4, and Human Tutors
written by Tung Phung, Victor-Alexandru Pădurean, José Cambronero, Sumit Gulwani, Tobias Kohn, Rupak Majumdar, Adish Singla, Gustavo Soares
(Submitted on 29 Jun 2023 (v1), last revised 30 Jun 2023 (this version, v2))
Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Comments: This article is a full version of the poster (extended abstract) from ICER'23

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

summary

Since the release of ChatGPT in 2022, many have been surprised by the versatility of large-scale language models. Less than a year later, a number of services using large-scale language models were released. As the name "large-scale language model" suggests, it is used in many applications, especially those that deal with "language. The model is used not only for natural languages, but also for programming languages. It is used for a wide variety of purposes, such as proofreading, summarizing, and translating texts, upward compatibility with web search engines, support for jobs requiring expertise such as lawyers and doctors, coaching and counseling, and so on. It can also be used to support programming and learning, and is expected to be a technology that will change the next generation of education in computer education.

The paper presented in this article examines the usefulness of large-scale language models in programming education for beginners. Over the past year, with the excitement of ChatGPT, several studies have been reported that have used large-scale language models to test various scenarios in programming education. However, most of these studies are limited in that the large-scale language models used are old, or only certain scenarios (e.g., generation of explanatory text) were validated, and there has yet to be a systematic study that evaluates the latest models for exhaustive scenarios related to programming education.

Therefore, the paper presented here systematically evaluates and compares the performance of ChatGPT (based on GPT-3.5) and GPT-4 with humans in various programming education scenarios. In other words, we evaluate how much difference there is between the old model (ChatGPT) and the new model (GPT-4) in programming education for beginners, and how much difference there is between the large-scale language models compared to humans. In this paper, we examine six scenarios (1) program modification, (2) hint generation, (3) evaluation feedback, (4) pair programming, (5) program explanation, and (6) creating programs for practice.

In this article, we would like to introduce "(1) Modification of the program," "(5) Explanation of the program," and "(6) Creation of the exercise program.

Scenario 1: Program Modification

In this scenario, we are verifying that the student's program (buggy) can be properly fixed, using the prompt shown below as input to ChatGPT and GPT-4. The prompt consists of a summary of the scenario being tested, a description of the problem {probrem_description}, and the student's program (buggy) {buggy_program}. The result of entering this prompt is the output of the modified program.

The output programs are evaluated using the "Correct" indicator of correctness to the problem and the "EditTokens" indicator of the token-based edit distance between the program and the buggy program.

EditTokens" indicates the degree of modification. Correct" is counted as 1 for correct and 0 for incorrect. Looking at the results in the figure below, the "Correct" tally is 88.0% for GPT-4, a significant improvement over the 68.0% for ChatGPT, and even closer to the 100.0% for human (Tutor). On the other hand, "EditTokens" is 36.6 for GPT-4, a large value compared to 19.0 for human (Tutor), indicating that they edit more when fixing a buggy program. It may be said that humans are more efficient, as they obtain the correct program with fewer corrections.

Scenario 2: Program Description

In this scenario, we are testing the ability to correctly explain a particular part of a program. Explaining a program that students do not understand is one of the most fundamental skills in programming education, and we use the prompts shown below as input to ChatGPT and GPT-4. The input consists of an overview of the scenario, a description of the problem {probrem_description}, a bug-free program {program}, of which the specific part of the program the student is trying to understand {program_part_to_explain}.

The output program is rated "Correct" if the output description contains accurate information about a specific part in the context of the overall program, "Complete" if the output description contains complete information in the context of the overall program, and "Comprehensible" if the output description is in an understandable and readable format and is not verbose. Complete" if the output description contains complete information in the context of the overall program, and "Comprehensible" if the output description is in a format that is easy to understand and read and is not verbose. In other words, it is evaluated for accuracy, comprehensiveness, and conciseness. In addition, "Overall" is tabulated only when the output description satisfies all three of the above indicators.

The figures below show the results for each indicator. Looking at the "Overall," GPT-4 shows a performance of 84.0%, which is still higher than ChatGPT's performance of 72.0%. It also shows a performance close to that of the human (Tutor) at 92.0%.

Scenario 3: Creating an exercise program

In this scenario, we are testing whether students can generate new exercises that will find and fix bugs. In education, it is important to practice on many problems, and being able to generate problems to do so is an important skill to have.The prompts shown below are used as input to ChatGPT and GPT-4. The input consists of an overview of the scenario, a description of the problem {probrem_description}, a buggy program {buggy_program}, and a program with fixed bugs {line_diffs_with_fixed_program}.

The output program is rated "Correct" to see if the new problem in the output is correct and solvable with respect to its description and specification, "Simpler" to see if the new problem in the output is easier than the input problem, "SimilarBugs" to see if the program containing the new bug in the output contains a student's bug SimilarBugs, which is whether the program has similar bugs to the program, and MinimalBugs, which is whether the program containing the new bugs in the output does not contain other bugs.

Since points to be learned cannot be learned from problems that unnecessarily contain bugs, "SimilarBugs" and "MinimalBugs" seem to be used as indicators to determine whether appropriate similar problems have been created. The "Overall" is tabulated only when the new problems output and the new buggy program jointly satisfy all of the above four evaluations. The figure below shows the aggregate results.

The "Overall" shows that 22.0% of the GPT-4s have improved compared to 10.0% of the ChatGPTs, but are significantly worse than the 74.0% of the Tutors. The breakdown shows a similar trend in "SimilarBugs," a measure of whether the new output contains similar bugs to the input program. The main reason for the low Overall may be the difficulty in generating programs that contain similar bugs.

summary

The paper shows that in each scenario, GPT-4 performs better than ChatGPT (based on GPT-3.5), and in some scenarios it performs as well as a human (Tutor). Although not confirmed to be useful enough to replace humans, the performance increases as the large-scale language model is upgraded, indicating that it could be a useful tool in programming education in the future. If realized, the Tutor will be able to leverage its advanced language processing capabilities and extensive knowledge to support programming learning and reduce the burden on educators.

Although two experts participated as evaluators in this study, further validation through various approaches is expected, such as large-scale evaluations including more experts and empirical studies targeting students.

From the perspective of programming education, this paper will show the possibilities and limitations of large-scale language models at this time, and will lead to further improvements in future programming education using large-scale language models.

Categories related to this article

Takumu: I have worked as a Project Manager/Product Manager and Researcher at internet advertising companies (DSP, DMP, etc.) and machine learning startups. Currently, I am a Product Manager for new business at an IT company. I also plan services utilizing data and machine learning, and conduct seminars related to machine learning and mathematics.

Are Humans Or Large-scale Language Models (ChatGPT, GPT-4) Better Instructors For Teaching Beginner Programming?

summary

Scenario 1: Program Modification

Scenario 2: Program Description

Scenario 3: Creating an exercise program

summary

Libra] A New Multimodal Design Of Large Language Models Using Separate Vision Systems

Libra] A New Multimodal Design Of Large Language Models Using Separate Vision Systems

Construction And Analysis Of The "TruthEval" Dataset To Expose LLM Weaknesses

Construction And Analysis Of The "TruthEval" Dataset To Expose LLM Weaknesses

SportQA, A New Dataset That Measures The Comprehension Of Sports In Large Language Models

SportQA, A New Dataset That Measures The Comprehension Of Sports In Large Language Models

Proposal For A New Evaluation Method For AI Assistants Based On Human Preferences

Proposal For A New Evaluation Method For AI Assistants Based On Human Preferences

The Future Of Music Education, Flute X GPT And LAUI's Potential To Change Large-Scale Language Models

The Future Of Music Education, Flute X GPT And LAUI's Potential To Change Large-Scale Language Model ...

Prediction Of Handball Results For The 2024 Paris Olympics And Explanation Of The Basis For The Prediction Using LLM

Prediction Of Handball Results For The 2024 Paris Olympics And Explanation Of The Basis For The Pred ...