Is The Performance Of ChatGPT (GPT-3.5 And GPT-4) Changing? Stanford University And UC Berkeley Research Teams Investigate
3 main points
✔️ suggest that the performance of large-scale language models (GPT-3.5 and GPT-4) may change significantly in the short term.
✔️ Ongoing research is needed to understand medium- and long-term changes in the performance of large-scale language models.
✔️ To promote research on performance changes of large-scale language models, evaluation data from this experiment and ChatGPT responses were made publicly available.
How is ChatGPT's behavior changing over time?
written by Lingjiao Chen, Matei Zaharia, James Zou
(Submitted on 18 Jul 2023)
Comments: Published on arxiv.
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
code:
The images used in this article are from the paper, the introductory slides, or were created based on them.
summary
Since the release of ChatGPT in 2022, GPT-3.5 and GPT-4 have been the most widely used large-scale language models; AI-SCHOLAR readers may already be using them in a variety of settings. However, OpenAI has not announced when and how these models will be updated. As a result, many people feel that integrating large language models into large-scale workflows and services is risky. Some of them clearly feel that their performance is declining.
Therefore, the paper presented here evaluates the March 2023 and June 2023 versions of GPT-3.5 and GPT-4 in four tasks (a. mathematical problem solving, b. answering sensitive/risky questions, c. code generation, and d. visual reasoning) to see how the performance of GPT-3.5 and GPT-4 The figure below shows each task and how it is evaluated. The figure below shows a summary of each task and its verification results.
Overall, the validation results indicate that the performance of large language models can change significantly in a relatively short period of time, underscoring the need for continuous monitoring of the quality of large language models.
Task 1: Math problem solving
This paper examines changes in mathematical problem-solving performance on a task that involves determining whether a given integer is prime. Because it is relatively easy for a human to determine whether a number is prime or not, and because the result is unambiguous, it is easily understood by humans and suitable for evaluating performance. The task also requires reasoning. To determine if a number is prime, one must logically consider multiple steps, such as dividing that number by all other numbers to see if it is divisible. Furthermore, such reasoning is well suited not only for solving mathematical problems, but also for assessing general problem-solving abilities and for evaluating the comprehensive performance of large-scale language models.
The results are shown in the figure below. As shown in figure (a) below, Accuracy on the GPT-4 dropped significantly from 97.6% (March) to 2.4% (June), while the GPT-3.5 improved significantly from 7.4% (March) to 86.8% (June). In addition, as shown in Figure (b) below, GPT-4 responses are much more concise. The average number of characters generated has decreased significantly from 821.2 (March) to 3.8 (June). On the other hand, the number of characters in GPT-3.5 responses has increased by approximately 40%. Surprisingly, the performance of the large-scale language model varied significantly even for such a simple task.
One possible explanation for these large differences is the "Chain-of-Thought" (Chain of Thought) identified in this paper. The sample in Figure (b) above shows that GPT-4 (March ver.) is heavily influenced by the chain-of-thought. To determine if the integer 17,077 is prime, it first breaks the task down into four steps: determine if 17,077 is even, find the square root of 17,077, get all prime numbers smaller than it, check if 17,077 is divisible by any of these numbers, and so on. These steps are performed and finally lead to the correct answer that 17,077 is a prime number. However, in GPT-4 (June ver.), the chain of thought does not seem to be functioning.
On the other hand, the exact opposite change is observed in GPT-3.5: GPT-3.5 (March ver.) tends to generate the answer "No" first and then execute the inference step. On the other hand, GPT-3.5 (June ver.) seems to have solved this problem, writing out the inference step first and finally generating the correct answer of "Yes".
Task 2: Answering sensitive/risky questions
It is known that asking sensitive questions to large-scale language models can generate social bias, personal information, and harmful text that can negatively impact users. Therefore, this paper examines whether large-scale language models respond that way to sensitive questions or whether their performance is altered.
The validation results are shown in the figure below: for GPT-4, the Answer Rate, the percentage of sensitive responses, decreased from 21.0% (March) to 5.0% (June), while for GPT-3.5, it increased from 2.0% (March) to 58.0% (June).
Another finding is that the number of characters in the text generated by the GPT-4 has decreased from more than 600 to about 140. the GPT-4 has stopped explaining when refusing to answer and has become more concise in its responses. As shown in Figure (b) above, in March, respondents explain the reason for their refusal, but in June, they respond only with "Sorry, but I can't assist with that. This is the same trend observed in GPT-3.5. While these large-scale language models have become safer, they no longer explain the reasons for refusing to answer certain questions.
Task 3: Code generation
Code generation is another typical application of large-scale language models. Since many datasets exist for code generation, but using them to evaluate the ability of large-scale language models to generate code may face the problem of data contamination, we have created a new dataset for this paper. The dataset is a selection of the 50 most recent questions in the "easy" category from the platform. These are suitable for assessing the ability of large-scale language models to deal with unknown problems, as the answers and explanations were first published in December 2022. The code generated by the large-scale language model (i.e., the answers to the questions) is sent to LeetCode's online judging system for automatic evaluation. Then, if the online judges accept the code generated by the LLM, i.e., if the code can be executed without errors and produces the expected results, the code is "Directly Executable. The results are shown in the figure below.
The number of Directly Executable has decreased from March to June. As shown in Figure (a) above, in March, more than 50% of the code generated by GPT-4 was executable; in June, this number dropped to 10%. GPT-3.5 shows a similar trend. There was also a slight increase in redundancy for both models.
As a possible primary cause, the paper cites the addition of "extra non-code text" to the code generated by the June version. The "'python'" and "'" have been added before and after the code generated by the large-scale language model, as well as more comments. This extra text and comments may be making the code non-executable. In particular, the addition of "'''python''" and "'''" before and after the code results in triple quotes (""") that make the code unexecutable as Python code. Triple quotes are symbols used in Python to define strings, not the part of the code that is executed, so when this symbol is added, the code in that section is considered to be no longer executable. The authors also point out that having this "extra" text or comment as part of the code makes it difficult to identify the problem, especially if that code is used in a large software pipeline.
Task 4: Visual Reasoning
Unlike previous tasks, we are testing visual reasoning (the ability to draw logical conclusions from visual information), which requires more abstract reasoning. The ARC (Abstraction and Reasoning Corpus) dataset is used to test human abstract thinking and reasoning abilities on machine learning models. The task with this dataset requires us to look at a pattern of inputs (we call this the "input grid") and generate a corresponding output pattern (we call this the "output grid").
Figure (b) below is a sample. When visual information such as a certain color or shape is input to a large-scale language model, it finds patterns in it and outputs a 3x3 array of colors. The task uses 467 samples from the ARC dataset to evaluate the percentage of correct answers.
As shown in Figure (a) below, there is a slight improvement in performance on both GPT-4 and GPT-3.5. However, despite the improvement in overall performance, some items that were answered correctly on GPT-4 (March ver.) are now incorrect on GPT-4 (June ver.), as shown in Figure (b) below. In other words, even if there is no significant change in overall performance, minor changes are hidden and may need to be carefully monitored, especially for critical applications.
summary
This paper shows that the performance of GPT-3.5 and GPT-4 may have changed significantly in a short period of time. Thus, since the performance of large-scale language models is not stable, designers will be required to take performance fluctuations into account through continuous monitoring when incorporating such models into services and other applications. The research team will continue to evaluate large-scale language models such as GPT-3.5 and GPT-4 on a regular basis. The evaluation data from this study and the ChatGPT responses are available on Github.
Categories related to this article