The Power Of The Large-scale Language Model "GPT-4" In Mathematical Question Answering

Large Language Models 10/12/2024

3 main points
✔️ GPT-4 outperforms other models by achieving higher scores on MSE questions
✔️ Large-scale language models perform well on natural language tasks, but still face challenges in mathematical reasoning and lower accuracy on complex questions
✔️ ArqMATH data set Evaluated six models using the GPT-4 to identify strengths and weaknesses through response generation and question-answer comparisons

Can LLMs Master Math? Investigating Large Language Models on Math Stack Exchange
written by Ankit Satpute, Noah Giessing, Andre Greiner-Petter, Moritz Schubotz, Olaf Teschke, Akiko Aizawa, Bela Gipp
(Submitted on 30 Mar 2024)
Comments: Accepted for publication at the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR) July 14--18, 2024, Washington D.C.,USA.
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

Summary

Large-scale language models (LLMs) are gaining attention for their ability to solve natural language tasks and, in some tasks, for their near-human accuracy. These models have excelled in tasks as diverse as translation, code writing, and passing professional exams, and are used in a variety of fields, including knowledge extraction, idea generation, and data processing and comparison. Large-scale language models have also been successful in question answering (QA) tasks, where natural language provides human-like answers to questions; evaluations of large-scale language models in QA have been useful to verify how accurately they generate answers and to find cases of possible hallucination. It has been shown to be useful.

Another important issue is to evaluate how large-scale language models can accommodate mathematical language, given the importance of mathematical content in the science, technology, engineering, and mathematics (STEM) fields that have received so much attention in recent years. Mathematics, due to its rigorous logic and abstract concepts, is conveyed in specialized languages with complex combinations of symbols and syntax. Unlike natural language, mathematical expressions rely on unstated rules and assumptions that require explicit knowledge and a high degree of precision. This makes mathematical reasoning still a major challenge even in modern language models.

This paper investigates the ability of large-scale language models to answer open-ended questions in mathematics (questions that cannot be answered with yes/no answers). an ideal testing environment is provided by utilizing a variety of mathematical questions from the Math Stack Exchange (MSE) platform. the MSE includes questions ranging from elementary level to advanced mathematics, requiring the application of correct mathematical principles and the ability to explain complex reasoning in a clear and understandable manner. The focus on open-ended questions promotes a deep understanding of mathematical concepts and provides benchmarks to measure progress in mathematical reasoning skills for large-scale language models.

We also evaluate the responses generated, identify challenges through case studies, and discuss future directions for large-scale language models to bridge the gap between our understanding of natural and mathematical language.

Datasets and Methodology

Manually validating answers to questions in the Mathematics Stack Exchange (MSE) is not practical due to the diversity of questions and expertise required. For this reason, we use the ArqMATH competition dataset. This dataset is a collection of MSE question-answer pairs, and [Task 1] of the third competitionfocused on obtaining answers from MSEs related to78questionsin college-level mathematics.Assessments were made by students, withan average of450 responses ratedper topic.

Six large-scale language models (ToRA, LLeMa, GPT-4, MAmmoTH, MABOWDOR, and Mistral 7B) were used inthis evaluation.Inaddition, experiments have been conducted in two scenarios.

In the first scenario (answer generation), a two-step procedure is performed to answer the MSE questions using the selected large-scale language model. First, 78 questionsare given to thelarge-scale languagemodel to generate answers. However, because MABOWDOR uses BERT-based Dense Passage Retrieval, it can only generate embeddings. The generated answers are then indexed as embeds and searched against all other answers in ArqMATH to find the most similar answers.

In the second scenario (question-answer comparison), the selectedlarge-scale languagemodel is used to generate all potential ArqMATH answerembeddings, as well as 78 questionembeddings.Finally,wefind the most similar answers to the questions.

Benchmark

Here are the evaluation results using Mean Average Precision (mAP), Precision@10 (P@10), normalized Discounted Cumulative Gain (nDCG), Binary Preference (BPref) ArqMAT All of these scores are derived from the evaluated responses in the ArqMATH dataset.

First, answer generation usesthe six selected models to generate answers. The generated answers are used as queries to search for relevant answers from ArqMATH's answer pool. The search uses DPR vector embedding and cosine similarity.

The table below shows the comparative results for all models. The results show that the models tuned specifically for the math task outperform the DPR benchmark. In particular, increasing the model size does not improve the results: the Mistral model, which scored the lowest on the MATH dataset, performs as well as Tora-7b. This suggests that the models that show superior performance on the MATH dataset may be overfitting certain tasks.

On the other hand, the GPT-4 generated responses outperform the DPR baseline in P@10 scores and outperform MABOWDOR, the current best approach in ArqMATH3 Task 1.

The question-answer comparison alsofocuses on using embeddings to match questions with the most relevant answers. Because the model was originally designed for prompt-based responses, it needs to be adjusted to facilitate the generation of embeddings. To this end, we use a final token embedding, prefaced by the prompt "In a nutshell, what does this text mean:". We also introduce three math-related sample responses to guide the large-scale language model, as follows.

The first is "This text: ' ' is in a word: 'expected value'"
The second is "This text: ' ' is in a word:''circle'"
The third , "this text: 'the distance between the center of an ellipse and its two foci' is in a word: 'eccentricity' "

For re-ranking, we focused on the top 10 results as determined by MABOWDOR for each query; with reference to Zhong et al.'s discussion, because of Tora-7b's inferior performance compared to ArqMATH's average approach, we re-ranked all systems in the responses. We did not evaluate LLeMa and MAmmoTH, whose re-ranking was expected to be less valid because of their inferiority to Tora in the MATH and GSM benchmarks. The analysis reveals that Tora-7b's Precision@10 is inferior to all the runs shown in the aforementioned table. This indicates that comparing question and answer embeddings may not solve the related answer retrieval problem.

Case Study: Performance Comparison of GPT-4 and DPR

Here, two annotators with expertise in mathematics and computer science evaluate the performance of GPT-4 in generating answers to selected questions. For the evaluation, we refer to zbMATH Open2, a multilingual summarization and review service for pure and applied mathematics. The main focus is on comparing the retrieval performance of GPT-4 and Dense Passage Retrieval (DPR). In particular, we focus on questions for which GPT-4 improved retrieval accuracy and questions for which DPR outperformed GPT-4.

The figure below represents the frequency of P@10 differences between DPR and GPT-4 ( P@10GPT-4 - P@10DPR).As the analysis shown in the figure below indicates,GPT-4 improves accuracy in 38 of the78questions inthe Mathematics Stack Exchange (MSE) and its ability to generate relevant answers to MSE's open-ended math questions.

Also, GPT-4's answer to the question shown in the figure below improves from P@10を0.0 (DPR) to 0.6. The first search result by DPR does not include the binomial coefficient, but GPT-4's result includes at least 𝑛 expansions. DPR cannot guess the meaning of the equation without context. GPT-4 has a good understanding of the underlying equation, as it cannot infer the meaning of the equation without context.

Furthermore, the P@10 for the question shown below drops from 0.5 for DPR to 0.1 for GPT-4. GPT-4's answer misses the point because it does not explain how the particular line the submitter is asking about is derived from the premise. GPT-4 generated answers to the search A pattern has been observed that leads the system in the wrong direction. The answer obtained only provides a general explanation of the concept of tangents to curves, which corresponds to a portion of GPT-4's answer. This indicates that the GPT-4 cannot answer questions about complex interactions between mathematical concepts.

In addition, we found that the responses obtained from the smaller models (essentially all but the GPT-4) were of very low quality, tended to misunderstand the prompt format, and were inconsistent in their answers. Some questions produce no output other than end-of-sequence tokens. Mistral, on the other hand, is better at maintaining a conversational tone than Tora, and its reasoning is more structured. However, the formulas are not written in LaTeX format and are of lower quality due to input variable errors.

Also, compare with the following question: Tora-7b-Codeの回答を使用するとPrecision@10が0.5から0.8に向上することがわかりました. In the diagram below, Tora is simulating a thread that normally exists in MSE, rather than an actual answer. As an answer, it is inconsistent and logically incorrect.

The above is a summary of the case studies on the performance of GPT-4, DPR, and Tora-7b-Code. The paper highlights the strengths and weaknesses of each model and provides valuable insights for future improvements.

Summary

This paper examines how well a large-scale language model handles questions in the Mathematics Stack Exchange (MSE). First, weevaluate theMSE's diverse and open-ended questionswith SOTA's language model, which shows high performance on the Math Question Answer (MathQA) dataset. The results show that GPT-4 outperforms the other models, achieving an nDCG score of 0.48 and a Precision@10 (P@10) score of 0.37. In particular, GPT-4 shows very high visual performance in ArqMATH3 Task 1.

In addition, we are conducting a case study to evaluate the effectiveness of GPT-4 in detail. Large-scale language models that have performed well on the traditional MathQA dataset have been found to often generate inaccurate answers. On the other hand, GPT-4 shows the ability to generate appropriate answers for simple math questions, but shows reduced accuracy for questions that are more complex and require specialized knowledge.

The authors have made the responses generated by the large-scale language model and the code used in the experiment available to the research community for further investigation and analysis.

Categories related to this article

Takumu: I have worked as a Project Manager/Product Manager and Researcher at internet advertising companies (DSP, DMP, etc.) and machine learning startups. Currently, I am a Product Manager for new business at an IT company. I also plan services utilizing data and machine learning, and conduct seminars related to machine learning and mathematics.

The Power Of The Large-scale Language Model "GPT-4" In Mathematical Question Answering

Summary

Datasets and Methodology

Benchmark

Case Study: Performance Comparison of GPT-4 and DPR

Summary

Libra] A New Multimodal Design Of Large Language Models Using Separate Vision Systems

Libra] A New Multimodal Design Of Large Language Models Using Separate Vision Systems

Construction And Analysis Of The "TruthEval" Dataset To Expose LLM Weaknesses

Construction And Analysis Of The "TruthEval" Dataset To Expose LLM Weaknesses

SportQA, A New Dataset That Measures The Comprehension Of Sports In Large Language Models

SportQA, A New Dataset That Measures The Comprehension Of Sports In Large Language Models

Proposal For A New Evaluation Method For AI Assistants Based On Human Preferences

Proposal For A New Evaluation Method For AI Assistants Based On Human Preferences

The Future Of Music Education, Flute X GPT And LAUI's Potential To Change Large-Scale Language Models

The Future Of Music Education, Flute X GPT And LAUI's Potential To Change Large-Scale Language Model ...

Prediction Of Handball Results For The 2024 Paris Olympics And Explanation Of The Basis For The Prediction Using LLM

Prediction Of Handball Results For The 2024 Paris Olympics And Explanation Of The Basis For The Pred ...