Catch up on the latest AI articles

Deep Dive Into The Mathematical Functions Of ChatGPT

Deep Dive Into The Mathematical Functions Of ChatGPT


3 main points
✔️ Propose a new dataset, GHOSTS, to test the mathematical capabilities of LLMs.
✔️ Present how the LLM can be integrated into the work of mathematicians.

✔️ Experiments comparing old and new upgrades and improving the performance of GPT-4.

Mathematical Capabilities of ChatGPT
written by Simon FriederLuca PinchettiAlexis ChevalierRyan-Rhys GriffithsTommaso SalvatoriThomas LukasiewiczPhilipp Christian PetersenJulius Berner
(Submitted on 31 Jan 2023 (v1), last revised 20 Jul 2023 (this version, v2))
Comments: Added further evaluations on another ChatGPT version and on GPT-4. The GHOSTS and miniGHOSTS datasets are available at this https URL

Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)


The images used in this article are from the paper, the introductory slides, or were created based on them.


This paper examines the mathematical performance of two language models called ChatGPT and GPT-4. We used new methods and tested them on both public and proprietary datasets. We focused on mathematical problems expressed in natural language rather than the usual formal database of mathematics.

Researchers have introduced two new datasets, called GHOSTS and miniGHOSTS, to address the fact that the current datasets focus primarily on elementary mathematics or cover only a very small area. These cover graduate level mathematics and aim to distinguish between different aspects of mathematics.

These datasets mimicked the everyday activities of mathematicians to assess how useful ChatGPT and GPT-4 are to professional mathematicians. The models were benchmarked against a variety of detailed performance measures, and the most detailed assessment work was done in understanding advanced mathematics.

Results show that ChatGPT functions as a mathematical search engine and is very useful as a mathematical assistant. On the other hand, the GPT-4 was found to be suitable for undergraduate level mathematics, but not successful at graduate level difficulty. Despite positive reports in the press regarding answer ability, overall mathematical performance was noted to fall short of graduate student expectations. Therefore, it is recommended that students be guided by the learning of their average colleagues in order to pass graduate-level mathematics exams.


ChatGPT is well-known as a question-and-answer dialogue system and has demonstrated high performance in a variety of tests, including medical licensing exams, psychology IQ tests, and operational management exams. GPT-4 also outperforms ChatGPT.

In this paper, we introduce a new dataset, GHOSTS, and analyze the mathematical features of ChatGPT in detail. We will also evaluate GPT-4 on a smaller dataset called miniGHOSTS. This will test the extent to which these models can contribute to our mathematical understanding.

Related Research

ChatGPT can perform mathematical reasoning in natural language, thereby competing with traditional technologies, and while methods for automating mathematical reasoning have been studied since 1959, more recently supervised learning and large-scale language models (LLMs) have been used to learn mathematics in natural language Increasingly, approaches are being used.

State-of-the-art mathematical symbolic encoding methods are seen as stagnant in their progress, and models such as ChatGPT are expected to directly perform mathematical reasoning and demonstrate advanced mathematical understanding. Comparisons with other state-of-the-art models and datasets suggest that ChatGPT is capable of taking on advanced mathematical tasks.

The study will introduce a new dataset, GHOSTS, to assess how well ChatGPT understands mathematical reasoning. Compared to other models, ChatGPT has the potential to show a high degree of mathematical understanding. However, other studies have shown that it still needs to improve its mathematical understanding.


The dataset is called GHOSTS and covers a variety of mathematical difficulty levels and problem types. Specific sub-datasets include Grad-Text, Holes-in-Proofs, Olympiad-Problem Solving, Symbolic Integral, MATH, Search Engine Aspects, and Theorem Proof Completion.

  1. Grad-Text (Thesis Text ) :
    A textual dataset about the thesis. Focuses on themes and topics related to mathematics. May contain documentation of theses written by mathematics graduates.
  2. Holes-in-Proofs:
    A dataset on mathematical proofs, focusing specifically on incompleteness and flaws present in proofs. It may include completeness of proofs, logical fallacies, etc.
  3. Olympiad-Problem Solving (Math Olympiad Problem Solving):
    A dataset of questions asked in the Math Olympiad and the answers and solutions to them. It includes questions to measure mathematical problem-solving skills.
  4. Symbolic Integrals:
    Dataset on symbolic integrals in mathematics. It includes integrals over different mathematical functions and expressions and focuses on the task of symbolic computation.
  5. MATH:
    A comprehensive data set on common mathematical texts and problems. May contain questions and information across a variety of mathematical disciplines and topics.
  6. Aspects of Search Engines:
    This dataset focuses on the operations and functions of search engines related to mathematics. It contains data on queries and search results for retrieving mathematical information.
  7. Theorem Proof Completion:
    This dataset on mathematical theorems and propositions focuses on the task of completing a proof from incomplete to complete. Information is provided to help understand the structure and logic of mathematical proofs.

Researchers used data points that were manually labeled by math experts to evaluate the prompts and model outputs. The creation of the dataset required mathematical insight and a detailed mathematical evaluation by the researchers. Building the dataset took hundreds of hours and included a total of 1636 prompts.

This study was undertaken to determine how well ChatGPT can handle mathematical reasoning. The dataset was created to provide a comprehensive assessment of different aspects of mathematics, providing a wealth of information beyond existing datasets.

Experimental results

The ChatGPT showed average performance in college math classes, but struggled with difficult exercises and advanced problems such as the Math Olympiad. However, they generally performed well on tasks in which they had to state basic mathematical facts.

The January 9, 2023 version of ChatGPT achieved an average rating of 3.20, particularly struggling with proof-based questions and complex symbolic computations. However, it excelled in question context recognition and notation matching, and scored well on simple mathematical tasks.

On the other hand, GPT-4 performed better than ChatGPT, achieving an average of 3.50 on the miniGHOSTS data set. This suggests that GPT-4 is capable of handling more advanced mathematical problems.

Figure 1 shows the ratings for each model. In particular, it is highlighted that GPT-4 outperforms the other models, with an average rating of 4.15.

This Sankey diagram visually shows how the various models were evaluated. The flow from top to bottom represents the evolution of the ratings from ChatGPT on January 9, 2023, to ChatGPT on January 30, 2023, to GPT-4. Depending on the rating of each model, the Sankey diagram is represented by bands of different widths, showing the percentage of each rating (a score of 5 is better).There is some grade shuffling between the January 9 ChatGPT and January 30 ChatGPT models, but overall performance remains nearly identical are.

On the other hand, we observe a significant increase in the complete evaluation (score 5) for GPT-4. Through this figure, the evolution of the different models and changes in performance can be understood at a glance. In short, the ChatGPT is well suited for basic math tasks, but limited for advanced math problems; the GPT-4 is an evolutionary version of it, capable of dealing with more difficult problems and improving overall mathematical competence.


While ChatGPT is not yet perfect in mathematical tasks, it has the potential to provide surprising answers. In particular, it has been noted that it struggles with advanced mathematics.

GPT-4 performs better than ChatGPT on the miniGHOSTS dataset and is expected to improve on mathematical functions. In general, ChatGPT does not excel at specific mathematical tasks and has the flexibility to search for mathematical objects, although it is not a specialized model.

The important point is that ChatGPT can be integrated as a mathematical assistant to speed up mathematical search tasks based on specific information. However, it is important that users have some mathematical knowledge in order to accurately identify mathematical objects.

This study points out that the GHOSTS dataset is still inadequate as a mathematical benchmark and encourages future researchers to develop a deeper understanding and improvement. Ultimately, the goal is to extend the GPT-4 assessment to the full GHOSTS dataset and establish a mathematical benchmark.


If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us