Catch up on the latest AI articles

ChatGPT And GPT-4 Take The CFA Exam! Testing The Applicability Of Large-scale Language Models In The Financial Sector

ChatGPT And GPT-4 Take The CFA Exam! Testing The Applicability Of Large-scale Language Models In The Financial Sector

Large Language Models

3 main points
✔️ Evaluating the Financial Reasoning Performance of Large-Scale Language Models: We evaluated the usefulness and limitations of the ChatGPT and GPT-4 for financial reasoning problems by solving simulated CFA exam questions that require specialized knowledge in the financial field.
✔️ Detailed Analysis of Test Performance: Through Level I and Level II practice questions on the CFA exam, we found that the large-scale language models performed well on certain financial topics (e.g., derivatives and equity investments), but struggled on other topics such as financial reporting and portfolio management. The study reveals that the

✔️ Suggestions for Improving Financial Expertise and Problem Solving Capabilities: suggested strategies and improvements to increase the applicability of large-scale language models to finance, including improving numerical and tabular processing capabilities of financial expertise.

Can GPT models be Financial Analysts? An Evaluation of ChatGPT and GPT-4 on mock CFA Exams
written by Ethan CallananAmarachi MbakweAntony PapadimitriouYulong PeiMathieu SibueXiaodan ZhuZhiqiang MaXiaomo LiuSameena Shah
(Submitted on 12 Oct 2023)
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); General Finance (q-fin.GN)


The images used in this article are from the paper, the introductory slides, or were created based on them.


The field of natural language processing has been transformed by the emergence of large-scale language models. In particular, models such as OpenAI's ChatGPT, GPT-4, LLaMA, and PaLM have attracted widespread attention for their easy-to-understand dialogue style. These models excel at a wide range of tasks, from text summarization to code generation to question answering. The financial sector is also increasingly using them to improve customer service and analyze sentiment. However, there is still room for improvement in general natural language processing models for finance-specific tasks.

This paper uses mock exam questions from the Chartered Financial Analyst (CFA) program to explore the usefulness of large-scale language models for real-world financial reasoning problems. the CFA exam is known as a detailed and practical test of financial expertise, and the large-scale language model is considered an ideal case study to assess the extent to which it can be used to understand and solve complex financial reasoning problems. This paper provides a detailed analysis of how to estimate performance on Level I and Level II of the CFA exam.

The study showed that ChatGPT and GPT-4 demonstrated some performance on financial reasoning problems, but also revealed limitations on certain problems.

The report also discusses strategies and improvements for increasing the applicability of large-scale language models in the financial field. This includes suggestions for new directions for research and development, such as incorporating financial expertise and improving problem-solving skills.

This study is the first comprehensive assessment of ChatGPT and GPT-4's capabilities for financial reasoning problems and aims to lay the groundwork for improving the applicability of large-scale language models to finance.


The CFA, which will be used in this data set, is an exam consisting of three levels that cover a wide range of topics from financial fundamentals to asset valuation, portfolio management, and wealth planning. It is taken by people with knowledge in finance, accounting, economics, and business to pursue careers in the financial industry, and obtaining the CFA is an important qualification for working in professions such as investment management and risk management.

In addition, each question on the CFA exam, regardless of level, is associated with one of ten different financial topics: ethics, quantitative methods, economics, financial statement analysis, corporate issuers, portfolio management, equity investments, fixed income, derivatives, and alternative investments. Level I consists of a total of 180 independent Multiple Choice (3-choice) questions. Level II consists of 22 Item Sets (3-choice questions), each Set containing a sub-assignment (Case Description and Evidence). Level III consists of a mixture of Essay (essay and short answer) and Item Sets (3-choice questions).

level test format
I Multiple Choice
II Item Set (3-choice question)
III. 50% Essay (short answer)
50% Item Set (3-choice questions)

However, since no official CFA exam questions are published, practice exams are used to benchmark research and models. In particular, we focus on Level I and II questions in this study because Level III questions require plain-text answers. We collected five Level I practice exams and two Level II practice exams, and used the example questions published by the CFA Institute. In this data set, each financial topic is represented in the appropriate proportions, clearly showing the structure of the questions and the importance of the topics for each level.

Below are sample questions for Levels I and II.

The table below also shows statistics for Level I and II questions.

Experimental procedure

Various prompting paradigms are examined when using CFA practice exams to assess financial reasoning skills on the ChatGPT and GPT-4.

The first is ZS (zero shot) prompting. This evaluates the model's inherent ability to infer without providing correct examples for input.

The second is the FS (Few Shot) prompting. It provides a prior example of the expected behavior of the model and facilitates the acquisition of new knowledge to help solve the question. two different approaches to selecting FS examples are tested: 1) a FS prompt, which is a prompt that provides a prior example of the expected behavior of the model, and 2) a FS prompt, which provides a prior example of the expected behavior of the model.

  • Random sampling from all question sets within a test level (2s, 4s, 6s)
  • Sampling of one question from each topic at each test level (10s)

This last approach aims to allow the model to identify the different attributes of each topic within each exam level; due to the limitations of the GPT-4 context window and the length of the Level II Item Set (three-choice questions), the 6S and 10 S prompts are evaluated.

The third is the CoT prompting. For each exam level, the input questions are evaluated by prompting the model to think step-by-step and show the process of calculation. This has the added benefit of analyzing the model's "problem-solving process" and identifying where and why mistakes were made.

The models were implemented using OpenAI's ChatCompletion API (gpt-3.5-turbo and gpt-4 models), with the temperature parameter set to zero to eliminate randomness in the model products. The evaluation metrics include a comparison of each established set of answers and their predictions for each of the CFA mock tests collected to measure the performance of the large-scale language models. Throughout this experiment, we use Accuracy as our evaluation metric.

Experimental results

As we have mentioned, in this paper, the Large Language Model (LLM) challenges the CFA mock exam, a certification exam for financial analysis. The table below shows the ChatGPT and GPT-4 Accuracy against Level I.

The table below also shows the Accuracy of ChatGPT and GPT-4 for Level II.

From the above two tables, we can see that the ChatGPT and GPT-4 face even more difficulty on the Level II exam than on the Level I exam. This difference can be attributed to the complexity of the format and content of the exam.

Level II exams average about 10 times the length of prompts compared to Level I. This increased length leads to dilution of information, making it difficult for the model to get to the heart of the question. In particular, Level II questions include more detailed case studies that reflect real-life situations, which increases the information-processing burden over more general questions.

Level II also includes more specialized and complex questions, with each item set focusing in depth on a specific financial topic. This is in contrast to the broad question format of Level I.

In addition, Level II has more computationally demanding questions and table-based questions. The inherent limitations of numerical and table processing in large-scale language models may contribute to the low accuracy at this level.

The two aforementioned tables also show that GPT-4 outperforms ChatGPT in almost all experiments, but both models struggle on certain financial topics.

In Level I, both models performed well, especially in the topics of derivatives (financial derivatives), alternative investments, corporate issued securities, equity investments, and ethics. In derivatives and ethics, this can be interpreted as relatively easy, as there were fewer calculations and tables to understand required for accurate responses. In addition, popular financial concepts such as options and arbitrage were explicitly included in the question text in these topics, which may have reduced the difficulty level.

On the other hand, both models perform relatively poorly in Financial Reporting and Portfolio Management. In particular, ChatGPT struggles with computationally intensive topics such as quantitative methodologies. These issues are more case-based, applied, computational, and contain CFA-specific content, which may have negatively impacted performance.

In Level II, both models continue to perform well in derivatives, corporate issued securities, and equity investments, while still struggling in financial reporting. Interestingly, both models show low accuracy in Level II ethics. This may be due to the more detailed and situational nature of Level II questions compared to Level I, which are particularly challenging.

We also observed that CoT (Chain of Thought) prompting showed consistent improvement against ZS (Zero-Shot), but not as much as initially expected. The effect is also limited, especially when exceeding FS (Few-Shot) at Level II on the GPT-4.

At Level I, the improvement in GPT-4 performance due to CoT prompting was only a 1% relative increase, and ChatGPT performance actually decreased. This small improvement suggests that CoT is not as effective as expected. At Level II, CoT prompting produced a 7% relative improvement over ZS for GPT-4, but only a 1% improvement for ChatGPT, which is not as effective as expected.

The paper contains several other more detailed examinations.


This paper evaluates the performance of ChatGPT and GPT-4 using CFA Level I and Level II simulations to assess the usefulness of large-scale language models in the financial domain. The results show that GPT-4 performs better than ChatGPT on almost all topics and levels. Based on estimated pass rates and self-report scores when different prompting methods were used, it was also concluded that ChatGPT was less likely to pass Levels I and II of the CFA under all settings tested. On the other hand, the GPT-4 was found to be likely to pass Levels I and II of the CFA when using the Few-Shot (FS) and Chain of Thought (CoT) prompts.

While CoT prompting has helped the model better understand the problem and the information, it also reveals the risk of error due to incorrect or missing domain-specific knowledge, inference errors, and computational errors. On the other hand, incorporating FS-positive instances into the prompts has helped us derive the highest performance at both levels.

Based on these results, future systems are expected to further improve performance by utilizing a variety of tools.Knowledge errors, the main errors that occur in the CoT, can be addressed by Retrieval-Augmented Generation (RAG) using an external knowledge base that contains CFA-specific information. Retrieval-Augmented Generation using an external knowledge base containing CFA-specific information. Computational errors can be avoided by outsourcing the computation to a function or API such as Wolfram Alpha. The remaining errors, inference and disagreement, could be reduced by using a critique model to review and question thoughts before giving answers, or by combining FS and CoT to provide examples of expected behavior.

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us