Catch up on the latest AI articles

[FinBen] Benchmark To Assess The Capabilities And Limitations Of LLM In The Financial Domain

[FinBen] Benchmark To Assess The Capabilities And Limitations Of LLM In The Financial Domain

Large Language Models

3 main points
✔️ Introducing FinBen, a new benchmark: the first open-source, comprehensive evaluation benchmark aimed at addressing finance-specific challenges and assessing the capabilities and limitations of large-scale language models in the financial domain.
✔️ Key Findings: Evaluation on FinBen reveals that while GPT-4 performs well on many tasks, other models outperform on specific tasks. We show that while the large-scale language model excels in basic tasks, there is room for h improvement in tasks requiring more advanced cognitive abilities.

Implications for ✔️ future research: provides valuable insights into how large-scale language models can contribute to financial transaction decision making. Suggests new directions for the application and development of large-scale language models in the financial domain.

The FinBen: An Holistic Financial Benchmark for Large Language Models
written by Qianqian XieWeiguang HanZhengyu ChenRuoyu XiangXiao ZhangYueru HeMengxi XiaoDong LiYongfu DaiDuanyu FengYijing XuHaoqiang KangZiyan KuangChenhan YuanKailai YangZheheng LuoTianlin ZhangZhiwei LiuGuojun XiongZhiyang DengYuechen JiangZhiyuan YaoHaohang LiYangyang YuGang HuJiajia HuangXiao-Yang LiuAlejandro Lopez-LiraBenyou WangYanzhao LaiHao WangMin PengSophia AnaniadouJimin Huang
(Submitted on 20 Feb 2024)
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)


The images used in this article are from the paper, the introductory slides, or were created based on them.


In recent years, large-scale language models have transformed the natural language processing landscape, with models such as ChatGPT and GPT-4 demonstrating their capabilities in fields ranging from mathematics to medicine to law to finance. Despite these advances, however, much work remains to be done on the capabilities and limitations of these models in the financial field. In particular, while their potential in financial text analysis and forecasting tasks has received much attention, there is a lack of extensive evaluation of the adaptability of the models to the complexity and unique demands of the financial sector.

Existing assessment benchmarks for the finance domain focus on language comprehension skills, but do not adequately assess understanding of financial knowledge or the ability to solve real-world financial tasks. Newly released general domain benchmarks also do not address finance-specific issues, highlighting the need for systematic assessment benchmarks specific to the financial domain.

To address this need, the paper proposes a new benchmark called FinBen, the first open-source comprehensive assessment benchmark aimed at comprehensively evaluating the capabilities of large-scale language models in the financial domain. 23 financial tasks and 35 datasets, the benchmark can test a wide range of skills, from language comprehension to numerical reasoning to text generation. It reveals the true capabilities and limitations of large-scale language models in finance and provides insights to improve their application in the financial sector.

The evaluation at FinBen revealed that while GPT-4 performs well on many tasks, other models outperform it on certain generative and predictive tasks, and that the state-of-the-art large-scale language model excels on basic tasks, but leaves room for improvement on more advanced tasks These findings are in line with the findings of the previous study. These findings provide valuable insights into the application and development of large-scale language models in the financial domain.


Here we present FinBen. It is designed to assess the ability of large-scale language models in the financial sector in a multidimensional manner. The framework is based on the Cattell-Horn-Carroll (CHC) theory and captures the spread of cognitive abilities in three main spectrums. These include tasks ranging from basic quantitative reasoning to extraction and even numerical understanding. We assess more advanced cognitive processes through generation and prediction tasks, and ultimately measure the strategic decision-making capacity of large-scale language models for the most advanced financial challenges of our time. In this way, we delve deeply into the financial analysis capabilities of large-scale language models through a wide range of cognitive demands. The specific tasks, the data sets used, and their statistics and metrics are presented in the figure below and in the table below.

The figure below shows the FinBen evaluation data set.

Also included in FinBen are the tasks, datasets, data statistics, and evaluation metrics in the table below.

First, Spectrum I: Basic Tasks assesses the ability to perform quantitative reasoning, extraction, and numerical understanding of large-scale language models through 20 data sets containing 16 tasks. Quantitative reasoning includes eight different classification tasks, including information extraction from financial text and sentiment analysis. For example, the sentiment analysis task uses the Financial Phrase Bank and FiQA-SA datasets to extract sentiment information from financial texts. The extraction task assesses the ability to accurately retrieve specific information from financial documents, while the comprehension task measures the ability of large-scale language models to interpret complex numerical data and statistics. Each task is assessed using accuracy and F1 scores. These tasks demonstrate how effectively financial language models can address a variety of challenges that may be encountered in a real-world financial environment.

Spectrum II consists of six tasks and 14 datasets designed to delve deeply into generative (crystalline intelligence) and predictive (fluid intelligence) capabilities. The generative task evaluates how effectively the model produces consistent, information-rich, and relevant text output. In particular, we use the ECTSUM dataset for summarizing earnings calls and the EDTSUM dataset for summarizing financial news articles. For the evaluation, we use the ROUGE score, BERTScore, and BART score to quantitatively measure the quality of the generated summaries. The forecasting task tests how accurately the model can predict the future behavior of markets and investors. It includes five prediction tasks as diverse as predicting stock price trends, credit scoring, fraud detection, financial crisis identification, and claims analysis, and is evaluated using the F1 score and Matthews correlation coefficient.

Spectrum III: General Intelligence, as a trading task, is set as the ultimate challenge for large scale language models and assesses the ability of a model to integrate a variety of information to formulate and implement trading strategies. This places it at the pinnacle in the cognitive capabilities of financial analysis: using SOTA's financial LLM agent, FinMem, we evaluate the model while mimicking a real-world trading environment based on a dataset of seven major stocks that we have independently collected. Performance is measured using cumulative returns, Sharpe ratios, daily and annual volatility, and maximum drawdown to comprehensively assess the model's profitability, risk management, and decision-making capabilities.

Through these advanced datasets and benchmarks, we aim to explore new horizons of cognitive capabilities in the financial analysis of large-scale language models, paving the way for future technological developments.

Experimental results

The table below shows that GPT-4 has the best average performance in the basic task, followed by ChatGPT and Gemini.

Among all open-source large language models, FinMA-7B performs well on several classification tasks, such as FPB, and even outperforms larger models such as GPT-4 for financial large language models. This is due to dedicated instructional tuning on the training dataset.

In the general large language model, LLaMA2 70B leads in average performance due to its large model size. In a specialized model for Chinese, ChatGLM2-6B outperforms InternLM 7B in average performance, demonstrating its effectiveness in processing financial tasks. However, CFGPT sft-7B-Full, fine-tuned with Chinese financial data, shows limited improvement over the basic model InternLM 7B on some datasets, such as MultiFin, and also shows performance degradation. This trend suggests a language-based mismatch, highlighting that fine tuning on Chinese data may negatively impact performance on English tasks and highlighting the complexity of interlanguage adaptation in model learning.

In particular, on quantitative datasets such as Headlines, other financial-tuned large-scale language models including Gemini and FinMA-7B perform as well as or better than GPT-4. However, GPT-4 and ChatGPT significantly outperform other models when dealing with datasets for comprehension tasks such as FinQA and ConvFinQA, highlighting the limitations of the numerical inference capabilities of models such as Gemini and LLaMA2-70B. finRED, CD, FNXL, FSRL, and other extracted datasets that require complex information extraction and numerical labeling, all models, including GPT-4, face challenges, indicating the need for further enhancements in these areas.

In the text generation task, Gemini leads the pack in EDTSUM, demonstrating its ability to generate a consistent summary. Nevertheless, all models face challenges in extractive summarization, which requires generating accurate label sequences of sentences. In the prediction task, Gemini distinguishes itself on most datasets, while GPT-4 shows superior performance on the Australian credit scoring dataset.

In addition, a comparative analysis of the performance of the large-scale language models in the complex task of stock trading, a task that requires a high degree of general intelligence, is conducted. This analysis reveals that all large-scale language models outperform traditional buy-and-hold strategies and have the ability to guide investors to more profitable trading decisions.

Among them, GPT-4 performed particularly well in optimizing profit over risk, achieving the best Sharpe Ratio (SR) of over 1. This result shows that GPT-4 provides a safer investment route for investors, offering a lower-risk and more effective way to limit losses.

In contrast, ChatGPT showed limitations in its financial decision-making capabilities, resulting in significantly lower performance indicators. Gemini, on the other hand, performed second to GPT-4, maintaining high returns while maintaining low risk and low volatility. The open source LLaMA-70B resulted in the least profit while balancing risk management and profitability, but with low volatility.

It is also noted that small models with less than 7 billion parameters are difficult to follow stock trading instructions consistently and have limited comprehension, extraction capabilities, and windows of contextual understanding, posing obvious challenges in tasks that require complex financial reasoning and decision making.

This paper shows that large-scale language models embody general intelligence in the financial domain and have the ability to apply advanced cognitive skills to real-world financial tasks. It heralds a new era in financial analysis and decision making, and suggests that large-scale language models have remarkable potential to understand and navigate the complexities of financial markets, and a promising path forward for further development and application to tasks requiring advanced general intelligence.

Among the large open-source language models, LLaMA2 70B stands out in text summarization and LLaMA2-7B-chat excels in the prediction task; despite instructional tuning on data sets such as BigData22 and ACL18, FinMA 7B lags behind Falcon 7B and others in predictive performance, highlighting the need for more effective improvement strategies.

CFGPT sft-7B-Full consistently performs worse than its underlying model, InternLM 7B. It is important to recognize that all large-scale language models do not meet the expected results with respect to prediction and lag behind traditional methods. This consistent observation is consistent with existing studies (Feng et al., 2023; Xie et al., 2023b) and highlights a notable lack of ability of large-scale language models to address advanced cognitive tasks as effectively as traditional methods.

This analysis reveals significant improvement potential in large-scale language models, including industry leaders such as GPT-4 and Gemini, especially in text generation and prediction tasks that require advanced cognitive skills.

In conclusion, SOTA's large-scale language models, such as GPT-4, show strong performance across quantitative tasks. However, clear gaps exist in numerical reasoning and complex information extraction tasks, pointing to the need for further development. Instructional tuning has been shown to significantly improve performance, suggesting a valuable approach to improving model capability for specialized financial tasks. These results highlight the complexity of interlanguage model tuning and the importance of careful language considerations to improve the effectiveness of large-scale language models in diverse financial tasks.


The "FinBen" presented in this paper is a groundbreaking benchmark for measuring the ability of large-scale language models specific to the financial domain, including 35 datasets across 23 different tasks, a broad task-quantification that far exceeds previous financial benchmarks, extraction, understanding, generation, and prediction are assessed. Of particular note is the introduction of an agent-based framework for evaluating direct transactions.

Through an exhaustive analysis of 15 large-scale language models, it was found that GPT-4 performed prominently in the quantification, extraction, comprehension, and transaction tasks, while Gemini performed best in the generation and prediction tasks. These results indicate that while large-scale language models are highly capable in basic tasks, they are still limited in tasks that require more advanced cognition and general intelligence.

This paper highlights the potential for large-scale language models to contribute directly to financial transaction decision making and suggests new directions for future research in this area. In the future, it is hoped that FinBen can be further extended to include a variety of languages and a broader range of financial transaction tasks to further explore the potential of large-scale language models in finance and advance the field.

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us