Can Language Models Predict The Future At The Human Level?
3 main points
✔️ collect data on questions from competing forecasting platforms to evaluate LM's forecasting capabilities.
✔️ Experiments show that the proposed system's performance on the test set is close to human predictions.
✔️ In the future, it may be possible for LM-based systems to make as accurate forecasts as competitive human forecasters.
Approaching Human-Level Forecasting with Language Models
written by Danny Halawi, Fred Zhang, Chen Yueh-Han, Jacob Steinhardt
(Submitted on 28 Feb 2024)
Comments: Published on arxiv.
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
code:
The images used in this article are from the paper, the introductory slides, or were created based on them.
Summary
This study will investigate whether language models (LMs) can predict future events. The study will develop a system to automatically collect information, generate and aggregate predictions. Data on the questions will be collected from a competitive forecasting platform to assess LM's ability to forecast. Results will show that LM can rival or exceed competitive human forecasters. This study suggests that using LMs to predict the future has the potential to provide useful information for organizational decision making.
Introduction
Predicting future events is important in this research, as governments and businesses use predictions of economic and political trends to inform their decision-making. Traditional forecasting methods have used statistical methods and human judgment, but these methods have limitations. Therefore, in this study, we developed a system that automatically makes predictions using language models (LMs). This system collects information from news and other sources and makes predictions based on that information. It then combines multiple predictions to obtain a result.
Above is an overview of our search and inference system. Our search system retrieves summarized new articles and feeds them into the inference system, which prompts LM to make inferences and predictions, which are then aggregated into a final prediction.
Related Research
Automatic prediction systems play an important role in supporting human decision making. Past research has attempted to pit machine learning systems against human forecasters using datasets containing questions drawn from news articles. Recent studies, using contest data through 2022, have shown that machine learning systems have improved their forecasting accuracy, with some comparable to human forecasters. However, such systems are still rare.
In addition, the most recent research is focused on questions for 2023-2024 and is working to improve the accuracy of machine learning systems. Information retrieval (IR) is important in forecasting events, and the use of LM improves question-answering capabilities. Prediction accuracy depends on calibration and is evaluated by appropriate scoring rules.
Proposed Method
Search (e.g. for someone using a search engine)
The system generates search queries to retrieve information from past news articles and selects the most relevant articles. The query is generated based on the question and then the articles are retrieved. The relevance of articles is evaluated by LM and less relevant articles are excluded. Articles are also summarized and the most relevant information is presented to the model.
Inference
By allowing the model to reason about the forecasting question, we can understand the basis for the forecast and improve the forecast. Models are asked to restate or extend the question, consider possibilities, eliminate weak arguments, and check for bias. A base model and a fine-tuned model are used and their results are enumerated.
Ensemble
Predictions from multiple models are combined to generate a more reliable final forecast. The best prompts and hyperparameters are selected and multiple forecasts are combined.
The figure above illustrates the procedure for generating data for self-supervised training. In this method, multiple candidate inference-prediction pairs are generated for each question, and the pair that performs better than the human tally is selected and fine-tuned.
Optimization
System optimization incorporates a variety of steps, such as fine-tuning inference models, adjusting hyperparameters, optimizing search queries, improving summarization during the inference process, and even introducing ensemble methods. This results in more accurate and reliable predictions and improves system performance. The system combines search and inference and effectively uses information from multiple models to improve the accuracy of forecasts.
Experiment
Studies have shown that the proposed system's performance on the test set is close to human predictions.
The table above shows the results of the system evaluations by category (left) and platform (right). Averaged over all acquisition dates, the optimized system achieved a Brier score of 0.179 (human prediction: 0.149) and an accuracy of 0.715 (human prediction: 0.770). Thus, the system showed superior results compared to previous studies and baseline models. In addition, the strengths and weaknesses of the system were analyzed in detail, providing insight for future improvements.
(a) When provided with sufficient relevant articles, this system performs better than other systems. This demonstrates the system's ability to access information and process it appropriately.
(b) For questions that humans are not certain about (predictions in the range of 0.3 to 0.7), this system outperforms them. In other words, the System makes better predictions on uncertain information than humans. However, humans do better than the System on questions where their confidence is very high.
(c) The earlier the acquisition date, the better the system's Brier score. This indicates the system's ability to quickly retrieve the latest information and process it.
It was also shown that the system outperformed humans when making selective predictions under certain conditions. This selective approach enabled the system to be effective. Finally, the ability of the system to complement human predictions was reported. Combining the system's predictions with human predictions allowed for more reliable predictions. This demonstrated the potential value of the proposed system as a practical forecasting tool.
Ablation
In the ablation, three different experiments were performed. First, we evaluated GPT-3.5 with fine-tuning to show that the system's performance was not simply dependent on the capabilities of GPT-4. The results showed minute differences, suggesting that the system's performance is affected by fine-tuning. Next, to understand the benefits of fine-tuning, we evaluated the system using only the GPT-4-Preview-1106 model. Without fine-tuning, system performance is slightly degraded. Finally, we evaluated the system using only the base GPT-4-1106-Preview model without news retrieval. In this case, system performance drops to baseline levels. Results indicate that fine-tuning and searching are important to improve system performance.
Conclusion
In this study, we developed the first machine learning (ML) system capable of making predictions at a human-like level. It provided new search mechanisms and fine-tuning methods to generate accurate predictions and inferences. We also released datasets from five real-world prediction contests, providing a foundation for further research. Looking to the future, the following key points are highlighted.
Explore iterative self-monitoring methods: repeated fine-tuning of the model can promote self-improvement and improve performance.
Exploit larger training data: training LMs with large corpora is expected to yield better predictive performance.
Domain-adaptive training: ways to leverage domain knowledge to fine-tune models in order to specialize models in specific areas will be explored.
Use of state-of-the-art models: better performance is expected by using and fine-tuning state-of-the-art models.
These efforts may enable future LM-based systems to produce forecasts as accurate as those of competitive human forecasters.
Categories related to this article