When Should We Believe In LLM?
3 main points
✔️ Analyze the knowledge that LLMs store in parameters in a question-answer format
✔️ The percentage of correct answers to a question is directly proportional to the popularity of the question's topic
✔️ Achieve fast and high performance by applying external knowledge according to popularity
When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories
written by Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Daniel Khashabi, Hannaneh Hajishirzi
(Submitted on 20 Dec 2022)
Comments:ACL 2023; Code and data available at this https URL
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
The images used in this article are from the paper, the introductory slides, or were created based on them.
Large-scale language models (LLMs) such as GPT-3 have attracted attention for their remarkable performance. LLMs trained with large parameters and training data often retain knowledge such as facts and common sense inside the parameters, and are said to have the ability to answer our questions at a certain level. This knowledge is called parametric knowledge because it is embedded in the parameters.
On the other hand, however, hallucinations and, more recently, confabulations, or so-called "lies" by LLMs, have become a problem. Sometimes their knowledge is correct, and sometimes it is wrong. In such cases, it is necessary to supplement their knowledge with our own and external information available on the Web, such as Wikipedia, to make inferences about a given question. This kind of external knowledge that is not embedded in parameters is called nonparametric knowledge.
In this paper, we attempt to analyze the discrimination of when and when not to trust LLMs' knowledge, and to acquire external knowledge based on this analysis, with the goal of realizing retrieval-augemented LLMs that apply external knowledge when necessary.
Specifically, we are working on the following three research questions.
- To what extent do LLMs have factual knowledge and what influences their memory?
- How much can the acquisition of non-patametric knowledge complement parametric knowledge?
- Is it feasible to combine nonparametric and parametric knowledge as appropriate?
RQ1: Analysis of LLMs against parametric knowledge
The authors have created a new dataset for the analysis of parametric knowledge in LLMs, PopQA, which consists of a very simple question and its correct answer based on the (Subject, Relation, Object) triplet obtained from Wikipedia. and the correct answers are given. In this section, we use PopQA to analyze LLMs knowledge.
For each triplet in PopQA, we define the number of views per month on the subject's Wikipedia page as its popularity. This popularity is introduced as an indicator of how often the topic is discussed on the Web. In addition, a dataset called EntityQuestions is used in subsequent experiments. Both PopQA and EntityQuestions are datasets with a long tail distribution of popularity, as shown in Figure 3 below.
Correlation between popularity and percentage of correct answers
Experiments with PopQA use several typical LLMs. The input is set to zero-shot for costly models such as APIs, and 15-shot for others, with prompts in the format "Q: <question> A:".
The upper figure shows the experimental results. The upper graph shows the percentage of correct answers, and the lower graph shows the correlation coefficient between the percentage of correct answers and popularity by model.
The graph above shows that models with larger parameters generally produce higher percentages of correct responses, indicating that the more parameters there are, the more knowledge can be embedded.
In the graph below, we observe that, with the exception of a few relations, the larger the model, the stronger the correlation between the percentage of correct answers and the popularity of the model. This indicates that the memory of parametric knowledge in LLMs depends on how often the topic is discussed on the Web.
The Impact of Relation
You will also notice that the nature of the result is different for each relation. The author points out that some relations can be easily "guessed" without memorizing the facts. Looking at the results again, we observe that while country and sport have a higher percentage of correct responses than the others, their correlation with popularity is weak. The authors argue that this suggests that the model is responding to questions based on superficial information from the input, regardless of the subject of the question. The analysis of the results also revealed that the model outputs only the same answers in such relations where there is a gap between the correct response rate and the strength of the correlation.
Scaling and tail knowledge
The above graph shows the relationship between the number of parameters, the percentage of correct answers, and popularity. The percentage of correct responses for the most popular entities (warm colors) increases as the number of parameters is increased, while the percentage of correct responses for the least popular ones (cold colors) is low throughout the entire range, regardless of the size of the number of parameters. These results indicate that the so-called scaling law, in which performance increases in proportion to the increase in the number of parameters, does not necessarily apply to less popular entities.
RQ2: Complementing parametric knowledge with non-parametric knowledge
Then, the effect of a method to extend the model by retrieval of nonparametric knowledge (Retrieval-augmented LLMs) is tested.
Effect of Retrieval
The experiment contrasts BM25, a method using retrieval from Wikipedia with pre-training called Contriever, and GenRead, a method that retrieves knowledge by prompting from LLMs. Note that nonparametric knowledge is provided for all questions in this experiment.
The results are shown in Figure 7 above, where we can see that the retrieval-augmented LLMs perform better than the model without Retrieval, i.e., without nonparametric knowledge (Vanilla).
External Knowledge and Popularity
In a more detailed analysis of the results, the following interesting trends are observed
The above figure shows that the performance of the retrieval-augmented model (BM25, Contriever) outperforms Vanilla for relatively less popular questions, while it is equal to or lower than Vanilla for more popular questions. Thus, it appears that retrieval is not effective for all inputs. GenRead generally outperforms Vanilla even though it uses parametric knowledge from LLMs. This may be an example of how prompting can effectively extract intra-parametric knowledge.
Non-parametric knowledge is not always effective
A detailed analysis of the phenomenon that the retrieval-augmented model is inaccurate with highly popular input is presented: the correct and incorrect answers in the retrieval-augmented model are distinguished from the correct and incorrect answers in the non-retrieval model (GPT-3), and then recall@1 (whether or not the correct answer is included in the top1 retrieved documents (external knowledge)).
The results are shown in Table 1 above. The values in parentheses indicate the percentage of each category in the total PopQA. We can see that 1@recall is significantly lower for questions that are answered correctly by the model without retrieval (10% of all questions) but incorrectly by the model with retrieval (top right). This suggests that the models are being misled by external knowledge that has been incorrectly retrieved, resulting in lower final performance.
RQ3: Adaptive Retrieval
Based on the results of previous experiments, we are devising a model that responds to questions while using both parametric and nonparametric knowledge as appropriate.
Adaptive Retrieval is a model that retrieves external knowledge to answer an input question if its popularity is below a threshold value. In the experiment, the threshold is set for each relation. BM25 is used for retrieval.
The above figure shows the experimental results: Adaptive Retrieval (green) outperforms the model without retrieval (blue) and the model without parametric knowledge (orange). However, for models with a large number of parameters (right side of the figure), the performance difference between Adaptive Retrieval and other models is not significant. The authors' analysis of this phenomenon shows that the percentage of retrieval performed on external knowledge varies depending on the size of the model.
The above graph shows the analysis. The vertical axis shows the percentage of models that use nonparametric knowledge. In short, the smaller models rely mostly on external knowledge, while the larger models are answered mostly by parametric knowledge.
Therefore, we conclude that the advantage of Adaptive Retrieval decreases as the model gets larger, simply because the opportunities for retrieval decrease and the structural differences from the model without retrieval become smaller.
In addition, the authors argue that Adaptive Retrieval is superior in terms of computational speed up to the output of answers, because it can flexibly respond to whether to retrieve or not depending on the popularity of the answer.
In this paper, the authors use popularity to determine when knowledge in LLMs is believable and when it is not, and realize a system that uses parametric knowledge in retrieval-augmented LLMs, but retrieves external knowledge as needed.
Because LLMs can produce very natural output, they should be handled with care, and this research, which automatically determines the reliability of LLMs' output and relies on internal knowledge for usable information while using external information for areas not covered by LLMs, is a very interesting paper for powerful LLMs for inference and question answering. This is a very interesting paper for powerful LLMs for inference and question answering.
Categories related to this article