A Method To Automatically Evaluate "the Accuracy Of LLM's Output Of Long Sentences" Was Created
3 main points
✔️ Created "LongFact," a dataset that can evaluate thefactuality and information accuracy of long sentences
✔️Proposed"SAFE," a method for automatically evaluating the factuality of long sentences using LLM
✔️ Introduced "F1@K", a metric to quantify the factuality of long sentences
Long-form factuality in large language models
written by Jerry Wei, Chengrun Yang, Xinying Song, Yifeng Lu, Nathan Hu, Jie Huang, Dustin Tran, Daiyi Peng, Ruibo Liu, Da Huang, Cosmo Du, Quoc V. Le
(Submitted on 3 Apr 2024)
Comments: Published on arxiv.
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
code:
The images used in this article are from the paper, the introductory slides, or were created based on them.
LLM's long-form factuality can now be automatically evaluated
This paper is about Google DeepMind and "proposes a new dataset, evaluation method, and metrics for benchmarking LLM long-form factuality (long-form factuality) and information accuracy".
The key points of this study are as follows
- Challenge:No dataset, evaluation methodology, or metrics exist to evaluate the factuality ofLLM'slong-formoutput.
- Solution method: Introducing "LongFact" dataset, "SAFE" method for automatic evaluation, and "F1@K" evaluation index.
- Point 1: The above method allows us to quantify the "factuality of the LLM's long-form output".
- Point 2: The larger the model, the more factual the longer statement was found to be.
In other words, this research can be used to automatically evaluate the accuracy of the long-form information output by LLMs, and can be used for future LLM development.
Current status in LLM performance evaluation
The performance of LLMs has improved remarkably in recent years, but at the same time they have the problem of "causing halcination" and "lying.
In particular, "significant inaccuracy when outputting long sentences" is a key issue.
One of the reasons for this problem was that no dataset existed to evaluate the factuality of LLM's long-answer questions.This is because most of the existing datasets are mainly Q&As that let the respondents answer short questions, so it was difficult to evaluate the factuality of long-form answers.
Furthermore, the method or index for quantifying the factuality of long-form writing has not been established, so it could not be successfully evaluated.
Methods proposed in this study
As mentioned earlier, this study proposed the following three methods for automatically evaluating factuality in LLM long sentences.
- LongFact
- SAFE (Search-Augmented Factuality Evaluator)
- F1@K
Let's look at each of these details, in turn.
Dataset: LongFact
LongFact is the new Q&A dataset proposed in this paper.
The main features are as follows
- Consists of 2,280 fact-seeking questions across 38 topics
- Topics are grouped into four categories: STEM (Science, Technology, Engineering, and Mathematics), Social Sciences, Humanities, and Other
- Consists of questions that require long answers
- Generatequestionsusing GPT-4
- Remove duplicates from the generated questions and randomly select 30 questions for each topic
The left side of the figure below shows the "percentage of question topics included in the LongFact" and the right side shows the "comparison of the existing data set to the LongFact.
Compared to existing datasets, LongFact has the largest number of topics of any dataset that can be used to evaluate the factuality of long sentences.
Incidentally, LongFact is publicly available on GitHub and can be used by anyone. As such, it is expected to serve as the basis for future LLM research.
Evaluation method: SAFE (Search-Augmented Factuality Evaluator)
SAFE (Search-Augmented Factuality Evaluator) is a method proposed in this paper to automatically evaluate the long-form factuality of LLMs.
A diagram outlining SAFE is shown below.
The evaluation by SAFE follows these steps
- Input Prompt to LLM and have it output Response
- Using LLM to break down the Response text into several "elements"
- Use LLM to determine if "each decomposed element is related to the input Prompt
- Generate Google search queries using LLM for "individual elements" determined to be relevant
- Google search with generated query
- Determine if"individual elements" from Google search results are correct information(is there a basis for the information)
In short, as shown in the figure below, we perform an elemental decomposition of the output text, query generation, and Google search to find information that supports the facts from the search results.
Naturally, the higher the "number of correct elements of information," the more reliable the Response output by the LLM.
When the authors assigned correct labels to the 100 facts where SAFE and humans disagreed, they found that SAFE was correct 76% of the time, while humans were correct only 19% of the time. In addition, SAFE outperformed humans at less than one-twentieth the cost of human raters.
In other words, you will find that SAFE is relatively low cost and highly accurate.
Incidentally, the implementation code for SAFE is also available as open source on GitHub and can be used by anyone.
Evaluation index: F1@K
F1@K is an index that takes into account both the rate of fit (PRECISION) and the rate of repeatability (RECALL). Specifically, it is defined as follows
- Fitness rate $ Prec(y) $:percentage of "correct elements of information" in output y
- Reproduction rate $ R_K(y) $: the minimum value of the number of "correct elements of information" in output y $ S(y) $ divided by the number of output sentence lengths (number of correct elements of information) $ K $ that users are expected to prefer $ min(S(y)/K, 1) $
And $ F1@K $ combines the goodness of fit and repeatability with the following equation
If $ S(y) > 0 $:.
$ F1@K(y) = \frac{2 * Prec(y) * R_K(y)}{ Prec(y) + R_K(y)} $
If $ S(y) = 0 $:.
$ F1@K(y) = 0 $.
In other words, F1@K takes values between 0 and 1, with closer to 1 indicating greater factuality of longer sentences.
K is a hyperparameter, representing the length of the output text (number of correct elements of information) that the user prefers. It is assumed thatusers consider more is better for up to K "information correct elements" but are indifferent about"information correctelements"beyond K.
For example, if K=64, the user considers the more "correct elements of information" up to 64, the better, but is indifferent about the 65th and beyond.
The value of K needs to be set according to the user's preference.
This allows us to evaluate not only whether it is factual, but also whether it contains a sufficient amount of information.
In fact, this paper uses F1@K to benchmark 13 LLMs and compare the factual performance of the models in long sentences.
Comparison of LLM performance using this dataset and evaluation metrics and methods
Experimental Details
Thirteen LLMs (Gemini, GPT, Claude, and PaLM-2 series) are benchmarked in LongFact to examine the relationship between "model size" and "long-form facticity " in the facticity of LLMs.
Specifically, for 250 randomly selected questions from LongFact, outputs are generated using each model and evaluated with SAFE.
They then quantified and compared the performance for F1@K (K=64 and K=178).
result
The results of the experiment show that the larger the model, the more factual the longer statement.
For example, GPT-4-Turbo has higher facticity than GPT-4 and GPT-4 has higher facticity than GPT-3.5-Turbo.We also see that Gemini-Ultra has higher facticity than Gemini-Pro and PaLM-2-L-IT-RLHF has higher facticity than PaLM-2-L-IT.
In addition, the three most factual models, regardless of K value, were GPT-4-Turbo, Gemini-Ultra, and PaLM-2-L-IT-RLHF.
Expect this research to serve as a foundation for future LLM development
This article introduced Google DeepMind's research on "methods for correctly evaluating the factuality and accuracy of information in long LLMs".
In this study, LongFact, the automatic evaluation method SAFE, and the indicator F1@K were proposed to evaluate the factuality of LLMs in long sentences.
These will clarify the current state of facticity of large-scale language models in long sentences and provide a basis for future research.
Limitations of this study include the following
- LongFact and SAFE depend on the LLM, so the ability of the LLM used has a direct impact
- SAFE relies on Google search and may not correctly evaluate some facts
- It has not been tested whether SAFE performs as well or better than a "human expert-level evaluator".
Therefore,they plan to conduct futureresearch on learning, fine tuning, and the use of external tools to improve LLM's factuality in long sentences.
He also stated that the development of a method toimprove SAFE's "dependence on the language model" andto evaluate the accuracy of facts to LLM's internal knowledge in long sentences is in the planning stages.
Personal Opinion
In my personal opinion, I thought this was an important study that tackled head-on the important issue of LLM evaluation. Although there is still room for improvement in the proposed methodology, I feel that it has the potential to make a significant contribution to the future development of LLM research.
Further refinement of this method and dataset will enable us to guarantee accuracy when outputting longer sentences to LLM. For example, LLM could be used in task areas where accuracy has not been very high, such as writing a complete blog or generating a full book.
Incidentally, the dataset and evaluation methods of this study are available on GitHub, and we recommend that those interested try them out.
Categories related to this article