FABLES, A Dataset For Book Summarization Consisting Only Of Long Sentences Of 100k Tokens Or More, Is Now Available!
3 main points
✔️ Building FABLES (Faithfulness Annotations for Book-Length Summarization), a dataset of annotations consisting of26 book summaries and3158 claims
✔️ Workflow consisting of three steps The three-step workflow significantly reduces the cost and time required to build the dataset
✔️ Statistical and quantitative analysis reveals the performance of multiple LLM models for book summarization
FABLES: Evaluating faithfulness and content selection in book-length summarizaiton
code:
written by Yekyung Kim, Yapei Chang, Marzena Karpinska, Aparna Garimella, Varun Manjunatha, Kyle Lo, Tanya Goyal, Mohit lyyer
(Submitted on 1 Apr 2024 )
Comments: Published on arxiv.
Subjects: Computation and Language(cs.CL); Artificial Intelligence(cs.AI)
The images used in this article are from the paper, the introductory slides, or were created based on them.
Introduction
Long-context large language models (LLMs) have attracted significant interest in recent years because of their ability to summarize book-length sentences that are technically over 100k tokens in length.
On the other hand, despite the importance of the relevance of faithfulness (the fidelity of the response to the source) and claims (the asserted part of the summary), recent studies in this area have only focused on input-independent aspects such as consistency. .
This problem has been a major bottleneck in LLM research due to the length and complexity of the input text, which makes hiring a human annotator to read and understand it very expensive and time-consuming.
To solve these problems, this paper describes a paper that shows new possibilities for LLM's book summarization task by constructing FABLES (Faithfulness Annotations for Book-Length Summarization), a dataset of annotations consisting of 26 book summaries and 3158 CLAIMs generated by LLM.The paper will discussnew possibilities for LLM's book summarization task by constructing FABLES, a dataset of 26 book summaries and 3158 claims generated by LLM, and conducting comparative experiments using multiple LLM models.
FABLES(Faithfulness Annotations for Book-Length Summarization)
FABLES (Faithfulness Annotations for Book-Length Summarization),thedataset newly constructed in this paper, consists of human annotations on the faithfulness and overall quality of book summaries generated by LLM. FABLES consists of annotations on the faithfulness and overall quality of book summaries generated by LLM.
The premise was that a major bottleneck in building a large data set consisting of summaries and annotations was the impossibility, both in terms of cost and time, of having annotators read 100k tokens or longer just to annotate the summaries generated by the LLM The LLM was not able to provide the annotators with a sufficient number of tokens to read the text.
This paper successfully solves this problem by using a very simple method of using only books that the annotator has read in the dataset.
This reduces the time required for annotators to understand the data set and allows them to proactively include long sentences of 100k tokens or more in the data set.
In addition to this, the dataset in this paper was constructed by three steps, as shown in the figure below.
(a)Summarization
To begin with, in summarizing the text, we have prepared electronic copies of 26 books published in 2023-2024, listed below.
As noted above, all books have already been read by the annotators,indicating that the average length of the books is 121k tokens, dealing with much longer texts than the existing data set.
To summarize these books, this paper also adopts the existing method of hierarchical merging strategy (Chang et al., 2023) and usesGPT-3.5-Turbo, GPT-4-Turbo, Mixtral, and Claude-3-Opus as base model as the base model.
(b)Claim Extraction
The next step is to decompose the resulting summary into multiple CLAIMS to allow detailed annotation.
As an example, the summary generated by Claude-3-Opus and the claim extracted by GPT-4 are as follows
(c)Human Evaluation
The final step is annotation by annotators (14 native English speakers).
Annotators were assigned to annotate all LLM-generated summaries that appeared in random order, and by employing this step, a large data set unparalleled in existing research, consisting of 130 summaries and 3158 annotations for 26 books in total, was We succeeded in constructing a large dataset, unparalleled in existing studies, consisting of 130 summaries and 3158 annotations for 26 books in total.
It is also noteworthy that it cost $5.2k and took only about 11 hours to build this data set, which is a major breakthrough in the construction of large data sets.
Analysis of summaries in FABLES
In addition, this paper provides a statistical and qualitative analysis of the 3158 annotations in the FABLES.
The table below shows the percentage ofclaims extracted from the summaries generated by LLMthat were rated bythe annotatorasFaithful, Unfaithful, Partially supported, or Can't verify. The table below shows the percentages of
The table shows that Claude-3-Opus produces the most faithful summary (Faithful = 90%), followed by GPT-4 and GPT-4-Turbo with significantly lower scores.
These results indicate that there is a significant performance difference between Claude-3-Opus and the other models in the book summarization task.
Additionally, the results of the qualitative analysis are shown in the figure below.
The results of this analysisrevealed that most of the claims annotated as Unfaithful were either about a specific event (31.5%) or about some personality or relationship status (38.6%).
Summary
How was it?In this article, we have described a paper that shows new possibilities for LLM's book summarization task by constructing FABLES (Faithfulness Annotations for Book-Length Summarization), a dataset of annotations on 3158 claims in LLM-generated summaries of 26 books, and conducting comparative experiments using multiple LLM models.Thepapershowednew possibilities for LLM book summarization tasksby conducting comparative experiments using multipleLLM models .
The ingenuity of employing annotators who have read each book prior to the annotation task in this paper has allowed us to construct an unprecedentedly large annotated dataset of long texts, which we believe will become the standard for future dataset construction.
In addition, the analysis conducted in this paper provides a great insight into the question of why accuracy deteriorates in LLM book summarization tasks, and we look forward to the emergence of even more accurate LLM models based on this paper.
The details of the data sets and experimental results presented here can be found in this paper for those interested.
Categories related to this article