Lessons Learned And Failures From Search Enhancement Generation System (RGA) Case Studies

Large Language Models 08/11/2024

3 main points
✔️ Present lessons learned and seven failures from three case studies on search-enhanced generation systems
✔️ Provide reference material for practitioners and present a research roadmap for search-enhanced generation systems
✔️ Contributes to the software engineering community by sharing case studies

Seven Failure Points When Engineering a Retrieval Augmented Generation System
written by Scott Barnett, Stefanus Kurniawan, Srikanth Thudumu, Zach Brannelly, Mohamed Abdelrazek
(Submitted on 11 Jan 2024)
Comments: Published on arxiv.
Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

Summary

Increasingly, search-enhanced generation (RAG) is being used to implement semantic search in applications.Asearch augmentation generationsystem works by finding documents that match the user's query and passing them to a large-scale language model (LLM), such as ChatGPT, to generate the correct answers. This method reduces misinformation (halucinations, illusions) from large-scale language models, associates sources with answers, and eliminates the need to attach metadata to documents.

However,search enhancement generationsystems alsohave limitations specific to information retrieval systems and problems dependent on large-scale language models. This paperreports examples of failures ofsearch enhancement generationsystemsthrough case studies in three different areas: research, education, and biomedical.The authors share lessons learned from their research andoffer seven points to consider when designing asearch enhancement generationsystem.

An important point of this paper is thatsearch enhancement generationsystems cannot be validated until they are actually in operation, and that their robustness is not designed from the beginning, but rather evolves.It also provides future research directions onsearch enhancement generationsystems forthe software engineering community.

Advances in large-scale language models allow software engineers to build new HCI solutions, complete complex tasks, summarize documents, answer questions, and generate new content.However, there is a limit to how much up-to-date information and expertise can be covered within an enterpriseusing only large-scale language models.There are two options forsolving this problem: fine-tuning thelarge-scale languagemodel orusing asearch-enhanced generationsystem. Fine tuning requires retraining large language models with domain-specific data, which can be difficult to manage. On the other hand,search-enhanced generationsystems use existing knowledge to generate answers, thus reducing management effort.

Thesearch-enhanced generationsystem combines both information retrieval and generation to provide accurate, up-to-date information that is contextually relevant to the user's query. This reduces development time by eliminating the need to build knowledge graphs and curate data.

Whensoftware engineersbuildsearch enhancement generationsystems, they are required to properly preprocess domain knowledge, store it in appropriate data stores, implement strategies for matching queries and artifacts, and call APIs of large-scale language models to pass user queries and context documents The system will be designed to be able to handle the queries and contextual documents.Methods for buildingnewsearch enhancement generationsystems are constantly evolving, but we need to determine how they apply to specific applications.

The paper presents lessons learned and seven failures from three case studies. It provides a reference for practitioners andpresents a research roadmap forsearch-enhanced generationsystems. The paper also recognizes that the software engineering community has a responsibility to contribute knowledge on how to leverage large-scale language models to achieve robust systems, and itis hoped thatthis researchwill be an important step toward robustness in the construction ofsearch-enhanced generativesystems. The paper is being prepared in the hope that it will be an important step toward robustness in the construction of enhanced search generation systems.

Search Enhancement Generation (RAG) Overview

With the rapid proliferation of large-scale language models such as ChatGPT, Claude, and Bard, case studies using these models as question answering systems are common. While these models show very high performance, there are significant challenges:one is called hallucination, in which thelarge-scale languagemodel produces seemingly correct but incorrect answers. Another isthat there is no way to control or update the output content (except through prompt engineering).The Search Enhanced Generation System is a method designed to overcome these challenges.

First there is the indexing process.Thesearch enhancement generationsystem converts natural language queries into numeric vectors (embeds), which are then used to search documents semantically. Documents are broken into smaller chunks, each of which is converted to an embedding and indexed in the database. During this process, software engineers must size the chunks appropriately. If the chunks are too small, they will not answer a particular question; if they are too large, they may contain noise.

Different types of documents require different chunking and processing. For example, video content requires transcription to convert audio to text. The choice of embedding model used is also considered important, which may require re-indexing all chunks. Embedding is chosen based on its ability to retrieve semantically correct responses.

Next is the query process. This takes place during the execution of the search. Natural language questions are first converted into general queries, and embeddings are computed to search the database for relevant documents. The top k similar documents are searched using methods such as cosine similarity. This ensures that chunks that are semantically close to the query are more likely to contain answers.

The retrieved documents are again ranked and adjusted so that chunks containing answers are placed at the top. In the next stage, chunks are processed to account for the token and rate limits of large language models. For example, OpenAI's service has a limit on the amount of text that can be included in a prompt, and the system's latency is also constrained.

At the end of thesearch enhancement generationsystem, answers are extracted from the generated text.TheReaderremoves noise from the prompts and generates output for the query according to the formatting instructions.Implementing a search enhancement generation system requires the customization of multiple prompts to process questions and answers.

The use of large-scale language models to answer questions from documents in real time has the potential to open up new application areas. However,search enhancement generation systemsare difficult to test, and since no data exists, the system must be piloted with synthetic data generation and minimal testing.

case study

This studyexamines three case studies to identify the challenges faced when implementing asearch enhancement generation system.A summary ofeach casestudyis provided in the table below.

The firstcase studyis Cognitive Reviewer. Itis asearch-enhanced generation systemdesigned to assist in the analysis of scientific documents.The researcher specifies a question or objective and uploads relevant research papers. The system ranks the documents based on the specified objectives and allows the researcher to review them manually. Researchers can also ask questions directly to all documents.Cognitive Reviewer is used by PhD students at Deakin University to assist in literature reviews. The system relies on a robust data processing pipeline to process uploaded documents, with an indexing process at runtime. It also uses a ranking algorithm to sort documents.

Thesecondcase studyis AI Tutor. This isasearch-enhanced generation system thatallows students to ask questions about a unit and get answers from the learning content.Students can access a list of answer sources and review the answers; AI Tutor is integrated into Deakin University's learning management system and indexes all content, including PDF documents, videos, and text documents. Videos are transcribed and chunked using the deep learning model Whisper; AI Tutor was developed between August and November 2023 and pilot tested in a unit of 200 students at the end of October 2023. The pipeline included a Rewriter, which generalizes queries and implements a chat interface that reads context from previous interactions between the user and the AI Tutor. The rewriter rewrote the query to take this context into account and resolve ambiguous requests.

The thirdcase studyisa biomedical question and answer.While the case study above focused on documents with little content,we are creating asearch-enhanced generationsystemusing the BioASQ dataset to explore larger questions.This dataset contains questions, links to documents, and responses. The answers to the questions can be yes/no, text summaries, factoids, or lists. 4,017 open access documents were downloaded from the BioASQ dataset for a total of 1000 questions. All documents are indexed andquestions are posed to thesearch enhancement generationsystem.

The generated questions are evaluated using the OpenEvals technology implemented by OpenAI, which manually inspects 40 questions, as well as all questions flagged as incorrect by OpenEvals. In this domain, we found that automated evaluations were more pessimistic than human evaluators. However, there is a validity threat to this finding, which we attribute to the fact that BioASQ is a domain-specific dataset and the reviewers were not experts. In other words, they state that large language models may have more knowledge than non-experts.

Examples of Failures of Search Enhancement Generation Systems

This paper identifies the following seven failures in search enhancement generation systems based on the three case studies described above. These are identified as the main problems that arise when developing search enhancement generation systems.

The first is lack of content (FP1).This occurs when the system receives a question that it cannot answer from the available documentation. Ideally, the response should be a refusal, such as "Sorry, I don't know," but this may generate the wrong answer to the relevant question.

The second is missing a higher ranked document(FP2).This is when an answer exists in a document but is not returned to the user because it is not ranked high enough. Theoretically, all documents would be ranked, but in practice, only the top K documents selected based on performance are returned.

The third is not included in context (FP3).This is when a document containing a response is retrieved but is not included in the context in which the response is generated. This is caused by the integration process not properly capturing responses when many documents are returned.

The fourth is not extracted (FP4).This is when a response is present in the context but the large-scale language model fails to extract the correct response. This occurs when there is too much noise or contradictory information in the context.

The fifth is formatting errors (FP5).This is when a question requires the extraction of information in a specific format, such as a table or list, and the large-scale language model ignores the instructions.

The sixth is lack of specificity (FP6).This is when an answer is returned but lacks or has too much specificity to the user's needs. This occurs especially when mere answers are returned to questions that should include educational content, or when users are unsure of how to ask a question and it is too general.

Finally, the seventh is an incomplete response (FP7).An incomplete answer is not an error, but it is when some extractable information is missing. For example, "What are the main points covered in documents A, B, and C?" it is better to ask the question individually.

The above failures are important points to note when designing and implementing a search enhancement generation system.

Lessons Learned and Future Research Directions

Based on lessons learned from the three case studies, this paper summarizes considerations for engineering search enhancement generation systems and future research questions.

The first concerns chunking and embedding.Chunking a document may seem simple, but its quality has a tremendous impact on the search process. In particular, chunk embedding affects similarity and matching with user queries. There are two methods of chunking: heuristic-based, which uses punctuation, paragraph endings, etc., and semantic chunking, which starts and ends based on the meaning of the text. Research is needed to explore the trade-offs between these methods and their impact on important processes such as embedding and similarity matching. An evaluation framework that indexes query relevance and search accuracy would contribute to this area.

Embedding is an active research area that includes the generation of multimedia and multimodal chunks such as tables, figures, and mathematical expressions. Chunk embeddings are typically generated once during system development or when new documents are indexed. Query preprocessing has a significant impact on the performance of search enhancement generation systems, especially for negative or ambiguous queries. Further research on architectural patterns and approaches to address the inherent limitations of embedding is warranted, according to the report.

The second concerns fine-tuning.Large-scale language models are excellent models based on vast amounts of training data, but they are also affected by the fine-tuning task that takes place prior to release. However, these models are for general purposes and may not know the details of a particular domain. They also have a cutoff date for knowledge. Fine-tuning and search-enhanced generation are two customization methods with different tradeoffs.

Fine-tuning requires curating internal data sets and training large language models, which raises security and privacy issues as data is incorporated into the model. In addition, fine-tuning must be done again as the underlying model evolves and new data is added. Search-enhanced generation systems, on the other hand, offer a practical solution that chunks data as needed and generates answers from the large-scale language model using only relevant chunks in context. This allows for continuous updating of knowledge with new documents and control over which chunks users can access. However, optimal strategies for chunk embedding, retrieval, and contextual fusion remain an active area of research. Research is needed to systematically compare fine-tuning and search-enhanced generation paradigms in terms of accuracy, latency, operational cost, robustness, and other factors.

The third concerns the testing and monitoring of search enhancement generation systems.Software engineering best practices for search enhancement generation systems are still developing. Software testing and test case generation are among the areas where there is room for improvement. Search-enhanced generation systems require application-specific questions and answers for indexed unstructured documents, which are often unavailable. Recent research has explored the use of large-scale language models to generate questions from multiple documents. How to generate realistic domain-related questions and answers remains an open problem.

Once adequate test data is available, quality metrics are alsoneededto support quality tradeoffs.The use of large language models is expensive, raises latency concerns, and performance characteristics change with each new release. This property has been studied previously in machine learning systems, but the necessary adaptations to apply it to large-scale language model-based systems (e.g., search enhancement generation) have not yet been made. Another idea is to incorporate the concept of self-adaptive systems to monitor and adapt search reinforcement generation systems.

Summary

Search Enhancement Generation Systems are a new method of information retrieval that leverages large-scale language models. Software engineers are increasingly coming into contact with search enhancement generation systems through the implementation of semantic search and new code-dependent tasks.

The paper presents lessons learned from three case studies involving 15,000 documents and 1,000 questions.It also presents the challenges of implementing asearch enhancement generationsystem and provides guidelines for practitioners.

It is expected that systems utilizing large-scale language models will continue to report a series of new capabilities that will be of interest to engineers and researchers. This paperprovides the first survey ofsearch enhancement generationsystems froma software engineering perspective.

Categories related to this article

Takumu: I have worked as a Project Manager/Product Manager and Researcher at internet advertising companies (DSP, DMP, etc.) and machine learning startups. Currently, I am a Product Manager for new business at an IT company. I also plan services utilizing data and machine learning, and conduct seminars related to machine learning and mathematics.

Lessons Learned And Failures From Search Enhancement Generation System (RGA) Case Studies

Summary

Search Enhancement Generation (RAG) Overview

case study

Examples of Failures of Search Enhancement Generation Systems

Lessons Learned and Future Research Directions

Summary

Libra] A New Multimodal Design Of Large Language Models Using Separate Vision Systems

Libra] A New Multimodal Design Of Large Language Models Using Separate Vision Systems

Construction And Analysis Of The "TruthEval" Dataset To Expose LLM Weaknesses

Construction And Analysis Of The "TruthEval" Dataset To Expose LLM Weaknesses

SportQA, A New Dataset That Measures The Comprehension Of Sports In Large Language Models

SportQA, A New Dataset That Measures The Comprehension Of Sports In Large Language Models

Proposal For A New Evaluation Method For AI Assistants Based On Human Preferences

Proposal For A New Evaluation Method For AI Assistants Based On Human Preferences

The Future Of Music Education, Flute X GPT And LAUI's Potential To Change Large-Scale Language Models

The Future Of Music Education, Flute X GPT And LAUI's Potential To Change Large-Scale Language Model ...

Prediction Of Handball Results For The 2024 Paris Olympics And Explanation Of The Basis For The Prediction Using LLM

Prediction Of Handball Results For The 2024 Paris Olympics And Explanation Of The Basis For The Pred ...